Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization¶
Paper Information¶
- Conference: ICLR 2026
- arXiv: 2509.03378
- Code: https://github.com/yorkerlin/KL-Methods
- Area: LLM Pretraining
- Keywords: Shampoo, SOAP, KL divergence, Kronecker structure, second-order optimization, covariance estimation
TL;DR¶
This paper reinterprets the structured second-order moment estimation of Shampoo and SOAP through the lens of KL divergence minimization, reveals their inherent limitations, and proposes two practical methods—KL-Shampoo and KL-SOAP—that match or surpass the original methods without requiring Adam grafting.
Background & Motivation¶
Core Problem¶
Shampoo and its efficient variant SOAP employ Kronecker-structured second-order moment estimates for preconditioned optimization. However: 1. Shampoo typically requires step-size grafting with Adam to remain competitive. 2. SOAP mitigates this by running Adam in the Shampoo eigenbasis, but incurs additional memory overhead. 3. Prior analyses are primarily based on the Frobenius norm, which ignores the SPD (symmetric positive definite) constraint.
Why KL Divergence?¶
- KL divergence naturally respects the SPD constraint, whereas the Frobenius norm does not.
- In quasi-Newton methods (BFGS, DFP), KL provides a unified interpretive framework.
- The entries of an SPD matrix do not play equivalent roles, yet the Frobenius norm treats them uniformly.
- KL divergence extends naturally to the tensor-valued setting.
Method¶
KL Interpretation of Shampoo¶
Claim 1: The estimation rule of Shampoo (\(p=1/2\)) can be derived as the optimal solution to a KL minimization problem:
where \(\boldsymbol{S} = (1/d_b \boldsymbol{S}_a) \otimes \boldsymbol{I}_b\), and the optimal solution is \(\boldsymbol{S}_a^* = \mathbb{E}[\boldsymbol{G}\boldsymbol{G}^\top]\).
Key Limitation: The one-sided approach of Shampoo does not adequately address the KL problem of jointly learning both factors.
KL-Shampoo: The Idealized Solution¶
Claim 2: The optimal solution to the joint KL minimization \(\min_{\boldsymbol{S}_a, \boldsymbol{S}_b} \text{KL}(\mathbb{E}[\boldsymbol{g}\boldsymbol{g}^\top], \boldsymbol{S})\) satisfies:
Statistical Equivalence: KL-Shampoo \(=\) maximum likelihood estimation of a zero-mean matrix Gaussian \(=\) matrix Gaussian whitening.
EMA-Based Implementation¶
An EMA update that approximates the above fixed-point conditions: $\(\boldsymbol{S}_a \leftarrow (1-\beta_2)\boldsymbol{S}_a + \frac{\beta_2}{d_b}\boldsymbol{G}\boldsymbol{S}_b^{-1}\boldsymbol{G}^\top\)$
Claim 3: This EMA scheme is a stochastic proximal gradient step for KL minimization.
Efficient Implementation: QR Decomposition + EMA Eigenvalues¶
Core technical contributions: 1. QR decomposition in place of eigendecomposition: achieves SOAP-level per-iteration runtime. 2. EMA eigenvalue estimation: a correction scheme for use with stale eigenbases.
Unified Framework: Divergence-Projection Perspective¶
| Method | Divergence | Preconditioner Structure | Estimation Scheme |
|---|---|---|---|
| KL-Shampoo | KL | Dense Kronecker | Maximum likelihood |
| Adafactor | von Neumann | Diagonal Kronecker | Matrix moment matching |
| F-Shampoo | Frobenius | Dense Kronecker | SVD-based |
Memory Comparison¶
| Method | Kronecker | Eigenbasis | Eigenvalues | Adam 2nd moment | Extra overhead |
|---|---|---|---|---|---|
| Shampoo | \(d_a^2+d_b^2\) | \(d_a^2+d_b^2\) | \(d_a+d_b\) | \(d_a d_b\) (grafting) | Yes |
| SOAP | \(d_a^2+d_b^2\) | \(d_a^2+d_b^2\) | N/A | \(d_a d_b\) (eigenbasis) | Yes |
| KL-Shampoo | \(d_a^2+d_b^2\) | \(d_a^2+d_b^2\) | \(d_a+d_b\) | None | None |
Key Experimental Results¶
Language Model Pretraining¶
Fair comparison using 150 random hyperparameter searches:
| Model | KL-Shampoo | SOAP | Shampoo+grafting | Shampoo (no grafting) |
|---|---|---|---|---|
| NanoGPT (123M) | Lowest loss | 2nd | 3rd | Poor |
| NanoRWKV7 (162M) | Lowest loss | 2nd | Middle | Complete failure |
| Llama (134M) | Lowest loss | 2nd | — | — |
| NanoMoE (227M, 3D tensors) | Lowest loss | 2nd | — | — |
Key Findings¶
- KL-Shampoo consistently outperforms SOAP: across all 4 models—a surprising result.
- KL-Shampoo requires no grafting: Shampoo (\(p=1/2\)) without grafting fails in all 150 runs on RWKV7.
- KL-Shampoo outperforms KL-SOAP: the core reason is that at the optimal eigenbasis, gradients are already Kronecker-diagonalized, making the additional Adam correction redundant.
- EMA eigenvalue estimation is critical: instantaneous estimation degrades severely when using a stale eigenbasis.
- VN-Shampoo (trace scaling) + EMA scheme also surpasses SOAP.
Highlights & Insights¶
- Deep theoretical insight: the KL perspective provides a unified interpretation of Shampoo, SOAP, and Adafactor.
- Practical improvement: eliminates Adam dependency, reduces memory, and maintains SOAP-level runtime.
- Explanation for KL-Shampoo > KL-SOAP: at the optimal eigenbasis, matrix Gaussian whitening is already satisfied, and no further diagonal correction is needed.
- Natural extension to tensors: the KL framework directly supports 3D+ weight tensors without reshaping.
Limitations & Future Work¶
- The EMA scheme of KL-Shampoo introduces additional matrix multiplications (\(\boldsymbol{G}\boldsymbol{S}_b^{-1}\boldsymbol{G}^\top\)).
- The theoretical analysis assumes zero-mean Gaussian gradients, which may not hold in practice.
- Experiments are primarily conducted on 100–200M scale models; billion-parameter regimes remain untested.
- QR decomposition does not support half precision in PyTorch, requiring precision casting.
Related Work & Insights¶
- Shampoo: Gupta et al. (2018) — original Kronecker preconditioner.
- SOAP: Vyas et al. (2025a) — runs Adam in the Shampoo eigenbasis.
- Quasi-Newton methods: BFGS/DFP — classical applications of KL divergence.
- Second-order optimization: K-FAC, EKFAC — Fisher information matrix approximations.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The KL perspective provides a deep and unified new understanding.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Fair comparison with 150 random searches is highly convincing.
- Writing Quality: ⭐⭐⭐⭐ — Mathematically rigorous, though somewhat lengthy.
- Value: ⭐⭐⭐⭐⭐ — Significant practical improvements with reduced memory and better performance.