Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner¶
Conference: NeurIPS 2025 arXiv: 2506.03595 Code: Unavailable Area: Optimization Keywords: Shampoo optimizer, preconditioner, Kronecker factor, learning rate grafting, adaptive eigenbasis update
TL;DR¶
By decomposing Shampoo's preconditioner into eigenvalue and eigenbasis components, this work reveals that learning rate grafting essentially compensates for eigenvalue staleness and scaling bias, and proposes eigenvalue correction and adaptive eigenbasis update frequency as principled replacements for these heuristics.
Background & Motivation¶
Shampoo is a preconditioned stochastic gradient optimizer based on Kronecker factor decomposition. Its recent victory in the external tuning track of the AlgoPerf neural network training competition has renewed interest in non-diagonal preconditioning methods.
Core Limitations of Prior Work: Shampoo's success relies heavily on two key heuristics: (1) learning rate grafting—rescaling Shampoo's update direction using the per-layer update magnitude from Adam; and (2) stale preconditioning—recomputing the inverse root of the preconditioner only every 100 steps. These two techniques increase algorithmic complexity, introduce additional hyperparameters, and lack theoretical justification. Empirically, Shampoo without grafting underperforms across a wide range of hyperparameter settings, sometimes even falling behind AdamW.
Limitations of Prior Work: Although SOAP proposed an eigenvalue-corrected variant of Shampoo with favorable results, a systematic understanding of what role grafting plays, why it is necessary, and how to replace it on principled grounds has been lacking.
Core Idea: Decouple the preconditioner update into eigenvalue and eigenbasis components. Grafting essentially compensates for eigenvalue staleness and the scaling bias induced by the Kronecker structure. Directly correcting the eigenvalues eliminates the need for grafting; adaptively determining eigenbasis update frequency improves computational efficiency.
Method¶
Overall Architecture¶
Shampoo approximates the \(mn \times mn\) preconditioner of full-matrix Adam as a Kronecker product of two smaller matrices: \(\mathbf{C}_t^{\text{Shampoo}} = (\mathbf{R}_t \otimes \mathbf{L}_t)^{1/2}\), where \(\mathbf{L}_t \in \mathbb{R}^{m \times m}\) and \(\mathbf{R}_t \in \mathbb{R}^{n \times n}\). This paper analyzes the sources of error in this approximation through the lens of eigendecomposition.
Key Designs¶
- Decoupled Analysis of Eigenvalues and Eigenbasis: Shampoo's update can be decomposed into three steps: (a) rotating the gradient into a new coordinate system via the eigenbasis \(\tilde{\mathbf{G}}_t = \mathbf{Q}_{\mathbf{L}_t}^\top \mathbf{G}_t \mathbf{Q}_{\mathbf{R}_t}\); (b) applying coordinate-wise scaling to the rotated gradient; and (c) rotating back to the original coordinate system. Lemma 1 establishes that the update magnitude is entirely determined by the eigenvalues (the eigenbasis does not affect magnitude), and that Shampoo's magnitude bound carries an additional dimension-dependent factor \(m^{-p/2} n^{-p/2}\):
whereas the magnitude bounds of full-matrix Adam and EShampoo contain no such dimension factors. This explains why grafting is so critical for Shampoo—it compensates for the inter-layer update magnitude inconsistency caused by varying layer dimensions.
-
Eigenvalue Correction (EShampoo): Rather than constraining the eigenvalues to a Kronecker product structure, a dense scaling matrix is maintained independently in the rotated coordinate system: \(\mathbf{D}_t = \beta_2 \mathbf{D}_{t-1} + (1 - \beta_2) \tilde{\mathbf{G}}_t^{\odot 2}\), where \(\tilde{\mathbf{G}}_t^{\odot 2}\) denotes the element-wise square of the transformed gradient. This enables per-step "eigenvalue" updates at \(O(mn)\) cost, entirely eliminating the need for grafting. Three theoretical predictions are empirically validated: (a) Shampoo2 (a tighter Kronecker approximation) with \(S^{-1}\) scaling can substitute for grafting; (b) eigenvalue correction can match or exceed grafted Shampoo; and (c) the optimal learning rate for Adam transfers directly to EShampoo.
-
Adaptive Eigenbasis Update Frequency: An adaptive criterion based on the QR algorithm's termination condition is proposed. For each Kronecker factor \(\mathbf{L}_t\), the approximate eigenvalue matrix \(\hat{\Lambda}_{\mathbf{L}_t} = \hat{\mathbf{Q}}_{\mathbf{L}_{t-1}}^\top \mathbf{L}_t \hat{\mathbf{Q}}_{\mathbf{L}_{t-1}}\) is computed; if the relative off-diagonal error falls below threshold \(\tau\):
the eigenbasis update is skipped and the previous eigenbasis is reused. This allows different layers and different Kronecker factors to have independent update frequencies.
Loss & Training¶
Standard neural network training objective \(\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})\). EShampoo updates the eigenvalue correction matrix \(\mathbf{D}_t\) at every step (same memory overhead as Adam grafting) and updates the eigenbasis at a fixed or adaptive frequency via torch.linalg.eigh or a warm-started QR algorithm.
Key Experimental Results¶
Main Results (AlgoPerf Subset Workloads)¶
| Workload | Shampoo Variant | Pass Rate | Steps | Time (min) |
|---|---|---|---|---|
| FastMRI | Adam grafting (F=100) | 4/5 | 4301±109 | 13.96±0.44 |
| FastMRI | EShampoo (F=100) | 5/5 | 2536±66 | 10.44±0.21 |
| FastMRI | EShampoo (τ=0.1, F=50) | 5/5 | 2468±145 | 10.81±0.72 |
| ImageNet ViT | Adam grafting (F=100) | 1/1 | 79907 | 894.27 |
| ImageNet ViT | EShampoo (F=100) | 1/1 | 76226 | 894.85 |
| OGBG | Adam grafting (F=100) | 2/5 | 12574±708 | 39.20±1.88 |
| OGBG | EShampoo (F=100) | 3/5 | 8320±1203 | 33.02±4.05 |
| OGBG | EShampoo (τ=0.1, F=50) | 5/5 | 7117±328 | 27.55±3.49 |
Ablation Study (Imagewoof ViT / ConvNeXt V2)¶
| Configuration | Result | Remark |
|---|---|---|
| Shampoo (F=1, no grafting) | Matches grafted on ViT; not on ConvNeXt V2 | Validates scaling issues from dimension differences |
| Shampoo2 + \(S^{-1}\) (F=1) | Matches or exceeds grafted version | Tighter Frobenius approximation |
| EShampoo (F=100) | Matches or exceeds grafted version | Eigenvalue correction eliminates need for grafting |
| τ=0.99 (eigenbasis computed only once) | Significantly outperforms AdamW (0.12 vs 0.3) | Single eigenbasis computation already yields large gains |
| 1D params with Adam + 2D params with EShampoo | Equivalent to full EShampoo | Eigenbasis updates for 1D parameters are unnecessary |
| 2D params with Adam + 1D params with EShampoo | Close to pure Adam | All gains stem from eigenbasis of 2D parameters |
Key Findings¶
- Eigenbasis update importance is concentrated in early training: Using high-precision eigenbasis (τ=0.01) in the first 10% of iterations has a far greater impact on final loss than in the remaining 90%
- Eigenbasis change rates differ across parameter types: Bias and LayerNorm parameters exhibit eigenbasis drift approximately four times faster than weight matrices
- Adaptive frequency reduces wall-clock time by approximately 20% compared to fixed F=100
Highlights & Insights¶
- Depth of the decoupled perspective: Grafting, previously a black-box heuristic, is reduced to an eigenvalue correction problem with both theoretical justification (dimension-dependent magnitude bounds) and empirical validation
- Strong practicality: EShampoo requires only an \(mn\)-dimensional buffer on top of Shampoo (the same overhead as grafting), yet eliminates grafting and achieves superior performance
- Hyperparameter transfer from AdamW to EShampoo: This is practically significant, reducing the tuning cost of adopting a new optimizer
- The finding that "frozen eigenbasis still outperforms AdamW" is surprising and practically valuable
Limitations & Future Work¶
- Main experiments are conducted on small-to-medium-scale models; thorough validation on large language models is lacking (only a 324M-parameter Llama 3 analysis is included)
- Adaptive update frequency slightly degrades performance on certain workloads (e.g., ImageNet ViT), indicating that the optimal task-specific strategy remains to be explored
- An open theoretical question remains unresolved: how to incorporate approximation quality into Shampoo's regret bound
- Applicability to non-AdaGrad-family methods (e.g., K-FAC, TNT) is not extensively discussed
Related Work & Insights¶
- This work is closely related to SOAP but provides deeper theoretical analysis and more systematic experiments
- The QR-based criterion for adaptive eigenbasis updates draws from matrix factorization theory in numerical linear algebra
- The approach of "purifying" Kronecker factor methods is generalizable to all Kronecker approximation-based optimizers
Rating¶
- Novelty: ⭐⭐⭐⭐ The decoupled perspective is original, though eigenvalue correction itself already appeared in SOAP
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers AlgoPerf workloads and diverse model architectures with detailed ablations
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are presented progressively, with clear correspondence between experimental validation and theoretical predictions
- Value: ⭐⭐⭐⭐⭐ Contributes significantly to understanding and improving Shampoo-class optimizers with strong practical applicability