Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales¶
Conference: NEURIPS2025 arXiv: 2512.05620 Code: To be confirmed Area: LLM/NLP Keywords: optimizer scaling, μP, Shampoo, SOAP, Muon, hyperparameter transfer, matrix preconditioning
TL;DR¶
This paper investigates the hyperparameter scaling rules for matrix-preconditioned optimizers (Shampoo/SOAP/Muon) with respect to model width and depth under the μP framework, and demonstrates that correct hyperparameter scaling is the key to achieving consistent speedups. Using μP with \(1/\text{width}\) weight decay, all three optimizers consistently achieve approximately \(1.4\times\) speedup on Llama models ranging from 190M to 1.4B parameters.
Background & Motivation¶
Background: Several matrix-preconditioned optimizers (Shampoo, SOAP, Muon) have demonstrated significant speedups over AdamW (1.5–2×) in small-scale experiments. Muon has already been adopted in OpenAI's trillion-parameter training runs.
Limitations of Prior Work: Reproduction attempts have reported highly inconsistent speedups—some teams report 2× acceleration, others only 1.1×, and some find that the gains vanish rapidly as scale increases. The root cause is the lack of reliable hyperparameter scaling rules.
Key Challenge: The loss-compute scaling exponent in language modeling is small (~0.05), meaning a 2% loss difference corresponds to a 40% compute difference. Grid search over hyperparameters is infeasible at large scale, making reliable hyperparameter transfer essential.
Goal: Derive μP learning rate scaling rules for Shampoo/SOAP/Muon and empirically verify that correct scaling is critical for consistent speedups across model scales.
Key Insight: Apply μP (Maximal Update Parameterization) theory to derive per-optimizer scaling rules for learning rate and weight decay as functions of model width and depth.
Core Idea: The speedups from matrix-preconditioned optimizers are genuine, but correct μP scaling is required for them to manifest reliably at large scale.
Method¶
Overall Architecture¶
For each matrix-preconditioned optimizer, the paper derives μP scaling rules (governing how learning rate, weight decay, and regularization parameters scale with model width/depth), and validates these rules on language models ranging from 190M to 1.4B parameters.
Key Designs¶
-
μP Derivation (General Procedure):
- For each weight matrix \(W \in \mathbb{R}^{n \times m}\), ensure that the RMS of the update \(\Delta W\) scales consistently as width \(\to \infty\).
- Due to differences in preconditioning, the learning rate scaling differs across optimizers: Adam uses \(1/d_{\text{in}}\), Muon uses \(\sqrt{d_{\text{out}}/d_{\text{in}}}\), and Shampoo depends on exponents \(e_L, e_R\).
-
Mitigating Finite-Width Bias:
- While μP guarantees learning rate transfer in the infinite-width limit, the optimal learning rate may shift at practical finite widths.
- Blocking (partitioning large weight matrices into smaller blocks) and Spectral Normalization effectively mitigate this shift.
- Grafting (normalizing the update direction by the norm from another optimizer) also affects the scaling behavior.
-
Weight Decay Scaling:
- The paper finds that an independent weight decay (decoupled from the learning rate) scaled as \(1/\text{width}\) is near-optimal across all optimizers.
- This is consistent with the recommendation of Xiao (2024).
Loss & Training¶
- Llama-architecture language models (190M, 470M, 1B, 1.4B) are trained on the FineWeb dataset.
- Hyperparameters are tuned on small models and transferred to large models via the derived μP scaling rules.
Key Experimental Results¶
Main Results¶
Compute-equivalent comparisons on Llama models from 190M to 1.4B:
| Optimizer | μP Scaling | 190M Speedup | 470M Speedup | 1B Speedup | 1.4B Speedup |
|---|---|---|---|---|---|
| Muon | ✅ | ~1.4× | ~1.4× | ~1.4× | ~1.4× |
| SOAP | ✅ | ~1.4× | ~1.4× | ~1.4× | ~1.4× |
| Shampoo | ✅ | ~1.4× | ~1.4× | ~1.3× | ~1.3× |
| Muon | ❌ (SP) | ~1.4× | ~1.2× | ~1.0× | Vanishes |
Ablation Study¶
| Configuration | Result | Notes |
|---|---|---|
| μP vs. standard parameterization | Consistent speedup under μP; vanishes under SP as scale increases | Hyperparameter transfer is critical |
| Blocking (128) | Mitigates finite-width bias | Practical and effective |
| Spectral normalization | Reduces learning rate sensitivity | Complementary to μP |
| WD=\(1/\text{width}\) vs. fixed WD | \(1/\text{width}\) is near-optimal | Consistent across optimizers |
| Compute-optimal model size | Matrix-preconditioned optimizers favor larger models | Different scaling law from AdamW |
Key Findings¶
- Incorrect hyperparameter scaling is the primary cause of prior reproduction failures: Under standard parameterization, the speedups from Muon/SOAP nearly vanish on models at the 1B+ scale.
- A \(1.4\times\) speedup is consistent and reliable: Under correct μP scaling, all three optimizers consistently achieve approximately \(1.4\times\) compute savings across all tested scales.
- μP scaling differs across optimizers: Directly applying Adam's μP rules to Muon is a common and consequential mistake.
Highlights & Insights¶
- Systematically resolves the community debate on whether matrix-preconditioned optimizers are beneficial—the answer is yes, but correct scaling is required. Prior inconsistent results can be attributed to hyperparameter misconfiguration.
- Generality of the μP derivation procedure: The paper provides a simple, general workflow for deriving scaling rules for any new optimizer.
- Practical guidance: Complete scaling formulas for each optimizer are provided (Table 1) and are directly applicable.
Limitations & Future Work¶
- Largest model tested is only 1.4B: Validation on models at 10B+ scale is absent.
- Only language modeling is evaluated: Other tasks such as vision and multimodal settings are not addressed.
- Relatively conservative speedup (\(1.4\times\)): This is more modest than the \(2\times\) reported in some prior work, possibly because the AdamW baseline is more carefully tuned.
- Future directions: Validation at larger scales; analysis of interactions with learning rate warmup/cooldown schedules; development of automated hyperparameter transfer tools.
Related Work & Insights¶
- vs. Liu et al. (2024, M-Muon): They report a \(2\times\) speedup for Muon without μP; this may result from an insufficiently tuned AdamW baseline.
- vs. Wen et al. (2024): They report that speedups vanish with scale—the present paper explains this as a consequence of hyperparameter drift under standard parameterization.
- vs. Yang et al. (μP): The original μP framework covers only Adam/SGD; this paper extends it to matrix-preconditioned optimizers.
Rating¶
- Novelty: ⭐⭐⭐⭐ The extension of μP to new optimizers is elegant, though the core framework builds on existing μP theory.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple optimizers × multiple scales × ablations; very comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear and systematic; Table 1 provides an at-a-glance summary of scaling rules.
- Value: ⭐⭐⭐⭐⭐ Directly resolves a community controversy and delivers actionable practical guidance.