NEURIPS2025 LLM/NLP optimizer scaling μP Shampoo SOAP Muon hyperparameter transfer matrix preconditioning

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales¶

Conference: NEURIPS2025 arXiv: 2512.05620 Code: To be confirmed Area: LLM/NLP Keywords: optimizer scaling, μP, Shampoo, SOAP, Muon, hyperparameter transfer, matrix preconditioning

TL;DR¶

This paper investigates the hyperparameter scaling rules for matrix-preconditioned optimizers (Shampoo/SOAP/Muon) with respect to model width and depth under the μP framework, and demonstrates that correct hyperparameter scaling is the key to achieving consistent speedups. Using μP with \(1/\text{width}\) weight decay, all three optimizers consistently achieve approximately \(1.4\times\) speedup on Llama models ranging from 190M to 1.4B parameters.

Background & Motivation¶

Background: Several matrix-preconditioned optimizers (Shampoo, SOAP, Muon) have demonstrated significant speedups over AdamW (1.5–2×) in small-scale experiments. Muon has already been adopted in OpenAI's trillion-parameter training runs.

Limitations of Prior Work: Reproduction attempts have reported highly inconsistent speedups—some teams report 2× acceleration, others only 1.1×, and some find that the gains vanish rapidly as scale increases. The root cause is the lack of reliable hyperparameter scaling rules.

Key Challenge: The loss-compute scaling exponent in language modeling is small (~0.05), meaning a 2% loss difference corresponds to a 40% compute difference. Grid search over hyperparameters is infeasible at large scale, making reliable hyperparameter transfer essential.

Goal: Derive μP learning rate scaling rules for Shampoo/SOAP/Muon and empirically verify that correct scaling is critical for consistent speedups across model scales.

Key Insight: Apply μP (Maximal Update Parameterization) theory to derive per-optimizer scaling rules for learning rate and weight decay as functions of model width and depth.

Core Idea: The speedups from matrix-preconditioned optimizers are genuine, but correct μP scaling is required for them to manifest reliably at large scale.

Method¶

Overall Architecture¶

For each matrix-preconditioned optimizer, the paper derives μP scaling rules (governing how learning rate, weight decay, and regularization parameters scale with model width/depth), and validates these rules on language models ranging from 190M to 1.4B parameters.

Key Designs¶

μP Derivation (General Procedure):
- For each weight matrix \(W \in \mathbb{R}^{n \times m}\), ensure that the RMS of the update \(\Delta W\) scales consistently as width \(\to \infty\).
- Due to differences in preconditioning, the learning rate scaling differs across optimizers: Adam uses \(1/d_{\text{in}}\), Muon uses \(\sqrt{d_{\text{out}}/d_{\text{in}}}\), and Shampoo depends on exponents \(e_L, e_R\).
Mitigating Finite-Width Bias:
- While μP guarantees learning rate transfer in the infinite-width limit, the optimal learning rate may shift at practical finite widths.
- Blocking (partitioning large weight matrices into smaller blocks) and Spectral Normalization effectively mitigate this shift.
- Grafting (normalizing the update direction by the norm from another optimizer) also affects the scaling behavior.
Weight Decay Scaling:
- The paper finds that an independent weight decay (decoupled from the learning rate) scaled as \(1/\text{width}\) is near-optimal across all optimizers.
- This is consistent with the recommendation of Xiao (2024).

Loss & Training¶

Llama-architecture language models (190M, 470M, 1B, 1.4B) are trained on the FineWeb dataset.
Hyperparameters are tuned on small models and transferred to large models via the derived μP scaling rules.

Key Experimental Results¶

Main Results¶

Compute-equivalent comparisons on Llama models from 190M to 1.4B:

Optimizer	μP Scaling	190M Speedup	470M Speedup	1B Speedup	1.4B Speedup
Muon	✅	~1.4×	~1.4×	~1.4×	~1.4×
SOAP	✅	~1.4×	~1.4×	~1.4×	~1.4×
Shampoo	✅	~1.4×	~1.4×	~1.3×	~1.3×
Muon	❌ (SP)	~1.4×	~1.2×	~1.0×	Vanishes

Ablation Study¶

Configuration	Result	Notes
μP vs. standard parameterization	Consistent speedup under μP; vanishes under SP as scale increases	Hyperparameter transfer is critical
Blocking (128)	Mitigates finite-width bias	Practical and effective
Spectral normalization	Reduces learning rate sensitivity	Complementary to μP
WD=\(1/\text{width}\) vs. fixed WD	\(1/\text{width}\) is near-optimal	Consistent across optimizers
Compute-optimal model size	Matrix-preconditioned optimizers favor larger models	Different scaling law from AdamW

Key Findings¶

Incorrect hyperparameter scaling is the primary cause of prior reproduction failures: Under standard parameterization, the speedups from Muon/SOAP nearly vanish on models at the 1B+ scale.
A \(1.4\times\) speedup is consistent and reliable: Under correct μP scaling, all three optimizers consistently achieve approximately \(1.4\times\) compute savings across all tested scales.
μP scaling differs across optimizers: Directly applying Adam's μP rules to Muon is a common and consequential mistake.

Highlights & Insights¶

Systematically resolves the community debate on whether matrix-preconditioned optimizers are beneficial—the answer is yes, but correct scaling is required. Prior inconsistent results can be attributed to hyperparameter misconfiguration.
Generality of the μP derivation procedure: The paper provides a simple, general workflow for deriving scaling rules for any new optimizer.
Practical guidance: Complete scaling formulas for each optimizer are provided (Table 1) and are directly applicable.

Limitations & Future Work¶

Largest model tested is only 1.4B: Validation on models at 10B+ scale is absent.
Only language modeling is evaluated: Other tasks such as vision and multimodal settings are not addressed.
Relatively conservative speedup (\(1.4\times\)): This is more modest than the \(2\times\) reported in some prior work, possibly because the AdamW baseline is more carefully tuned.
Future directions: Validation at larger scales; analysis of interactions with learning rate warmup/cooldown schedules; development of automated hyperparameter transfer tools.

vs. Liu et al. (2024, M-Muon): They report a \(2\times\) speedup for Muon without μP; this may result from an insufficiently tuned AdamW baseline.
vs. Wen et al. (2024): They report that speedups vanish with scale—the present paper explains this as a consequence of hyperparameter drift under standard parameterization.
vs. Yang et al. (μP): The original μP framework covers only Adam/SGD; this paper extends it to matrix-preconditioned optimizers.

Rating¶

Novelty: ⭐⭐⭐⭐ The extension of μP to new optimizers is elegant, though the core framework builds on existing μP theory.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple optimizers × multiple scales × ablations; very comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Clear and systematic; Table 1 provides an at-a-glance summary of scaling rules.
Value: ⭐⭐⭐⭐⭐ Directly resolves a community controversy and delivers actionable practical guidance.