Convergence of Muon with Newton-Schulz¶
Conference: ICLR 2026 arXiv: 2601.19156 Code: To be confirmed Area: Optimization / Theory Keywords: Muon optimizer, Newton-Schulz, polar decomposition, matrix optimization, convergence analysis
TL;DR¶
This work provides the first convergence guarantees for the Muon optimizer as it is actually used in practice—with Newton-Schulz (NS) approximation rather than exact SVD-based polar decomposition. It proves that the convergence rate matches the idealized SVD variant up to a constant factor \(C_q\) that decays doubly exponentially in the number of NS iterations \(q\), and that Muon enjoys a \(\sqrt{r}\) advantage over its vector-space counterpart SGD-M due to reduced rank loss.
Background & Motivation¶
Background: The Muon optimizer updates matrix parameters by orthogonalizing the momentum matrix—rather than vectorizing it as Adam does—and has demonstrated strong empirical performance in LLM training. In practice, it employs Newton-Schulz (NS) iterations to approximate the polar decomposition, avoiding the costly SVD.
Limitations of Prior Work: Existing theoretical analyses of Muon (Shen et al.; Li & Hong) replace NS with exact SVD—yet SVD is never used in practice. It remains unclear how the NS approximation error affects convergence, and how many NS steps suffice.
Key Challenge: Muon with only a few NS steps achieves SVD-level performance empirically (with superior wall-clock time), yet no theoretical framework covers this regime—practice has far outpaced theory.
Key Insight: Directly analyze the polar decomposition approximation error \(\varepsilon_q\) induced by NS iterations, and prove that it decays doubly exponentially in the number of steps.
Core Idea: The NS approximation error \(\varepsilon_q\) decays doubly exponentially → a few NS steps suffice to bring Muon's convergence rate to the SVD level → each step costs far less than SVD → superior wall-clock time.
Method¶
Overall Architecture¶
At each step, Muon performs: (1) stochastic gradient computation \(G_t\); (2) momentum update \(M_t = \beta M_{t-1} + G_t\); (3) pre-scaling \(X_{t,0} = M_t/\alpha_t\); (4) \(q\) NS iterations \(X_{t,j} = p_\kappa(X_{t,j-1}X_{t,j-1}^\top)X_{t,j-1}\) to approximate orthogonalization; (5) parameter update \(W_t = W_{t-1} - \eta O_t\). The analysis targets \(\frac{1}{T}\sum_t \mathbb{E}[\|\nabla f(W_{t-1})\|_*] \leq \epsilon\).
Key Designs (Theoretical Contributions)¶
- Theorem 1: Non-convex convergence of NS-Muon:
- Number of iterations for Muon with \(q\) NS steps to reach an \(\epsilon\)-stationary point: \(T = O\left(\frac{C_q \cdot L D}{\epsilon^2}\right)\)
-
\(C_q\) is the sole constant factor depending on the NS approximation quality.
-
Theorem 2: Doubly exponential decay of polar decomposition approximation error:
- \(\varepsilon_q \leq \varepsilon_0^{(2\kappa+1)^q}\) — doubly exponential decay in \(q\) and decay in polynomial degree \(\kappa\).
- Implication: \(q = 3\text{–}5\) steps with \(\kappa = 2\text{–}3\) suffices to achieve \(C_q \approx 1\), matching the SVD variant.
-
Wall-clock advantage: NS requires only matrix multiplications (GPU-efficient), whereas SVD costs \(O(mn\min(m,n))\).
-
Theorem 3: Rank advantage over SGD-M:
- Muon converges \(\sqrt{r}\) times faster than SGD-M, where \(r = \min(m,n)\) is the matrix rank.
- Reason: Muon operates under the nuclear norm → exploits low-rank matrix structure → more efficient search directions.
Key Experimental Results¶
Main Results (Convergence Comparison)¶
| Method | Metric | Convergence Rate | Rank Dependence |
|---|---|---|---|
| SGD-M | Frobenius gradient | \(O(1/\sqrt{T})\) | \(\sqrt{r}\) loss |
| Muon (SVD) | Nuclear gradient | \(O(1/\sqrt{T})\) | No \(\sqrt{r}\) loss |
| Muon (NS, \(q\) steps) | Nuclear gradient | \(O(C_q/\sqrt{T})\) | No \(\sqrt{r}\) loss |
Ablation Study (\(C_q\) vs. Number of NS Steps \(q\))¶
| NS Steps \(q\) | \(C_q\) (\(\kappa=2\)) | \(C_q\) (\(\kappa=3\)) |
|---|---|---|
| 1 | Large | Moderate |
| 3 | \(\approx 1.01\) | \(\approx 1.001\) |
| 5 | \(\approx 1.0\) | \(\approx 1.0\) |
Key Findings¶
- 3–5 NS steps match SVD: \(C_q\) converges doubly exponentially to 1, providing rigorous theoretical justification for the choices made in practice.
- \(\sqrt{r}\) advantage of Muon over SGD-M: The benefit is pronounced for high-rank matrix parameters such as large attention layers.
- Wall-clock advantage explained: NS iterations cost far less than SVD per step while achieving nearly identical iteration counts, yielding lower total runtime.
Highlights & Insights¶
- First convergence guarantees for practical Muon: Closes the practice–theory gap; all prior theoretical work implicitly assumed exact SVD.
- Doubly exponential decay as the key insight: \(\varepsilon_q \leq \varepsilon_0^{5^q}\) (for \(\kappa=2\)) — after 3 steps, the error is below \(10^{-100}\).
- Nuclear norm as the natural metric: Working with the nuclear norm in matrix space naturally aligns with polar decomposition and reveals the rank advantage.
- Implications for future matrix optimizers: The general analysis framework for NS approximation can be extended to other matrix-based optimizers.
Limitations & Future Work¶
- This is a purely theoretical contribution with no new experiments (though the paper's explicit goal is to provide theoretical grounding for existing practice).
- The analysis assumes standard smoothness and bounded variance; adaptive methods in the style of Adam are not covered.
- Comparison with second-order methods such as Shampoo/SOAP is not analyzed.
Related Work & Insights¶
- vs. Shen et al. / Li & Hong: Prior work analyzes SVD-Muon. This paper is the first to analyze NS-Muon—the only theory that matches actual practice.
- vs. Shampoo/SOAP: Second-order preconditioners that maintain curvature information. Muon is not a second-order method—it orthogonalizes momentum, a distinct mechanism that is potentially complementary.
- vs. Orthogonal-SGDM: Orthogonalization is applied before momentum accumulation. Muon applies momentum first, then orthogonalizes, and replaces SVD with NS.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First convergence theory for NS-Muon as used in practice; the doubly exponential decay result is elegant.
- Experimental Thoroughness: ⭐⭐⭐ Purely theoretical with no new experiments (justified, as the goal is to provide theoretical explanation for existing empirical practice).
- Writing Quality: ⭐⭐⭐⭐⭐ The research question is clearly stated, theorems build progressively, and the exposition is rigorous and fluent.
- Value: ⭐⭐⭐⭐⭐ Provides much-needed theoretical foundations for one of the most prominent matrix optimizers in current practice.