Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation¶

Conference: ICLR 2026 arXiv: 2602.24283 Code: github.com/mrflogs/LoRA-Pre Area: Model Compression / Efficient Optimizers Keywords: Low-rank optimizer, momentum compression, pretraining efficiency, LoRA, Adam, Muon

TL;DR¶

This work reveals that EMA-based momentum updates are equivalent to gradient descent on an online linear regression objective, and builds upon this insight to propose LoRA-Pre — a method that compresses optimizer momentum via low-rank factorization for memory-efficient LLM pretraining and fine-tuning. LoRA-Pre achieves state-of-the-art performance across all model scales using only 1/8 the rank required by baseline methods.

Background & Motivation¶

Optimizers such as Adam maintain first- and second-order momentum, causing memory consumption to reach three times that of the model weights.
Existing low-rank optimization methods (GaLore, Flora, Fira, etc.) compress optimizer states by projecting gradients into lower-dimensional subspaces.
- Periodic subspace updates introduce optimization discontinuities and error accumulation.
- These methods cannot adapt instantaneously to shifting gradient subspaces.
There is a need for an efficient momentum compression approach that continuously adapts to the gradient subspace.

Method¶

Core Insight: Momentum as a Covert Online Linear Regressor¶

The EMA momentum update can be rewritten as:

\[m_{t+1} = \underbrace{m_t}_{weight} - \underbrace{(1-\beta)}_{lr} \cdot \underbrace{(m_t - g)}_{gradient}\]

This is equivalent to online gradient descent on the objective:

\[\min_m L(m; g) = \frac{1}{2} \|m - g\|_F^2\]

with learning rate $1-\beta$ and loss gradient $m_t - g$.

LoRA-Pre: Low-Rank Online Linear Regression¶

First-Order Momentum Compression¶

The full-rank momentum $m \in \mathbb{R}^{p \times q}$ is factorized as $m = m_B \cdot m_A$, where $m_B \in \mathbb{R}^{p \times r}$, $m_A \in \mathbb{R}^{r \times q}$, and $r \ll \min(p,q)$:

\[\min_{m_B, m_A} L(m_B, m_A; g) = \frac{1}{2} \|m_B m_A - g\|_F^2\]

Memory cost is reduced from $p \times q$ to $(p+q) \times r$.

Closed-form update rules are derived via the Newton method (Theorem 3.1):

\[m_B \leftarrow (1-\gamma_1) m_B + \gamma_1 g m_A^T (m_A m_A^T)^{-1}$$ $$m_A \leftarrow (1-\gamma_1) m_A + \gamma_1 (m_B^T m_B)^{-1} m_B^T g\]

These updates take the form of EMA and require no backpropagation.

Second-Order Momentum Compression¶

Challenge: Adam's parameter update requires $\sqrt{v}$, which demands element-wise non-negativity of $v$.

Solution: Reparameterize as $v = (v_B v_A)^{\circ 2}$ (Hadamard square) and optimize:

\[\min_{v_B, v_A} L(v_B, v_A; g) = \frac{1}{2} \|v_B v_A - |g|\|_F^2\]

This guarantees element-wise positivity while preserving the low-rank structure.

Generality¶

LoRA-Pre can be applied to any momentum-based optimizer: - LoRA-Pre (Adam): compresses both $m$ and $v$ - LoRA-Pre (Muon): compresses the momentum of the Muon optimizer

Key Experimental Results¶

Pretraining: Validation Perplexity of Llama Models on C4 (↓)¶

Model	Full-rank Adam	GaLore	Flora	Fira	LoRA-Pre
60M	baseline	second-best	—	—	best
130M	baseline	second-best	—	—	best
350M	baseline	second-best	—	—	best
1B	baseline	second-best	—	—	best

Rank Efficiency Comparison¶

Method	Rank Required (to achieve comparable performance)
GaLore	baseline rank $r$
Flora	baseline rank $r$
LoRA-Pre	$r/8$

Fine-tuning: MetaMathQA → GSM8K / MATH-500¶

Method	Llama-3.1-8B	Llama-2-7B
Standard LoRA	baseline	baseline
LoRA-Pre	+3.14	+6.17

Ablation Study¶

Component	Effect
First-order compression only	effective but inferior to both orders
First-order + second-order compression	best
Varying rank $r$	robust to rank variation; $r/8$ suffices
Adam vs. Muon variants	both optimizers benefit

Key Findings¶

LoRA-Pre achieves the lowest validation perplexity across all model scales.
Only 1/8 the rank of baseline methods is needed to match or surpass their performance.
The method remains effective in fine-tuning settings, yielding a +6.17 improvement on Llama-2-7B.
Closed-form update rules require no backpropagation, ensuring computational efficiency.
The Hadamard square reparameterization for second-order momentum resolves the positivity constraint.

Highlights & Insights¶

Elegant theoretical contribution: The equivalence between EMA and online linear regression reveals a fundamentally new perspective on momentum.
From model compression to optimizer compression: The core idea of LoRA is transferred from model weights to optimizer states.
Continuous subspace adaptation: Unlike periodic-update methods such as GaLore, LoRA-Pre adapts to the gradient subspace at every step.
Exceptional rank efficiency: 1/8 rank = lower memory footprint + better performance.
Unified framework: The same framework applies to both Adam and Muon, for both pretraining and fine-tuning.

Limitations & Future Work¶

Computing $(m_A m_A^T)^{-1}$ or $(m_B^T m_B)^{-1}$ introduces additional overhead when $r$ is large.
The Hadamard reparameterization for second-order momentum introduces approximation error.
Validation is limited to the Llama architecture; cross-architecture generalization remains to be confirmed.
Communication efficiency in distributed training settings is insufficiently analyzed.

Low-rank pretraining: GaLore (SVD projection), Flora (random projection), Fira (SGD complementary subspace)
Online momentum compression: MLorc, MoFaSGD, ADAPM
Parameter-efficient fine-tuning: LoRA, LoRA+, DoRA, LoFT, LoRA-Pro

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The EMA = online regression insight is remarkably elegant.
Technical Depth: ⭐⭐⭐⭐⭐ — Rigorous theoretical derivations with clean closed-form solutions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage from 60M–1B pretraining to 7B–8B fine-tuning.
Practical Value: ⭐⭐⭐⭐⭐ — Directly reduces LLM training memory; highly deployable.

Method	Rank Required (to achieve comparable performance)
GaLore	baseline rank \(r\)
Flora	baseline rank \(r\)
LoRA-Pre	\(r/8\)