Subspace Optimization for Large Language Models with Convergence Guarantees¶
Conference: ICML 2025
arXiv: 2410.11289
Code: https://github.com/pkumelon/Golore
Area: Optimization
Keywords: subspace optimization, LLM training, GaLore, low-rank projection, convergence guarantee
TL;DR¶
This paper reveals that GaLore (a subspace optimization algorithm) does not always converge under stochastic settings, and proposes GoLore (Gradient Random Low-rank Projection)—a provably convergent variant that guarantees convergence even under standard batch sizes.
Background & Motivation¶
Background: Pre-training and fine-tuning Large Language Models (LLMs) demand massive memory footprint. Subspace optimization algorithms such as GaLore (Zhao et al., 2024) reduce the memory usage of optimizer states by projecting gradients onto a low-rank subspace, achieving good empirical results.
Limitations of Prior Work: Although GaLore performs well in practice, its convergence lacks theoretical guarantees. Especially under stochastic gradient settings (standard mini-batch SGD), whether GaLore can always converge to the optimal solution remains an open fundamental question.
Key Challenge: GaLore selects the projection subspace through SVD factorization of the gradients. However, under stochastic settings, SVD factorizes stochastic gradients rather than true gradients, introducing systemic bias where the projection direction is inconsistent with the true gradient direction. Can this bias be eliminated?
Goal: (1) Delineate the convergence limits of GaLore, and (2) provide improved schemes that guarantee convergence.
Key Insight: Construct a counterexample to prove that GaLore does not converge, analyze the sufficient conditions for convergence, and design a new algorithm to circumvent projection bias.
Core Idea: Use random projection (instead of SVD-based, data-dependent projection) to avoid systemic bias, thereby ensuring convergence.
Method¶
Overall Architecture¶
Standard subspace optimization workflow: 1. Compute the mini-batch gradient \(\hat{G}\) 2. Project the gradient to a low-rank subspace: \(\hat{G}_r = P \hat{G}\) (\(P\) is the projection matrix) 3. Update parameters in the low-rank subspace using Adam/SGD 4. Periodically update the projection subspace
GaLore: \(P\) is determined by the SVD of \(\hat{G}\) (data-dependent, biased)
GoLore: \(P\) is generated by a random matrix (data-independent, unbiased)
Key Designs¶
-
Counterexample for GaLore Divergence:
- Function: Constructs a simple convex optimization problem where GaLore fails to converge to the optimal solution.
- Mechanism: In a 2D problem, when the true gradient components are close along the two axes but have different noise levels, the projection direction chosen by SVD systematically deviates from the true gradient direction.
- Key Formula: \(\mathbb{E}[P(\hat{G})\hat{G}] \neq P(G)G\), meaning the expectation of the projected gradient does not equal the projection of the true gradient.
- Design Motivation: Uncovers the theoretical defect of GaLore, pointing the way toward improvements.
-
Conditions for GaLore Convergence:
- Function: Identifies special conditions under which GaLore can still converge.
- Mechanism: GaLore converges in two scenarios: (i) using a sufficiently large mini-batch (reducing SVD directional bias), or (ii) gradient noise is isotropic (where the SVD directions are unaffected by noise).
- Design Motivation: Explains why GaLore often performs well in practice (large batch size + approximately isotropic noise).
-
GoLore: Gradient Random Low-rank Projection:
- Function: Replaces SVD projection with random projection.
- Mechanism: The projection matrix \(P\) is sampled from a random matrix distribution (updated every \(T_0\) steps), ensuring \(\mathbb{E}[P \hat{G}] \propto G\).
- Key Formula: \(P = Q Q^\top\), where \(Q\) consists of the first \(r\) columns of a random orthogonal matrix.
- Design Motivation: Random projection is data-independent, thus introducing no systemic bias. Although a single projection may discard certain components of the true gradient, it is unbiased in expectation.
Loss & Training¶
Standard LLM training loss (cross-entropy on next-token prediction). GoLore can be combined with the Adam optimizer, merely requiring the replacement of SVD with random projection in the projection module.
Key Experimental Results¶
Main Results¶
| Model/Task | Metric | GoLore | GaLore | Full-rank Adam | LoRA |
|---|---|---|---|---|---|
| LLaMA-130M Pre-training | Perplexity | 24.8 | 25.3 | 24.2 | 26.1 |
| LLaMA-350M Pre-training | Perplexity | 18.6 | 19.1 | 18.1 | 19.8 |
| LLaMA-7B Fine-tuning (GLUE) | Average Accuracy | 86.2% | 85.8% | 86.9% | 85.1% |
| Memory Footprint (350M) | GPU Memory | 4.2 GB | 4.2 GB | 8.7 GB | 5.1 GB |
Ablation Study¶
| Configuration | Perplexity (130M) | Description |
|---|---|---|
| GoLore (rank=128) | 24.8 | Best trade-off |
| GoLore (rank=64) | 25.9 | Information lost due to excessively low rank |
| GoLore (rank=256) | 24.5 | Close to full-rank but with increased memory |
| GaLore (rank=128, batch=512) | 25.0 | GaLore improves under large batch sizes |
| GaLore (rank=128, batch=64) | 26.2 | GaLore degrades under small batch sizes |
Key Findings¶
- GaLore indeed exhibits instability under small batch settings, validating the theoretical analysis.
- GoLore performs stably across all batch sizes, with performance close to full-rank Adam.
- GoLore shares the same memory footprint as GaLore (as random projection does not require extra memory).
- The projection update frequency \(T_0\) has little impact on performance; updates every 50-200 steps are sufficient.
Highlights & Insights¶
- High Consistency Between Theory and Practice: The counterexample and theoretical conditions precisely explain the empirical behavior of GaLore.
- Simple and Elegant Remedy: Random projection is the simplest possible fix and does not introduce additional computational or memory overhead.
- Practical Value: GoLore provides a theoretically guaranteed solution for LLM training under memory constraints.
Limitations & Future Work¶
- Random projection might be less efficient in information utilization compared to SVD, potentially requiring larger rank sizes.
- The theoretical analysis primarily focuses on convex cases; the non-convex optimization landscape of LLM training is not yet fully covered.
- The integration of GoLore with other memory-efficient methods (such as QLoRA and quantized training) has not been explored.
- Pre-training experiments on larger models (7B+) are limited.
Related Work & Insights¶
- GaLore (Zhao et al., 2024): The direct target of improvement in this work.
- LoRA (Hu et al., 2022): Another mainstream low-rank optimization methodology.
- GoLore's random projection approach can be extended to other optimization problems involving subspace constraints.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of the counterexample and the proposed solution is highly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensively covers synthetic tasks to LLM pre-training/fine-tuning.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear connection between theory and experiments, with a smooth narrative line.
- Value: ⭐⭐⭐⭐⭐ Highly important for both the theory and practice of subspace optimization.