COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection¶
Background & Motivation¶
Training large language models (LLMs) faces severe memory bottlenecks. Taking LLaMA-7B as an example, the model parameters themselves occupy approximately 14GB (FP16), but the Adam optimizer needs to maintain two sets of state variables of the same size as the parameters (momentum and variance), resulting in an additional 28GB of VRAM being consumed by optimizer states. This makes training medium-to-large models on a single consumer-grade GPU (such as RTX 4090, 24GB) almost impossible.
Existing memory-efficient training methods mainly fall into two categories:
Low-Rank Optimizers (e.g., GaLore, Flora): Project gradients into a low-rank subspace to reduce the dimensions of optimizer states. However, the update of the projection matrix relies on SVD, which incurs extremely high computational overhead.
Quantized Optimizers (e.g., Q-Adam): Quantize and compress optimizer states. However, quantization errors accumulate, affecting training quality.
The core problem of low-rank projection methods lies in the update strategy of the projection matrix \(P_t\). GaLore performs a full SVD on the current gradient every \(T\) steps to update \(P_t\). This operation is extremely time-consuming for large matrices (approximately 540 seconds per full SVD for a 7B model), severely slowing down the training speed.
The core insight of this paper: High correlation exists between projection matrices of adjacent update cycles. Leveraging this property, expensive full SVD can be replaced by incremental updates at an extremely low cost.
Method¶
Problem Formulation¶
Standard low-rank gradient projection projects the \(m imes n\) gradient matrix \(G_t\) into a rank-\(r\) subspace:
where \(P_t \in \mathbb{R}^{m imes r}\) is the projection matrix. Optimizer states (momentum, variance) are maintained in the low-rank space \(\mathbb{R}^{r imes n}\), reducing the memory from \(O(mn)\) to \(O(rn + mr)\).
Correlation-Aware Projection Update¶
The core innovation of COAP is dividing the update of the projection matrix into two phases:
Phase 1: SGD Incremental Update (Executed Every Step)¶
Utilizing the correlation between projections, the projection matrix \(P_t\) is incrementally updated via a simple SGD step:
The computational complexity of this update is only \(O(mr)\), which is far smaller than the \(O(m^2n)\) of SVD.
Phase 2: Occasional Low-Cost SVD (Executed Every \(T\) Steps)¶
Every \(T\) steps, a warm-start SVD is executed: using the current \(P_t\) as initialization, a partial SVD is performed on the gradient. Since the initialization is already close to the optimal solution, convergence requires very few iterations.
| Operation | GaLore SVD | COAP Warm-Start SVD | Speedup |
|---|---|---|---|
| Execution Time per Step (LLaMA-7B) | ~540s | ~23s | ~20× |
| Update Frequency | Every 200 steps | Every 200 steps | - |
| Amortized Overhead per Step | 2.7s | 0.12s | ~23× |
Inter-Projection Correlation Analysis¶
This paper experimentally verifies the high correlation between adjacent projection matrices:
This observation provides a theoretical foundation for the SGD incremental updates: the projection space changes slowly, allowing a small-step incremental update to track the optimal subspace.
Memory Analysis¶
| Method | Optimizer Memory (LLaMA-1B) | Relative to Standard Adam |
|---|---|---|
| Adam (FP16) | 4.0 GB | 100% |
| Adam (BF16) | 4.0 GB | 100% |
| GaLore (r=256) | 1.8 GB | 45% |
| Flora (r=256) | 1.6 GB | 40% |
| COAP (r=256) | 1.56 GB | 39% |
COAP achieves an -61% reduction in optimizer memory.
Experimental Results¶
LLaMA Pre-training¶
| Method | LLaMA-1B PPL↓ | LLaMA-7B PPL↓ | Training Speed (tokens/s) |
|---|---|---|---|
| Adam | 14.89 | 12.31 | 1× |
| GaLore | 16.12 | 13.05 | 0.72× |
| Flora | 15.98 | 12.87 | 0.81× |
| COAP | 15.56 | 12.58 | 0.93× |
LLaVA-7B Fine-Tuning¶
| Method | Training Time | Accuracy | GPU Memory |
|---|---|---|---|
| LoRA | 12.3h | 88.1% | 18GB |
| Full fine-tuning (Adam) | 47.1h | 82.4% | 62GB |
| GaLore | 15.2h | 87.3% | 24GB |
| COAP | 7.6h | 92.3% | 22GB |
COAP achieves a 6.2× speedup (7.6h vs. 47.1h) on LLaVA-7B fine-tuning, while improving accuracy from 82.4% to 92.3%.
Downstream Task Evaluation¶
| Task | Adam | GaLore | COAP |
|---|---|---|---|
| MMLU (5-shot) | 46.2 | 43.8 | 45.7 |
| HellaSwag | 72.1 | 69.4 | 71.5 |
| ARC-Challenge | 41.3 | 38.9 | 40.8 |
| WinoGrande | 67.4 | 65.1 | 66.9 |
Summary & Outlook¶
By observing the high correlation between projection matrices, COAP designs an efficient two-phase projection update strategy: SGD incremental updates + occasional warm-start SVD. This design reduces the computational overhead of SVD by approximately 20× while maintaining projection quality comparable to full SVD. It achieves a PPL of 15.56 and a 61% reduction in optimizer memory during LLaMA-1B pre-training, and realizes a 6.2× speedup and a 9.9% accuracy improvement in LLaVA-7B fine-tuning.