Skip to content

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

Background & Motivation

Training large language models (LLMs) faces severe memory bottlenecks. Taking LLaMA-7B as an example, the model parameters themselves occupy approximately 14GB (FP16), but the Adam optimizer needs to maintain two sets of state variables of the same size as the parameters (momentum and variance), resulting in an additional 28GB of VRAM being consumed by optimizer states. This makes training medium-to-large models on a single consumer-grade GPU (such as RTX 4090, 24GB) almost impossible.

Existing memory-efficient training methods mainly fall into two categories:

Low-Rank Optimizers (e.g., GaLore, Flora): Project gradients into a low-rank subspace to reduce the dimensions of optimizer states. However, the update of the projection matrix relies on SVD, which incurs extremely high computational overhead.

Quantized Optimizers (e.g., Q-Adam): Quantize and compress optimizer states. However, quantization errors accumulate, affecting training quality.

The core problem of low-rank projection methods lies in the update strategy of the projection matrix \(P_t\). GaLore performs a full SVD on the current gradient every \(T\) steps to update \(P_t\). This operation is extremely time-consuming for large matrices (approximately 540 seconds per full SVD for a 7B model), severely slowing down the training speed.

The core insight of this paper: High correlation exists between projection matrices of adjacent update cycles. Leveraging this property, expensive full SVD can be replaced by incremental updates at an extremely low cost.

Method

Problem Formulation

Standard low-rank gradient projection projects the \(m imes n\) gradient matrix \(G_t\) into a rank-\(r\) subspace:

\[ ilde{G}_t = P_t P_t^T G_t\]

where \(P_t \in \mathbb{R}^{m imes r}\) is the projection matrix. Optimizer states (momentum, variance) are maintained in the low-rank space \(\mathbb{R}^{r imes n}\), reducing the memory from \(O(mn)\) to \(O(rn + mr)\).

Correlation-Aware Projection Update

The core innovation of COAP is dividing the update of the projection matrix into two phases:

Phase 1: SGD Incremental Update (Executed Every Step)

Utilizing the correlation between projections, the projection matrix \(P_t\) is incrementally updated via a simple SGD step:

\[P_{t+1} = P_t - \eta_P abla_{P} \mathcal{L}\]

The computational complexity of this update is only \(O(mr)\), which is far smaller than the \(O(m^2n)\) of SVD.

Phase 2: Occasional Low-Cost SVD (Executed Every \(T\) Steps)

Every \(T\) steps, a warm-start SVD is executed: using the current \(P_t\) as initialization, a partial SVD is performed on the gradient. Since the initialization is already close to the optimal solution, convergence requires very few iterations.

Operation GaLore SVD COAP Warm-Start SVD Speedup
Execution Time per Step (LLaMA-7B) ~540s ~23s ~20×
Update Frequency Every 200 steps Every 200 steps -
Amortized Overhead per Step 2.7s 0.12s ~23×

Inter-Projection Correlation Analysis

This paper experimentally verifies the high correlation between adjacent projection matrices:

\[\text{sim}(P_t, P_{t+T}) = rac{\|P_t^T P_{t+T}\|_F}{\|P_t\|_F \|P_{t+T}\|_F} > 0.95\]

This observation provides a theoretical foundation for the SGD incremental updates: the projection space changes slowly, allowing a small-step incremental update to track the optimal subspace.

Memory Analysis

Method Optimizer Memory (LLaMA-1B) Relative to Standard Adam
Adam (FP16) 4.0 GB 100%
Adam (BF16) 4.0 GB 100%
GaLore (r=256) 1.8 GB 45%
Flora (r=256) 1.6 GB 40%
COAP (r=256) 1.56 GB 39%

COAP achieves an -61% reduction in optimizer memory.

Experimental Results

LLaMA Pre-training

Method LLaMA-1B PPL↓ LLaMA-7B PPL↓ Training Speed (tokens/s)
Adam 14.89 12.31
GaLore 16.12 13.05 0.72×
Flora 15.98 12.87 0.81×
COAP 15.56 12.58 0.93×

LLaVA-7B Fine-Tuning

Method Training Time Accuracy GPU Memory
LoRA 12.3h 88.1% 18GB
Full fine-tuning (Adam) 47.1h 82.4% 62GB
GaLore 15.2h 87.3% 24GB
COAP 7.6h 92.3% 22GB

COAP achieves a 6.2× speedup (7.6h vs. 47.1h) on LLaVA-7B fine-tuning, while improving accuracy from 82.4% to 92.3%.

Downstream Task Evaluation

Task Adam GaLore COAP
MMLU (5-shot) 46.2 43.8 45.7
HellaSwag 72.1 69.4 71.5
ARC-Challenge 41.3 38.9 40.8
WinoGrande 67.4 65.1 66.9

Summary & Outlook

By observing the high correlation between projection matrices, COAP designs an efficient two-phase projection update strategy: SGD incremental updates + occasional warm-start SVD. This design reduces the computational overhead of SVD by approximately 20× while maintaining projection quality comparable to full SVD. It achieves a PPL of 15.56 and a 61% reduction in optimizer memory during LLaMA-1B pre-training, and realizes a 6.2× speedup and a 9.9% accuracy improvement in LLaVA-7B fine-tuning.