GoRA: Gradient-Driven Adaptive Low Rank Adaptation¶

Conference: NeurIPS 2025 arXiv: 2502.12171 Code: GitHub Area: Model Compression / LLM Efficiency Keywords: LoRA, adaptive rank allocation, gradient-driven initialization, parameter-efficient fine-tuning, LLM

TL;DR¶

GoRA is proposed to leverage pre-computed gradient information to simultaneously perform adaptive rank allocation and weight initialization prior to training — assigning per-layer ranks based on parameter sensitivity and initializing the $B$ matrix via the gradient pseudo-inverse so that the initial output approximates one step of gradient descent, thereby addressing both major bottlenecks of LoRA in a unified framework.

Background & Motivation¶

LoRA performance is constrained by two critical factors: rank selection and weight initialization.

Rank allocation problem: - Higher rank consistently yields better performance, but directly increasing rank incurs large memory overhead - AdaLoRA dynamically adjusts ranks during training via masking, but requires pre-allocating larger matrices (1.5× parameter count), limiting the rank upper bound - Methods such as MeLoRA modify the LoRA structure at the cost of generality

Initialization problem: - PiSSA/MiLoRA initialize via SVD decomposition of pretrained weights — task-agnostic and limited in generalization - LoRA-GA initializes using singular features of gradients — but requires modifying the pretrained weights as $W_0 \leftarrow W_0 - A_0 B_0$, introducing a training–inference gap - All non-zero initialization methods require saving the modified pretrained weights, sacrificing LoRA's storage advantage

Core insight: LoRA adapters are reinterpreted as gradient compressors. Experiments from LoRA-FA show that freezing random $A$ and training only $B$ can approximate full LoRA performance, where $\Delta W = \frac{\alpha}{r} A_0 \Delta B_t = -\eta \frac{\alpha}{r} \sum_t A_0 A_0^T \frac{\partial \mathcal{L}}{\partial W_0}$, i.e., $A_0$ compresses the gradient and $A_0$ decompresses it.

Method¶

Overall Architecture¶

GoRA proceeds in two steps, both completed before formal training begins:

Gradient pre-computation: Forward passes over $N$ batches to accumulate the average gradient of each layer's weights, $G = \frac{1}{N} \sum_i \frac{\partial \mathcal{L}_i}{\partial W_0}$
Adaptive rank allocation: Determine the per-layer rank based on an importance score derived from weight–gradient interaction
Gradient-driven initialization: Initialize $B_0$ via the gradient pseudo-inverse such that $A_0 B_0 \approx -G$

Key Designs¶

Adaptive rank allocation strategy:

Compute the per-layer importance (parameter sensitivity metric): $$I(W) = \text{avg}(|W \odot G|)$$ i.e., the element-wise absolute mean of the Hadamard product of weights and gradients — intuitively, layers with large weights and large gradients are more important.
Normalize to advantage scores: $a^i = I(W_0^i) / \sum_j I(W_0^j)$
Compute the total parameter budget based on a reference rank $r^{\text{ref}}$: $b = \sum_i \sqrt{m_i + n_i} \times r^{\text{ref}}$
Allocate the rank for each layer: $$r^i = \text{clip}\left(\text{round}\left(\frac{b \cdot a^i}{\sqrt{m+n}}\right), r^{\min}, r^{\max}\right)$$

Design objectives: (1) completed once before training with no dynamic shape changes; (2) total parameter count comparable to standard LoRA (±10%); (3) structurally compatible with LoRA.

Gradient-driven initialization:

$A_0$ is initialized with Kaiming uniform distribution (consistent with the PEFT library); $B_0$ is initialized via the pseudo-inverse of the gradient:

\[B_0 = -(A_0^T A_0)^{-1} A_0^T G\]

This makes $A_0 B_0 = -A_0(A_0^T A_0)^{-1} A_0^T G$ the optimal low-rank approximation of $G$ in the column space of $A_0$ (minimizing $\|A_0 B_0 + G\|_F$).

Scaling factor $\xi$: To match the true gradient magnitude, a scaling factor is introduced: $$\frac{\alpha}{\sqrt{r}} A_0(\xi B_0) \approx -\gamma G$$ where $\xi = \gamma \sqrt{m} / \alpha$ and $\gamma$ is a tunable step-size hyperparameter (recommended: 5e-2).

No modification to pretrained weights: Unlike PiSSA/LoRA-GA, GoRA does not require $W_0 \leftarrow W_0 - A_0 B_0$, because the initialization objective is to make $A_0 B_0 \approx -G$ (approximating one step of gradient descent) rather than decomposing $W_0$.

Loss & Training¶

Forward computation: $W_t = W_0 + \frac{\alpha}{\sqrt{r}} A_t B_t$ (adopting the $\sqrt{r}$ scaling from rsLoRA)
Standard fine-tuning loss (next-token prediction or classification loss)
The formal training procedure is identical to LoRA; GoRA's innovations are entirely within the initialization phase

Key Experimental Results¶

Main Results¶

T5-Base on GLUE ($r^{\text{ref}}=8$):

Method	MNLI	SST-2	CoLA	QNLI	MRPC	Avg.
Full FT	86.33	94.75	80.70	93.19	84.56	87.91
LoRA	85.30	94.04	69.35	92.96	68.38	82.08
LoRA-GA	85.70	94.11	80.57	93.18	85.29	87.77
AdaLoRA	85.45	93.69	69.16	91.66	68.14	81.62
GoRA	85.91	94.68	79.86	93.27	86.10	87.96

Llama-3.1-8B-Base on generation tasks:

Method	MTBench	GSM8k	HumanEval
Full FT	5.88	73.69	51.63
LoRA	6.15	67.78	43.09
LoRA-GA	5.99	71.39	43.29
GoRA	6.34	72.91	48.98
GoRA ($r^{\text{ref}}=128$)	5.82	75.74	52.03

GoRA ($r^{\text{ref}}=128$) surpasses full fine-tuning on GSM8k and HumanEval.

Ablation Study¶

Effect of rank allocation range (Llama-3.1-8B, $\gamma=5e{-2}$):

$r^{\min}$	$r^{\max}$	GSM8k	HumanEval
8	8 (fixed)	72.10	44.75
6	15	72.25	45.85
4	32	72.88	48.98

Effect of initialization scaling factor $\gamma$:

$\gamma$	GSM8k	HumanEval
0 (no initialization)	72.45	46.34
3e-2	72.71	45.93
5e-2	72.88	48.98
8e-2	72.91	46.54

Comparison of importance metrics:

Metric	GSM8k	HumanEval
$\text{avg}(\\|W \odot G\\|)$ (GoRA)	72.88	48.98
$\\|G\\|_*$ (nuclear norm)	72.70	43.09
$\\|W \odot G\\|_*$	72.65	45.12

Key Findings¶

Wider rank ranges are better: $(r^{\min}=4, r^{\max}=32)$ significantly outperforms fixed rank=8; most rank budget is allocated to $W_v$ layers
Initialization is critical: $\gamma=0$ (no initialization) underperforms the optimal $\gamma$ by 2.64 points on HumanEval
Parameter sensitivity metric is best: outperforms gradient nuclear norm and the nuclear norm of the weight–gradient product
Cross-modal consistency: GoRA outperforms baselines on NLU (T5), NLG (Llama), and visual classification (CLIP-ViT)
Surpasses full fine-tuning at high rank: GoRA $r^{\text{ref}}=128$ outperforms full fine-tuning by 2.05 points on mathematical reasoning

Highlights & Insights¶

LoRA as a gradient compressor: This reinterpretation unifies the design logic for both rank allocation and initialization
One-time pre-training setup: No runtime dynamic adjustment is required, making it fully compatible with distributed training (FSDP/ZeRO)
No training–inference gap: Pretrained weights are not modified, preserving LoRA's storage advantage
Automatic hyperparameter tuning: Adaptive gradient accumulation stopping and adaptive $\gamma$ search strategies are proposed, approaching the performance of manual tuning
Rank distribution pattern: $W_v$ receives the most rank and $W_q$ the least, consistent with observations in the original LoRA paper

Limitations & Future Work¶

Gradient pre-computation requires additional forward pass overhead ($N$ batches), which is non-negligible for very large models
The optimal value of $\gamma$ varies by task (GSM8k favors 8e-2, HumanEval favors 5e-2)
$r^{\min}$ and $r^{\max}$ still require manual selection, although the paper shows that wide ranges are generally preferable
The pseudo-inverse computation assumes $A_0$ is full rank, which may be numerically unstable at very low ranks
Combination with QLoRA (quantization) has not been explored

LoRA-GA: Also gradient-driven initialization, but requires modifying pretrained weights; GoRA avoids this via the scaling factor
AdaLoRA: Also an adaptive rank method, but through training-time masking, inflating parameter count by 1.5×; GoRA's pre-training allocation incurs no additional overhead
rsLoRA: The $\alpha/\sqrt{r}$ scaling rule is adopted by GoRA to better leverage high ranks
Inspiration: The gradient compressor perspective may generalize to other PEFT methods (Adapter, Prefix-tuning), warranting exploration of a gradient-driven unified framework

Rating¶

Novelty: ⭐⭐⭐⭐ — The view of LoRA as a gradient compressor is original and unifies two previously separate problems
Technical Depth: ⭐⭐⭐⭐ — Theoretical analysis is clear; the optimality of the pseudo-inverse is formally proven
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three modalities (NLU/NLG/vision), multiple models and benchmarks, comprehensive ablations
Practicality: ⭐⭐⭐⭐⭐ — Plug-and-play, compatible with the LoRA ecosystem, code is open-sourced
Overall: ⭐⭐⭐⭐

GoRA: Gradient-Driven Adaptive Low Rank Adaptation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Background & Motivation¶

Core Problem¶

Method¶

Key Experimental Results¶

Highlights & Insights¶

Limitations & Future Work¶

Inspiration & Connections¶

Rating¶

Metric	GSM8k	HumanEval
\(\text{avg}(\\|W \odot G\\|)\) (GoRA)	72.88	48.98
\(\\|G\\|_*\) (nuclear norm)	72.70	43.09
\(\\|W \odot G\\|_*\)	72.65	45.12

GoRA: Gradient-Driven Adaptive Low Rank Adaptation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Background & Motivation¶

Core Problem¶

Method¶

Key Experimental Results¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Inspiration & Connections¶

Rating¶

Related Papers¶