GORP: Continual Gradient Low-Rank Projection Fine-Tuning for LLMs¶

Conference: ACL 2025
arXiv: 2507.02503
Code: https://github.com/Wcxwcxw/GORP
Area: LLM/NLP
Keywords: Continual Learning, Gradient Projection, LoRA, Catastrophic Forgetting, Low-Rank Optimization

TL;DR¶

GORP proposes to unify the gradients of full-rank parameters and LoRA low-rank parameters by projecting them into a low-rank gradient subspace for joint updates. By utilizing the first moment of Adam to implicitly construct a shared gradient space across tasks, it alleviates catastrophic forgetting. In continual学习 settings on T5 and LLaMA2, its performance is close to the multi-task joint training upper bound.

Background & Motivation¶

Background: Continual fine-tuning of LLMs requires learning on sequential tasks and faces catastrophic forgetting. LoRA is widely used in continual learning due to its parameter efficiency.

Limitations of Prior Work: - The low-rank limitation of LoRA restricts its expressiveness, and the limited search space leads to insufficient learning of new tasks. - Explicit parameter constraint methods (e.g., regularization, sparsification) such as O-LoRA and MIGU cannot dynamically adapt to changes in the gradient space of new tasks. - Calculating the implicit feature space covariance directly in the original high-dimensional space is computationally prohibitive.

Key Challenge: The low-rank nature of LoRA limits plasticity (learning new tasks), while explicit constraints limit stability's ability to adapt to changes in the gradient space.

Key Insight: Gradient matrices naturally exhibit a low-rank structure during training, allowing for efficient operations in a low-rank space.

Core Idea: Project gradients of full-rank parameters into the low-rank space as well, jointly updating them with LoRA gradients in a unified shared gradient subspace.

Method¶

Overall Architecture¶

GORP consists of two components: (1) Gradient Shared Space Construction: After training each task, the first moment of Adam is used via SVD to construct the principal gradient directions of the task, gradually expanding the cross-task shared gradient space; (2) Low-Rank Projection Optimization: When training a new task, gradients of both full-rank and LoRA parameters are projected into directions orthogonal to old tasks, preventing forgetting while expanding the search space.

Key Designs¶

Gradient Shared Space Construction (Gradient Shared Space):
- Function: Approximate the principal gradient directions of each task using the first moment of Adam \(M_t\), and take the top \(k\) base vectors after SVD.
- Mechanism: For task 1, \(M_1^l = U_1^l \Sigma_1^l (V_1^l)^T\), taking the top \(k\) vectors that satisfy \(\|(M_1^l)_k\|_F^2 > \epsilon_t^l \|M_1^l\|_F^2\) to construct \(\mathcal{S}_1^l\). Subsequent tasks are first projected onto the orthogonal space: \(\hat{M}_2^l = M_2^l - \mathcal{S}^l(\mathcal{S}^l)^T M_2^l\), and then SVD is used to expand the space.
- Design Motivation: The first moment aggregates historical gradient information, representing the overall gradient direction of the task more effectively than randomly sampled hidden features.
Joint Full-Rank and Low-Rank Low-Rank Projection:
- Function: Separately project the gradients of LoRA parameters and full-rank parameters onto the orthogonal complement of the shared space.
- Mechanism: LoRA gradient projection \(G_{A,l}' = G_{A,l} - \mathcal{S}_{t-1}^{A,l}(\mathcal{S}_{t-1}^{A,l})^T G_{A,l}\); full-rank gradients are first SVD-reduced to \(k\) dimensions \(G_{t,l}' = U_{l,k}^T G_{t,l} V_{l,k}\), and then projected onto the orthogonal direction \(P_{t,l} = G_{t,l}' - \mathcal{S}_{t-1}^l(\mathcal{S}_{t-1}^l)^T G_{t,l}'\).
- Design Motivation: Full-rank parameters enhance the flexibility of the search space, while low-rank projection ensures efficiency and mitigates forgetting.
Implicit vs. Explicit Constraints:
- Function: Replace parameter orthogonality (explicit) with gradient orthogonality (implicit).
- Mechanism: GORP does not directly constrain parameter changes; instead, it constrains the gradient direction to be orthogonal to old tasks, allowing the model to freely choose the optimal parameter update magnitude.
- Design Motivation: Gradient orthogonality better guarantees that the learning direction does not interfere with old tasks while allowing parameters to change flexibly along orthogonal directions.

Computational Efficiency Design¶

SVD of full-rank parameters is executed once every \(T=10\) steps (not every step), significantly reducing overhead.
Low-rank projection reduces the matrix dimension input to Adam optimizer to \(k \times k\) (\(k=8\)), which is much smaller than the original dimension.

Key Experimental Results¶

Main Results (T5-Large)¶

Method	Standard CL (3 orders avg)	Large-scale Tasks (15 tasks, 3 orders avg)
O-LoRA	75.8	69.6
MIGU	76.6	70.0
N-LoRA	78.8	72.4
GORP	79.8	76.0
MTL (Upper Bound)	80.0	-

LLaMA2-7B Results¶

Method	Order 1-3 avg	BWT (forgetting)
O-LoRA	76.1	-7.8%
N-LoRA	77.6	-4.9%
GORP	78.6	-0.8%

Ablation Study¶

Configuration	Avg Performance	Description
B (LoRA Projection Only)	baseline	Baseline
B+L (With Full-Rank Projection)	+0.7%	Full-rank parameters expand search space
B+S (With Gradient Space)	+2.0%	Shared gradient space contributes the most
B+L+S (Full GORP)	+3.9%	Complementary components

Key Findings¶

Close to MTL Upper Bound: GORP (79.8%) is close to MTL (80.0%) under standard CL, almost eliminating the gap in continual learning.
Significant Reduction in Forgetting: BWT decreases from -7.8% of O-LoRA to -0.8% (T5).
Unseen Task Generalization: Outperforms O-LoRA by 26.2% and N-LoRA by 7.0%.
Computationally Efficient: Training time is comparable to O-LoRA, while FLOPs are 1/550 of O-LoRA.
Greater Advantage with More Tasks: Outperforms N-LoRA by 3.6% with 15 tasks, indicating that the dynamic gradient space handles long-sequence tasks better.

Highlights & Insights¶

The insight of gradient orthogonality vs. parameter orthogonality is profound: explicitly constraining parameter orthogonality is a "conservative" strategy that easily limits new task learning; implicitly constraining gradient orthogonality is a "flexible" strategy, controlling only the direction of updates, not the magnitude.
Gradient space approximation via first moment is an ingenious engineering design: the momentum term of Adam naturally accumulates the gradient statistics of the task, which is far more efficient than additionally calculating the covariance matrix.
The idea of full-rank and low-rank complementarity is transferable: in other scenarios requiring parameter-efficient fine-tuning, there is no need to completely abandon full-rank parameters; instead, full-rank gradients can be operated within a low-rank subspace.

Limitations & Future Work¶

As the number of tasks increases, the dimension of the gradient space grows continuously, potentially causing dimension explosion.
Multiple hyperparameters require tuning: \(k=8\), \(\alpha\), \(T\), and different learning rates.
The performance in sequential editing and online learning scenarios has not been validated.
The scalability to larger models (70B+) remains unknown.

vs. O-LoRA: O-LoRA uses explicit parameter constraints and is static; GORP uses implicit gradient constraints and is dynamically adaptive.
vs. MIGU: MIGU uses sparsification (updating only growing units), while GORP uses gradient projection, which is more flexible.
vs. N-LoRA: N-LoRA extends the orthogonal subspace of LoRA but remains within low-rank constraints; GORP introduces full-rank parameters to expand the search space.

Rating¶

Novelty: ⭐⭐⭐⭐ Shared gradient space + joint full-rank and low-rank projection is a meaningful innovation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two models, multiple task sequences, multiple benchmarks, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Clear method derivations and systematic experimental design.
Value: ⭐⭐⭐⭐ High reference value for continual learning in LLMs.

Consistently outperforms methods like O-LoRA, MIGU, and N-LoRA on continual fine-tuning benchmarks, effectively balancing plasticity and stability.

Highlights & Insights¶

Joint optimization of full-rank and low-rank parameters in a unified low-rank gradient space—balancing both expressiveness and efficiency.
Replacing hidden feature covariance with the first moment—significantly reducing computational costs.

Limitations & Future Work¶

The frequency \(T\) of gradient SVD decomposition needs to be manually set.
Only validated at the LLaMA-7B scale.

Refer to the detailed comparison in the Related Work section of the original paper.

Rating¶

Novelty: ⭐⭐⭐⭐ Joint full-rank + low-rank updates and implicit gradient space construction are innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons across multiple continual learning settings.
Writing Quality: ⭐⭐⭐⭐ Clear algorithm pseudocode.
Value: ⭐⭐⭐⭐ Continual fine-tuning is an important direction, and GORP provides a concise and effective solution.

GORP: Continual Gradient Low-Rank Projection Fine-Tuning for LLMs¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Computational Efficiency Design¶

Key Experimental Results¶

Main Results (T5-Large)¶

LLaMA2-7B Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶