CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning¶

Conference: CVPR 2025
arXiv: 2505.24816
Code: JiangpengHe/CL-LoRA
Institution: MIT / Purdue University
Area: Continual Learning / Model Compression
Keywords: Class-Incremental Learning, LoRA, PEFT, knowledge distillation, catastrophic forgetting

TL;DR¶

This paper proposes CL-LoRA, which designs a dual-adapter architecture (task-shared + task-specific LoRA). By combining knowledge distillation, gradient reassignment, and learnable block-wise weights, CL-LoRA achieves SOTA continual learning performance with only 0.3% trainable parameters.

Background & Motivation¶

Background: Class-incremental learning (CIL) requires models to sequentially learn new classes while preserving previous knowledge. Recently, pre-trained models (PTMs) combined with Parameter-Efficient Fine-Tuning (PEFT) have demonstrated promising results in rehearsal-free CIL without storing exemplar images from old tasks.

Limitations of Prior Work: - Prompt-based methods (L2P, DualPrompt, CODA-Prompt): Require a large number of prompt parameters, and the task selection mechanisms are complex. - Adapter-based methods (EASE, O-LoRA, InfLoRA): Create a new adapter for each task, leading to parameter redundancy. - Existing methods fail to effectively utilize shared knowledge across tasks, and independent learning for each task leads to fragmented knowledge.

Key Challenge: How to maintain parameter efficiency while simultaneously learning task-shared knowledge and capturing task-specific features, thereby balancing stability (no forgetting) and plasticity (learning new knowledge).

Key Insight: Designing a dual-adapter architecture, where the first half of the Transformer layers uses a shared LoRA to learn cross-task knowledge, and the second half uses task-specific LoRAs to capture task-specific features.

Core Idea: Shared LoRA (with knowledge distillation + gradient reassignment) + Specific LoRA (with block-wise weights + orthogonality constraints) = Efficient Continual Learning.

Method¶

Overall Architecture¶

Based on the ViT-B/16 pre-trained model, the 12 Transformer layers are split into two halves: the first 6 layers insert the task-shared LoRA, and the remaining 6 layers insert the task-specific LoRAs. A prototype classifier is used during inference.

Key Designs¶

Task-Shared Adapter
- Function: Inserts a shared LoRA in the first \(l=6\) Transformer blocks to learn general knowledge across tasks.
- Mechanism: Initializes \(\mathbf{A}_s\) with a random orthogonal matrix and only updates \(\mathbf{B}_s\).
- Knowledge Distillation: When learning a new task, the shared adapter from the previous task serves as the teacher to distill knowledge into the current adapter.
- Early Exit Strategy: Computes the distillation loss only at the last layer of the shared adapter, reducing computational overhead.
Gradient Reassignment
- Function: Identifies and protects parameters in the shared adapter that are important to old tasks.
- Mechanism: Calculates the difference in gradients between the teacher and student models, and reduces the learning rate for important parameters.
- Implementation: \(\nabla \mathcal{L}'_{kd} = \nabla \mathcal{L}_{kd} \odot |\nabla_{\mathbf{B}_s^{t-1}} \mathcal{L}_{kd} - \nabla_{\mathbf{B}_s^{t}} \mathcal{L}_{kd}|\)
- Effect: Retains crucial shared knowledge more precisely.
Task-Specific Adapter and Block-wise Weights
- Function: Each task has an independent LoRA in the remaining \(N-l\) Transformer blocks.
- Block-wise Weights: Learns trainable block-wise scaling factors \(w_i^j\) for each specific adapter.
- Orthogonality Constraint: \(\mathcal{L}_{orth} = \sum_{j=l+1}^{N} \sum_{i \neq k} \| \mathbf{B}_i^j {}^\top \mathbf{B}_k^j \|_F^2\)
- Design Motivation: Different tasks may require differing levels of adaptation at different Transformer layers.
Prototype Classifier Inference
- Features are computed for each learned task using its corresponding task-specific adapter.
- The cosine similarity between the features and each task's prototypes is computed.
- The class with the highest similarity is predicted.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{ce} + \lambda_1 \mathcal{L}_{kd} + \lambda_2 \mathcal{L}_{orth}\]

Hyperparameters: \(\lambda_1 = 5\), \(\lambda_2 = 0.0001\), LoRA rank \(r = 10\).

Key Experimental Results¶

Main Results: Average Accuracy \(\bar{A}\) (%) on 4 Benchmarks¶

Method	Params (%)	CIFAR-100 T=10	ImageNet-R T=20	ImageNet-A T=20	VTAB T=10
L2P	0.2	79.51	65.82	39.81	78.96
DualPrompt	0.5	80.44	67.41	56.43	82.51
EASE	1.4	85.71	78.04	68.92	93.01
RanPAC	3.1	87.62	78.53	66.14	89.61
InfLoRA	0.3	80.97	73.22	56.91	88.83
O-LoRA	0.4	81.26	72.52	55.02	87.22
CL-LoRA	0.3	85.32	81.58	70.15	94.57

ImageNet-R Long-Sequence (T=40) Average Accuracy¶

Method	\(\bar{A}\)
InfLoRA	47.04
O-LoRA	47.53
CL-LoRA	60.54

CL-LoRA demonstrates a more significant advantage on long sequences, outperforming the compilation-based runner-up by 13 percentage points.

Ablation Study (CIFAR-100 T=10 / ImageNet-R T=20)¶

KD	GR	BW	CIFAR-100	ImageNet-R
✗	✗	✗	88.20	82.24
✓	✗	✗	90.83	83.42
✓	✓	✗	91.69	84.08
✗	✗	✓	89.01	82.93
✓	✓	✓	91.85	84.77

Each of the three modules contributes to the performance, with KD yielding the most substantial gain.

Highlights & Insights¶

Elegant design of the dual-adapter architecture: shared knowledge in early layers, specialized adaptation in deep layers.
Achieves SOTA with only 0.3% trainable parameters, remaining highly competitive while using 10x fewer parameters than RanPAC (3.1%).
The gradient reassignment mechanism protects critical parameters more precisely than naive knowledge distillation.
The advantage is particularly pronounced on challenging benchmarks with distribution shifts (ImageNet-R/A).