CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning¶
Conference: CVPR 2025
arXiv: 2505.24816
Code: JiangpengHe/CL-LoRA
Institution: MIT / Purdue University
Area: Continual Learning / Model Compression
Keywords: Class-Incremental Learning, LoRA, PEFT, knowledge distillation, catastrophic forgetting
TL;DR¶
This paper proposes CL-LoRA, which designs a dual-adapter architecture (task-shared + task-specific LoRA). By combining knowledge distillation, gradient reassignment, and learnable block-wise weights, CL-LoRA achieves SOTA continual learning performance with only 0.3% trainable parameters.
Background & Motivation¶
Background: Class-incremental learning (CIL) requires models to sequentially learn new classes while preserving previous knowledge. Recently, pre-trained models (PTMs) combined with Parameter-Efficient Fine-Tuning (PEFT) have demonstrated promising results in rehearsal-free CIL without storing exemplar images from old tasks.
Limitations of Prior Work: - Prompt-based methods (L2P, DualPrompt, CODA-Prompt): Require a large number of prompt parameters, and the task selection mechanisms are complex. - Adapter-based methods (EASE, O-LoRA, InfLoRA): Create a new adapter for each task, leading to parameter redundancy. - Existing methods fail to effectively utilize shared knowledge across tasks, and independent learning for each task leads to fragmented knowledge.
Key Challenge: How to maintain parameter efficiency while simultaneously learning task-shared knowledge and capturing task-specific features, thereby balancing stability (no forgetting) and plasticity (learning new knowledge).
Key Insight: Designing a dual-adapter architecture, where the first half of the Transformer layers uses a shared LoRA to learn cross-task knowledge, and the second half uses task-specific LoRAs to capture task-specific features.
Core Idea: Shared LoRA (with knowledge distillation + gradient reassignment) + Specific LoRA (with block-wise weights + orthogonality constraints) = Efficient Continual Learning.
Method¶
Overall Architecture¶
Based on the ViT-B/16 pre-trained model, the 12 Transformer layers are split into two halves: the first 6 layers insert the task-shared LoRA, and the remaining 6 layers insert the task-specific LoRAs. A prototype classifier is used during inference.
Key Designs¶
-
Task-Shared Adapter
- Function: Inserts a shared LoRA in the first \(l=6\) Transformer blocks to learn general knowledge across tasks.
- Mechanism: Initializes \(\mathbf{A}_s\) with a random orthogonal matrix and only updates \(\mathbf{B}_s\).
- Knowledge Distillation: When learning a new task, the shared adapter from the previous task serves as the teacher to distill knowledge into the current adapter.
- Early Exit Strategy: Computes the distillation loss only at the last layer of the shared adapter, reducing computational overhead.
-
Gradient Reassignment
- Function: Identifies and protects parameters in the shared adapter that are important to old tasks.
- Mechanism: Calculates the difference in gradients between the teacher and student models, and reduces the learning rate for important parameters.
- Implementation: \(\nabla \mathcal{L}'_{kd} = \nabla \mathcal{L}_{kd} \odot |\nabla_{\mathbf{B}_s^{t-1}} \mathcal{L}_{kd} - \nabla_{\mathbf{B}_s^{t}} \mathcal{L}_{kd}|\)
- Effect: Retains crucial shared knowledge more precisely.
-
Task-Specific Adapter and Block-wise Weights
- Function: Each task has an independent LoRA in the remaining \(N-l\) Transformer blocks.
- Block-wise Weights: Learns trainable block-wise scaling factors \(w_i^j\) for each specific adapter.
- Orthogonality Constraint: \(\mathcal{L}_{orth} = \sum_{j=l+1}^{N} \sum_{i \neq k} \| \mathbf{B}_i^j {}^\top \mathbf{B}_k^j \|_F^2\)
- Design Motivation: Different tasks may require differing levels of adaptation at different Transformer layers.
-
Prototype Classifier Inference
- Features are computed for each learned task using its corresponding task-specific adapter.
- The cosine similarity between the features and each task's prototypes is computed.
- The class with the highest similarity is predicted.
Loss & Training¶
Hyperparameters: \(\lambda_1 = 5\), \(\lambda_2 = 0.0001\), LoRA rank \(r = 10\).
Key Experimental Results¶
Main Results: Average Accuracy \(\bar{A}\) (%) on 4 Benchmarks¶
| Method | Params (%) | CIFAR-100 T=10 | ImageNet-R T=20 | ImageNet-A T=20 | VTAB T=10 |
|---|---|---|---|---|---|
| L2P | 0.2 | 79.51 | 65.82 | 39.81 | 78.96 |
| DualPrompt | 0.5 | 80.44 | 67.41 | 56.43 | 82.51 |
| EASE | 1.4 | 85.71 | 78.04 | 68.92 | 93.01 |
| RanPAC | 3.1 | 87.62 | 78.53 | 66.14 | 89.61 |
| InfLoRA | 0.3 | 80.97 | 73.22 | 56.91 | 88.83 |
| O-LoRA | 0.4 | 81.26 | 72.52 | 55.02 | 87.22 |
| CL-LoRA | 0.3 | 85.32 | 81.58 | 70.15 | 94.57 |
ImageNet-R Long-Sequence (T=40) Average Accuracy¶
| Method | \(\bar{A}\) |
|---|---|
| InfLoRA | 47.04 |
| O-LoRA | 47.53 |
| CL-LoRA | 60.54 |
CL-LoRA demonstrates a more significant advantage on long sequences, outperforming the compilation-based runner-up by 13 percentage points.
Ablation Study (CIFAR-100 T=10 / ImageNet-R T=20)¶
| KD | GR | BW | CIFAR-100 | ImageNet-R |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 88.20 | 82.24 |
| ✓ | ✗ | ✗ | 90.83 | 83.42 |
| ✓ | ✓ | ✗ | 91.69 | 84.08 |
| ✗ | ✗ | ✓ | 89.01 | 82.93 |
| ✓ | ✓ | ✓ | 91.85 | 84.77 |
Each of the three modules contributes to the performance, with KD yielding the most substantial gain.
Highlights & Insights¶
- Elegant design of the dual-adapter architecture: shared knowledge in early layers, specialized adaptation in deep layers.
- Achieves SOTA with only 0.3% trainable parameters, remaining highly competitive while using 10x fewer parameters than RanPAC (3.1%).
- The gradient reassignment mechanism protects critical parameters more precisely than naive knowledge distillation.
- The advantage is particularly pronounced on challenging benchmarks with distribution shifts (ImageNet-R/A).