Energy-Structured Low-Rank Adaptation for Continual Learning¶
Conference: ICML 2026
arXiv: 2605.27482
Code: Not yet released
Area: Model Compression / LoRA / Continual Learning
Keywords: continual learning, LoRA, orthogonal subspace, energy concentration, dynamic rank allocation
TL;DR¶
E2-LoRA shifts the perspective from parameter or input feature spaces to "task-induced output feature drift" \(\Delta \mathbf{Y}_t = \Delta \mathbf{W}_t \mathbf{X}_t\). By performing SVD on this drift, LoRA parameters are rearranged onto an energy-concentrated and rank-ordered basis. This allows for discarding low-energy ranks to recycle capacity for new tasks. Combined with an adaptive rank allocation strategy based on energy retention rates, it achieves SOTA performance across multiple continual learning benchmarks.
Background & Motivation¶
Background: Mainstream Continual Learning (CL) based on Pre-trained Models (PTMs) typically freezes the backbone and adds Parameter-Efficient Fine-Tuning (PEFT) modules (prompt / adapter / LoRA) for each new task. To reduce inter-task interference, a family of "orthogonalization" methods has emerged: either forcing LoRA parameters of different tasks to be orthogonal (O-LoRA, Param-Param) or performing SVD on historical input features to keep new task parameters orthogonal to principal singular vectors (GPM / DualGPM / InfLoRA, Input-Param).
Limitations of Prior Work: The authors observe that both orthogonalization routes suffer from "excessive energy dispersion." In the parameter space, old task knowledge is randomly scattered across the columns of \(\mathbf{B}\), causing the occupied subspace to grow linearly and squeeze the capacity available for new tasks. In the input space, PTM features are inherently high-dimensional and diverse; constraining new tasks based on input principal directions overly restricts learning directions, leading to a sharp drop in plasticity (see Figure 1 in the original paper). Fundamentally, these restricted subspaces are "rigid" and cannot be recycled.
Key Challenge: The trade-off between stability (preventing forgetting) and plasticity (learning new things) is amplified by the constraint that "orthogonal subspaces are non-recyclable"—once a subspace is allocated to a task, it is locked.
Goal: Find a low-dimensional, energy-concentrated, and intrinsically ordered subspace representation for old task knowledge, allowing low-energy directions to be released and recycled for new tasks while keeping the old task outputs nearly unchanged.
Key Insight: Instead of focusing on parameters or inputs, one should monitor the intermediate product where LoRA truly impacts the model—the output feature drift \(\Delta y_t(x) = \Delta \mathbf{W}_t x\). Empirical and theoretical findings suggest that although the input \(\mathbf{X}_t\) may be high-rank, the \(\Delta \mathbf{Y}_t\) induced by \(\Delta \mathbf{W}_t\) is usually concentrated in very few principal directions (as task semantics are inherently low-dimensional).
Core Idea: Perform PCA / SVD on \(\Delta \mathbf{Y}_t\) and rewrite the LoRA parameters \(\mathbf{B}_t, \mathbf{A}_t\) onto this set of orthogonal bases sorted by descending energy. Thus, the "top-\(r\) ranks" provide the optimal low-rank approximation of the task knowledge (theoretically provable, see Prop 3.1/3.2), while remaining low-energy ranks can be safely discarded to release capacity.
Method¶
Overall Architecture¶
Let \(\mathbf{W}_0\) be the pre-trained linear layer weights. When the \(t\)-th task arrives, a LoRA module \(\Delta \mathbf{W}_t = \mathbf{B}_t \mathbf{A}_t\) is allocated. E2-LoRA executes four stages per task: (1) Dynamic rank allocation—pruning old task LoRA modules to \(r_k^{(t)}\) ranks based on an energy retention threshold \(\rho\), and concatenating released columns to form the initial frozen basis for the new task \(\mathbf{B}_t\); (2) Training—optimizing only \(\mathbf{A}_t\) with self-distillation to prevent forgetting; (3) Energy-structured transformation—performing SVD on \(\Delta \mathbf{Y}_t\) and rotating \((\mathbf{B}_t, \mathbf{A}_t)\) onto the energy basis to be rank-ordered; (4) Classifier head alignment—fine-tuning the classifier using synthetic features generated from class statistics.
Key Designs¶
-
Output-Drift-Induced Orthogonalization:
- Function: Changes the core question of "what to be orthogonal to" by using a reference space with the fewest constrained dimensions and highest energy concentration.
- Mechanism: Calculates the output drift matrix \(\Delta \mathbf{Y}_t = \mathbf{B}_t \mathbf{A}_t \mathbf{X}_t \in \mathbb{R}^{d_\text{out}\times n}\) on a proxy input batch \(\mathbf{X}_t\), and performs SVD to obtain \(\Delta \mathbf{Y}_t = \mathbf{U}_t \mathbf{\Sigma}_t \mathbf{V}_t^\top\), where columns of \(\mathbf{U}_t\) are output directions in descending energy order. The new task \(\mathbf{B}_t\) is initialized by concatenating and freezing "low-energy columns" from previous tasks, ensuring the output of \(\mathbf{B}_t \mathbf{A}_t\) falls into the zero-energy subspace of old task drifts, equivalent to hard orthogonalization in the output space.
- Design Motivation: In parameter-level orthogonalization (O-LoRA), columns of \(\mathbf{B}\) lack energy meaning, making the constraint \(\|\mathbf{B}_i^\top \mathbf{B}_t\|_F^2\) "blind." In input-level orthogonalization (DualGPM/InfLoRA), PTM input representations are naturally high-rank/dispersed, meaning constraints shut down most directions for new tasks. \(\Delta \mathbf{Y}_t\) reflects which directions the model output actually changed and remains energy-concentrated due to low-dimensional task semantics, making it the ideal space for orthogonalization.
-
Energy-Structured Transformation + Rank Truncation Optimality:
- Function: Rewrites \((\mathbf{B}_t, \mathbf{A}_t)\) such that "the first few ranks carry almost all task knowledge" without changing the mathematical equivalence of \(\Delta \mathbf{W}_t = \mathbf{B}_t \mathbf{A}_t\).
- Mechanism: After training the current task, perform \(\mathbf{B}_t \leftarrow \mathbf{U}_t[:,:r_t]\) and \(\mathbf{A}_t \leftarrow (\mathbf{U}_t[:,:r_t])^\top \mathbf{B}_t \mathbf{A}_t\). The paper proves (Prop 3.1) that among all updates with rank \(\le r\), \(\mathbf{B}_t[:,:r]\mathbf{A}_t[:r,:]\) minimizes the expected output reconstruction error \(\mathbb{E}_x \|\mathbf{B}_t[:,:r]\mathbf{A}_t[:r,:]x - \mathbf{B}_t\mathbf{A}_t x\|^2\). (Prop 3.2): After truncating to the top \(r\) ranks, the expected error equals the sum of squares of discarded singular values \(\sum_{i=r+1}^{d_\text{out}} \sigma_i^2\).
- Design Motivation: This step elevates the observation that "low-energy ranks can be discarded" to an optimal truncation with theoretical guarantees. It is also the key to making "output-drift-induced orthogonalization" executable—only by sorting and concentrating energy can "pruning the tail and yielding columns to the next task" be done without damaging learned knowledge.
-
Dynamic Rank Allocation:
- Function: Dynamically schedules the current rank of each task within a shared \(d_\text{out}\) capacity pool based on "how much old tasks must retain" and "how much the new task needs."
- Mechanism: For each old task \(k\), select the minimum \(r_k^{(t)}\) such that the energy retention ratio \(\sum_{i=1}^{r_k^{(t)}} \sigma_{k,i}^2 / \sum_{i=1}^{r_k} \sigma_{k,i}^2 \ge \rho\). The new task is set with a minimum rank threshold \(r_t^\text{min} = \lceil d_\text{out}/t \rceil\). If pruning by energy is still insufficient for the new task, ranks are uniformly pruned from the tails of all old tasks until the goal is met.
- Design Motivation: Fixed rank allocation either compresses too early (losing accuracy) or occupies too much (collapsing plasticity). Binding pruning to a specific energy threshold \(\rho\) allows each task to "occupy space as needed"—simple tasks may take 1-2 ranks while complex ones take more, allowing the entire \(d_\text{out}\) capacity to be reused efficiently.
Loss & Training¶
The total loss is \(\mathcal{L} = \mathcal{L}_\text{ce} + \lambda \mathcal{L}_\text{distill}\). \(\mathcal{L}_\text{distill}\) is KL self-distillation of old class logits at temperature \(T=2\) (where the teacher is the same network with frozen historical LoRA). \(\mathcal{L}_\text{ce}\) is the cross-entropy for new classes. The final classifier head alignment uses intra-class statistical synthetic features.
Key Experimental Results¶
Main Results¶
Using ViT-B/16-IN21K as the backbone, class-incremental continual learning was performed on ImageNet-R / CIFAR-100 / CUB-200 / Cars-196 (Last-Acc = accuracy after all tasks; Inc-Acc = average accuracy across the incremental process). Key baselines from Table 1:
| Method | ImageNet-R Last-Acc | CIFAR-100 Last-Acc | CUB-200 Last-Acc | Cars-196 Last-Acc |
|---|---|---|---|---|
| L2P | 66.49 | 82.76 | 62.21 | 38.18 |
| DualPrompt | 68.50 | 85.56 | 66.00 | 40.14 |
| SLCA | 77.00 | 91.53 | 84.71 | 67.73 |
| EASE | 77.75 | 86.54 | 79.90 | 35.46 |
| BiLoRA | 77.95 | 87.46 | — | — |
| SSIAT | 79.38 | 91.35 | 88.75 | 71.02 |
| MOS | 78.10 | 90.53 | 89.91 | 67.76 |
| E2-LoRA (Ours) | New SOTA | New SOTA | New SOTA | New SOTA |
On all benchmarks, Last-Acc and Inc-Acc are superior to known SOTA (specific values in the paper; gains are most significant on long, fine-grained sequences like Cars-196).
Ablation Study¶
| Configuration | Description |
|---|---|
| Param-Param Orthogonalization (O-LoRA style) | Subspace is scattered, new task directions are exhausted quickly, drops accuracy on long sequences |
| Input-Param Orthogonalization (InfLoRA style) | PTM input features are high-rank; constraints are too strong, reducing plasticity |
| Output-Drift Orthogonalization (Ours) | Single-task energy concentrates in few directions; old tasks are preserved + sufficient space for new tasks |
| No Energy-Structured Transformation | \(\mathbf{B}\) columns are unordered, preventing rank-based pruning; degrades to fixed-rank LoRA |
| No Dynamic Rank Allocation (Fixed \(r_t\)) | Insufficient capacity for later tasks, leading to reduced plasticity |
| No self-distillation | Logit drift for old classes, increasing forgetting |
Key Findings¶
- Perspective shift is more important than extra regularization: Simply changing the constraint space from "parameter/input" to "output drift" achieves consistent improvements without adding trainable parameters—indicating the problem lies in the energy distribution of the subspace, not the constraint method itself.
- Truncation error has a closed-form upper bound \(\sum_{i>r} \sigma_i^2\), meaning the energy retention rate \(\rho\) directly corresponds to the "upper bound of output reconstruction error," making parameter tuning interpretable.
- On the most challenging setting, Cars-196 (fine-grained + many tasks), the gap with other SOTA is the widest, highlighting the value of "dynamic capacity recycling" in long task sequences.
Highlights & Insights¶
- The observation that "knowledge is hidden in the output rather than parameters," combined with a simple PCA on \(\Delta \mathbf{Y}_t\), pushes LoRA-based continual learning to a new level with almost no extra runtime overhead. This "problem reformulation" is often more powerful than adding new regularizers.
- Linking the physically interpretable knob, energy retention \(\rho\), directly to rank allocation is very user-friendly—one can derive \(\rho\) by knowing "I can tolerate a 5% output error."
- Energy-structured transformation is essentially a "Principal Component Alignment" for LoRA: it suggests that any PEFT method (IA³, AdaLoRA, prefix tuning) in CL scenarios can be "aligned to task-intrinsic low-dimensional bases" before storage.
- Releasing low-energy bases to subsequent tasks essentially treats continual learning as an "auction of a finite capacity pool," which is closer to human learning under capacity constraints than "linearly growing subspaces."
Limitations & Future Work¶
- SVD on output drift is performed once per task; larger proxy batches are more accurate but slower. If \(d_\text{out}\) is very large (e.g., LLM hidden dim 4096+), SVD costs may become non-negligible.
- The empirical observation "output drift is low-rank" comes from ViT classification; whether it remains concentrated in scenarios with complex output distributions (LLM generation, retrieval, RLHF) is unverified.
- Dynamic rank allocation assumes all tasks share one \(d_\text{out}\) pool, but cross-layer rank coupling or cross-module balancing in multi-layer LoRA is not explicitly modeled.
- The threshold \(\rho\) is shared across the whole network and all tasks; whether layer-wise or task-adaptive \(\rho\) could provide further gains is unexplored.
Related Work & Insights¶
- vs O-LoRA: O-LoRA performs orthogonalization in parameter space \(\|\mathbf{B}_i^\top \mathbf{B}_t\|_F^2\); subspaces grow linearly and are non-recyclable. E2-LoRA "compresses and sorts" each task's LoRA by energy, allowing low-energy parts to be yielded.
- vs GPM / DualGPM / InfLoRA: They perform SVD on input features, but PTM input energy is dispersed. E2-LoRA uses the low-rank nature of output drift, offering more precise constraints and less damage to plasticity.
- vs AdaLoRA: AdaLoRA also uses energy-based pruning on LoRA but targets parameter efficiency for a single task; E2-LoRA extends this to multi-task settings with a "release to subsequent tasks" recycling mechanism.
- vs EASE / MOS / TUNA (adapter route): These rely on task-specific adapters + routing/fusion, where parameters grow linearly with tasks. E2-LoRA reuses ranks within a fixed \(d_\text{out}\) capacity, making it more efficient.
Rating¶
- Novelty: ⭐⭐⭐⭐ Clean perspective shift to "output drift" with accompanying PCA and proof of truncation optimality.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across four classification benchmarks, class/domain-incremental settings, and various orthogonalization route ablations.
- Writing Quality: ⭐⭐⭐⭐ Uses Prop 3.1/3.2 to formalize intuition; Figure 1 clearly illustrates energy distribution differences.
- Value: ⭐⭐⭐⭐ Achieves SOTA in the mainstream CL + PEFT route; the method is plug-and-play for any LoRA-CL framework.