Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings¶
Conference: NeurIPS 2025 arXiv: 2502.11609 Code: GitHub Area: LLM Evaluation Keywords: Continual Learning, task embedding, transferability, hypernetwork, H-score, catastrophic forgetting, LoRA
TL;DR¶
This paper proposes H-embedding, a transferability-aware task embedding based on H-score, and integrates it into a hypernetwork framework. By explicitly modeling inter-task relationships in the embedding space to guide parameter generation, the method achieves state-of-the-art final accuracy in a rehearsal-free setting.
Background & Motivation¶
Continual Learning (CL) requires a model to sequentially learn a series of tasks. The core challenge is catastrophic forgetting: learning new tasks degrades performance on previously learned ones. Existing approaches fall into three main categories:
Rehearsal-based: Store old samples for replay, but incur privacy and memory overhead.
Regularization-based: Constrain parameter updates to preserve old knowledge, potentially sacrificing adaptability to new tasks.
Architecture-based: Separate task-specific and shared components, but face scalability issues as the number of tasks grows.
These methods generally focus on model-level operations, overlooking a more fundamental question: the relationships between tasks. Capturing and exploiting inter-task transferability information could better enable forward and backward transfer.
The authors observe that:
- Transferability metrics naturally quantify compatibility between tasks.
- Existing transferability-based methods (e.g., Ermis et al., 2022) rely on storing old models and samples, making them incompatible with rehearsal-free settings.
- An online, efficient approach that requires no revisiting of old data is needed to encode task relationships.
Method¶
Overall Architecture¶
The framework consists of three core components:
- H-embedding: A task embedding derived from the H-score transferability metric, computed online before training on each new task.
- Hypernetwork: Takes task embeddings as input and generates model parameters for the corresponding task.
- Encoder-Decoder Guidance Module: Injects H-embedding information into the intermediate representations of the hypernetwork.
Workflow: When learning the \(j\)-th task, the hypernetwork first reconstructs model parameters for the previous \(j-1\) tasks to compute H-scores, then solves for the H-embedding \(\hat{e}^{(j)}\), and finally uses this embedding to guide hypernetwork optimization during training.
Key Designs¶
Key Design 1: Computing H-embedding¶
H-score metric: A transferability measure grounded in information theory, defined as:
For transferability from task \(T_n\) to \(T_j\), the current task data \(D_j\) and old task model parameters \(\Theta^{(n)}\) (reconstructed by the hypernetwork) are used, without accessing old data.
Embedding optimization: H-embedding is obtained by minimizing the discrepancy between pairwise Euclidean distances in embedding space and normalized transferability scores:
AHP normalization: Since absolute H-score values depend on target task features, directly aligning Euclidean distances with inverse H-scores introduces scale inconsistency. A pairwise tournament matrix \(W^{(j)}\) is constructed (where \(w_{m,n}^{(j)} = H(T_m, T_j) / H(T_n, T_j)\)), and its principal eigenvector serves as the normalized transferability score, subsequently mapped to distances via exponential transformation.
Key Design 2: Hypernetwork Architecture and Embedding Guidance¶
The hypernetwork \(f_h(e, \Theta_h)\) maps task embedding \(e^{(j)}\) to task model parameters \(\Theta^{(j)}\). The guidance mechanism is implemented via an encoder-decoder:
- Encoder \(f_{Enc}\): The first half of the hypernetwork, mapping task embeddings to a hidden representation \(h\).
- Decoder \(f_{Dec}\): A lightweight MLP reconstructing the embedding \(\tilde{e}\) from \(h\), with the constraint \(\tilde{e} \approx \hat{e}\) (the H-embedding).
This ensures that the intermediate representations of the hypernetwork retain sufficient inter-task relationship information.
Key Design 3: LoRA Plugin Mode¶
The framework supports generating only LoRA parameters rather than full model weights, naturally compatible with PEFT:
- The pretrained backbone is frozen.
- The hypernetwork outputs only the low-rank matrices of LoRA adapters.
- This significantly reduces hypernetwork size and inference overhead.
Loss & Training¶
The total loss consists of three components:
| Term | Definition | Role |
|---|---|---|
| \(L_t\) | Cross-entropy supervised loss | Learning the current task |
| \(L_c\) | \(\frac{1}{j-1}\sum_{n=1}^{j-1}\|f_h(e^{(n)}, \Theta_h) - f_h(e^{(n)}, \Theta_h^*)\|^2\) | Anti-forgetting: ensures old embeddings generate similar model weights |
| \(L_e\) | \(\mathcal{L}(f_{Dec}(f_{Enc}(e^{(j)})), \hat{e}^{(j)})\) (cosine similarity loss) | H-embedding guidance: injects inter-task relationship priors |
Key Experimental Results¶
Main Results: CIFAR-100 & ImageNet-R (N=10)¶
| Backbone | Method | CIFAR-100 FAA(↑) | ImageNet-R FAA(↑) |
|---|---|---|---|
| ResNet-32 | WSN | 82.75 | 37.99 |
| ResNet-32 | HyperNet | 81.57 | 38.03 |
| ResNet-32 | H-embed Hnet | 83.08 | 38.16 |
| ViT-B/16 | HiDe-Prompt | 93.48 | 74.65 |
| ViT-B/16 | SD-LoRA | 87.26 | 77.18 |
| ViT-B/16 | H-embed Hnet-LoRA | 97.07 | 81.38 |
Improvements are particularly pronounced in the ViT-LoRA setting: +3.6 points over HiDe-Prompt on CIFAR-100, and +4.2 points over SD-LoRA on ImageNet-R.
Extended Results: Varying Task Counts & DomainNet¶
| Method | ImgNet-R (N=5) FAA | ImgNet-R (N=20) FAA | DomainNet (N=5) FAA |
|---|---|---|---|
| SD-LoRA | 79.01 | 74.05 | 72.58 |
| HiDe-Prompt | 74.77 | 73.59 | 72.20 |
| H-embed Hnet-LoRA | 79.27 | 79.90 | 76.64 |
Key finding: the performance advantage grows with the number of tasks (leading SD-LoRA by ~5.9 points at N=20), demonstrating stronger robustness in long task sequences.
Ablation Study¶
Conducted on ImageNet-R (N=5, 10, 20):
| Variant | Effect |
|---|---|
| w/o H-embedding guidance (w/o Hemb) | Noticeable FAA drop, validating the effectiveness of inter-task relationship priors |
| w/o CL regularization (w/o CLreg) | Significant FAA deterioration and increased forgetting |
| w/o AHP normalization (w/o AHP) | Performance degradation, especially reduced stability in long sequences |
Efficiency Analysis¶
- Task embedding dimensionality is only 32, with negligible storage overhead.
- Near-zero inference latency overhead: ResNet-32 on CIFAR-100 goes from 4.257s to 4.260s; ViT on ImageNet-R goes from 4.313s to 4.568s with LoRA.
- Both the decoder and H-embedding are lightweight two-layer MLPs and 32-dimensional vectors.
Highlights & Insights¶
- Novel perspective: Approaches continual learning through prior exploitation of task relationships rather than posterior model operations, providing an orthogonal dimension of improvement.
- Solid theoretical grounding: H-score is rooted in information-theoretic HGR maximal correlation analysis, providing a clear theoretical foundation.
- Online computation: H-embedding requires no revisiting of old data; it is computed solely from hypernetwork-reconstructed old model parameters, naturally suited to rehearsal-free settings.
- Elegant AHP normalization: The principal eigenvector of a pairwise tournament matrix elegantly resolves the cross-task scale inconsistency of H-scores.
- Plug-and-play: The framework can generate only PEFT parameters such as LoRA, integrating seamlessly with pretrained models.
- Long-sequence advantage: Performance gains amplify as the number of tasks increases, demonstrating greater value of task relationship modeling in complex scenarios.
Limitations & Future Work¶
- CIL adaptation is not fully natural: Class-Incremental Learning (CIL) requires an additional task-ID classifier (trained on frozen pretrained model features), which is not tightly coupled to the online nature of the framework.
- Assumptions underlying H-score: H-score relies on linear feature and conditional independence assumptions; its measurement accuracy may degrade for highly nonlinear inter-task relationships.
- Hypernetwork scalability: Although hypernetwork parameters are constrained to not exceed those of the main network, full-model generation is clearly infeasible for very large models (e.g., LLMs), making the LoRA variant essential.
- Task boundary assumption: The framework assumes clearly defined task boundaries and availability of task IDs during training (TIL setting), limiting applicability to online streaming scenarios with blurred boundaries.
- Only classification tasks evaluated: Experiments cover only image classification benchmarks; dense prediction tasks such as detection and segmentation remain unevaluated.
Related Work & Insights¶
- von Oswald et al. (2020): Established the foundational paradigm of hypernetworks for continual learning (the source of the CL regularization loss \(L_c\)); this paper introduces transferability guidance on top of that framework.
- SD-LoRA (Wu et al., 2025): A strong LoRA-based continual learning baseline, substantially outperformed here through task relationship modeling.
- H-score (Bao et al., 2019): An information-theoretic transferability metric; this paper extends it from a static evaluation tool to a dynamic embedding guidance signal.
- AHP normalization (Zamir et al., 2018): Originally from Taskonomy for task relationship modeling; the normalization technique is adapted here to resolve H-score scale inconsistency.
- Inspiration: Combining transferability metrics from transfer learning with continual learning frameworks is a promising direction; similar ideas could generalize to meta-learning or multi-task learning settings.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | 4 | Approaches continual learning from task transferability; H-embedding design is original |
| Technical Depth | 4 | Solid information-theoretic foundations; elegant AHP normalization; complete framework |
| Experimental Thoroughness | 4 | Multiple benchmarks, backbones, and settings; complete ablations; thorough efficiency analysis |
| Practicality | 3.5 | LoRA plugin mode is practical; CIL and large-model adaptation need further improvement |
| Writing Quality | 4 | Clear structure; motivation developed naturally; rigorous mathematical derivations |
| Overall | 3.9/5 | A solid continual learning work balancing theory and experiments; the introduction of inter-task relationship priors is a clear and well-motivated contribution |
Related Work & Insights¶
| Method | Category | Requires Replay | Task Relationship Modeling | Scalability |
|---|---|---|---|---|
| EWC / SI | Regularization | No | None (implicit via parameter importance) | Moderate; constraints accumulate with tasks |
| PackNet / WSN | Architecture | No | None | Poor; subnetwork capacity is limited |
| HyperNet (von Oswald) | Hypernetwork | No | None (task embeddings randomly initialized) | Moderate; embeddings lack prior guidance |
| Ermis et al. (2022) | Transferability | Yes (old samples + models) | Yes | Poor; large storage overhead |
| HiDe-Prompt | Prompt | No | None | Moderate; prompt pool size is constrained |
| SD-LoRA | LoRA | No | None | Moderate |
| Ours (H-embed Hnet) | Hypernetwork + Transferability | No | Yes (H-score prior) | Strong; advantage amplifies with more tasks |
Core distinction: This paper is the only method that explicitly exploits information-theoretic transferability measures to model task relationships in a rehearsal-free setting. Compared to vanilla HyperNet, H-embedding endows task embeddings with geometric structure—distances between embeddings reflect transferability—so that hypernetwork-generated parameters naturally align with task similarity.
Inspirations and Connections¶
- Transferability metrics → meta-learning signals: H-score was originally a static evaluation tool; this paper transforms it into an online optimization objective, inspiring the introduction of other transfer learning metrics (e.g., LogME, OTCE) into continual or multi-task learning.
- Geometric constraints in embedding space: Optimizing embeddings so that distances approximate transferability distances (MDS-style) resembles Task2Vec but is more lightweight—this idea could be explored for model selection or NAS.
- Generality of AHP normalization: The technique of pairwise comparison → principal eigenvector normalization generalizes to any scenario requiring cross-scale alignment (e.g., multi-metric AutoML).
- Hypernetwork + PEFT paradigm: Generating only LoRA parameters via a hypernetwork is worth exploring for LLM continual learning—e.g., generating LoRA adapters for each downstream task while a shared hypernetwork encodes task relationships.