Skip to content

Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings

Conference: NeurIPS 2025 arXiv: 2502.11609 Code: GitHub Area: LLM Evaluation Keywords: Continual Learning, task embedding, transferability, hypernetwork, H-score, catastrophic forgetting, LoRA

TL;DR

This paper proposes H-embedding, a transferability-aware task embedding based on H-score, and integrates it into a hypernetwork framework. By explicitly modeling inter-task relationships in the embedding space to guide parameter generation, the method achieves state-of-the-art final accuracy in a rehearsal-free setting.

Background & Motivation

Continual Learning (CL) requires a model to sequentially learn a series of tasks. The core challenge is catastrophic forgetting: learning new tasks degrades performance on previously learned ones. Existing approaches fall into three main categories:

Rehearsal-based: Store old samples for replay, but incur privacy and memory overhead.

Regularization-based: Constrain parameter updates to preserve old knowledge, potentially sacrificing adaptability to new tasks.

Architecture-based: Separate task-specific and shared components, but face scalability issues as the number of tasks grows.

These methods generally focus on model-level operations, overlooking a more fundamental question: the relationships between tasks. Capturing and exploiting inter-task transferability information could better enable forward and backward transfer.

The authors observe that:

  • Transferability metrics naturally quantify compatibility between tasks.
  • Existing transferability-based methods (e.g., Ermis et al., 2022) rely on storing old models and samples, making them incompatible with rehearsal-free settings.
  • An online, efficient approach that requires no revisiting of old data is needed to encode task relationships.

Method

Overall Architecture

The framework consists of three core components:

  1. H-embedding: A task embedding derived from the H-score transferability metric, computed online before training on each new task.
  2. Hypernetwork: Takes task embeddings as input and generates model parameters for the corresponding task.
  3. Encoder-Decoder Guidance Module: Injects H-embedding information into the intermediate representations of the hypernetwork.

Workflow: When learning the \(j\)-th task, the hypernetwork first reconstructs model parameters for the previous \(j-1\) tasks to compute H-scores, then solves for the H-embedding \(\hat{e}^{(j)}\), and finally uses this embedding to guide hypernetwork optimization during training.

Key Designs

Key Design 1: Computing H-embedding

H-score metric: A transferability measure grounded in information theory, defined as:

\[H(f) = \mathrm{tr}\left(\mathrm{cov}(f(X))^{-1} \cdot \mathrm{cov}(\mathbb{E}_{P_{X|Y}}[f(X)|Y])\right)\]

For transferability from task \(T_n\) to \(T_j\), the current task data \(D_j\) and old task model parameters \(\Theta^{(n)}\) (reconstructed by the hypernetwork) are used, without accessing old data.

Embedding optimization: H-embedding is obtained by minimizing the discrepancy between pairwise Euclidean distances in embedding space and normalized transferability scores:

\[\hat{e}^{(j)}, \gamma^{(j)} = \arg\min_{\hat{e}^{(j)}, \gamma^{(j)}} \sum_{n=1}^{j-1} \left(\|\hat{e}^{(j)} - e^{(n)}\|_2 - \gamma^{(j)} \exp(-\mathcal{AHP}(T_n, T_j))\right)^2\]

AHP normalization: Since absolute H-score values depend on target task features, directly aligning Euclidean distances with inverse H-scores introduces scale inconsistency. A pairwise tournament matrix \(W^{(j)}\) is constructed (where \(w_{m,n}^{(j)} = H(T_m, T_j) / H(T_n, T_j)\)), and its principal eigenvector serves as the normalized transferability score, subsequently mapped to distances via exponential transformation.

Key Design 2: Hypernetwork Architecture and Embedding Guidance

The hypernetwork \(f_h(e, \Theta_h)\) maps task embedding \(e^{(j)}\) to task model parameters \(\Theta^{(j)}\). The guidance mechanism is implemented via an encoder-decoder:

  • Encoder \(f_{Enc}\): The first half of the hypernetwork, mapping task embeddings to a hidden representation \(h\).
  • Decoder \(f_{Dec}\): A lightweight MLP reconstructing the embedding \(\tilde{e}\) from \(h\), with the constraint \(\tilde{e} \approx \hat{e}\) (the H-embedding).

This ensures that the intermediate representations of the hypernetwork retain sufficient inter-task relationship information.

Key Design 3: LoRA Plugin Mode

The framework supports generating only LoRA parameters rather than full model weights, naturally compatible with PEFT:

  • The pretrained backbone is frozen.
  • The hypernetwork outputs only the low-rank matrices of LoRA adapters.
  • This significantly reduces hypernetwork size and inference overhead.

Loss & Training

The total loss consists of three components:

\[L = L_t + \beta_e L_e + \beta_c L_c\]
Term Definition Role
\(L_t\) Cross-entropy supervised loss Learning the current task
\(L_c\) \(\frac{1}{j-1}\sum_{n=1}^{j-1}\|f_h(e^{(n)}, \Theta_h) - f_h(e^{(n)}, \Theta_h^*)\|^2\) Anti-forgetting: ensures old embeddings generate similar model weights
\(L_e\) \(\mathcal{L}(f_{Dec}(f_{Enc}(e^{(j)})), \hat{e}^{(j)})\) (cosine similarity loss) H-embedding guidance: injects inter-task relationship priors

Key Experimental Results

Main Results: CIFAR-100 & ImageNet-R (N=10)

Backbone Method CIFAR-100 FAA(↑) ImageNet-R FAA(↑)
ResNet-32 WSN 82.75 37.99
ResNet-32 HyperNet 81.57 38.03
ResNet-32 H-embed Hnet 83.08 38.16
ViT-B/16 HiDe-Prompt 93.48 74.65
ViT-B/16 SD-LoRA 87.26 77.18
ViT-B/16 H-embed Hnet-LoRA 97.07 81.38

Improvements are particularly pronounced in the ViT-LoRA setting: +3.6 points over HiDe-Prompt on CIFAR-100, and +4.2 points over SD-LoRA on ImageNet-R.

Extended Results: Varying Task Counts & DomainNet

Method ImgNet-R (N=5) FAA ImgNet-R (N=20) FAA DomainNet (N=5) FAA
SD-LoRA 79.01 74.05 72.58
HiDe-Prompt 74.77 73.59 72.20
H-embed Hnet-LoRA 79.27 79.90 76.64

Key finding: the performance advantage grows with the number of tasks (leading SD-LoRA by ~5.9 points at N=20), demonstrating stronger robustness in long task sequences.

Ablation Study

Conducted on ImageNet-R (N=5, 10, 20):

Variant Effect
w/o H-embedding guidance (w/o Hemb) Noticeable FAA drop, validating the effectiveness of inter-task relationship priors
w/o CL regularization (w/o CLreg) Significant FAA deterioration and increased forgetting
w/o AHP normalization (w/o AHP) Performance degradation, especially reduced stability in long sequences

Efficiency Analysis

  • Task embedding dimensionality is only 32, with negligible storage overhead.
  • Near-zero inference latency overhead: ResNet-32 on CIFAR-100 goes from 4.257s to 4.260s; ViT on ImageNet-R goes from 4.313s to 4.568s with LoRA.
  • Both the decoder and H-embedding are lightweight two-layer MLPs and 32-dimensional vectors.

Highlights & Insights

  1. Novel perspective: Approaches continual learning through prior exploitation of task relationships rather than posterior model operations, providing an orthogonal dimension of improvement.
  2. Solid theoretical grounding: H-score is rooted in information-theoretic HGR maximal correlation analysis, providing a clear theoretical foundation.
  3. Online computation: H-embedding requires no revisiting of old data; it is computed solely from hypernetwork-reconstructed old model parameters, naturally suited to rehearsal-free settings.
  4. Elegant AHP normalization: The principal eigenvector of a pairwise tournament matrix elegantly resolves the cross-task scale inconsistency of H-scores.
  5. Plug-and-play: The framework can generate only PEFT parameters such as LoRA, integrating seamlessly with pretrained models.
  6. Long-sequence advantage: Performance gains amplify as the number of tasks increases, demonstrating greater value of task relationship modeling in complex scenarios.

Limitations & Future Work

  1. CIL adaptation is not fully natural: Class-Incremental Learning (CIL) requires an additional task-ID classifier (trained on frozen pretrained model features), which is not tightly coupled to the online nature of the framework.
  2. Assumptions underlying H-score: H-score relies on linear feature and conditional independence assumptions; its measurement accuracy may degrade for highly nonlinear inter-task relationships.
  3. Hypernetwork scalability: Although hypernetwork parameters are constrained to not exceed those of the main network, full-model generation is clearly infeasible for very large models (e.g., LLMs), making the LoRA variant essential.
  4. Task boundary assumption: The framework assumes clearly defined task boundaries and availability of task IDs during training (TIL setting), limiting applicability to online streaming scenarios with blurred boundaries.
  5. Only classification tasks evaluated: Experiments cover only image classification benchmarks; dense prediction tasks such as detection and segmentation remain unevaluated.
  • von Oswald et al. (2020): Established the foundational paradigm of hypernetworks for continual learning (the source of the CL regularization loss \(L_c\)); this paper introduces transferability guidance on top of that framework.
  • SD-LoRA (Wu et al., 2025): A strong LoRA-based continual learning baseline, substantially outperformed here through task relationship modeling.
  • H-score (Bao et al., 2019): An information-theoretic transferability metric; this paper extends it from a static evaluation tool to a dynamic embedding guidance signal.
  • AHP normalization (Zamir et al., 2018): Originally from Taskonomy for task relationship modeling; the normalization technique is adapted here to resolve H-score scale inconsistency.
  • Inspiration: Combining transferability metrics from transfer learning with continual learning frameworks is a promising direction; similar ideas could generalize to meta-learning or multi-task learning settings.

Rating

Dimension Score (1–5) Notes
Novelty 4 Approaches continual learning from task transferability; H-embedding design is original
Technical Depth 4 Solid information-theoretic foundations; elegant AHP normalization; complete framework
Experimental Thoroughness 4 Multiple benchmarks, backbones, and settings; complete ablations; thorough efficiency analysis
Practicality 3.5 LoRA plugin mode is practical; CIL and large-model adaptation need further improvement
Writing Quality 4 Clear structure; motivation developed naturally; rigorous mathematical derivations
Overall 3.9/5 A solid continual learning work balancing theory and experiments; the introduction of inter-task relationship priors is a clear and well-motivated contribution
Method Category Requires Replay Task Relationship Modeling Scalability
EWC / SI Regularization No None (implicit via parameter importance) Moderate; constraints accumulate with tasks
PackNet / WSN Architecture No None Poor; subnetwork capacity is limited
HyperNet (von Oswald) Hypernetwork No None (task embeddings randomly initialized) Moderate; embeddings lack prior guidance
Ermis et al. (2022) Transferability Yes (old samples + models) Yes Poor; large storage overhead
HiDe-Prompt Prompt No None Moderate; prompt pool size is constrained
SD-LoRA LoRA No None Moderate
Ours (H-embed Hnet) Hypernetwork + Transferability No Yes (H-score prior) Strong; advantage amplifies with more tasks

Core distinction: This paper is the only method that explicitly exploits information-theoretic transferability measures to model task relationships in a rehearsal-free setting. Compared to vanilla HyperNet, H-embedding endows task embeddings with geometric structure—distances between embeddings reflect transferability—so that hypernetwork-generated parameters naturally align with task similarity.

Inspirations and Connections

  1. Transferability metrics → meta-learning signals: H-score was originally a static evaluation tool; this paper transforms it into an online optimization objective, inspiring the introduction of other transfer learning metrics (e.g., LogME, OTCE) into continual or multi-task learning.
  2. Geometric constraints in embedding space: Optimizing embeddings so that distances approximate transferability distances (MDS-style) resembles Task2Vec but is more lightweight—this idea could be explored for model selection or NAS.
  3. Generality of AHP normalization: The technique of pairwise comparison → principal eigenvector normalization generalizes to any scenario requiring cross-scale alignment (e.g., multi-metric AutoML).
  4. Hypernetwork + PEFT paradigm: Generating only LoRA parameters via a hypernetwork is worth exploring for LLM continual learning—e.g., generating LoRA adapters for each downstream task while a shared hypernetwork encodes task relationships.