NeurIPS 2025 LLM Evaluation Continual Learning task embedding transferability hypernetwork H-score catastrophic forgetting LoRA

Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings¶

Conference: NeurIPS 2025 arXiv: 2502.11609 Code: GitHub Area: LLM Evaluation Keywords: Continual Learning, task embedding, transferability, hypernetwork, H-score, catastrophic forgetting, LoRA

TL;DR¶

This paper proposes H-embedding, a transferability-aware task embedding based on H-score, and integrates it into a hypernetwork framework. By explicitly modeling inter-task relationships in the embedding space to guide parameter generation, the method achieves state-of-the-art final accuracy in a rehearsal-free setting.

Background & Motivation¶

Continual Learning (CL) requires a model to sequentially learn a series of tasks. The core challenge is catastrophic forgetting: learning new tasks degrades performance on previously learned ones. Existing approaches fall into three main categories:

Rehearsal-based: Store old samples for replay, but incur privacy and memory overhead.

Regularization-based: Constrain parameter updates to preserve old knowledge, potentially sacrificing adaptability to new tasks.

Architecture-based: Separate task-specific and shared components, but face scalability issues as the number of tasks grows.

These methods generally focus on model-level operations, overlooking a more fundamental question: the relationships between tasks. Capturing and exploiting inter-task transferability information could better enable forward and backward transfer.

The authors observe that:

Transferability metrics naturally quantify compatibility between tasks.
Existing transferability-based methods (e.g., Ermis et al., 2022) rely on storing old models and samples, making them incompatible with rehearsal-free settings.
An online, efficient approach that requires no revisiting of old data is needed to encode task relationships.

Method¶

Overall Architecture¶

The framework consists of three core components:

H-embedding: A task embedding derived from the H-score transferability metric, computed online before training on each new task.
Hypernetwork: Takes task embeddings as input and generates model parameters for the corresponding task.
Encoder-Decoder Guidance Module: Injects H-embedding information into the intermediate representations of the hypernetwork.

Workflow: When learning the \(j\)-th task, the hypernetwork first reconstructs model parameters for the previous \(j-1\) tasks to compute H-scores, then solves for the H-embedding \(\hat{e}^{(j)}\), and finally uses this embedding to guide hypernetwork optimization during training.

Key Designs¶

Key Design 1: Computing H-embedding¶

H-score metric: A transferability measure grounded in information theory, defined as:

\[H(f) = \mathrm{tr}\left(\mathrm{cov}(f(X))^{-1} \cdot \mathrm{cov}(\mathbb{E}_{P_{X|Y}}[f(X)|Y])\right)\]

For transferability from task \(T_n\) to \(T_j\), the current task data \(D_j\) and old task model parameters \(\Theta^{(n)}\) (reconstructed by the hypernetwork) are used, without accessing old data.

Embedding optimization: H-embedding is obtained by minimizing the discrepancy between pairwise Euclidean distances in embedding space and normalized transferability scores:

\[\hat{e}^{(j)}, \gamma^{(j)} = \arg\min_{\hat{e}^{(j)}, \gamma^{(j)}} \sum_{n=1}^{j-1} \left(\|\hat{e}^{(j)} - e^{(n)}\|_2 - \gamma^{(j)} \exp(-\mathcal{AHP}(T_n, T_j))\right)^2\]

AHP normalization: Since absolute H-score values depend on target task features, directly aligning Euclidean distances with inverse H-scores introduces scale inconsistency. A pairwise tournament matrix \(W^{(j)}\) is constructed (where \(w_{m,n}^{(j)} = H(T_m, T_j) / H(T_n, T_j)\)), and its principal eigenvector serves as the normalized transferability score, subsequently mapped to distances via exponential transformation.

Key Design 2: Hypernetwork Architecture and Embedding Guidance¶

The hypernetwork \(f_h(e, \Theta_h)\) maps task embedding \(e^{(j)}\) to task model parameters \(\Theta^{(j)}\). The guidance mechanism is implemented via an encoder-decoder:

Encoder \(f_{Enc}\): The first half of the hypernetwork, mapping task embeddings to a hidden representation \(h\).
Decoder \(f_{Dec}\): A lightweight MLP reconstructing the embedding \(\tilde{e}\) from \(h\), with the constraint \(\tilde{e} \approx \hat{e}\) (the H-embedding).

This ensures that the intermediate representations of the hypernetwork retain sufficient inter-task relationship information.

Key Design 3: LoRA Plugin Mode¶

The framework supports generating only LoRA parameters rather than full model weights, naturally compatible with PEFT:

The pretrained backbone is frozen.
The hypernetwork outputs only the low-rank matrices of LoRA adapters.
This significantly reduces hypernetwork size and inference overhead.

Loss & Training¶

The total loss consists of three components:

\[L = L_t + \beta_e L_e + \beta_c L_c\]

Term	Definition	Role
\(L_t\)	Cross-entropy supervised loss	Learning the current task
\(L_c\)	\(\frac{1}{j-1}\sum_{n=1}^{j-1}\\|f_h(e^{(n)}, \Theta_h) - f_h(e^{(n)}, \Theta_h^*)\\|^2\)	Anti-forgetting: ensures old embeddings generate similar model weights
\(L_e\)	\(\mathcal{L}(f_{Dec}(f_{Enc}(e^{(j)})), \hat{e}^{(j)})\) (cosine similarity loss)	H-embedding guidance: injects inter-task relationship priors

Key Experimental Results¶

Main Results: CIFAR-100 & ImageNet-R (N=10)¶

Backbone	Method	CIFAR-100 FAA(↑)	ImageNet-R FAA(↑)
ResNet-32	WSN	82.75	37.99
ResNet-32	HyperNet	81.57	38.03
ResNet-32	H-embed Hnet	83.08	38.16
ViT-B/16	HiDe-Prompt	93.48	74.65
ViT-B/16	SD-LoRA	87.26	77.18
ViT-B/16	H-embed Hnet-LoRA	97.07	81.38

Improvements are particularly pronounced in the ViT-LoRA setting: +3.6 points over HiDe-Prompt on CIFAR-100, and +4.2 points over SD-LoRA on ImageNet-R.

Extended Results: Varying Task Counts & DomainNet¶

Method	ImgNet-R (N=5) FAA	ImgNet-R (N=20) FAA	DomainNet (N=5) FAA
SD-LoRA	79.01	74.05	72.58
HiDe-Prompt	74.77	73.59	72.20
H-embed Hnet-LoRA	79.27	79.90	76.64

Key finding: the performance advantage grows with the number of tasks (leading SD-LoRA by ~5.9 points at N=20), demonstrating stronger robustness in long task sequences.

Ablation Study¶

Conducted on ImageNet-R (N=5, 10, 20):

Variant	Effect
w/o H-embedding guidance (w/o Hemb)	Noticeable FAA drop, validating the effectiveness of inter-task relationship priors
w/o CL regularization (w/o CLreg)	Significant FAA deterioration and increased forgetting
w/o AHP normalization (w/o AHP)	Performance degradation, especially reduced stability in long sequences

Efficiency Analysis¶

Task embedding dimensionality is only 32, with negligible storage overhead.
Near-zero inference latency overhead: ResNet-32 on CIFAR-100 goes from 4.257s to 4.260s; ViT on ImageNet-R goes from 4.313s to 4.568s with LoRA.
Both the decoder and H-embedding are lightweight two-layer MLPs and 32-dimensional vectors.

Highlights & Insights¶

Novel perspective: Approaches continual learning through prior exploitation of task relationships rather than posterior model operations, providing an orthogonal dimension of improvement.
Solid theoretical grounding: H-score is rooted in information-theoretic HGR maximal correlation analysis, providing a clear theoretical foundation.
Online computation: H-embedding requires no revisiting of old data; it is computed solely from hypernetwork-reconstructed old model parameters, naturally suited to rehearsal-free settings.
Elegant AHP normalization: The principal eigenvector of a pairwise tournament matrix elegantly resolves the cross-task scale inconsistency of H-scores.
Plug-and-play: The framework can generate only PEFT parameters such as LoRA, integrating seamlessly with pretrained models.
Long-sequence advantage: Performance gains amplify as the number of tasks increases, demonstrating greater value of task relationship modeling in complex scenarios.

Limitations & Future Work¶

CIL adaptation is not fully natural: Class-Incremental Learning (CIL) requires an additional task-ID classifier (trained on frozen pretrained model features), which is not tightly coupled to the online nature of the framework.
Assumptions underlying H-score: H-score relies on linear feature and conditional independence assumptions; its measurement accuracy may degrade for highly nonlinear inter-task relationships.
Hypernetwork scalability: Although hypernetwork parameters are constrained to not exceed those of the main network, full-model generation is clearly infeasible for very large models (e.g., LLMs), making the LoRA variant essential.
Task boundary assumption: The framework assumes clearly defined task boundaries and availability of task IDs during training (TIL setting), limiting applicability to online streaming scenarios with blurred boundaries.
Only classification tasks evaluated: Experiments cover only image classification benchmarks; dense prediction tasks such as detection and segmentation remain unevaluated.

von Oswald et al. (2020): Established the foundational paradigm of hypernetworks for continual learning (the source of the CL regularization loss \(L_c\)); this paper introduces transferability guidance on top of that framework.
SD-LoRA (Wu et al., 2025): A strong LoRA-based continual learning baseline, substantially outperformed here through task relationship modeling.
H-score (Bao et al., 2019): An information-theoretic transferability metric; this paper extends it from a static evaluation tool to a dynamic embedding guidance signal.
AHP normalization (Zamir et al., 2018): Originally from Taskonomy for task relationship modeling; the normalization technique is adapted here to resolve H-score scale inconsistency.
Inspiration: Combining transferability metrics from transfer learning with continual learning frameworks is a promising direction; similar ideas could generalize to meta-learning or multi-task learning settings.

Rating¶

Dimension	Score (1–5)	Notes
Novelty	4	Approaches continual learning from task transferability; H-embedding design is original
Technical Depth	4	Solid information-theoretic foundations; elegant AHP normalization; complete framework
Experimental Thoroughness	4	Multiple benchmarks, backbones, and settings; complete ablations; thorough efficiency analysis
Practicality	3.5	LoRA plugin mode is practical; CIL and large-model adaptation need further improvement
Writing Quality	4	Clear structure; motivation developed naturally; rigorous mathematical derivations
Overall	3.9/5	A solid continual learning work balancing theory and experiments; the introduction of inter-task relationship priors is a clear and well-motivated contribution

Method	Category	Requires Replay	Task Relationship Modeling	Scalability
EWC / SI	Regularization	No	None (implicit via parameter importance)	Moderate; constraints accumulate with tasks
PackNet / WSN	Architecture	No	None	Poor; subnetwork capacity is limited
HyperNet (von Oswald)	Hypernetwork	No	None (task embeddings randomly initialized)	Moderate; embeddings lack prior guidance
Ermis et al. (2022)	Transferability	Yes (old samples + models)	Yes	Poor; large storage overhead
HiDe-Prompt	Prompt	No	None	Moderate; prompt pool size is constrained
SD-LoRA	LoRA	No	None	Moderate
Ours (H-embed Hnet)	Hypernetwork + Transferability	No	Yes (H-score prior)	Strong; advantage amplifies with more tasks

Core distinction: This paper is the only method that explicitly exploits information-theoretic transferability measures to model task relationships in a rehearsal-free setting. Compared to vanilla HyperNet, H-embedding endows task embeddings with geometric structure—distances between embeddings reflect transferability—so that hypernetwork-generated parameters naturally align with task similarity.

Inspirations and Connections¶

Transferability metrics → meta-learning signals: H-score was originally a static evaluation tool; this paper transforms it into an online optimization objective, inspiring the introduction of other transfer learning metrics (e.g., LogME, OTCE) into continual or multi-task learning.
Geometric constraints in embedding space: Optimizing embeddings so that distances approximate transferability distances (MDS-style) resembles Task2Vec but is more lightweight—this idea could be explored for model selection or NAS.
Generality of AHP normalization: The technique of pairwise comparison → principal eigenvector normalization generalizes to any scenario requiring cross-scale alignment (e.g., multi-metric AutoML).
Hypernetwork + PEFT paradigm: Generating only LoRA parameters via a hypernetwork is worth exploring for LLM continual learning—e.g., generating LoRA adapters for each downstream task while a shared hypernetwork encodes task relationships.