Attention Retention for Continual Learning with Vision Transformers¶

Conference: AAAI 2026 arXiv: 2602.05454 Code: None Area: LLM Safety Keywords: Continual Learning, Vision Transformer, Attention Retention, Catastrophic Forgetting, Gradient Masking

TL;DR¶

This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.

Background & Motivation¶

Background: Continual learning requires models to retain performance on previous tasks while learning new ones. ViTs are increasingly adopted in CL settings.

Limitations of Prior Work: (a) Catastrophic forgetting in ViTs manifests as attention drift; (b) regularization methods (e.g., EWC) offer limited effectiveness for ViTs; (c) expansion methods (e.g., DualPrompt) introduce substantial additional parameters.

Key Challenge: Updating parameters to learn new tasks may disrupt the attention allocation established for discriminative features of old tasks.

Goal: Directly prevent the destruction of attention patterns corresponding to previously learned tasks in ViTs.

Key Insight: Inspired by the selective attention mechanism of the human V1 visual cortex — maintaining sustained focus on important features.

Core Idea: Generate attention masks from previous tasks and zero out gradients of Q/K/V weights in the corresponding regions during new task training, directly preventing attention drift.

Method¶

Overall Architecture¶

Input: a sequence of tasks arriving sequentially. Output: a ViT capable of handling all learned tasks. Two steps: (1) layer-wise rollout to extract attention maps → adaptive thresholding → binary mask; (2) mask-based zeroing of Q/K/V weight gradients.

Key Designs¶

Attention Mask Generation:
- Function: Extract attention regions to be protected from the previous task.
- Mechanism: Layer-wise rollout extracts \(\mathbf{U}_{t-1}\); instance-adaptive thresholding generates \(\bar{\mathbf{M}}_{t-1}\).
- Design Motivation: Identify attention regions critical to discriminative features of old tasks.
Gradient Masking:
- Function: Protect old attention patterns during new task training.
- Mechanism: \(\nabla \mathbf{W}'_{\theta,t} = \nabla \mathbf{W}_{\theta,t} \odot (1 - \bar{\mathbf{M}}_{t-1})\), with Adam-compatible scaling \(\Delta\mathbf{W}'_{\theta,t} = (\nabla\mathbf{W}'_{\theta,t} / \nabla\mathbf{W}_{\theta,t}) \odot \Delta\mathbf{W}_{\theta,t}\).
- Design Motivation: Directly block modifications to critical regions of old tasks at the gradient level, with compatibility for Adam optimizer.
Instance-Adaptive Thresholding:
- Function: Generate sample-specific thresholds for binarization.
- Design Motivation: Attention distributions vary considerably across different tasks and samples.

Loss & Training¶

Standard cross-entropy loss is used. The gradient mask is applied after backpropagation, requiring no modification to the loss function.

Key Experimental Results¶

Main Results¶

Method	10S-ImageNet-R	20S-ImageNet-R	10S-CIFAR-100
CODA-Prompt	75.45%	-	86–89%
OS-Prompt++	-	73.77%	-
ARCL-ViT	SOTA	SOTA	~87%

Ablation Study¶

Configuration	Performance	Note
Full model	Best	Attention mask + gradient mask
w/o gradient mask	Severe degradation	Equivalent to Seq-FT
w/o adaptive threshold	Slight drop	Global threshold lacks flexibility
Different pre-training schemes	Robust	Insensitive to pre-training choice

Key Findings¶

Attention drift is the primary cause of catastrophic forgetting in ViTs, clearly demonstrated through visualization.
Gradient masking outperforms regularization and expansion methods.
The approach is robust on long task sequences (20S) and across different pre-training schemes.

Highlights & Insights¶

Precise Problem Formulation: Attributing catastrophic forgetting to attention drift is well-motivated and supported by compelling visual evidence.
Biologically Inspired Elegant Solution: Simple gradient masking achieves attention retention in a concise yet effective manner.
Adam Compatibility Design: The gradient/update scaling trick is a critical engineering contribution for practical applicability.

Limitations & Future Work¶

Storing attention masks for previous tasks incurs memory overhead that grows linearly with the number of tasks.
Binary masks may be overly coarse; soft masks could offer greater flexibility.
Validation is limited to ViTs; applicability to CNNs remains unexplored.
Task boundary information is required during training.

vs. EWC: EWC applies regularization in parameter space, whereas ARCL-ViT directly protects the attention space, yielding better effectiveness for ViTs.
vs. DualPrompt/CODA-Prompt: Expansion methods increase model size; ARCL-ViT introduces no additional parameters.
vs. PackNet: A conceptually similar idea transferred from CNN pruning to attention protection in ViTs.

Rating¶

Novelty: ⭐⭐⭐⭐ The attention drift perspective is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, settings, and visualizations.
Writing Quality: ⭐⭐⭐⭐ Clear and intuitive presentation.
Value: ⭐⭐⭐⭐ Practical contribution to continual learning with ViTs.

Additional Remarks¶

The methodology and experimental design offer useful reference for related research areas.
Future work should validate generalizability and scalability across more diverse scenarios and larger scales.
Potential research value exists in combining this approach with recent methods (e.g., RL/MCTS or multimodal frameworks).
Deployment feasibility and computational efficiency should be assessed against practical application requirements.
The choice of datasets and evaluation metrics may affect the generalizability of conclusions; cross-validation on additional benchmarks is recommended.

Additional Remarks¶

The methodology and experimental design offer useful reference for related research areas.
Future work should validate generalizability and scalability across more diverse scenarios and larger scales.
Potential research value exists in combining this approach with recent related work (e.g., intersections with RL/MCTS or multimodal methods).

Attention Retention for Continual Learning with Vision Transformers¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Additional Remarks¶

Additional Remarks¶

Related Papers¶