Attention Retention for Continual Learning with Vision Transformers¶
Conference: AAAI 2026 arXiv: 2602.05454 Code: None Area: LLM Safety Keywords: Continual Learning, Vision Transformer, Attention Retention, Catastrophic Forgetting, Gradient Masking
TL;DR¶
This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.
Background & Motivation¶
Background: Continual learning requires models to retain performance on previous tasks while learning new ones. ViTs are increasingly adopted in CL settings.
Limitations of Prior Work: (a) Catastrophic forgetting in ViTs manifests as attention drift; (b) regularization methods (e.g., EWC) offer limited effectiveness for ViTs; (c) expansion methods (e.g., DualPrompt) introduce substantial additional parameters.
Key Challenge: Updating parameters to learn new tasks may disrupt the attention allocation established for discriminative features of old tasks.
Goal: Directly prevent the destruction of attention patterns corresponding to previously learned tasks in ViTs.
Key Insight: Inspired by the selective attention mechanism of the human V1 visual cortex — maintaining sustained focus on important features.
Core Idea: Generate attention masks from previous tasks and zero out gradients of Q/K/V weights in the corresponding regions during new task training, directly preventing attention drift.
Method¶
Overall Architecture¶
Input: a sequence of tasks arriving sequentially. Output: a ViT capable of handling all learned tasks. Two steps: (1) layer-wise rollout to extract attention maps → adaptive thresholding → binary mask; (2) mask-based zeroing of Q/K/V weight gradients.
Key Designs¶
-
Attention Mask Generation:
- Function: Extract attention regions to be protected from the previous task.
- Mechanism: Layer-wise rollout extracts \(\mathbf{U}_{t-1}\); instance-adaptive thresholding generates \(\bar{\mathbf{M}}_{t-1}\).
- Design Motivation: Identify attention regions critical to discriminative features of old tasks.
-
Gradient Masking:
- Function: Protect old attention patterns during new task training.
- Mechanism: \(\nabla \mathbf{W}'_{\theta,t} = \nabla \mathbf{W}_{\theta,t} \odot (1 - \bar{\mathbf{M}}_{t-1})\), with Adam-compatible scaling \(\Delta\mathbf{W}'_{\theta,t} = (\nabla\mathbf{W}'_{\theta,t} / \nabla\mathbf{W}_{\theta,t}) \odot \Delta\mathbf{W}_{\theta,t}\).
- Design Motivation: Directly block modifications to critical regions of old tasks at the gradient level, with compatibility for Adam optimizer.
-
Instance-Adaptive Thresholding:
- Function: Generate sample-specific thresholds for binarization.
- Design Motivation: Attention distributions vary considerably across different tasks and samples.
Loss & Training¶
Standard cross-entropy loss is used. The gradient mask is applied after backpropagation, requiring no modification to the loss function.
Key Experimental Results¶
Main Results¶
| Method | 10S-ImageNet-R | 20S-ImageNet-R | 10S-CIFAR-100 |
|---|---|---|---|
| CODA-Prompt | 75.45% | - | 86–89% |
| OS-Prompt++ | - | 73.77% | - |
| ARCL-ViT | SOTA | SOTA | ~87% |
Ablation Study¶
| Configuration | Performance | Note |
|---|---|---|
| Full model | Best | Attention mask + gradient mask |
| w/o gradient mask | Severe degradation | Equivalent to Seq-FT |
| w/o adaptive threshold | Slight drop | Global threshold lacks flexibility |
| Different pre-training schemes | Robust | Insensitive to pre-training choice |
Key Findings¶
- Attention drift is the primary cause of catastrophic forgetting in ViTs, clearly demonstrated through visualization.
- Gradient masking outperforms regularization and expansion methods.
- The approach is robust on long task sequences (20S) and across different pre-training schemes.
Highlights & Insights¶
- Precise Problem Formulation: Attributing catastrophic forgetting to attention drift is well-motivated and supported by compelling visual evidence.
- Biologically Inspired Elegant Solution: Simple gradient masking achieves attention retention in a concise yet effective manner.
- Adam Compatibility Design: The gradient/update scaling trick is a critical engineering contribution for practical applicability.
Limitations & Future Work¶
- Storing attention masks for previous tasks incurs memory overhead that grows linearly with the number of tasks.
- Binary masks may be overly coarse; soft masks could offer greater flexibility.
- Validation is limited to ViTs; applicability to CNNs remains unexplored.
- Task boundary information is required during training.
Related Work & Insights¶
- vs. EWC: EWC applies regularization in parameter space, whereas ARCL-ViT directly protects the attention space, yielding better effectiveness for ViTs.
- vs. DualPrompt/CODA-Prompt: Expansion methods increase model size; ARCL-ViT introduces no additional parameters.
- vs. PackNet: A conceptually similar idea transferred from CNN pruning to attention protection in ViTs.
Rating¶
- Novelty: ⭐⭐⭐⭐ The attention drift perspective is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, settings, and visualizations.
- Writing Quality: ⭐⭐⭐⭐ Clear and intuitive presentation.
- Value: ⭐⭐⭐⭐ Practical contribution to continual learning with ViTs.
Additional Remarks¶
- The methodology and experimental design offer useful reference for related research areas.
- Future work should validate generalizability and scalability across more diverse scenarios and larger scales.
- Potential research value exists in combining this approach with recent methods (e.g., RL/MCTS or multimodal frameworks).
- Deployment feasibility and computational efficiency should be assessed against practical application requirements.
- The choice of datasets and evaluation metrics may affect the generalizability of conclusions; cross-validation on additional benchmarks is recommended.
Additional Remarks¶
- The methodology and experimental design offer useful reference for related research areas.
- Future work should validate generalizability and scalability across more diverse scenarios and larger scales.
- Potential research value exists in combining this approach with recent related work (e.g., intersections with RL/MCTS or multimodal methods).