CVPR2025 Medical Imaging Continual learning domain-incremental learning prompt-based learning catastrophic forgetting knowledge distillation sparse selection

Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning¶

Conference: CVPR2025
arXiv: 2603.12816
Code: To be confirmed
Area: Medical Images
Keywords: Continual learning, domain-incremental learning, prompt-based learning, catastrophic forgetting, knowledge distillation, sparse selection

TL;DR¶

To address domain-incremental learning (DIL) without task IDs and data replay, this paper proposes the Residual SODAP framework. It concurrently solves representation adaptation and classifier forgetting through \(\alpha\)-entmax sparse prompt selection with residual aggregation, pseudo-replay distillation based on feature statistics, prompt usage pattern drift detection, and uncertainty weighting. It achieves state-of-the-art (SOTA) performance on diabetic retinopathy (DR), skin cancer, and CORe50 datasets.

Background & Motivation¶

Core Challenge of Continual Learning: Catastrophic forgetting is particularly severe in domain-incremental learning (DIL), where task IDs are unavailable and historical data cannot be stored.
Two Limitations of Prior Prompt-based CL:
Inadequate Prompt Selection Schemes: Top-\(k\) hard selection limits expressiveness and is non-differentiable; Softmax soft selection allows irrelevant prompts to exert influence, leading to noise accumulation.
Ignoring Classifier Structure: Existing PCL methods primarily focus on prompt pool design to adapt representations, but the classifier layer also exhibits instability under domain shifts (as demonstrated by the cross-composition diagnostic experiments).
Key Finding: Feature extractor (backbone) \(\times\) classifier cross-composition analysis (referencing the diagnostic method of Liu et al.) reveals that even if backbone representations are well-maintained via prompt adaptation, the decision boundaries of the classifier layer still degrade significantly during domain-incremental training.
Goal: A unified framework is required to simultaneously address representation adaptation at the prompt layer and knowledge preservation at the classifier layer.

Method¶

1. \(\alpha\)-Entmax Residual Prompt Selection¶

Query Enhancement: Fuses the current-layer CLS token, the initial CLS token (global context), and retrieval signals from a learnable memory bank, generating an enhanced query via Multi-Head Attention (MHA) and a bottleneck adapter.
Sparse Selection: Replaces Softmax with \(\alpha\)-entmax (\(\alpha=1.5\)), which assigns exact zero weights to low-scoring prompts, balancing full prompt pool utilization and noise suppression.
Residual Structure: Starting from Stage 2, the prompt pool is split into a frozen set \(F\) and an active set \(A\). These sets undergo independent sparse routing and are combined in a residual manner: \(p_{out} = p_F + 0.1 \cdot p_A\). The frozen set preserves prior knowledge, while the active set performs only residual adaptation.
Auxiliary Loss: Diversity loss (penalizes similarity among highly co-activated prompts) + norm regularization (constrains active prompt values to act purely as residuals).

2. Statistical Knowledge Preservation (Pseudo-Replay Distillation)¶

At the end of each stage, class-wise feature statistics (mean and variance) are collected using Welford's online algorithm, and the current classifier head is frozen as a teacher.
During multi-stage training, statistics are cumulatively merged using Welford's formula, which can be completed in a single pass and is memory-efficient.
During the training of the next stage, two types of distillation are performed:
Real Feature Distillation: Aligns current batch features via the KL divergence between the teacher and student heads (temperature \(T=2.0\)).
Pseudo-Feature Replay: Samples pseudo-features from stored class statistics (using the reparameterization trick) and aligns them via the KL divergence between the frozen teacher and trainable student heads.
Classes are uniformly sampled to mitigate under-representation of minority classes, with \(K=B\) pseudo-features sampled per batch.

3. Prompt Usage Pattern Drift Detection (PUDD)¶

Concurrently monitors two signals: (a) entropy changes in the prompt selection distribution (short-term fluctuations reflecting domain changes), and (b) structural shifts of the active prompt set (using Intersection over Union, IoU).
The two signals are weighted and combined into a drift score \(D_t\) (\(\alpha=1.0, \beta=0.5\)), which is averaged across layers and batches to determine the scale of prompt pool expansion.
The expansion size is directly proportional to the drift intensity: weak drifts trigger minimal prompt addition (\(E_{min}=10\)), while strong drifts lead to large-scale expansion (\(E_{max}=80\)).
After expansion, the existing active prompts are moved to the frozen set, and the newly added prompts become the new active set.

4. Uncertainty Weighting¶

Employs homoscedastic uncertainty weighting following Kendall et al. to learn a log-variance \(s_i\) for each loss term.
Total Loss: \(L_{total} = \sum (e^{-s_i} \cdot L_i + s_i)\), where noisy losses are automatically downweighted.

Key Experimental Results¶

Benchmark Settings¶

Three DIL scenarios: Diabetic Retinopathy (DR, 3 domains: APTOS \(\rightarrow\) DDR \(\rightarrow\) DRD), Skin Cancer (3 domains: ISIC \(\rightarrow\) HAM \(\rightarrow\) DERM7), and CORe50 (general benchmark).
No data replay, no task IDs, with all results averaged over 3 independent trials.
Evaluation metrics: AvgACC (Average Accuracy) and AvgF (Average Forgetting).

Main Results¶

Method	DR AvgACC↑	DR AvgF↓	Skin AvgACC↑	Skin AvgF↓	CORe50 AvgACC↑	CORe50 AvgF↓
OS-Prompt++	0.769	0.113	0.725	0.063	0.983	0.014
Coda-Prompt	0.688	0.140	0.713	0.041	0.974	0.056
DER++	0.607	0.288	0.722	0.099	0.994	0.061
Residual SODAP	0.850	0.047	0.760	0.031	0.995	0.003

DR scenario: AvgACC improves by 8.1 percentage points (pp) compared to the second-best method (OS-Prompt++: 0.769), while AvgF drops from 0.113 to 0.047.
Skin cancer scenario: Achieves the best AvgACC of 0.760 with an AvgF of 0.031 (Dual-Prompt has a lower AvgF of 0.012, but its AvgACC is only 0.637, representing a poor accuracy-forgetting trade-off).
CORe50: Performance is near-perfect (AvgACC: 0.995, AvgF: 0.003), demonstrating the effectiveness of the proposed method in general domains.

Ablation Study (DR)¶

Removing Query Enhancer: AvgACC drops by 4.2 pp (making it the most critical individual component).
Removing Residual (degrading to SODAP): AvgACC drops by 1.9 pp, and AvgF degrades by 2.0 pp.
Removing Pseudo-Replay: AvgACC drops by 1.5 pp.
Removing Distillation: Also leads to performance degradation.
PUDD dynamically regulates prompt pool expansion from 60 \(\rightarrow\) 84 \(\rightarrow\) 94, ensuring no redundant prompts after expansion.

Highlights & Insights¶

Comprehensive Design: Simultaneously addresses three main dimensions—prompt selection noise, classifier forgetting, and domain drift detection—rather than focusing on a single issue.
\(\alpha\)-Entmax Sparse Selection: Accurately suppresses irrelevant prompts while preserving differentiable optimization across the entire prompt pool, outperforming both Top-\(k\) (non-differentiable) and Softmax (prone to noise accumulation).
Zero-Data Pseudo-Replay: Stores only class-wise means and variances (only 2D vectors per class, extremely low storage overhead), enabling knowledge preservation via reparameterization sampling.
PUDD Automatic Expansion: Detects domain drift based on actual usage patterns rather than hard-coded rules, allowing adaptive expansion and avoiding capacity waste or insufficiency.
Generality: Achieves SOTA results on both medical imaging (DR, skin cancer) and general computer vision (CORe50), demonstrating that the method is not application-restricted.

Limitations & Future Work¶

The experimental scenarios are relatively short (only 3 domains); scalability under longer sequences (10+ domains) remains unverified, where continuous expansion of the prompt pool might impose memory and computational burdens.
Hyperparameters such as \(\alpha\) and \(\lambda_r\) are fixed (\(\alpha=1.5, \lambda_r=0.1\)); sensitivity analyses across different scenarios have not been fully explored.
Relying on a frozen ViT backbone (pretrained on ImageNet-21K), the method's dependency on the specific backbone architecture and pretraining data remains uninvestigated.
Pseudo-replay assumes diagonal Gaussian feature distributions, which may be inaccurate for highly complex multimodal distributions, particularly under domain feature overlap.
The clamp range \([-3, 6]\) for uncertainty weighting is empirically chosen, lacking theoretical guidance.
Hyperparameters for PUDD (sliding window \(W=100\), \(D_{max}=5.0\), threshold \(\tau_s=0.01\)) have not undergone sufficient sensitivity analysis.
The fundamental difference from multi-head classifier approaches warrants deeper discussion—although the proposed scheme does not require Task-IDs, it relies on drift detection.

Rating¶

Novelty: 4/5 — \(\alpha\)-entmax residual prompt selection and PUDD drift detection represent significant innovations, and classifier knowledge preservation addresses a blind spot in PCL.
Experimental Thoroughness: 4/5 — Covers three benchmarks (2 medical, 1 general), comprehensive ablations, and visualization analyses, though the sequence length is limited (only 3 domains).
Writing Quality: 4/5 — Structurally sound with accurate formulations. The cross-composition diagnostic analysis is highly convincing, though some notations are dense.
Value: 4/5 — Offers a systematic and practical solution for DIL without data replay or Task-IDs, holding direct application value for continual learning in medical imaging.