RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning¶

Conference: ICCV 2025 arXiv: 2507.22553 Code: None Area: Video Understanding Keywords: Continual Learning, Prompt Learning, Knowledge Integration, Class-Incremental Learning, Video Action Recognition

TL;DR¶

This paper proposes RainbowPrompt, a prompt-evolving mechanism that integrates multiple task-specific prompts into a diversity-enhanced unified prompt via attention-based transformation and task-guided alignment, achieving an average improvement of 8.23% over existing methods on image classification and video action recognition tasks.

Background & Motivation¶

Prompt-based continual learning (PCL) keeps the pre-trained model frozen while fine-tuning a small number of prompt parameters, serving as an effective replay-free continual learning paradigm. The core challenge lies in how to effectively integrate task-specific knowledge within prompts.

Limitations of existing methods:

Fixed prompt methods (e.g., CODA-Prompt): learned prompt representations remain unchanged during new task training, failing to adapt to new task requirements, resulting in low representational diversity.

Generative prompt methods (e.g., ConvPrompt): prompts are generated from an entangled task-shared space, suffering from task interference and dominance effects, leading to limited diversity.

Empirical validation (Fig. 2): Nuclear norm analysis demonstrates insufficient prompt representational diversity in existing methods; higher diversity correlates with higher accuracy and lower forgetting rates.

Core motivation: to design a prompt-evolving mechanism that continuously evolves knowledge by transforming and aligning all accumulated base prompts (both previously learned and newly introduced), thereby enhancing representational diversity.

Method¶

Overall Architecture¶

When a new task \(t\) arrives, the method maintains an accumulated prompt set \(\boldsymbol{\mathcal{P}} = \{\boldsymbol{p}^1, ..., \boldsymbol{p}^t\}\). Through attention-based transformation and task-guided alignment, all base prompts are evolved and integrated into a unified RainbowPrompt \(\boldsymbol{p}_l^{\text{rainbow}(t)}\). A learnable probabilistic gate is also introduced to adaptively determine which layers to insert prompts into.

Key Designs¶

Attention-based Transformation:
- Task Conditioning: Learnable task embeddings \(\boldsymbol{e}^t\) are used to compute attention weights, highlighting task-relevant prompt components.
- Task-Level Transformation (TLT): The new task prompt serves as Query while the accumulated prompt set serves as Key/Value, computing a cross-task affinity matrix \(\mathcal{G}\) to reweight each task's contribution to the new task.
- Feature-Level Transformation (FLT): Transposed Query/Key are employed to capture cross-feature interactions (inspired by bilinear pooling), \(\hat{V} = \mathcal{F} \cdot \tilde{V}^T\), integrating cross-feature contributions at a finer granularity.
Task-Guided Alignment (TGA): A nonlinear transformation \(\text{LT}(x) = \max(0, xW_l^1)W_l^2\) refines the transformed representations to adapt to new task characteristics while preserving the intrinsic properties of each prompt. The final RainbowPrompt is obtained via mean aggregation: \(\boldsymbol{p}_l^{\text{rainbow}(t)} = \frac{1}{t}\sum_{i=1}^t \tilde{\boldsymbol{\mathcal{P}}}_l[i]\)
Adaptive Prompting: Task-specific learnable probabilistic gates \(\boldsymbol{g}_l^t\) (Bernoulli variables) are introduced, optimized differentiably via Gumbel-Softmax relaxation, adaptively determining which ViT layers to insert the RainbowPrompt into, avoiding suboptimal manual layer selection.

Loss & Training¶

Total loss: \(\min_{\Theta^t} \sum_i \text{CE}(\boldsymbol{z}_i, y_i) + \lambda_s \mathcal{L}_{\text{sparse}} + \lambda_m \mathcal{L}_{\text{match}}\)

\(\text{CE}\): Cross-entropy classification loss
\(\mathcal{L}_{\text{sparse}} = \sum_l \log \alpha_l^t\): Sparsity regularization encouraging compact prompt insertion patterns
\(\mathcal{L}_{\text{match}} = \gamma(q(\boldsymbol{x}), \boldsymbol{e}^t)\): Task embedding matching loss
\(\lambda_s = \lambda_m = 0.01\)
Learnable parameters \(\Theta^t = \{\boldsymbol{p}^t, \boldsymbol{e}^t, \boldsymbol{G}^t, W^{\text{evolution}}, \phi\}\)
Prompt evolution is performed only during training; the stored RainbowPrompt is used directly at inference

Key Experimental Results¶

Main Results¶

Image Classification (ImageNet-R / CIFAR-100):

Method	ImgNet-R A₁₀↑	ImgNet-R F₁₀↓	ImgNet-R A₂₀↑	CIFAR-100 A₁₀↑	CIFAR-100 A₂₀↑
L2P	63.49	6.85	59.38	82.76	77.95
DualPrompt	68.50	5.14	63.21	85.07	80.49
CODA-Prompt	74.24	4.92	70.86	87.00	82.15
ConvPrompt	77.86	4.33	75.10	88.87	87.37
RainbowPrompt	79.09	3.90	78.36	89.86	90.15

Video Action Recognition (UCF-101 / ActivityNet):

Method	UCF A₁₀↑	UCF A₂₀↑	ActivityNet A₁₀↑	ActivityNet A₂₀↑
CODA-Prompt	84.77	75.35	66.13	58.62
CPrompt	87.16	81.78	66.81	62.17
ConvPrompt	85.58	78.83	67.32	60.01
RainbowPrompt	89.03	84.05	69.87	70.55

Ablation Study¶

Component ablation (ImageNet-R 10-task):

Configuration	A₁₀↑	F₁₀↓	Note
Full model	79.09	3.90	-
w/o Task Conditioning (TC)	78.92	4.19	Provides complementary information
w/o Task-Level Transformation (TLT)	78.70	4.14	Captures inter-task dependencies
w/o Feature-Level Transformation (FLT)	78.57	4.29	Fine-grained feature interaction is more critical
w/o Task-Guided Alignment (TGA)	66.31	4.84	Most critical component, -12.78%
w/o Adaptive Prompting (AP)	78.13	4.07	Adaptive selection outperforms manual layer choice

Key Findings¶

Task-guided alignment is the most critical component; its removal causes a 12.78% accuracy drop.
In the CIFAR-100 20-task setting, competing methods suffer accuracy drops of 1.5–4.85%, whereas RainbowPrompt gains 0.29%.
On ActivityNet 20-task, ConvPrompt degrades from 67.32% to 60.01% (−7.31%), while RainbowPrompt improves from 69.87% to 70.55% (+0.68%).
At inference, 76.5% of the evolution parameters (6.2M) are discarded, requiring only 18.5B MACs.

Highlights & Insights¶

The concept of prompt "evolution" is novel: rather than simple prompt selection or fusion, prompt representations are transformed and aligned to adapt to new tasks.
The theoretical analysis using nuclear norm as a diversity metric is convincing, and experiments confirm a positive correlation between diversity and performance.
The performance advantage grows with the number of tasks, indicating that the evolution mechanism effectively leverages accumulated knowledge.
The adaptive gating mechanism eliminates the need for manual prompt layer selection, with optimal layer assignments varying across tasks and datasets.

Limitations & Future Work¶

As the number of tasks increases, the accumulated prompt set grows, and the computational cost of the evolution process scales linearly.
Task identity is unknown at test time (class-incremental setting), making the approach reliant on the accuracy of task embedding matching.
The method is currently evaluated on ViT-B/16 only; its effectiveness on larger pre-trained models remains unverified.
Video action recognition samples only 3 frames, limiting the model's capacity to capture complex temporal dynamics.

The prompt-evolving paradigm is potentially generalizable to other parameter-efficient fine-tuning scenarios (e.g., continual learning with LoRA).
The dual-level attention transformation (task-level + feature-level) can be transferred to other settings requiring multi-source knowledge integration.
The adaptive layer selection mechanism offers general reference value for the layer selection problem in prompt tuning.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Overall	⭐⭐⭐⭐