Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning¶

Conference: NeurIPS 2025 arXiv: 2509.16738 Code: https://github.com/ASCIIJK/MiN-NeurIPS2025 Area: Model Compression / Incremental Learning Keywords: Class-incremental learning, pre-trained models, parameter drift, positive noise, catastrophic forgetting suppression

TL;DR¶

This paper proposes learning beneficial "mixture of noise" to suppress parameter drift in pre-trained models during class-incremental learning. By dynamically mixing task-specific noise with learned weights across tasks, the method achieves state-of-the-art performance, particularly in the challenging 50-step incremental setting.

Background & Motivation¶

Background: Pre-trained models (PTMs) exhibit strong performance when fine-tuned on downstream tasks; however, continual fine-tuning induces parameter drift, which corrupts discriminative features of previous tasks and allows new-task features to interfere with existing decision boundaries.

Limitations of Prior Work: Conventional class-incremental learning methods focus on improving feature utilization efficiency (e.g., prompt learning, prototype networks), while overlooking inter-task feature interference. Parameter drift has been uniformly treated as a purely negative phenomenon.

Core Idea: Noise is not inherently harmful — positive-incentive noise (Pi-Noise) can improve classification by masking inter-class confusion and highlighting discriminative features. Parameter drift constitutes "destructive noise," yet "beneficial noise" can be learned to counteract it.

Goal: Rather than pursuing efficient feature utilization, this work actively learns beneficial noise to suppress inter-task confusion patterns.

Key Insight: From an information-theoretic perspective, noise is modeled as a latent variable, and efficient inference is achieved via the reparameterization trick and dynamic mixing.

Core Idea: Through noise expansion (learning task-specific noise modules) combined with noise mixture (dynamic weight-based fusion), parameter drift is transformed from "catastrophic forgetting" into a "controllable positive signal."

Method¶

Overall Architecture¶

The MiN (Mixture of Noise) framework is realized through two core strategies: - Noise Expansion: Learning dedicated noise generation modules for each new task. - Noise Mixture: Dynamically learning weights to mix noise from different tasks, enabling single-pass inference.

Key Designs¶

Noise Expansion Strategy (Section 4.1)
- Function: Inserts π-noise layers into intermediate layers of the pre-trained backbone.
- Mechanism: Given an intermediate feature \(r_l\), the noise generation process is defined as \(\varepsilon_t = \varepsilon \cdot \phi_t^{\sigma}(r_l W_{down}) + \phi_t^{\mu}(r_l W_{down})\), where \(\varepsilon\) is sampled from a standard normal distribution, \(\phi_t^{\sigma}\) and \(\phi_t^{\mu}\) are two-layer MLPs generating variance and mean vectors respectively, and \(W_{down}\) projects high-dimensional features into a low-dimensional space \(d_2 \ll d_1\).
- Design Motivation: The number of trainable parameters is minimal (only two \(d_2 \times d_2\) matrices), keeping the model lightweight.
Noise Mixture Strategy (Section 4.2)
- Function: For task \(t\), an accumulated set of noise modules \(\{\varepsilon_1, \ldots, \varepsilon_t\}\) is maintained, and dynamic weights are learned for optimal mixing.
- Mechanism: Weights are initialized based on task prototype similarity \(s_{t,i} = \frac{\mu_t \cdot \mu_i}{\|\mu_t\| \|\mu_i\|}\), and the normalized weighted mixture is computed as \(\varphi(\{\varepsilon_1, \ldots, \varepsilon_t\}) = \sum_{i=1}^{t}\varepsilon_i \omega_i\).
- Design Motivation: Applying each noise module independently would result in inference complexity growing linearly with the number of tasks; mixing enables a single forward pass.
Three-Step Training Pipeline (Algorithm 1)
- Function: Iteratively updates through three steps: analytic learning → noise learning → classifier update.
- Mechanism: Step 1 updates the classifier \(W_t\) via analytic learning (gradient-free); Step 2 trains task-specific noise using an auxiliary classifier; Step 3 updates the classifier again.
- Design Motivation: The auxiliary classifier \(W_{aux}\) is initialized to zero to fit residuals, preventing the main classifier from being disturbed during noise training.

Loss & Training¶

\[\mathcal{L}_{cls} = \ell(z_L W_{aux}, y - z_L W_t)\]

The auxiliary classifier learns residuals, and the noise ultimately acts as perturbations in the feature space.

Key Experimental Results¶

Main Results¶

Method	CIFAR-100 10-step	CIFAR-100 20-step	CIFAR-100 50-step	CUB-200 10-step	CUB-200 20-step
L2P	85.92	81.90	74.29	84.29	81.75
DualPrompt	89.65	85.57	73.66	84.39	83.79
CODA-Prompt	91.05	87.51	69.54	84.15	83.89
SLCA	92.67	93.32	90.76	86.83	83.38
FeCAM	93.23	91.86	90.92	92.73	92.89
MiN (Ours)	94.12	93.89	92.34	93.45	93.67

Ablation Study¶

Configuration	CIFAR-100 20-step	Notes
Noise Expansion only	90.56	Basic noise learning
+ Initial weights	91.23	Similarity-based initialization
+ Weight learning	93.12	Dynamic weight optimization
+ Three-step pipeline	93.89	Full pipeline

Key Findings¶

Significant improvement over the previous SOTA in the 50-step incremental setting (the most challenging scenario): 90.92 → 92.34, with the smallest standard deviation.
Even in the 50-step setting, the total number of trainable parameters remains minimal (fewest per task among compared methods).
Grad-CAM visualizations demonstrate that MiN substantially improves attention to key discriminative regions while suppressing responses to irrelevant areas.

Highlights & Insights¶

Novel Perspective: Parameter drift is reinterpreted as a controllable noise signal, extracting a positive mechanism from what was previously considered a purely negative phenomenon. This represents a fundamental conceptual shift — from "avoiding drift" to "exploiting drift."
Lightweight Design: The noise generation modules introduce very few parameters, far fewer than prompt-based methods, and mixing noise at inference incurs no additional overhead.
Robustness in Extreme Settings: The method maintains stable performance under 50-step incremental learning (2 new classes per step), substantially outperforming competing methods in this regime.

Limitations & Future Work¶

Experiments are conducted on CIFAR-100 (10K samples) and CUB-200 (12K samples); large-scale real-world validation is absent.
Evaluation is limited to ViT-B/16-IN21K; the effectiveness on other backbones (e.g., Swin, ConvNeXt) remains unexplored.
The temperature coefficient and initialization weight computation leave room for further tuning.
Applicability to non-visual domains (NLP, multimodal incremental learning) has yet to be verified.

vs. L2P/DualPrompt: These methods adapt to new tasks via prompt learning, but prompts do not share knowledge across tasks; MiN's noise mixture inherently enables cross-task sharing.
vs. FeCAM: FeCAM relies on prototypes and feature space alignment, whereas MiN injects positive noise into the feature space — the two approaches are complementary.
Transferable Directions: Noise augmentation in knowledge distillation and contrastive learning; noise feature modeling in cross-domain transfer learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The parameter drift–beneficial noise framework represents a genuinely novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six datasets with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Logic is clear and methodological exposition is precise.
Value: ⭐⭐⭐⭐⭐ Provides actionable guidance for effective adaptation of pre-trained models with high practical deployment value.