Skip to content

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Conference: CVPR 2026 arXiv: 2603.00574 Code: GitHub Area: Multimodal VLM Keywords: Multi-modal test-time adaptation, stability-plasticity decoupling, redundancy score, asymmetric adaptation, catastrophic forgetting

TL;DR

This paper proposes DASP, which diagnoses biased modalities via a redundancy score and applies an asymmetric adaptation strategy to decouple stability and plasticity, addressing negative transfer and catastrophic forgetting in multi-modal test-time adaptation.

Background & Motivation

Vulnerability of multi-modal models to distribution shift: Multi-modal models (audio-video) are susceptible to distribution shifts in non-stationary environments, such as weather changes and sensor degradation, leading to significant performance degradation in statically pre-trained models.

Rise of Test-Time Adaptation (TTA): TTA enables online parameter updates to adapt to distribution shifts without access to source data; however, existing methods are predominantly designed for unimodal settings.

Negative transfer: Modality-agnostic adaptation strategies indiscriminately update all modalities, potentially causing negative transfer to well-aligned, unbiased modalities.

Catastrophic forgetting: Continuous parameter updates erase source domain knowledge, particularly severe in biased modalities.

Stability-plasticity dilemma: Existing methods struggle to balance the two objectives—biased modalities require plasticity to adapt to the target distribution, while unbiased modalities require stability to preserve source domain knowledge.

Unreliability of conventional diagnostic metrics: Entropy and confidence are unreliable in multi-modal settings—a dominant modality may maintain low entropy and high confidence even under distribution shift, making cross-modal comparison infeasible.

Method

Overall Architecture

DASP follows a diagnose-then-mitigate framework: (1) biased modalities are identified via a redundancy score; (2) an asymmetric adaptation strategy handles biased and unbiased modalities separately.

Key Designs

Redundancy Score for Diagnosis

Inter-dimensional correlations of each modality's features are computed in the shared latent space of the fusion layer. Distribution shift causes feature manifold degeneration, inducing spurious inter-dimensional correlations (all dimensions respond uniformly to domain-specific noise), resulting in a marked increase in redundancy. A redundancy score \(R(\mathbf{Z})\) is defined, and the relative redundancy of each modality is compared:

\[\Delta^m = r^m - \min_{n \in \mathcal{M}} r^n\]

Modality \(m\) is identified as biased and added to \(\mathcal{G}\) when \(\Delta^m \geq \delta\).

Asymmetric Adaptation

Each modality adapter \(\Phi^m\) is decomposed into a stable adapter \(\phi_s^m\) (low-rank, encouraging domain-agnostic generalization) and a plastic adapter \(\phi_p^m\) (high-rank, capturing domain-specific information):

  • Biased modalities (\(m \in \mathcal{G}\)): activate the plastic adapter and freeze the stable adapter → \(\tilde{z}^m = \phi_p^m(\phi_s^m(z^m))\)
  • Unbiased modalities (\(m \notin \mathcal{G}\)): bypass the plastic adapter, update the stable adapter with KL regularization → \(\tilde{z}^m = \phi_s^m(z^m)\)

Loss & Training

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{div}} + \lambda_{\text{ent}} \mathcal{L}_{\text{ent}} + \lambda_{\text{kl}} \mathcal{L}_{\text{kl}}\]

The objective comprises a diversity regularization term (preventing class collapse), an entropy minimization term (encouraging confident predictions), and a KL divergence penalty (constraining the stable adapters of unbiased modalities from deviating from the source model).

Key Experimental Results

Main Results: Kinetics50-C Video Corruption (Episodic Adaptation)

Method Mean Accuracy↑
Source (no adaptation) 59.9
Tent 59.4
EATA 60.1
SAR 59.8
READ 62.5
TSA 63.8
DASP (Ours) 65.2

Ablation Study

Component Effect
Redundancy score vs. entropy/confidence Redundancy score strongly correlates with accuracy; entropy/confidence are unreliable
Asymmetric vs. symmetric adaptation Asymmetric adaptation substantially reduces negative transfer and forgetting
KL regularization Effectively constrains stability of unbiased modalities
Low-rank/high-rank design Aligns with the functional requirements of each modality role

Key Findings

  • The redundancy score strongly correlates with accuracy on both Kinetics50-C and VGGSound-C
  • Biased modalities exhibit significantly higher redundancy than unbiased ones
  • DASP simultaneously mitigates negative transfer (unbiased modalities) and catastrophic forgetting (biased modalities)
  • Performance gains are particularly pronounced under audio corruption scenarios (large margins over baselines on VGGSound-C)

Highlights & Insights

  • The redundancy score is an elegant non-parametric diagnostic metric that can be applied online without source domain statistics
  • The diagnose-then-mitigate framework is logically coherent—localizing the problem before applying targeted remediation
  • The decoupled design of stable and plastic adapters is intuitively sound—externalizing domain-specific parameters while internalizing domain-agnostic ones
  • The structured low-rank vs. high-rank design naturally aligns with the respective functional roles of each adapter

Limitations & Future Work

  • The threshold \(\delta\) for the redundancy score must be predefined and may require adjustment across different scenarios
  • Validation is limited to audio-video bimodal settings; extension to more modalities (e.g., text + image + audio) remains unexplored
  • Modality bias detection is a hard decision, unable to handle cases where both modalities are simultaneously biased
  • Computing the redundancy score requires batch statistics, making it inapplicable to batch size = 1 settings
  • Most closely related to TSA's selective adaptation, though TSA's soft routing is less stable in unsupervised settings
  • Both MDAA and DASP address catastrophic forgetting, but DASP achieves this through architectural decoupling rather than analytical methods
  • The stability-plasticity perspective generalizes to continual learning, federated learning, and related settings
  • The redundancy score can serve as a general-purpose distribution shift detection tool

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐