BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment¶

Conference: CVPR2026
arXiv: 2602.19170
Code: github.com/ZhouKanglei/BriMA
Area: Multi-Modal VLM
Keywords: Action Quality Assessment, Continual Learning, Missing Modality, Multi-Modal Fusion, Memory Replay

TL;DR¶

BriMA is proposed to address the non-stationary modality imbalance problem in multi-modal continual action quality assessment (AQA) via memory-guided bridging imputation and modality-aware replay optimization, achieving an average improvement of 6–8% in correlation coefficient and a 12–15% reduction in error across three benchmarks.

Background & Motivation¶

Action Quality Assessment (AQA) is widely applied in sports analysis, rehabilitation evaluation, and skill assessment; multi-modal methods leveraging visual and kinematic cues have achieved notable progress.
In real-world deployments, sensor failures and missing annotations cause non-stationary modality imbalance—modality availability varies over time.
Existing multi-modal AQA methods assume complete and stable input modalities; any modality absence leads to significant performance degradation.
Existing continual AQA methods focus solely on task-level forgetting and do not handle modality-level dynamic changes.
Simple imputation, retrieval-based completion, and generative synthesis all fail to preserve the geometric structure critical for AQA scoring, disrupting ranking consistency.
The fine-grained score sensitivity of AQA makes it fundamentally different from conventional missing modality reconstruction problems.

Method¶

Overall Architecture¶

At each training session, BriMA: (1) completes missing modality features via the MBI module; (2) fuses all modality features for score prediction; (3) selects informative samples for replay using the MRO module to counteract distribution drift.

Key Designs¶

MBI (Memory-Guided Bridging Imputation): 1. Candidate Retrieval: For a missing modality $m$, cosine similarity is used to retrieve $K$ structurally aligned exemplar features from the memory buffer $\mathcal{B}_{t-1}$: $s_{j,t'} = \frac{\langle \mathbf{z}_{i,t}^{\mathcal{O}}, \mathbf{z}_{j,t'}^{\mathcal{O}} \rangle}{\|\mathbf{z}_{i,t}^{\mathcal{O}}\| \|\mathbf{z}_{j,t'}^{\mathcal{O}}\|}$ 2. Task Indicator: A binary mask $\mathbf{r}_{i,t}$ identifies missing modalities, coupled with a learnable task embedding $\mathbf{p}_t^m$ to provide task-specific conditioning. 3. Bridging Residual: A residual correction is learned rather than synthesizing complete features: $\tilde{\mathbf{z}}_{i,t}^m = \bar{\mathbf{z}}_{i,t}^m + \Delta\mathbf{z}_{i,t}^m = \bar{\mathbf{z}}_{i,t}^m + B_\Theta(\mathbf{z}_{i,t}^{\mathcal{O}}, \bar{\mathbf{z}}_{i,t}^m, \mathbf{c}_t^m)$

MRO (Modality-Aware Replay Optimization): - Dynamically prioritizes replay samples based on modality distortion and score drift. - Maintains a representative sample buffer with reliable modalities and balanced score coverage. - Counteracts cross-task distribution drift through replay.

Loss & Training¶

$$\min_{\theta_f, \theta_g} \mathcal{L}_{score} + \lambda_{mem}\mathcal{L}_{mem} + \lambda_{rec}\mathcal{L}_{rec}$$ where $\mathcal{L}_{score}$ is the MSE scoring loss, $\mathcal{L}_{mem}$ is the memory replay regularization loss, and $\mathcal{L}_{rec} = \|\tilde{\mathbf{z}} - \mathbf{z}\|_2^2$ is the feature reconstruction loss.

Key Experimental Results¶

Main Results: RG Dataset Comparison ($\beta=10\%$ Missing Rate)¶

Method	Publication	SRCC↑ Avg	MSE↓ Avg	RL2↓ Avg
ST-MLAVL	CVPR'25	0.599	9.94	3.558
EWC	PNAS'17	0.605	10.26	3.709
MER	ICLR'19	0.722	6.77	—
BriMA	Ours	Best (~0.76+)	Lowest	Lowest

Ablation Study¶

Component	SRCC Change	MSE Change
w/o MBI (zero-fill)	Significant drop	Significant rise
w/o MRO (random replay)	Drop	Rise
w/o residual (direct generation)	Drop	Rise
Full BriMA	Best	Best

Cross-Dataset Performance¶

Across three datasets—RG, Fis-V, and FS1000—BriMA achieves average improvements of: - Rank correlation: +6.1%, +8.3%, +1.4% - Error reduction: −12.7%, −15.3%, −6.4% - Relative error reduction: −13.9%, −14.1%, −5.2%

Key Findings¶

The residual learning strategy is more stable than direct feature generation, particularly under limited supervision signals.
Modality-aware replay selection is substantially more effective than random replay.
Both MBI and MRO contribute meaningfully to overall performance gains.

Highlights & Insights¶

This work is the first to systematically define and address the non-stationary modality imbalance problem in multi-modal continual AQA.
Residual bridging is more conservative and safer than complete reconstruction—particularly important in score-sensitive tasks.
The memory-guided retrieval combined with residual correction demonstrates strong capability in preserving the structure of the scoring manifold.

Limitations & Future Work¶

The framework assumes that the missing modality pattern is known (i.e., $\mathcal{M}_{i,t}$ is observable during training); automatic detection of missing modalities is not explored.
Experiments are limited to two-modality scenarios; scalability to three or more modalities remains to be verified.
The impact of memory buffer size on performance is not sufficiently discussed.

Distinction from general missing modality learning: BriMA is specifically designed for AQA score sensitivity, avoiding the scoring manifold corruption introduced by general-purpose methods.
Distinction from continual AQA methods (e.g., Fs-Aug, MAGR): These methods only address task-level non-stationarity and do not resolve modality-level dynamics.
Inspiration: The residual bridging idea is transferable to other tasks requiring modality completion under high output precision requirements.

Rating¶

Novelty: ⭐⭐⭐⭐ (novel problem formulation; MBI design is well-motivated)
Experimental Thoroughness: ⭐⭐⭐⭐ (3 datasets, multiple missing rates, comprehensive ablations)
Writing Quality: ⭐⭐⭐⭐ (clear problem formalization; consistent notation)
Value: ⭐⭐⭐ (relatively niche application scenario, though the methodology has broader generalizability)