Continual Multimodal Contrastive Learning¶
Conference: NeurIPS 2025 arXiv: 2503.14963 Code: https://github.com/Xiaohao-Liu/CMCL Area: Multimodal VLM Keywords: Continual Learning, Multimodal Contrastive Learning, Gradient Projection, Catastrophic Forgetting, Modality Binding
TL;DR¶
This paper is the first to formally define the Continual Multimodal Contrastive Learning (CMCL) problem and proposes Dual-side Null-space gradient projection (DNS), which projects gradients from new data into subspaces that do not interfere with previously acquired knowledge. DNS achieves the best stability–plasticity trade-off across 7 datasets.
Background & Motivation¶
Background: Multimodal contrastive learning (MCL) aligns different modalities (vision, audio, text, etc.) into a unified representation space via contrastive objectives. Representative methods such as CLIP, ImageBind, and LanguageBind have demonstrated strong cross-modal representation capabilities.
Limitations of Prior Work: Existing MCL methods typically assume all modality data can be collected at once and trained jointly. In practice, multimodal data often arrives in batches—new modality-pair data continually emerges—making retraining from scratch prohibitively expensive. Continual fine-tuning on existing models, however, leads to catastrophic forgetting that disrupts previously learned cross-modal alignment.
Key Challenge: Conventional continual learning methods (e.g., EWC, GEM, DER++) are designed for class-incremental or task-incremental settings and cannot handle the unique cross-modal complexity of MCL, where the contrastive objective remains consistent while the involved modality pairs continuously change. These methods suffer from a severe stability–plasticity trade-off in CMCL scenarios.
Goal: (1) Formally define continual multimodal contrastive learning with rigorous mathematical definitions of stability and plasticity; (2) Design a method that preserves previously learned cross-modal alignment while effectively acquiring new modality pairs.
Key Insight: From a gradient update perspective, model parameter updates are essentially modifications to a global parameter matrix. If gradients can be projected into subspaces that do not affect old data representations, both objectives can be simultaneously satisfied. This draws inspiration from null-space projection in single-modal continual learning, but the multimodal setting requires simultaneous projection from two modality sides.
Core Idea: Project gradients from both modality sides simultaneously into the null space of old data features, ensuring that parameter updates do not interfere with previously learned cross-modal alignment.
Method¶
Overall Architecture¶
Given a pre-trained modality-binding model (e.g., ImageBind), trainable linear projection layers are appended on top. When new modality-pair data arrives, contrastive learning optimizes the projection layer parameters. The core of DNS is: at each training step, gradients are projected from both "sides" into specific subspaces, such that the projected gradient updates do not affect previously learned cross-modal alignment scores.
Key Designs¶
-
CMCL Problem Formalization and Dual-Objective Definition:
- Function: Provides a rigorous mathematical framework for continual multimodal contrastive learning.
- Mechanism: Stability is defined as the alignment score \(\mathbf{A}_{t-1;t}^{m_1,m_2}\) computed on old data with the current model remaining consistent with that of the old model \(\mathbf{A}_{t-1;t-1}^{m_1,m_2}\); plasticity requires the model to learn effectively from new modality pairs. The authors prove that if the global parameter update is projected as \(\bar{\mathbf{W}} = \tilde{\mathbf{W}} - \mathbf{P}'\tilde{\mathbf{W}}\mathbf{P}\), where \(\mathbf{P}'\) and \(\mathbf{P}\) are spatial projectors of old data features, the stability condition is automatically satisfied.
- Design Motivation: The CMCL objective (contrastive loss) remains fixed while the modality pairs change—this necessitates dedicated stability/plasticity definitions distinct from those in classification-based continual learning.
-
Dual-side Null-space Gradient Projection (DNS):
- Function: Decomposes the global stability condition into locally actionable gradient projections.
- Mechanism: The global parameter update is expanded into three terms associated with the gradients of the two modalities. Theorem 4 proves that the gradient of each modality can be projected independently—the gradient for modality \(m_1\) is projected as \(\Delta\mathbf{W}_t^{m_1} = \nabla\mathbf{W}_t^{m_1} - \tilde{\mathbf{P}}\nabla\mathbf{W}_t^{m_1}\mathbf{P}'\), where \(\tilde{\mathbf{P}}\) and \(\mathbf{P}'\) are constructed from the features of the corresponding old modalities. Replacing original gradients with projected ones ensures parameter updates do not disturb previously learned cross-modal alignment.
- Design Motivation: Direct projection on global parameters is intractable; decomposing the projection to per-modality gradients substantially reduces implementation complexity.
-
Multi-step and Multi-modality-pair Extension:
- Function: Generalizes the two-step two-modality theory to arbitrary steps and modality pairs.
- Mechanism: An incremental centered feature covariance matrix \(\bar{\mathbf{Z}}_{<t}^m\) accumulates feature information from all historical steps; the projector is updated via SVD at the end of each step. For different modality pairs, gradients of modalities not involved in training are simply set to zero.
- Design Motivation: Real-world training involves multiple steps and modality pairs, requiring an efficient mechanism to maintain historical information without storing old data; the recursive covariance update enables replay-free continual learning.
Loss & Training¶
Standard CLIP-style InfoNCE contrastive loss is used with the AdamW optimizer (lr=0.0001, weight decay=0.001) and batch size=64. Null-space projection is approximated via truncated SVD, with a minimum eigenvalue threshold \(\lambda_{\min}\) of 0.01 (ImageBind/UniBind) or 0.0001 (LanguageBind).
Key Experimental Results¶
Main Results¶
Evaluated on 7 datasets (UCF101, ESC50, NYUDv2, VGGSound-S, Clotho, TVL, LLVIP) across 11 training steps, 7 modalities, and 3 backbone models.
| Backbone | Method | Cls. Acc | BWT_A | R@10 | BWT_R10 |
|---|---|---|---|---|---|
| ImageBind | Vanilla | 47.32 | -5.72 | 38.56 | -3.34 |
| ImageBind | Co2L | 50.13 | -3.74 | 38.66 | -1.77 |
| ImageBind | DNS | 52.52 | -0.02 | 40.89 | -1.07 |
| LanguageBind | Vanilla | 51.86 | -15.71 | 36.02 | -10.19 |
| LanguageBind | CILA | 59.30 | -2.63 | 40.48 | -1.09 |
| LanguageBind | DNS | 64.07 | -0.09 | 42.44 | -3.00 |
| UniBind | C-FLAT | 51.25 | -4.36 | 40.51 | -1.48 |
| UniBind | DNS | 52.86 | +0.31 | 41.44 | -1.19 |
Ablation Study¶
| Analysis Dimension | Result | Notes |
|---|---|---|
| Stability deviation | Per-step alignment score deviation near 0 | Empirically validates the theoretical bound of Theorem 5 |
| Plasticity higher-order term | \(o(\eta)/\eta < 0\) | Satisfies the plasticity condition of Theorem 6 |
| Training loss | Converges normally at each step | Plasticity is preserved under projection constraints |
| Training time overhead | DNS adds <1s | More efficient than replay-based methods |
Key Findings¶
- DNS achieves classification BWT near zero or even positive (+0.31 on UniBind), indicating that learning new knowledge can even benefit old tasks—an extremely rare outcome in continual learning.
- Forgetting is most severe in the LanguageBind setting (Vanilla BWT = -15.71); DNS reduces this to -0.09, an improvement of approximately 170×.
- DNS is a replay-free method requiring no storage of old data, with negligible additional training time of under one second.
Highlights & Insights¶
- Elegance of Dual-side Projection: Unlike single-modal continual learning, which requires only one-sided projection, CMCL must simultaneously account for the feature spaces of two modalities. The authors cleverly decompose the global stability condition into local per-modality gradient projections, yielding a theoretically principled and practically simple solution.
- Efficiency of the Replay-free Design: Compared to methods requiring a replay buffer, DNS maintains only a recursively updated feature covariance matrix, with minimal storage and computational overhead.
- Flexible Extension to Diverse Modality Pairs: When a new step involves a different modality pair, gradients of non-participating modalities are simply set to zero without additional handling—highly practical in real-world multimodal scenarios.
Limitations & Future Work¶
- Experiments evaluate linear layers appended to pre-trained models; directly fine-tuning entire encoders is not explored, which may limit applicability in large-scale training.
- The SVD truncation threshold requires separate tuning for different backbones (0.01 vs. 0.0001), lacking an adaptive strategy.
- Only pairwise modality alignment is considered; simultaneous training on three or more modalities is not addressed.
- Theoretical analysis assumes linear projection layers; generalization to nonlinear encoders requires further investigation.
Related Work & Insights¶
- vs. Adam-NSCL: Also employs null-space projection but only for single-modal continual learning and cannot handle the cross-modal complexity of multimodal alignment. DNS's dual-side projection is a natural extension of null-space methods to multimodal settings.
- vs. Co2L: Uses contrastive distillation to prevent forgetting but requires old data replay; BWT remains -5.79 in the LanguageBind setting. DNS requires no replay and achieves BWT near zero.
- vs. EWC: Prevents forgetting via parameter importance weights, implicitly assuming parameter independence; performance is limited under cross-modal interactions in multimodal settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to formalize the CMCL problem and propose a dedicated method; dual-side projection offers theoretical innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven datasets, three backbones, multiple baselines, and detailed stability/plasticity analysis.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear with a complete notation system, though readability is somewhat impacted by the density of mathematical symbols.
- Value: ⭐⭐⭐⭐ Fills an important gap at the intersection of multimodal contrastive learning and continual learning.