Skip to content

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Conference: CVPR 2025
arXiv: 2503.08497
Code: https://github.com/yunncheng/MMRL
Area: Multimodal VLM
Keywords: CLIP transfer learning, multi-modal representation, few-shot learning, prompt learning, generalization preservation

TL;DR

MMRL proposes a shared, modality-agnostic learnable representation space that projects representation tokens into high-level layers of image and text encoders (preserving low-level generalization knowledge). Through a decoupled inference strategy (utilizing representation + class features for base classes, and only class features for novel classes), MMRL achieves an optimal balance between few-shot adaptation and generalization across 15 datasets, establishing a new SOTA in base-to-novel generalization.

Background & Motivation

Pre-trained VLMs such as CLIP possess strong zero-shot capabilities, but they often overfit during few-shot downstream adaptation, leading to degraded generalization performance on novel classes or new datasets. Existing methods fall into two main categories: (1) Prompt Learning (e.g., CoOp, MaPLe) adapts models via learnable prompts, but injecting prompts in shallow layers disrupts the generalization knowledge of CLIP, and text-centric designs introduce modality imbalance; (2) Adapters (e.g., CLIP-Adapter, MMA) adjust features via lightweight modules, but only optimize class token features, making them prone to overfitting when data is scarce. The Key Challenge is: How to adapt to downstream tasks while maintaining the generalization capability of VLMs? MMRL's Key Insight is to introduce an independent, unbiased shared representation space that conducts multi-modal interactions in the deeper layers of the encoder. This allows representation tokens to acquire downstream knowledge while the class tokens retain generalization.

Method

Overall Architecture

Based on a frozen CLIP model, MMRL introduces a shared, learnable representation space \(\mathcal{R}\) (initialized via Gaussian distribution). Through a learnable mapping function \(\mathcal{F}\), the space tokens \(R \in \mathbb{R}^{K \times d_r}\) are projected into visual representation tokens \(R^v\) and textual representation tokens \(R^t\). Starting from the \(J\)-th layer, these tokens are inserted into the Transformer layers of the image and text encoders, participating in attention computation alongside original tokens. The entire CLIP model remains frozen; only \(\mathcal{R}\), \(\mathcal{F}\), and the projection layers of the representation tokens are trained.

Key Designs

  1. Shared Modality-Agnostic Representation Space:

    • Function: Establishes a bridge unbiased toward either modality, facilitating balanced multi-modal interactions.
    • Mechanism: The shared space \(\mathcal{R}\) is initialized using a Gaussian distribution, and independent linear mappings \(\mathcal{F}_i^v\) and \(\mathcal{F}_i^t\) are used to generate visual and text representation tokens for each layer, respectively. Crucially, independent mapping per layer is employed—the representation tokens at different layers are generated by distinct mapping functions to adapt to layer-specific feature distributions.
    • Design Motivation: Methods like MaPLe map visual prompts from textual prompts, which is inherently text-centric and ignores the independence of the visual modality. The design of a shared space allows both modalities to start from the same origin, achieving truly "unbiased" multi-modal learning.
  2. High-Level Injection Strategy:

    • Function: Protects the generalization knowledge in the lower layers of CLIP from disruption.
    • Mechanism: Representation tokens are injected only starting at the \(J\)-th layer and onwards. For the image encoder, the layers before the \(J\)-th layer process \([c_{i-1}, E_{i-1}]\) as usual, while subsequent layers process \([c_{i-1}, R_{i-1}^v, E_{i-1}]\). However, the outputs of the representation tokens in the intermediate layers are discarded (and only retained in the final layer \(L\)). In the text encoder, \(R^t\) is inserted before the text tokens, and the attention mask is adjusted to accommodate the increased sequence length.
    • Design Motivation: MMA discovered that the deeper layers of CLIP encoders encode dataset-specific discriminative features, while the shallower layers encode generalizable features. Injecting learnable parameters into lower layers disrupts generalized representations, causing a sharp drop in performance on novel classes.
  3. Decoupled Inference Strategy:

    • Function: Uses different feature combinations for base and novel classes to maximize their respective performances.
    • Mechanism: The final layer of the image encoder outputs the class token feature \(f_c = P_v^c(c_L)\) (frozen projection) and the representation feature \(f_r = P_v^r(\text{Mean}(R_L^v))\) (learnable projection). During inference, for base classes, the two features are fused with weights: \(p(y=c|x) = \alpha \cdot p(y=c|f_c) + (1-\alpha) \cdot p(y=c|f_r)\); for novel classes, only \(f_c\) is used (preserving generalization knowledge).
    • Design Motivation: Ablation studies demonstrate (w/o DS2) that if novel classes also use dual features, representation tokens overfitted to base classes drag down novel class performance. Decoupled utilization is critical.

Loss & Training

The total loss is: \(\mathcal{L}_{MMRL} = \alpha \mathcal{L}_{ce}^c + (1-\alpha) \mathcal{L}_{ce}^r + \lambda (\mathcal{L}_{cos}^v + \mathcal{L}_{cos}^t)\)

  • \(\mathcal{L}_{ce}^c\) and \(\mathcal{L}_{ce}^r\): Cross-entropy losses for class token features and representation token features, respectively.
  • \(\mathcal{L}_{cos}^v\) and \(\mathcal{L}_{cos}^t\): Regularization terms that constrain the image/text features of class tokens to maintain cosine similarity with the zero-shot features of the frozen CLIP, preventing deviation from pre-trained knowledge during adaptation.
  • \(\alpha = 0.7\) controls the dual-feature weight, and \(\lambda = 0.5\) controls the regularization strength.

Key Experimental Results

Main Results (Base-to-Novel Generalization, Average of 11 Datasets)

Method Base Novel HM Description
CLIP 69.34 74.22 71.70 Zero-shot baseline
CoOp 82.69 63.22 71.66 Strong base but poor novel generalization
MaPLe 82.28 75.14 78.55 Multi-modal prompt
PromptSRC 84.26 76.10 79.97 Self-regularized prompt
MMA (prev. SOTA) 83.20 76.80 79.87 Multi-modal adapter
MMRL 85.68 77.16 81.20 Broad performance lead across Base/Novel/HM

Ablation Study

Configuration Base Novel HM Description
w/o V (remove visual representation) 82.83 75.03 78.74 Visual representation contributes significantly
w/o L (remove textual representation) 85.05 75.65 80.08 Textual representation is also important
w/o DS₁ (novel classes also use dual features) 83.59 77.16 80.25 Decoupling strategy is key
w/o DS₂ (base classes only use class features) 85.68 73.80 79.30 Representation features yield substantial gains for base classes
w/o RS (no shared space) 85.79 75.55 80.34 Shared space is crucial for novel class generalization
MMRL† (biased mapping similar to MaPLe) 85.60 76.02 80.55 Unbiased > Biased
MMRL (Full) 85.68 77.16 81.20 All components work synergistically

Key Findings

  • MMRL improves the average HM by 1.33% over MMA across 11 datasets, with an increase of 2.48% on Base and a steady gain of 0.36% on Novel.
  • The optimal representation space dimension is \(d_r = 512\); larger dimensions lead to overfitting.
  • Representation tokens should be injected into deeper layers (9th layer is optimal out of a 12-layer encoder); injecting too low disrupts generalization features.
  • \(K = 4\) representation tokens represents the optimal balance point.

Highlights & Insights

  • Decoupled inference is the most critical design—optimizing both features during training while selecting them adaptively during inference, which elegantly resolves the tension between adaptation and generalization.
  • The proposed shared, modality-agnostic space outperforms biased mapping schemes (which map from one modality to another), as verified by the ablation study comparing MMRL† to MMRL.
  • The entire method requires training only a small number of parameters (\(\mathcal{R}\), \(\mathcal{F}\), and one projection layer) while CLIP remains completely frozen, ensuring high training efficiency.
  • On fine-grained datasets like FGVCAircraft, MMRL yields a 3.57% higher Base accuracy than PromptSRC.
  • The intermediate outputs of representation tokens are discarded and only retained in the final layer, preventing information leakage across intermediate layers.
  • The regularization term \(\mathcal{L}_{cos}\) simultaneously constrains class features on both the visual and textual sides, preventing feature drift from both ends.
  • In few-shot learning, the performance advantage becomes more pronounced as the number of shots increases, indicating a higher upper bound for adaptation capacity.

Limitations & Future Work

  • The method is specifically tailored for dual-encoder architectures like CLIP and is not directly applicable to generative VLMs (e.g., LLaVA).
  • Improvements in extreme few-shot scenarios (below 16 shots) are relatively limited.
  • The regularization term requires an extra forward pass of the frozen CLIP to compute zero-shot features, which increases training overhead.
  • There are many hyperparameters (\(\alpha\), \(\lambda\), \(J\), \(K\), \(d_r\)), entailing non-trivial tuning costs.
  • vs MaPLe: MaPLe injects prompts in shallow layers and employs a text-centric mapping to generate visual prompts; MMRL injects tokens in deeper layers and utilizes an unbiased shared space.
  • vs MMA: MMA is a multi-modal adapter that only optimizes class tokens; MMRL introduces a division of labor with independent representation tokens.
  • vs PromptSRC: PromptSRC uses self-regularization to prevent forgetting; MMRL achieves better results through a decoupled strategy + cosine regularization.
  • vs CoOp/CoCoOp: CoOp completely ignores generalization, leading to a performance drop on novel classes, whereas CoCoOp partially mitigates this via instance-specific prompts.
  • vs ProVP: ProVP utilizes only single-modal visual prompts and lacks cross-modal interaction.

Supplementary Notes

  • Based on the ViT-B/16 CLIP model with a 12-layer encoder, representation tokens are injected starting from the 9th layer.
  • Training is optimized using the SGD optimizer, which converges in a small number of iterations under the 16-shot setting.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of a shared representation space and decoupled inference is creative, though the overall design remains within the prompt/adapter paradigm.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 15 datasets, 4 evaluation setups, and comprehensive ablation studies across 6 dimensions.
  • Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations and detailed methodological descriptions.
  • Value: ⭐⭐⭐⭐ Achieves stable SOTA improvements in the highly competitive arena of parameter-efficient CLIP transfer learning.