Skip to content

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Conference: ICLR 2026 arXiv: 2509.26036 Code: https://github.com/christti98/semobridge Area: Few-Shot Learning / Vision-Language Models Keywords: CLIP adaptation, modality gap, intra-modal misalignment, few-shot classification, pseudo EOS token

TL;DR

SeMoBridge is proposed as a lightweight semantic modality bridge that maps image embeddings into the text modality, converting unreliable intra-modal (image-to-image) comparisons into reliable inter-modal (text-to-image) comparisons, achieving state-of-the-art few-shot classification performance with minimal training overhead.

Background & Motivation

CLIP aligns image and text representations into a shared embedding space via contrastive learning, demonstrating strong zero-shot performance. However, intra-modal misalignment arises in few-shot classification:

  • CLIP exhibits an inherent modality gap—a systematic separation between image and text embeddings.
  • The contrastive training objective focuses solely on cross-modal alignment, leaving the semantic structure within each modality uncalibrated.
  • Consequently, query images may be incorrectly placed closer to the few-shot centroids of wrong classes.

Limitations of prior work: - Methods such as Tip-X and APE operate at the level of logit scores, failing to fully exploit CLIP's inter-modal semantic priors. - Cross the Gap addresses the issue through per-sample optimization, but incurs prohibitive computational cost.

Method

Core Idea

Image embeddings are mapped into the text modality while preserving semantic content, thereby transforming unreliable image-to-image comparisons into reliable image-to-text inter-modal comparisons.

Key Design 1: Pseudo EOS Token Derivation

The directional alignment guaranteed by CLIP's training objective is exploited:

\[\frac{\mathbf{f}_{\text{img}}}{\|\mathbf{f}_{\text{img}}\|} \approx \frac{\hat{\mathbf{f}}_{\text{txt}}}{\|\hat{\mathbf{f}}_{\text{txt}}\|}\]

Back-projection is performed via the Moore-Penrose pseudoinverse of the text projection matrix, followed by rescaling:

\[\hat{\mathbf{f}}_{\text{eos}} \approx \frac{\|\mathbf{T}_{\text{eos}}\|}{\|\mathbf{W}_{\text{txt}}^+ \mathbf{f}_{\text{img}}\|} \mathbf{W}_{\text{txt}}^+ \mathbf{f}_{\text{img}}\]

The final bridged embedding is:

\[\hat{\mathbf{f}}_{\text{txt}} = \mathbf{W}_{\text{txt}} \hat{\mathbf{f}}_{\text{eos}} \approx \frac{\|\mathbf{T}_{\text{eos}}\|}{\|\mathbf{W}_{\text{txt}}^+ \mathbf{f}_{\text{img}}\|} \mathbf{f}_{\text{img}}\]

Since \(\mathbf{W}_{\text{txt}}\mathbf{W}_{\text{txt}}^+\) approximates the identity matrix, the transformation reduces to a scaling of the original image embedding.

Key Design 2: Triple Logit Score Fusion

\[\mathbf{z}_q = \lambda_1 \mathbf{z}_1 + \lambda_2 \mathbf{z}_2 + \lambda_3 \mathbf{z}_3\]
  • \(\mathbf{z}_1\): zero-shot prior (query image vs. class text prompts)
  • \(\mathbf{z}_2\): original few-shot vs. bridged query (bridged query compared to few-shot images in text space)
  • \(\mathbf{z}_3\): original query vs. bridged few-shot (inverted signal for enhanced robustness)

Key Design 3: Multimodal Supervised Training (SeMoBridge-T)

Class-Specific Bias (CSB) \(\hat{\boldsymbol{\tau}} \in \mathbb{R}^{C \times d_t}\) is introduced:

\[\hat{\mathbf{F}}_{\text{eos}}^c \approx \frac{\|\mathbf{T}_{\text{eos}}\|}{\|\hat{\mathbf{W}}_{\text{txt}}^+ \mathbf{F}_{\text{img}}^c + \hat{\boldsymbol{\tau}}^c\|} (\hat{\mathbf{W}}_{\text{txt}}^+ \mathbf{F}_{\text{img}}^c + \hat{\boldsymbol{\tau}}^c)\]

The multimodal loss is:

\[\mathcal{L} = \lambda_{\text{it}} \mathcal{L}_{\text{img}} + (1-\lambda_{\text{it}})\frac{\mathcal{L}_{\text{txte}} + \mathcal{L}_{\text{txtp}}}{2} + \lambda_c \mathcal{L}_{\text{cons}} + \lambda_b \mathcal{L}_{\text{bias}}\]
  • \(\mathcal{L}_{\text{img}}\): alignment between bridged and original image embeddings
  • \(\mathcal{L}_{\text{txte}}, \mathcal{L}_{\text{txtp}}\): alignment with class description EOS tokens and projections
  • \(\mathcal{L}_{\text{cons}}\): intra-class few-shot consistency
  • \(\mathcal{L}_{\text{bias}}\): CSB regularization

Only the bridge parameters are updated during training; CLIP is fully frozen.

Experiments

Training Efficiency Comparison

Method Parameters Avg. Training Time Avg. Accuracy
CoOp 0.01M 10h 00min 63.90%
PromptSRC 0.05M 1h 42min 77.90%
APE-T 0.51M 3min 30s 77.18%
LDC 69M 2min 77.17%
SeMoBridge-T 0.77M 27s 78.15%

SeMoBridge-T achieves the highest accuracy with only 27 seconds of training.

Few-Shot Classification Results

  • Training-free SeMoBridge outperforms APE on 7 out of 11 datasets.
  • SeMoBridge-T achieves the best overall performance in low-shot settings (1/2/4-shot).
  • The most significant improvements are observed on datasets with visually similar categories, such as OxfordPets.

Out-of-Distribution Generalization (16-shot ImageNet)

Method ImageNet ImageNet-V2 ImageNet-Sketch
APE 71.81 64.81 49.95
SeMoBridge 71.86 64.90 49.55
APE-T 74.13 66.21 49.73
SeMoBridge-T Competitive with APE-T

Key Findings

  • Intra-modal misalignment is the primary cause of CLIP's failure in few-shot settings.
  • A simple modality bridge (scaling + projection) is sufficient to effectively address this issue.
  • CSB benefits datasets with large numbers of categories such as ImageNet, but is less critical for smaller datasets.
  • A fixed 1:1 balance between image and text losses in the multimodal objective is adequate.

Highlights & Insights

  • The method is remarkably concise and elegant—the core computation reduces to a pseudoinverse matrix multiplication.
  • Training time is extremely short (27 seconds), an order of magnitude faster than the next best method.
  • The training-free variant is already competitive, with the trained variant yielding further gains.
  • The theoretical motivation is well-grounded, deriving the bridge directly from CLIP's training objective.
  • The advantage is most pronounced in low-shot regimes (1/2/4-shot).

Limitations & Future Work

  • The approximation that \(\mathbf{W}_{\text{txt}}\mathbf{W}_{\text{txt}}^+\) is close to the identity matrix may not hold for certain CLIP variants.
  • CSB is applied during training but not to queries at inference, potentially introducing a train-inference distribution mismatch.
  • Evaluation is primarily conducted on ViT-B/16; the effectiveness on larger models remains underexplored.
  • The weights for triple logit fusion require validation-set tuning.
  • CLIP Adaptation: prompt/adapter-based methods including CoOp, Tip-Adapter, and APE.
  • Modality Gap Research: the modality gap in CLIP's embedding space as identified by Liang et al.
  • Modality Inversion: per-sample optimization approaches (OTI/OVI) and closed-form projection (SD-IPC).

Rating

  • Novelty: ⭐⭐⭐⭐ — The modality bridge concept and its adaptation from SD-IPC are elegant.
  • Simplicity: ⭐⭐⭐⭐⭐ — The method is clean, intuitive, and easy to reproduce.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 11 datasets, training efficiency comparisons, and OOD generalization.
  • Value: ⭐⭐⭐⭐⭐ — 27-second training with an extremely low barrier to adoption.