SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP¶
Conference: ICLR 2026 arXiv: 2509.26036 Code: https://github.com/christti98/semobridge Area: Few-Shot Learning / Vision-Language Models Keywords: CLIP adaptation, modality gap, intra-modal misalignment, few-shot classification, pseudo EOS token
TL;DR¶
SeMoBridge is proposed as a lightweight semantic modality bridge that maps image embeddings into the text modality, converting unreliable intra-modal (image-to-image) comparisons into reliable inter-modal (text-to-image) comparisons, achieving state-of-the-art few-shot classification performance with minimal training overhead.
Background & Motivation¶
CLIP aligns image and text representations into a shared embedding space via contrastive learning, demonstrating strong zero-shot performance. However, intra-modal misalignment arises in few-shot classification:
- CLIP exhibits an inherent modality gap—a systematic separation between image and text embeddings.
- The contrastive training objective focuses solely on cross-modal alignment, leaving the semantic structure within each modality uncalibrated.
- Consequently, query images may be incorrectly placed closer to the few-shot centroids of wrong classes.
Limitations of prior work: - Methods such as Tip-X and APE operate at the level of logit scores, failing to fully exploit CLIP's inter-modal semantic priors. - Cross the Gap addresses the issue through per-sample optimization, but incurs prohibitive computational cost.
Method¶
Core Idea¶
Image embeddings are mapped into the text modality while preserving semantic content, thereby transforming unreliable image-to-image comparisons into reliable image-to-text inter-modal comparisons.
Key Design 1: Pseudo EOS Token Derivation¶
The directional alignment guaranteed by CLIP's training objective is exploited:
Back-projection is performed via the Moore-Penrose pseudoinverse of the text projection matrix, followed by rescaling:
The final bridged embedding is:
Since \(\mathbf{W}_{\text{txt}}\mathbf{W}_{\text{txt}}^+\) approximates the identity matrix, the transformation reduces to a scaling of the original image embedding.
Key Design 2: Triple Logit Score Fusion¶
- \(\mathbf{z}_1\): zero-shot prior (query image vs. class text prompts)
- \(\mathbf{z}_2\): original few-shot vs. bridged query (bridged query compared to few-shot images in text space)
- \(\mathbf{z}_3\): original query vs. bridged few-shot (inverted signal for enhanced robustness)
Key Design 3: Multimodal Supervised Training (SeMoBridge-T)¶
Class-Specific Bias (CSB) \(\hat{\boldsymbol{\tau}} \in \mathbb{R}^{C \times d_t}\) is introduced:
The multimodal loss is:
- \(\mathcal{L}_{\text{img}}\): alignment between bridged and original image embeddings
- \(\mathcal{L}_{\text{txte}}, \mathcal{L}_{\text{txtp}}\): alignment with class description EOS tokens and projections
- \(\mathcal{L}_{\text{cons}}\): intra-class few-shot consistency
- \(\mathcal{L}_{\text{bias}}\): CSB regularization
Only the bridge parameters are updated during training; CLIP is fully frozen.
Experiments¶
Training Efficiency Comparison¶
| Method | Parameters | Avg. Training Time | Avg. Accuracy |
|---|---|---|---|
| CoOp | 0.01M | 10h 00min | 63.90% |
| PromptSRC | 0.05M | 1h 42min | 77.90% |
| APE-T | 0.51M | 3min 30s | 77.18% |
| LDC | 69M | 2min | 77.17% |
| SeMoBridge-T | 0.77M | 27s | 78.15% |
SeMoBridge-T achieves the highest accuracy with only 27 seconds of training.
Few-Shot Classification Results¶
- Training-free SeMoBridge outperforms APE on 7 out of 11 datasets.
- SeMoBridge-T achieves the best overall performance in low-shot settings (1/2/4-shot).
- The most significant improvements are observed on datasets with visually similar categories, such as OxfordPets.
Out-of-Distribution Generalization (16-shot ImageNet)¶
| Method | ImageNet | ImageNet-V2 | ImageNet-Sketch |
|---|---|---|---|
| APE | 71.81 | 64.81 | 49.95 |
| SeMoBridge | 71.86 | 64.90 | 49.55 |
| APE-T | 74.13 | 66.21 | 49.73 |
| SeMoBridge-T | Competitive with APE-T | — | — |
Key Findings¶
- Intra-modal misalignment is the primary cause of CLIP's failure in few-shot settings.
- A simple modality bridge (scaling + projection) is sufficient to effectively address this issue.
- CSB benefits datasets with large numbers of categories such as ImageNet, but is less critical for smaller datasets.
- A fixed 1:1 balance between image and text losses in the multimodal objective is adequate.
Highlights & Insights¶
- The method is remarkably concise and elegant—the core computation reduces to a pseudoinverse matrix multiplication.
- Training time is extremely short (27 seconds), an order of magnitude faster than the next best method.
- The training-free variant is already competitive, with the trained variant yielding further gains.
- The theoretical motivation is well-grounded, deriving the bridge directly from CLIP's training objective.
- The advantage is most pronounced in low-shot regimes (1/2/4-shot).
Limitations & Future Work¶
- The approximation that \(\mathbf{W}_{\text{txt}}\mathbf{W}_{\text{txt}}^+\) is close to the identity matrix may not hold for certain CLIP variants.
- CSB is applied during training but not to queries at inference, potentially introducing a train-inference distribution mismatch.
- Evaluation is primarily conducted on ViT-B/16; the effectiveness on larger models remains underexplored.
- The weights for triple logit fusion require validation-set tuning.
Related Work & Insights¶
- CLIP Adaptation: prompt/adapter-based methods including CoOp, Tip-Adapter, and APE.
- Modality Gap Research: the modality gap in CLIP's embedding space as identified by Liang et al.
- Modality Inversion: per-sample optimization approaches (OTI/OVI) and closed-form projection (SD-IPC).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The modality bridge concept and its adaptation from SD-IPC are elegant.
- Simplicity: ⭐⭐⭐⭐⭐ — The method is clean, intuitive, and easy to reproduce.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 11 datasets, training efficiency comparisons, and OOD generalization.
- Value: ⭐⭐⭐⭐⭐ — 27-second training with an extremely low barrier to adoption.