VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models¶
Conference: NeurIPS 2025 arXiv: 2511.22664 Code: GitHub Area: Multimodal VLM Keywords: Prompt Learning, Variational Inference, CLIP, Few-Shot Learning, Domain Generalization
TL;DR¶
This paper proposes VaMP, a variational multi-modal prompt learning framework that models text-side prompts as latent variables and performs instance-level uncertainty modeling via variational inference. Combined with a class-aware prior for regularizing the latent space, VaMP significantly improves CLIP's downstream adaptation under few-shot and domain generalization settings.
Background & Motivation¶
Vision-language models (e.g., CLIP) perform well in zero-shot settings but remain challenging to adapt to downstream tasks in few-shot scenarios. Prompt learning has been widely adopted as a parameter-efficient adaptation strategy; however, existing methods (e.g., CoOp, MaPLe, MMRL) typically employ fixed, deterministic prompts shared across all samples, giving rise to three core limitations:
Lack of instance-level adaptability: Fixed prompts cannot be personalized for different inputs, limiting generalization under distribution shift.
Lack of uncertainty modeling: Deterministic parameters fail to capture model uncertainty in few-shot settings.
Overly simplistic priors: Existing variational approaches (e.g., Bayesian Prompt Learning, Any-Shift Prompting) only model the text branch, use standard Gaussian priors, and operate at a global level, precluding fine-grained token-level semantic variation.
The paper frames prompt tuning as a variational inference problem, introducing token-level variational modeling within a multi-modal (text + vision) framework and employing a class-aware prior to provide semantically structured regularization.
Method¶
Overall Architecture¶
VaMP consists of three core components: (1) image-conditioned sample-specific prompt generation; (2) a variational inference mechanism that treats text-side prompts as latent variables; and (3) a class-aware prior constructed from class prototypes. Prompts are injected into multiple Transformer layers of both the CLIP text and vision encoders.
Key Designs¶
- Sample-Specific Prompt Generation
Unlike MaPLe/MMRL, which use fixed shared prompts, VaMP dynamically generates text-side prompts conditioned on the input image. Given input image \(x\), a global representation \(f_x\) is first extracted via the frozen CLIP image encoder. A set of \(H\) layer-specific MLP generators \(\{\Phi_i\}\) then produces prompt tokens for each Transformer layer:
\(z_i = \Phi_i(f_x) \in \mathbb{R}^{M \times d}, \quad i = J, \dots, J+H-1\)
On the vision side, shared learnable prompt tokens \(\tilde{z}_i\) are used and remain invariant across samples. This asymmetric design enables instance-level adaptation on the text side while maintaining stable shared representations on the vision side.
- Variational Multi-Modal Prompt Adaptation
The central innovation is upgrading text-side prompts \(z_i\) from deterministic parameters to latent variables. For each layer, an MLP \(\phi_i\) predicts the posterior distribution parameters:
\([\mu_i, \sigma_i] = \phi_i(\bar{f}_x), \quad q_\phi(z_i|x) = \mathcal{N}(\mu_i, \text{diag}(\sigma_i^2))\)
The training objective maximizes the variational evidence lower bound (ELBO):
\(\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p(y|x,t,z)] - \text{KL}(q_\phi(z|x) \| p(z))\)
End-to-end training is enabled via the reparameterization trick \(z_i = \mu_i + \sigma_i \odot \epsilon_i\). At inference time, Monte Carlo sampling with \(S=10\) draws followed by prediction averaging yields an uncertainty-aware ensemble.
- Class-Aware Prior
The standard Gaussian prior \(\mathcal{N}(0, I)\) lacks semantic structure. VaMP constructs a class-aware prior to provide global semantic anchoring. During training, per-class prototypes \(o_y\) are computed as the mean of posterior means for samples in each class, then mapped to prior distribution parameters via layer-specific prior networks \(\psi_i\):
\([\hat{\mu}_i, \hat{\sigma}_i] = \psi_i(c_y), \quad p_\psi(z_i|o_y) = \mathcal{N}(\hat{\mu}_i, \text{diag}(\hat{\sigma}_i^2))\)
The ELBO is summed across layers, encouraging prompt distributions of same-class samples to cluster in the latent space and improving intra-class consistency. At test time, where labels are unavailable, the prior degenerates to a standard Gaussian.
Loss & Training¶
The variational ELBO serves as the training objective, comprising a reconstruction term (classification cross-entropy) and a KL divergence regularization term. All experiments use the 16-shot setting with ViT-B/16 CLIP, trained on a single V100 GPU. Class prototypes are computed offline with the frozen encoder prior to training.
Key Experimental Results¶
Main Results: Base-to-Novel Generalization (Average over 11 Datasets)¶
| Method | Base | Novel | H (Harmonic Mean) |
|---|---|---|---|
| CLIP | 69.34 | 74.22 | 71.70 |
| MaPLe | 82.28 | 75.14 | 78.55 |
| MMRL | 85.68 | 77.16 | 81.20 |
| VaMP | 86.45 | 78.67 | 82.37 |
VaMP surpasses MMRL on Novel classes by 1.51% and improves the harmonic mean by 1.17%.
Domain Generalization (ImageNet → 4 Variants)¶
| Method | ImageNet | -V2 | -Sketch | -A | -R |
|---|---|---|---|---|---|
| MMRL | 72.03 | 64.47 | 49.17 | 51.20 | 77.53 |
| VaMP | 72.83 | 64.96 | 49.69 | 51.97 | 78.01 |
Ablation Study¶
| Configuration | Base | Novel | H | Note |
|---|---|---|---|---|
| Task-specific prompts (MMRL baseline) | 85.68 | 77.16 | 81.20 | Fixed shared prompts |
| + Sample-specific prompts | 85.93 | 78.13 | 81.84 | Novel +0.97% |
| + Variational modeling | 86.11 | 78.45 | 82.10 | Novel +0.32% |
| + Class-aware prior | 86.45 | 78.67 | 82.37 | Novel +0.22% |
Key Findings¶
- All three components (sample-specificity, variational modeling, class-aware prior) contribute independently and complementarily.
- On fine-grained texture datasets such as DTD, VaMP leads MMRL by 2.67% on Novel accuracy.
- In cross-dataset generalization (average over 10 target datasets), VaMP achieves 67.74%, surpassing MMRL by 0.49%.
- Latent space visualizations reveal that deeper layers yield more compact posterior distributions, reflecting hierarchical uncertainty refinement.
Highlights & Insights¶
- The integration of prompt learning and variational inference is taken to its fullest extent: rather than simple global variational modeling, fine-grained token-level, multi-layer variational modeling is employed.
- The class-aware prior substantially outperforms the standard Gaussian prior, providing structured semantic guidance in the latent space.
- Monte Carlo sampling at inference time yields uncertainty-aware ensembling, improving robustness.
- The framework is general and can be adapted to both MaPLe and MMRL as multi-modal prompt baselines.
Limitations & Future Work¶
- Inference requires 10 Monte Carlo samples, introducing computational overhead; mean approximation can be used as a faster alternative.
- The class-aware prior degenerates to a standard Gaussian at test time, leaving test-time class information underutilized.
- Experiments are conducted only on ViT-B/16; performance on larger backbones (e.g., ViT-L/14) remains unknown.
- Variational modeling is restricted to the text side; the vision side still employs deterministic prompts.
Related Work & Insights¶
- Bayesian Prompt Learning and Any-Shift Prompting are pioneering works in this direction, but are limited to unimodal text modeling and global-level variational inference.
- MMRL's multi-modal shared representation paradigm provides the foundation for VaMP's multi-layer prompt injection.
- Classical applications of variational inference in few-shot learning (e.g., VAE-based meta-learning) provide the theoretical grounding for this work.
- Insight: the variational framework can be extended to other PETL methods such as visual prompts and adapters.
Rating¶
- Novelty: ⭐⭐⭐⭐ Token-level multi-layer variational prompt modeling combined with a class-aware prior constitutes a meaningful and systematic contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three generalization settings, 11 datasets, complete ablations, and visualization analyses.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with rigorous mathematical derivations.
- Value: ⭐⭐⭐⭐ Introduces a structured probabilistic modeling paradigm for prompt learning.