Skip to content

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Conference: NeurIPS 2025 arXiv: 2511.22664 Code: GitHub Area: Multimodal VLM Keywords: Prompt Learning, Variational Inference, CLIP, Few-Shot Learning, Domain Generalization

TL;DR

This paper proposes VaMP, a variational multi-modal prompt learning framework that models text-side prompts as latent variables and performs instance-level uncertainty modeling via variational inference. Combined with a class-aware prior for regularizing the latent space, VaMP significantly improves CLIP's downstream adaptation under few-shot and domain generalization settings.

Background & Motivation

Vision-language models (e.g., CLIP) perform well in zero-shot settings but remain challenging to adapt to downstream tasks in few-shot scenarios. Prompt learning has been widely adopted as a parameter-efficient adaptation strategy; however, existing methods (e.g., CoOp, MaPLe, MMRL) typically employ fixed, deterministic prompts shared across all samples, giving rise to three core limitations:

Lack of instance-level adaptability: Fixed prompts cannot be personalized for different inputs, limiting generalization under distribution shift.

Lack of uncertainty modeling: Deterministic parameters fail to capture model uncertainty in few-shot settings.

Overly simplistic priors: Existing variational approaches (e.g., Bayesian Prompt Learning, Any-Shift Prompting) only model the text branch, use standard Gaussian priors, and operate at a global level, precluding fine-grained token-level semantic variation.

The paper frames prompt tuning as a variational inference problem, introducing token-level variational modeling within a multi-modal (text + vision) framework and employing a class-aware prior to provide semantically structured regularization.

Method

Overall Architecture

VaMP consists of three core components: (1) image-conditioned sample-specific prompt generation; (2) a variational inference mechanism that treats text-side prompts as latent variables; and (3) a class-aware prior constructed from class prototypes. Prompts are injected into multiple Transformer layers of both the CLIP text and vision encoders.

Key Designs

  1. Sample-Specific Prompt Generation

Unlike MaPLe/MMRL, which use fixed shared prompts, VaMP dynamically generates text-side prompts conditioned on the input image. Given input image \(x\), a global representation \(f_x\) is first extracted via the frozen CLIP image encoder. A set of \(H\) layer-specific MLP generators \(\{\Phi_i\}\) then produces prompt tokens for each Transformer layer:

\(z_i = \Phi_i(f_x) \in \mathbb{R}^{M \times d}, \quad i = J, \dots, J+H-1\)

On the vision side, shared learnable prompt tokens \(\tilde{z}_i\) are used and remain invariant across samples. This asymmetric design enables instance-level adaptation on the text side while maintaining stable shared representations on the vision side.

  1. Variational Multi-Modal Prompt Adaptation

The central innovation is upgrading text-side prompts \(z_i\) from deterministic parameters to latent variables. For each layer, an MLP \(\phi_i\) predicts the posterior distribution parameters:

\([\mu_i, \sigma_i] = \phi_i(\bar{f}_x), \quad q_\phi(z_i|x) = \mathcal{N}(\mu_i, \text{diag}(\sigma_i^2))\)

The training objective maximizes the variational evidence lower bound (ELBO):

\(\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p(y|x,t,z)] - \text{KL}(q_\phi(z|x) \| p(z))\)

End-to-end training is enabled via the reparameterization trick \(z_i = \mu_i + \sigma_i \odot \epsilon_i\). At inference time, Monte Carlo sampling with \(S=10\) draws followed by prediction averaging yields an uncertainty-aware ensemble.

  1. Class-Aware Prior

The standard Gaussian prior \(\mathcal{N}(0, I)\) lacks semantic structure. VaMP constructs a class-aware prior to provide global semantic anchoring. During training, per-class prototypes \(o_y\) are computed as the mean of posterior means for samples in each class, then mapped to prior distribution parameters via layer-specific prior networks \(\psi_i\):

\([\hat{\mu}_i, \hat{\sigma}_i] = \psi_i(c_y), \quad p_\psi(z_i|o_y) = \mathcal{N}(\hat{\mu}_i, \text{diag}(\hat{\sigma}_i^2))\)

The ELBO is summed across layers, encouraging prompt distributions of same-class samples to cluster in the latent space and improving intra-class consistency. At test time, where labels are unavailable, the prior degenerates to a standard Gaussian.

Loss & Training

The variational ELBO serves as the training objective, comprising a reconstruction term (classification cross-entropy) and a KL divergence regularization term. All experiments use the 16-shot setting with ViT-B/16 CLIP, trained on a single V100 GPU. Class prototypes are computed offline with the frozen encoder prior to training.

Key Experimental Results

Main Results: Base-to-Novel Generalization (Average over 11 Datasets)

Method Base Novel H (Harmonic Mean)
CLIP 69.34 74.22 71.70
MaPLe 82.28 75.14 78.55
MMRL 85.68 77.16 81.20
VaMP 86.45 78.67 82.37

VaMP surpasses MMRL on Novel classes by 1.51% and improves the harmonic mean by 1.17%.

Domain Generalization (ImageNet → 4 Variants)

Method ImageNet -V2 -Sketch -A -R
MMRL 72.03 64.47 49.17 51.20 77.53
VaMP 72.83 64.96 49.69 51.97 78.01

Ablation Study

Configuration Base Novel H Note
Task-specific prompts (MMRL baseline) 85.68 77.16 81.20 Fixed shared prompts
+ Sample-specific prompts 85.93 78.13 81.84 Novel +0.97%
+ Variational modeling 86.11 78.45 82.10 Novel +0.32%
+ Class-aware prior 86.45 78.67 82.37 Novel +0.22%

Key Findings

  • All three components (sample-specificity, variational modeling, class-aware prior) contribute independently and complementarily.
  • On fine-grained texture datasets such as DTD, VaMP leads MMRL by 2.67% on Novel accuracy.
  • In cross-dataset generalization (average over 10 target datasets), VaMP achieves 67.74%, surpassing MMRL by 0.49%.
  • Latent space visualizations reveal that deeper layers yield more compact posterior distributions, reflecting hierarchical uncertainty refinement.

Highlights & Insights

  • The integration of prompt learning and variational inference is taken to its fullest extent: rather than simple global variational modeling, fine-grained token-level, multi-layer variational modeling is employed.
  • The class-aware prior substantially outperforms the standard Gaussian prior, providing structured semantic guidance in the latent space.
  • Monte Carlo sampling at inference time yields uncertainty-aware ensembling, improving robustness.
  • The framework is general and can be adapted to both MaPLe and MMRL as multi-modal prompt baselines.

Limitations & Future Work

  • Inference requires 10 Monte Carlo samples, introducing computational overhead; mean approximation can be used as a faster alternative.
  • The class-aware prior degenerates to a standard Gaussian at test time, leaving test-time class information underutilized.
  • Experiments are conducted only on ViT-B/16; performance on larger backbones (e.g., ViT-L/14) remains unknown.
  • Variational modeling is restricted to the text side; the vision side still employs deterministic prompts.
  • Bayesian Prompt Learning and Any-Shift Prompting are pioneering works in this direction, but are limited to unimodal text modeling and global-level variational inference.
  • MMRL's multi-modal shared representation paradigm provides the foundation for VaMP's multi-layer prompt injection.
  • Classical applications of variational inference in few-shot learning (e.g., VAE-based meta-learning) provide the theoretical grounding for this work.
  • Insight: the variational framework can be extended to other PETL methods such as visual prompts and adapters.

Rating

  • Novelty: ⭐⭐⭐⭐ Token-level multi-layer variational prompt modeling combined with a class-aware prior constitutes a meaningful and systematic contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three generalization settings, 11 datasets, complete ablations, and visualization analyses.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with rigorous mathematical derivations.
  • Value: ⭐⭐⭐⭐ Introduces a structured probabilistic modeling paradigm for prompt learning.