Towards Calibrating Prompt Tuning of Vision-Language Models¶

Conference: CVPR 2026 arXiv: 2602.19024 Code: https://github.com/ashshaksharifdeen/TCPT Area: Multimodal VLM Keywords: prompt tuning, calibration, CLIP, confidence estimation, pretrained semantic preservation

TL;DR¶

To address the "dual miscalibration" problem in prompt-tuned CLIP (underconfidence on base classes and overconfidence on novel classes), this paper proposes two complementary regularizers — mean-variance margin regularization and text moment-matching loss — as plug-and-play modules that consistently reduce ECE across 7 prompt tuning methods and 11 datasets.

Background & Motivation¶

Background: Prompt tuning is the dominant paradigm for adapting CLIP to downstream tasks, achieving parameter-efficient fine-tuning by learning a small number of prompt tokens. It improves accuracy on base classes while preserving zero-shot generalization to novel classes.

Limitations of Prior Work: Existing prompt tuning methods focus almost exclusively on accuracy, neglecting confidence calibration. A mismatch between predicted confidence and actual accuracy leads to unreliable decision-making, which is particularly harmful in safety-critical applications such as autonomous driving and medical imaging.

Key Challenge: Prompt tuning induces "dual miscalibration" — logit margin shrinkage on base classes causes underconfidence, while margin inflation on novel classes causes overconfidence. Existing post-hoc calibration methods (e.g., temperature scaling in DAC) cannot constrain how prompt tuning reshapes the embedding space, and may introduce embedding collapse or clustering artifacts.

Goal: To simultaneously address base-class underconfidence and novel-class overconfidence at training time without sacrificing accuracy.

Key Insight: Analysis reveals a correlation pattern between margin variability and ECE — negative for base classes and positive for novel classes.

Core Idea: Stabilize the logit distribution by maximizing mean margin and minimizing margin variance, while preserving CLIP's semantic geometry by matching the first- and second-order moments of tuned and frozen text embeddings.

Method¶

Overall Architecture¶

Two complementary regularizers are added on top of the standard cross-entropy loss for prompt tuning: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \lambda_{\text{Margin}}\mathcal{L}_{\text{Margin}} + \lambda_{\text{mom}}\mathcal{L}_{\text{mom}}\). No additional inference computation is required.

Key Designs¶

Mean-Variance Margin Regularization:
- Function: Stabilizes the statistical properties of logit margins within a batch.
- Mechanism: The per-sample margin is defined as \(m_i = z_{i,y_i} - \max_{j\neq y_i} z_{i,j}\), and the loss is \(\mathcal{L}_{\text{Margin}} = -\alpha \cdot \frac{1}{B}\sum_i m_i + \beta \cdot \text{Var}(m_1,...,m_B)\).
- Design Motivation: The mean term (weighted by \(\alpha\)) encourages sufficient separation on base classes to address underconfidence; the variance term (weighted by \(\beta\)) prevents margin inconsistency from inducing novel-class overconfidence. Using only the mean term risks enlarging the margin of incorrect classes when the top-1 prediction is wrong, thereby aggravating overconfidence.
Text Moment-Matching Loss:
- Function: Preserves the global statistical properties of text embeddings after prompt tuning to remain consistent with the frozen CLIP encoder.
- Mechanism: Aligns the first-order (mean) and second-order (covariance) moments of tuned and frozen text embeddings: \(\mathcal{L}_{\text{mom}} = \|\mu_{\tilde{c}} - \mu_{c^0}\|_2^2 + \|\Sigma_{\tilde{c}} - \Sigma_{c^0}\|_F^2\).
- Design Motivation: The margin regularizer operates in logit space and does not directly constrain the embedding geometry. Moment matching preserves semantic centroids and dispersion, preventing embedding drift induced by prompt tuning from disrupting inter-class relationships. Unlike direct \(L_2\) alignment, moment matching constrains only global statistics, retaining flexibility for local task adaptation.
Complementarity of the Two Regularizers:
- The margin loss enhances discriminability in logit space but may exacerbate novel-class overconfidence when top-1 predictions are incorrect.
- The moment-matching loss stabilizes geometry in embedding space, counteracting the failure mode introduced by the margin loss.
- Experiments confirm that applying the margin loss alone may increase novel-class ECE, whereas combining it with moment matching consistently improves calibration.

Loss & Training¶

The total loss is the sum of the three terms described above, with \(\lambda_{\text{Margin}}\) and \(\lambda_{\text{mom}}\) controlling regularization strength. The method is entirely agnostic to the underlying prompt tuning technique and can be used as a drop-in plugin.

Key Experimental Results¶

Main Results (CoOp, averaged over 11 datasets)¶

Method	Base Acc	Base ECE↓	Novel Acc	Novel ECE↓
Zero-Shot CLIP	69.50	3.58	—	—
CoOp	81.00	6.35	71.64	6.56
+ Temp. Scaling	83.06	2.96	72.10	5.84
+ DAC	—	—	—	5.21
+ ZS-Norm	80.50	3.44	71.80	4.85
+ Ours	81.00	2.30	71.64	3.98

Generalization Across Prompt Tuning Methods¶

Prompt Tuning Method	Base ECE (Original)	Base ECE (Ours)	Novel ECE (Original)	Novel ECE (Ours)
CoOp	6.35	2.30	6.56	3.98
CoCoOp	5.89	2.45	5.32	3.67
MaPLe	4.78	1.98	4.85	3.21
KgCoOp	5.12	2.15	5.01	3.45

Key Findings¶

The proposed method consistently reduces ECE across all 7 prompt tuning methods and 11 datasets.
Accuracy is largely unaffected (within ±0.5%), demonstrating that calibration improvements do not come at the cost of discriminative performance.
The moment-matching loss contributes most to novel-class ECE reduction, validating the importance of preserving embedding geometry for generalization calibration.
Improvements are particularly pronounced on challenging datasets such as DTD and EuroSAT, with ECE reductions exceeding 5 points.

Highlights & Insights¶

Plug-and-Play: As training-time regularizers, the proposed terms introduce no inference overhead and are compatible with any prompt tuning method, making the approach highly practical.
Analysis-Driven Design: Starting from the correlation analysis between margin variability and ECE, the method precisely identifies the separate causes of base underconfidence and novel overconfidence, and designs targeted regularizers accordingly. This "analyze-then-design" paradigm is methodologically instructive.
Moment Matching vs. Direct Alignment: By constraining only global statistics rather than per-sample embeddings, moment matching elegantly balances the trade-off between semantic preservation and task adaptation.

Limitations & Future Work¶

Hyperparameters \(\alpha\), \(\beta\), and \(\lambda\) require tuning on a validation set and may need different settings across datasets.
Validation is limited to classification tasks; applicability to structured output tasks such as detection and segmentation remains unexplored.
Moment matching assumes that class embedding distributions are approximately Gaussian, which may limit effectiveness for highly asymmetric distributions.
Domain-shift scenarios (e.g., transferring prompts from natural images to medical images) are not considered.

vs. DAC: DAC applies post-hoc temperature scaling to handle novel classes but cannot constrain embedding space deformation during training; the proposed method directly preserves embedding structure at training time.
vs. ZS-Norm: ZS-Norm matches global statistics of logit distributions; the proposed method performs moment matching in embedding space, providing a more fundamental preservation of inter-class relationships.

Rating¶

Novelty: ⭐⭐⭐⭐ — Both the dual miscalibration analysis and the dual-regularizer design are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extensive validation across 7 methods × 11 datasets with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ — The analytical chain is clear, though the notation is somewhat dense.
Value: ⭐⭐⭐⭐ — Calibration is a critical issue for real-world VLM deployment; the method is highly practical.