Skip to content

Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

Conference: NeurIPS 2025 arXiv: 2510.02913 Code: Not available Area: Multimodal VLM Keywords: CLIP, adversarial robustness, zero-shot, adversarial fine-tuning, confidence-aware weighting

TL;DR

This paper proposes CAW (Confidence-Aware Weighting), an adversarial fine-tuning loss function for CLIP that focuses on hard adversarial examples via confidence-aware weighting, combined with feature alignment regularization to preserve pre-trained semantic knowledge. CAW achieves state-of-the-art zero-shot robustness under AutoAttack with lower memory overhead.

Background & Motivation

Vision-language models such as CLIP exhibit strong zero-shot generalization, yet remain highly vulnerable to adversarial attacks—small, imperceptible perturbations can cause severe prediction errors. Adversarial training is the most effective approach for improving robustness, but applying it directly to large-scale pre-trained models like CLIP poses two key challenges: (1) overfitting and forgetting of pre-trained knowledge; and (2) difficulty in simultaneously maintaining clean accuracy and adversarial robustness.

Limitations of prior work: - TeCoA is the first to study zero-shot robustness of large-scale VLMs, introducing contrastive adversarial loss with text supervision, but fails to improve both clean and robust accuracy simultaneously. - PMG-AFT extends TeCoA with additional loss terms to enhance robustness, but incurs substantial memory overhead. - TGA-ZSR leverages semantic text supervision to improve robustness and interpretability, but robust accuracy under strong attacks remains suboptimal.

Core starting point: not all adversarial examples are equally important—the model is already highly confident on some samples while highly uncertain on others. Focusing training on the "hard" adversarial examples (i.e., those with lowest model confidence) enables more efficient robustness improvement.

Method

Overall Architecture

CAW employs a frozen original CLIP model alongside a fine-tunable target CLIP model (image encoder only), jointly optimized via two novel loss terms: (1) a confidence-aware loss that emphasizes hard samples; and (2) a feature alignment regularization that preserves pre-trained knowledge. Training follows the standard adversarial training framework—the inner loop generates adversarial examples via PGD (maximizing cross-entropy loss), and the outer loop updates model parameters using the new loss function.

Key Designs

  1. Confidence-Aware Loss: The core idea is to align the prediction distribution \(P^{\text{clean}}\) of the frozen CLIP on clean images with the prediction distribution \(P^{\text{adv}}\) of the fine-tuned CLIP on adversarial images via KL divergence, using \((1 - P^{\text{adv}}_{i,y_i})\) as a weighting factor to amplify the contribution of hard samples:

    \(L_{\text{CA}} = \frac{1}{N}\sum_{i=1}^{N}\left[\text{KL}(P^{\text{adv}}_i \| P^{\text{clean}}_i)(1 - P^{\text{adv}}_{i,y_i})\right]\)

where \(P^{\text{adv}}_{i,y_i}\) is the predicted probability of the ground-truth class under adversarial input—lower probability indicates higher uncertainty, yielding a larger weight and greater training attention on that sample. Unlike ARoW, CAW uses the direction \(\text{KL}(P^{\text{adv}} \| P^{\text{clean}})\) (adversarial distribution first), which empirically yields better performance.

  1. Feature Alignment Regularization: By minimizing the \(\ell_2\) distance between the features of the fine-tuned and frozen encoders on adversarial inputs, semantic consistency is maintained at the image feature level prior to text alignment:

    \(L_{\text{Reg}} = \frac{1}{N}\sum_{i=0}^{N}\|f(x_{\text{adv}})_{\text{tar}} - f(x_{\text{adv}})_{\text{ori}}\|_2\)

Performing distillation in feature space rather than at the logit level better preserves the visual-semantic knowledge of pre-trained CLIP and reduces overfitting risk during fine-tuning.

  1. Total Loss:

    \(L_{\text{total}} = L_{\text{CE}} + \alpha \cdot L_{\text{CA}} + \beta \cdot L_{\text{Reg}}\)

where \(L_{\text{CE}}\) is the standard cross-entropy loss, and \(\alpha\) and \(\beta\) control the contribution of each term.

Loss & Training

  • Inner loop: PGD-2 is used to generate adversarial examples with perturbation budget \(\epsilon = 1/255\).
  • Outer loop: Image encoder parameters are updated using \(L_{\text{total}}\).
  • Class descriptions are constructed using CLIP's standard text templates.
  • The model is trained exclusively on TinyImageNet and then zero-shot transferred to 14 other datasets.

Key Experimental Results

Main Results

Zero-shot robust accuracy under AutoAttack (\(\epsilon=1/255\), 15 datasets)

Method TinyImageNet CIFAR-10 CIFAR-100 STL-10 SUN397 Caltech-101 Avg.
CLIP 0.02 0.01 0.08 0.03 0.04 0.43 0.09
FT-Adv 50.48 37.55 20.39 69.14 16.25 49.90 29.08
TeCoA 35.03 28.18 16.09 66.08 17.41 54.54 27.23
PMG-AFT 44.26 44.12 23.66 73.90 19.63 60.57 31.55
TGA-ZSR 49.45 40.53 22.38 72.06 20.36 57.16 31.63
CAW 50.52 47.35 26.35 74.27 19.64 62.79 33.51

Ablation Study

Configuration Key Metric Description
\(L_{\text{CE}}\) only Baseline robust accuracy Standard adversarial training
\(L_{\text{CE}} + L_{\text{CA}}\) Robust accuracy ↑ Hard samples better learned via confidence weighting
\(L_{\text{CE}} + L_{\text{Reg}}\) Clean accuracy ↑ Feature regularization effectively preserves pre-trained knowledge
\(L_{\text{CE}} + L_{\text{CA}} + L_{\text{Reg}}\) Both ↑ Two loss terms are complementary; best overall performance
KL(clean‖adv) direction Robust accuracy ↓ Placing adversarial distribution first in KL is superior

Memory comparison with baselines

Method Memory Usage (relative) Avg. Robust Accuracy
PMG-AFT High 31.55
TGA-ZSR High 31.63
CAW Low 33.51

Key Findings

  • Vanilla CLIP achieves near-zero accuracy under AutoAttack (average 0.09%), demonstrating the severity of the zero-shot robustness problem.
  • CAW achieves an average robust accuracy of 33.51% across 15 datasets, approximately 2% higher than the previous best, TGA-ZSR.
  • Fine-tuning solely on TinyImageNet generalizes to 14 datasets from different distributions, indicating that the learned robust features transfer well.
  • On CIFAR-10, CAW outperforms PMG-AFT by 3.23 percentage points (47.35 vs. 44.12).

Highlights & Insights

  • Simple yet effective core idea: Weighting by \((1 - P^{\text{adv}}_{i,y_i})\) naturally focuses training on hard samples without requiring additional sample mining or curriculum learning strategies.
  • Regularization in feature space rather than logit space better preserves CLIP's rich pre-trained semantics.
  • The overall method is lightweight—no additional text generation, attention mechanisms, or complex multi-stage training is needed.
  • Memory efficiency surpasses PMG-AFT and TGA-ZSR, making CAW more suitable for resource-constrained deployment.

Limitations & Future Work

  • Validation is limited to classification tasks; extension to more complex downstream tasks such as detection and segmentation remains unexplored.
  • The paper primarily addresses \(\ell_\infty\)-norm perturbations; robustness to other attack types such as \(\ell_2\) is not thoroughly discussed.
  • As a workshop paper, the experimental scale and depth of analysis are limited.
  • No comparison is made with more recent adversarial training methods, such as those that leverage generative models to augment adversarial training data.
  • The selection strategy for hyperparameters \(\alpha\) and \(\beta\) is not elaborated.
  • TeCoA establishes the foundational research paradigm for zero-shot robustness of large-scale VLMs.
  • PMG-AFT introduces a pre-trained model-guided fine-tuning framework.
  • ARoW proposes the idea of prioritizing vulnerable samples; CAW adapts and improves this concept for vision-language models.
  • Insight: The confidence-weighting idea is simple and general, and can be extended to other adversarial training and robust learning settings.

Rating

  • Novelty: ⭐⭐⭐ The core mechanism (confidence weighting + feature regularization) is relatively intuitive, but the combination proves effective in the VLM setting.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation across 15 datasets, multiple attack types, and memory comparison.
  • Writing Quality: ⭐⭐⭐⭐ Concise and clear; method formulations are well-presented.
  • Value: ⭐⭐⭐⭐ Provides a low-resource approach to CLIP robustness enhancement with significant practical deployment value.