BATCLIP: Bimodal Online Test-Time Adaptation for CLIP¶

Conference: ICCV 2025 arXiv: 2412.02837 Code: https://github.com/sarthaxxxxx/BATCLIP Area: LLM Evaluation Keywords: CLIP, test-time adaptation, bimodal adaptation, robustness to image corruption, vision-language models

TL;DR¶

This paper proposes BATCLIP, a bimodal online test-time adaptation (TTA) method for CLIP that simultaneously adapts the LayerNorm parameters of both the visual and text encoders. By introducing a projection matching loss and an inter-class separability loss to enhance vision-text feature alignment and class discriminability, BATCLIP achieves state-of-the-art performance on CIFAR-10C, CIFAR-100C, and ImageNet-C.

Background & Motivation¶

Despite CLIP's strong zero-shot classification performance as a vision-language model, its accuracy degrades sharply under common image corruptions (e.g., Gaussian noise, fog, snow). The authors observe that with ViT-B/16, CLIP's accuracy on CIFAR-100 under Gaussian noise at severity-5 drops from 66.6% to 10.79%. Such brittleness has serious consequences in safety-critical applications such as autonomous driving.

Existing CLIP TTA methods suffer from three core problems:

Unimodal limitation: TPT optimizes only text prompts while keeping the visual encoder frozen, leaving the adapted prompts unaware of the test image distribution.

Computational expense: TPT requires generating multiple augmented views per test image and performing multiple forward passes.

Prompt template dependency: Methods such as WATT rely on multiple prompt templates, making prompt selection or optimization at test time impractical.

Through systematic experiments, the authors identify two key insights: (1) CLIP is highly sensitive to corruption severity, with noticeable degradation even at severity-1; and (2) different prompt templates have negligible effect on corrupted images — the true bottleneck lies in vision-text feature alignment rather than in text prompts. This motivates a bimodal adaptation approach that adjusts both encoders simultaneously.

Method¶

Overall Architecture¶

BATCLIP operates in an online TTA setting: test batches arrive sequentially, and the model performs a single forward pass and parameter update per batch. Only the LayerNorm parameters of both encoders are adapted (approximately 0.044% of total parameters), and parameters are reset after each task/corruption type.

Key Designs¶

Bimodal LayerNorm Adaptation:
- Function: Simultaneously updates the LayerNorm parameters of the visual encoder \(f_{vis}\) and the text encoder \(f_{txt}\).
- Mechanism: Unlike existing methods that update only the visual side or text prompts, BATCLIP jointly updates the normalization layer parameters \(\phi_v\) and \(\phi_t\) of both encoders, enabling mutually-aware, domain-specific feature representations.
- Design Motivation: Unimodal adaptation (updating only vision or only text) leads to misalignment between the two modalities. When only the visual encoder is updated, text features remain optimized for pre-training data; when only text prompts are updated, the resulting text features remain unaware of the test distribution.
Projection Matching Loss:
- Function: Maximizes the scalar projection of visual class prototypes onto the corresponding text features.
- Mechanism: Visual prototypes \(\bar{v}_c\) (mean of intra-class image features) are first computed using pseudo-labels; the projection of each prototype onto the normalized text feature is then maximized: \(\mathcal{L}_{pm} = \frac{1}{C} \sum_c \bar{v}_c \cdot \hat{z}_c\) where \(\hat{z}_c = z_c / \|z_c\|_2\). Using prototypes rather than individual features mitigates noise in pseudo-labels under corrupted inputs by smoothing the class distribution.
- Design Motivation: Geometrically, maximizing the projection aligns visual features with the direction of text features, directly optimizing vision-text alignment.
Inter-class Separability Loss:
- Function: Increases the cosine distance between prototypes of different classes.
- Mechanism: \(\mathcal{L}_{sp} = \sum_{l \in C} \sum_{c \in C} \mathbb{1}[l \neq c](1 - \cos(\bar{v}_c, \bar{v}_l))\) Maximizing this loss pushes visual prototypes of different classes apart in the feature space.
- Design Motivation: Image corruptions can cause visual features of different classes to overlap; aligning vision and text alone is insufficient to ensure well-separated classification decision boundaries.
Overall Optimization Objective: \(\arg\min_{\phi_v, \phi_t} (\mathcal{L}_{ent} - \mathcal{L}_{pm} - \mathcal{L}_{sp})\) The three losses serve distinct roles: (1) \(\mathcal{L}_{ent}\) entropy minimization — reduces prediction uncertainty; (2) \(-\mathcal{L}_{pm}\) projection matching — reinforces vision-text alignment; (3) \(-\mathcal{L}_{sp}\) inter-class separation — enhances feature discriminability.

Loss & Training¶

AdamW optimizer; learning rate 1e-3 for CIFAR-10C, 5e-4 for CIFAR-100C/ImageNet-C.
Single-step online adaptation (one gradient update per batch), suitable for real-time deployment.
Uses the generic prompt template "a photo of a \<CLS>." without relying on multiple templates.
Model parameters are reset after each corruption task to prevent catastrophic forgetting.
Batch size: 200 for CIFAR-10C/100C, 64 for ImageNet-C.

Key Experimental Results¶

Main Results (Average Accuracy %, severity-5, ViT-B/16)¶

Method	CIFAR-10C	CIFAR-100C	ImageNet-C
Zero-shot CLIP	61.16	35.79	24.51
TENT	62.03	37.96	25.15
SAR	67.37	41.19	29.73
TPT	63.64	36.15	24.87
VTE	64.15	35.01	25.60
WATT-S*	72.81	36.71	24.67
BATCLIP	73.85	42.09	30.72

Ablation Study¶

Loss Combination	CIFAR-10C	CIFAR-100C	ImageNet-C	Note
\(\mathcal{L}_{ent}\)	60.65	38.17	24.03	Entropy minimization only
\(\mathcal{L}_{sp}\)	73.16	41.36	30.05	Inter-class separation contributes most
\(\mathcal{L}_{ent}+\mathcal{L}_{pm}\)	62.60	39.32	25.21	Alignment provides modest gain
\(\mathcal{L}_{ent}+\mathcal{L}_{sp}\)	72.69	41.84	30.08	Separation yields substantial gain
\(\mathcal{L}_{ent}+\mathcal{L}_{pm}+\mathcal{L}_{sp}\)	73.85	42.09	30.72	All three combined is optimal

Key Findings¶

\(\mathcal{L}_{sp}\) (inter-class separability) is the primary contributor to performance gains, achieving 73.16% on CIFAR-10C when used alone.
Multi-step iterative adaptation leads to overfitting; single-step adaptation achieves the best balance between robustness and performance.
BATCLIP remains competitive with or superior to TPT and VTE at small batch sizes (e.g., 32), as prototype computation adapts to available samples.
The method generalizes to ViT-L/14, achieving 84.74% on CIFAR-10C (vs. 75.84% zero-shot).
Computational efficiency is substantially better than WATT — only 0.2 seconds per batch (vs. 2.34 seconds for WATT), enabling real-time deployment.
Consistent performance gains are achieved by updating only 0.044% of total parameters.

Highlights & Insights¶

The systematic analysis of CLIP zero-shot performance across backbones, corruption types, and prompt templates provides a valuable reference for the community.
The core insight behind "bimodal adaptation" — that the text encoder also needs to be aware of the test domain — is the key to overcoming the bottleneck of unimodal TTA.
The method is simple and practical: no multiple templates or augmented views are required; single-step, single-template adaptation supports real-time deployment.
t-SNE visualizations clearly demonstrate how BATCLIP forms tighter class clusters and achieves better vision-text alignment.

Limitations & Future Work¶

Gains on domain generalization benchmarks such as OfficeHome and PACS are limited, possibly because single-step adaptation is insufficient to handle large style shifts.
Pseudo-label quality depends on batch size and corruption severity, and may degrade under extreme conditions.
Adaptive learning rate schedules or more sophisticated prototype update strategies remain unexplored.
Catastrophic forgetting is avoided via inter-task parameter resets rather than being fundamentally addressed.
Validation is primarily conducted with ViT-B/16; the effectiveness on other visual backbones remains limited.

vs TPT: TPT performs unimodal adaptation (text prompt optimization only) and requires multiple forward passes per image, making it slow and oblivious to visual encoder adaptation.
vs VTE: VTE employs multi-template ensemble without updating model parameters, resulting in insufficient adaptability under severe corruptions.
vs WATT: WATT relies on multi-template, multi-step adaptation, which is unsuitable for online TTA and incurs more than 10× the computational cost.
vs TENT: TENT updates only batch normalization layers without considering vision-text alignment; directly applying it to CLIP yields limited benefit.
vs SAR: SAR performs well by filtering noisy samples in gradient space but remains a unimodal method.

Rating¶

Novelty: ⭐⭐⭐⭐ The bimodal adaptation idea is well-motivated and effective; the combination of projection matching and inter-class separability is clearly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three major benchmarks plus domain generalization datasets, detailed ablations, and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ Thorough analysis with systematic preliminary studies preceding the method proposal; logically coherent.
Value: ⭐⭐⭐⭐ High practical utility; the method is simple, efficient, and suitable for real-world deployment scenarios.