Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference¶
Conference: AAAI 2026 arXiv: 2511.09064 Code: Available Area: AI Security Keywords: Adversarial Robustness, CLIP Defense, Test-Time Defense, Orthogonal Counterattack, Vision-Language Models
TL;DR¶
This paper proposes Directional Orthogonal Counterattack (DOC), a method that expands the search space during counterattack optimization by introducing orthogonal gradient components and momentum updates, and adaptively modulates counterattack intensity via a cosine-similarity-based Directional Sensitivity Score (DSS). DOC significantly improves the test-time adversarial robustness of CLIP across 16 datasets.
Background & Motivation¶
Vision-language pre-trained models such as CLIP exhibit strong zero-shot generalization but are highly vulnerable to adversarial examples. Existing defenses fall into three categories:
Adversarial Fine-tuning (e.g., TeCoA, PMG-AFT, FARE): Fine-tunes CLIP on adversarial examples, but incurs high computational cost and may degrade generalization.
Adversarial Prompt Tuning: Adjusts prompts in the embedding space, but sacrifices semantic interpretability.
Test-Time Counterattack (TTC): A recent parameter-free defense that generates counterattack perturbations to maximize the embedding distance between adversarial inputs and their variants.
Core issue with TTC: A fundamental objective mismatch exists between adversarial attacks and counterattacks:
- Adversarial attack objective: maximize classification loss
- Counterattack objective: maximize embedding distance
TTC uses PGD to generate counterattacks along the gradient direction, but due to this mismatch, the search space is confined to a narrow region, causing the counterattack to overfit to a limited set of adversarial patterns and lack the diversity needed to neutralize a broad distribution of perturbations.
Method¶
Overall Architecture¶
DOC (Directional Orthogonal Counterattack) comprises two core components:
- Orthogonal Gradient Augmentation (OGA): Adds a random component orthogonal to the primary gradient direction at each counterattack optimization step, combined with momentum updates.
- Directional Sensitivity Score (DSS): Assesses whether an input is adversarial based on cosine similarity, and adaptively modulates counterattack intensity.
Key Designs¶
Orthogonal Gradient Augmentation (OGA):
- Compute the normalized gradient \(g\) (gradient of the counterattack loss w.r.t. the adversarial input, then normalized).
- Sample a random vector \(r\) from the standard normal distribution; apply Gram-Schmidt orthogonalization to obtain a component orthogonal to the gradient: \(r_\perp = (r - \langle r, g \rangle g) / \|r - \langle r, g \rangle g\|\).
- Combine the update direction: \(d = g + \lambda \cdot r_\perp\) (where \(\lambda\) controls orthogonal injection strength).
- Momentum update: \(m_t = \mu \cdot m_{t-1} + (1 - \mu) \cdot d\).
- Counterattack perturbation iteration: \(\delta_{t+1} = \mathrm{Proj}(\delta_t + \alpha \cdot \mathrm{sign}(m_t))\).
Design intuition: The orthogonal component enables the counterattack to explore regions beyond the gradient direction, while momentum helps escape narrow local optima, yielding more diverse counterattack perturbations. t-SNE visualizations confirm that DOC produces a more dispersed counterattack distribution than TTC.
Directional Sensitivity Score (DSS):
TTC uses \(\ell_2\) distance to detect adversarial inputs, which suffers from two issues: (a) embeddings with similar directions but different scales produce spuriously large \(\ell_2\) distances; (b) single noise samples introduce instability.
DOC replaces this with cosine similarity averaged over multiple samples:
- Low \(\hat{\tau}\): perturbed embeddings maintain consistent directions, indicating a clean sample.
- High \(\hat{\tau}\): directional inconsistency, indicating a likely adversarial sample.
A soft gating function adaptively modulates counterattack intensity:
For clean samples, \(w \approx 0\) (counterattack is nearly suppressed); for adversarial samples, \(w \approx 1\) (full counterattack is applied).
Loss & Training¶
DOC is a training-free test-time defense:
- No model parameters are modified; no training data or label supervision is required.
- Counterattack budget: \(\epsilon_{ca} = 4/255\).
- Default: 4 counterattack steps, step size \(\alpha = 3/255\).
- Batch size 256; requires only a single NVIDIA 4090 GPU.
Key Experimental Results¶
Main Results¶
Average results across 16 datasets under PGD-10 attack (\(\epsilon_{atk} = 4/255\)):
| Method | Type | Avg. Robust Acc. | Avg. Clean Acc. |
|---|---|---|---|
| CLIP (original) | — | 0.06% | 61.51% |
| HD | Test-time defense | 0.56% | 54.85% |
| TeCoA4 | Adversarial fine-tuning | 10.95% | 37.58% |
| FARE4 | Adversarial fine-tuning | 1.38% | 56.62% |
| TTC | Test-time defense | 21.22% | 55.63% |
| DOC | Test-time defense | 31.02% | 58.26% |
DOC improves robust accuracy over TTC by 9.80%, while also achieving higher clean accuracy (+2.63%).
Per-dataset key results (robust accuracy under PGD-10):
| Dataset | CLIP | TTC | DOC | Gain |
|---|---|---|---|---|
| CIFAR-10 | 0.00% | 30.25% | 38.14% | +7.89% |
| STL-10 | 0.04% | 51.89% | 69.16% | +17.27% |
| ImageNet | 0.00% | 13.07% | 24.64% | +11.57% |
| OxfordPets | 0.00% | 25.89% | 46.52% | +20.63% |
| Caltech-256 | 0.13% | 26.38% | 43.08% | +16.70% |
Ablation Study¶
| DSS | OGA | Clean Acc. | PGD Robust | CW Robust | AutoAttack |
|---|---|---|---|---|---|
| ✗ | ✗ | 55.66% | 21.43% | 20.70% | 21.97% |
| ✓ | ✗ | 58.23% | 23.37% | 22.27% | 22.66% |
| ✗ | ✓ | 55.38% | 31.83% | 29.02% | 26.07% |
| ✓ | ✓ | 58.27% | 31.04% | 28.15% | 25.89% |
- DSS alone: Primarily improves clean accuracy (+2.57%) by suppressing unnecessary perturbations on clean samples.
- OGA alone: Substantially boosts robust accuracy (+10.4%), validating the effectiveness of diversified counterattacks.
- Combined: Achieves favorable trade-offs between robustness and clean accuracy.
Average robust accuracy under CW attack: DOC 28.18% vs. TTC 20.61% (+7.58%). Under AutoAttack, DOC outperforms TTC by approximately 4.1%.
Key Findings¶
- DOC outperforms TTC on nearly all 16 datasets, with EuroSAT as the only exception.
- DOC functions as a plug-and-play module compatible with adversarial fine-tuning: combining with FARE yields average robust accuracy exceeding vanilla CLIP by 18%.
- Counterattack performance saturates at as few as \(N = 3\)–\(4\) steps, incurring minimal computational overhead.
- Clean accuracy remains stable as the number of steps increases; robustness gains do not come at the expense of clean performance.
Highlights & Insights¶
- Precise problem identification: Reveals the fundamental objective mismatch between adversarial attacks and counterattacks.
- Clear design intuition for OGA: Introducing exploration noise via orthogonalization is both mathematically elegant and practically effective.
- Cosine similarity replaces \(\ell_2\) distance for adversarial sample detection, which is more principled in high-dimensional spaces due to scale invariance.
- Completely training-free: Requires no data, no parameter modification, and runs on a single GPU, resulting in an extremely low deployment barrier.
- t-SNE visualizations intuitively demonstrate DOC's ability to push adversarial samples toward the clean distribution.
Limitations & Future Work¶
- The counterattack budget is set equal to the attack budget; in practice, the attack budget is unknown.
- The orthogonal component is randomly sampled, potentially causing inference results to vary across runs (though empirical variance is small).
- Clean accuracy decreases on ImageNet (−3.25%) and fluctuates on some fine-grained classification datasets.
- Validation is limited to CLIP; the approach has not been extended to other VLPs (e.g., BLIP-2, LLaVA).
- Robustness against adaptive attacks is not sufficiently discussed.
Related Work & Insights¶
- TTC (Xing et al. 2025): The pioneering test-time counterattack work that DOC directly improves upon.
- TeCoA (Mao et al.): A representative adversarial fine-tuning method.
- PMG-AFT (Wang et al. 2024): Adversarial fine-tuning augmented with CLIP-guided regularization.
- FARE (Schlarmann et al. 2024): Adversarial fine-tuning under large perturbation budgets.
- Hedge Defense (Wu et al. 2021): A test-time defense that maximizes loss across all classes.
- Insight: In unsupervised test-time defense, diversity matters more than precision. The orthogonal exploration paradigm is generalizable to other robust optimization scenarios.
Rating¶
- Novelty: 4/5 — OGA and DSS represent meaningful and original contributions.
- Technical Depth: 4/5 — The method is grounded in clear theoretical motivation and mathematical derivation.
- Experimental Thoroughness: 5/5 — 16 datasets × 3 attack types × ablations × combination experiments + visualizations.
- Writing Quality: 4/5 — Problem motivation is clearly articulated with rich figures and tables.
- Overall: 4.0/5