VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer¶

Conference: CVPR 2026 arXiv: 2603.07952 arXiv: 2603.07952 Code: None Area: Medical Imaging Keywords: Zero-shot anomaly detection, Vision Transformer, language-free, learnable token, industrial + medical

TL;DR¶

This paper revisits the necessity of the text branch in zero-shot anomaly detection (ZSAD) and proposes VisualAD, a purely vision-based framework. Two learnable tokens (anomaly/normal) are inserted into a frozen ViT, enhanced by Spatial-Aware Cross-Attention (SCA) and a Self-Alignment Function (SAF). Without a text encoder, VisualAD achieves state-of-the-art performance across 13 industrial and medical benchmarks.

Background & Motivation¶

ZSAD challenge: Anomalies must be detected on unseen categories without per-class normal training samples.
Mainstream methods rely on CLIP's text branch: Methods such as AnomalyCLIP generate normal/abnormal prototypes via learnable text prompts and perform image-text similarity scoring.
Core question: If the final decision depends solely on two prototype vectors (normal and abnormal), is the text modality truly indispensable?
Exploratory experiment: Removing the text encoder from AnomalyCLIP and directly optimizing two visual vectors yields:
- No significant drop in detection performance
- A reduction of 99%+ in trainable parameters
- More stable training curves (the text-branch variant exhibits severe oscillation)
Conclusion: Text prompts may serve merely as an indirect channel for shaping visual prototypes, rather than being essential.

Method¶

Overall Architecture¶

Two learnable tokens are inserted into the token sequence of a frozen ViT:

\[z_0 = [t_a, t_n, t_c, p_1, \ldots, p_N]\]

where $t_a$ is the anomaly token, $t_n$ is the normal token, and $t_c$ is the original class token.

Features are extracted from intermediate layers $\mathcal{L} = \{6, 12, 18, 24\}$, tokens are enhanced via SCA, patches are calibrated via SAF, and per-layer anomaly maps are computed and fused.

Spatial-Aware Cross-Attention (SCA)¶

Global tokens lack spatial localization capability. SCA aggregates local spatial evidence through a small set of anchor queries $Q_{\text{anchor}} \in \mathbb{R}^{m \times d}$ ($m=4$):

\[A_\ell = \text{softmax}\left(\frac{Q_{\text{anchor}} (P_\ell^{\text{pos}})^\top}{\sqrt{d}}\right), \quad U_\ell = A_\ell P_\ell\]

A token-guided gating mechanism adaptively modulates the output:

\[g(t) = \sigma(W_g t) \in \mathbb{R}^m$$ $$\tilde{t}_\ell = t + \alpha \sum_{i=1}^{m} g_i(t) \cdot a_i\]

SCA is instantiated independently at each layer, dynamically adjusting the spatial sensitivity of tokens for each input image.

Self-Alignment Function (SAF)¶

A single-hidden-layer MLP at each layer calibrates the patch features: $\hat{P}_\ell = \mathcal{F}_\ell(P_\ell)$

Anomaly Scoring¶

After L2 normalization, anomaly scores are computed as cosine contrastive differences:

\[s_i^{(\ell)} = \langle \bar{\hat{p}}_i^{(\ell)}, \bar{t}_a^{(\ell)} \rangle - \langle \bar{\hat{p}}_i^{(\ell)}, \bar{t}_n^{(\ell)} \rangle\]

Multi-layer fusion: $H = \sum_{\ell \in \mathcal{L}} H_\ell$; image-level scores are computed as the mean of the top-1% pixel values.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{seg}} + \mathcal{L}_{\text{ctr}}\]

$\mathcal{L}_{\text{cls}}$: image-level BCE loss
$\mathcal{L}_{\text{seg}}$: per-layer Focal + Dice loss
$\mathcal{L}_{\text{ctr}}$: cosine margin penalty ensuring an angular distance > 120° between $t_a$ and $t_n$

Only $t_a$, $t_n$, SCA, and SAF are updated; the ViT backbone remains frozen.

Key Experimental Results¶

Image-Level AUROC on Industrial Domains¶

Method	MVTec-AD	VisA	BTAD	KSDD2	DAGM
WinCLIP	90.4	75.6	68.2	93.5	91.8
AnomalyCLIP	91.6	81.0	88.7	91.9	98.0
AdaCLIP	92.0	79.7	90.0	94.9	98.3
VisualAD(CLIP)	92.2	84.7	94.9	98.0	99.5

Image-Level AUROC on Medical Domains¶

Method	OCT17	BrainMR1	Brain_AD	HIS
AnomalyCLIP	63.7	96.4	69.0	55.2
VisualAD(CLIP)	88.9	96.7	80.8	60.1
VisualAD(DINOv2)	91.2	93.8	87.1	60.1

Gains on medical domains are particularly substantial: AUROC on OCT17 improves from 63.7 to 91.2 (+27.5).

Ablation Study¶

Module	Image AUROC	Pixel AP
w/o SCA	82.3	27.4
w/o SAF	50.5	3.5
w/o SCA & SAF	48.0	0.8
Full model	84.7	28.4

SAF is the critical component; its removal causes catastrophic performance degradation.

Backbone Flexibility¶

The same framework seamlessly accommodates both CLIP and DINOv2 backbones. The DINOv2 variant performs stronger on pixel-level segmentation, while the CLIP variant is superior on image-level classification.

Highlights & Insights¶

Challenging the necessity of text: Experiments demonstrate that the CLIP text branch in ZSAD may serve merely as an indirect channel for shaping visual prototypes, with a 99% reduction in parameters.
Minimalist and elegant design: Only two learnable tokens plus lightweight SCA/SAF modules are required, with a unified training and inference pipeline.
Cross-domain zero-shot generalization: The framework generalizes strongly under the industrial-training → medical-inference setting.
Backbone agnosticism: Compatible with both CLIP and DINOv2, offering strong extensibility.

Limitations & Future Work¶

Gains on certain medical datasets (e.g., HIS) are limited, likely due to weak visual priors for histopathological anomalies.
The choice of anchor query count $m=4$ lacks in-depth analysis.
An auxiliary training set (normal and abnormal samples from VisA) is required; the method is not fully training-free.
Pixel-level segmentation still lags behind some specialized methods on certain datasets.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐⭐