AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors¶

Conference: CVPR 2026 arXiv: 2601.20524 Code: Project Page Area: Multimodal VLM / Anomaly Detection Keywords: Zero-shot anomaly detection, vision foundation models, synthetic data, parameter-efficient fine-tuning, LoRA

TL;DR¶

AnomalyVFM proposes a general framework that transforms arbitrary Vision Foundation Models (VFMs) into strong zero-shot anomaly detectors via a three-stage synthetic data generation pipeline and parameter-efficient LoRA adaptation, achieving 94.1% image-level AUROC across 9 industrial datasets with RADIO as the backbone, surpassing the previous SOTA by 3.3 percentage points.

Background & Motivation¶

Background: Zero-shot anomaly detection requires detecting anomalies on unseen object categories without any in-domain images. Current SOTA methods (AnomalyCLIP, AdaCLIP, etc.) rely on high-level conceptual knowledge from vision-language models such as CLIP.

Limitations of Prior Work: - Pure VFMs (e.g., DINOv2) possess stronger visual representations yet underperform VLM-based methods in zero-shot anomaly detection — a counterintuitive result, given that anomaly detection is inherently a visual task; - Reason 1: Existing auxiliary anomaly datasets lack sufficient diversity; VLMs compensate through reliable high-level conceptual knowledge, whereas VFMs cannot rely on such knowledge; - Reason 2: Existing VFM adaptation strategies are too shallow (training only the output head) and do not alter the internal visual representations.

Key Challenge: VFMs possess stronger visual representation capacity, but lack diverse training data and effective deep adaptation methods to unlock their potential.

Key Insight: Simultaneously addressing both bottlenecks — synthesizing large-scale diverse data and performing deep LoRA-based adaptation.

Core Idea: Generative data + parameter-efficient backbone adaptation + confidence-weighted loss = unlocking VFMs' potential for zero-shot anomaly detection.

Method¶

Overall Architecture¶

Three-stage data generation → LoRA injection into VFM Transformer → lightweight decoder → confidence-weighted loss training → direct output of anomaly segmentation maps and image-level scores at inference.

Key Designs¶

Three-Stage Synthetic Data Generation:
- Stage 1 — Normal Image Generation: A FLUX model generates defect-free object images \(I = G(p)\) from text prompts, covering 100 object categories × 50 background types generated by an LLM.
- Stage 2 — Anomaly Image Generation: Foreground masks are extracted → anomalous regions \(R\) are randomly sampled → local inpainting is performed with anomaly prompts (e.g., cracked, damaged), where anomaly descriptions are generated by an LLM for each object category.
- Stage 3 — Data Filtering: DINOv2 features are extracted from normal and anomalous images; cosine distance is computed as a distance score \(D\) to filter samples where anomalies were not successfully generated (\(D < T\)); anomaly masks \(M\) are obtained via thresholding.
- Design Motivation: Existing datasets (MVTec, VisA) lack diversity. The synthesis pipeline requires no real samples and can be scaled to unlimited object categories and anomaly types. The filtering step ensures data quality.
Feature Adaptation Module:
- Function: LoRA is injected into every Transformer block of the VFM to adapt its internal representations.
- Mechanism: LoRA (rank=64) is injected into the Query, Value, and Output projection layers of the attention mechanism.
- Decoder: Two upsampling blocks (Conv + GroupNorm + ReLU + bilinear upsampling) followed by a final convolutional layer that outputs the anomaly segmentation map \(M_o\) and confidence map \(c\).
- Image-level Score: Predicted by a linear layer from the [CLS] token.
- Design Motivation: Training only the output head (as in prior methods) cannot alter VFM internal feature representations, limiting the ability to discriminate normal from anomalous samples. LoRA adapts all layers with fewer than 1% additional parameters.
Confidence-Weighted Loss:
- Function: Reduces the loss contribution from uncertain regions in synthetic annotations.
- Mechanism: \(\mathcal{L}_{seg} = \mathcal{L}_{base}(M_o, M_{GT}) \cdot C - \alpha \log(C)\) where \(C = 1 + \exp(c)\) and \(c\) is the confidence map predicted by the decoder.
- Base loss: \(\mathcal{L}_{base} = \ell_1 + 5 \cdot \ell_{focal}\)
- Design Motivation: Anomaly masks derived from synthetic data inevitably contain noise. The confidence weighting allows the model to down-weight the loss in uncertain regions, preventing erroneous guidance from noisy annotations.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{seg} + \mathcal{L}_{img}\) (Focal Loss for image-level supervision)
Model-agnostic: applicable to any Transformer-based VFM backbone.

Key Experimental Results¶

Main Results (Zero-shot on 9 Industrial Datasets, Image-level AUROC)¶

Method	MVTec AD	VisA	BTAD	RealIAD	DTD	Average
WinCLIP	91.8	78.1	68.2	74.7	95.1	83.2
AnomalyCLIP	91.6	82.0	88.2	78.7	93.9	87.6
Bayes-PFL	92.3	87.0	93.2	85.2	95.1	90.8
AnomalyVFM	94.9	93.6	96.0	88.0	99.4	94.1

Ablation Study (VFM Generality Validation)¶

Backbone	Synthetic Data	LoRA Adaptation	Image AUROC	Pixel AUROC	Gain
DINOv2	✗	✗	83.0	80.4	Baseline
DINOv2	✓	✓	90.2 (+7.2)	93.4 (+13.0)	Significant
RADIO	✗	✗	89.1	84.9	Baseline
RADIO	✓	✓	94.1 (+5.0)	96.9 (+12.0)	Best overall

Key Findings¶

Both synthetic data and LoRA adaptation individually yield significant improvements; their combination achieves the best results.
The framework is effective across three VFMs (DINOv2, DINOv3, RADIO), demonstrating its generality.
Pixel-level AUROC improvements are particularly notable: RADIO improves from 84.9 to 96.9 (+12.0).
Strong performance is also observed on medical anomaly detection datasets without additional fine-tuning.

Highlights & Insights¶

The core finding is insightful: VFMs' underperformance in zero-shot anomaly detection is not a capability issue, but rather a consequence of insufficient data and inadequate adaptation strategies.
The synthetic data pipeline is highly scalable and does not depend on any real anomaly samples.
The confidence-weighted loss provides an elegant solution to the noise inherent in synthetic annotations.
The framework is broadly applicable across different VFM backbones.

Limitations & Future Work¶

Data generation quality depends on the FLUX model's generation fidelity and prompt coverage.
The LoRA rank of 64 is relatively large; whether smaller ranks suffice remains underexplored.
Pixel-level performance on certain specialized domains (e.g., KSDD steel surface inspection) remains suboptimal.

The synthetic anomaly concept is conceptually related to DRÆM, but without requiring real normal samples.
The confidence-weighted loss is analogous to uncertainty modeling approaches in NeRF.

Rating¶

Novelty: ⭐⭐⭐⭐ Addresses the key question of why VFMs underperform VLMs in zero-shot anomaly detection.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 9 industrial and medical datasets with three VFM backbones.
Writing Quality: ⭐⭐⭐⭐ Problem analysis is thorough and method motivation is clearly articulated.
Value: ⭐⭐⭐⭐⭐ Opens a new pathway for applying VFMs to anomaly detection.