CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection¶

Conference: CVPR 2026
arXiv: 2508.03447
Code: https://github.com/cqylunlun/CoPS
Area: LLM / NLP (Other)
Keywords: zero-shot anomaly detection, conditional prompt synthesis, CLIP, vision-language model, industrial defect

TL;DR¶

CoPS is a framework that dynamically generates prompts through two visual conditioning mechanisms — Explicit State Token Synthesis (ESTS) and Implicit Category Token Sampling (ICTS) — combined with Spatially-Aware Global-local Alignment (SAGA), achieving zero-shot anomaly detection SOTA across 13 industrial and medical datasets.

Background & Motivation¶

Background: Large-scale pre-trained vision-language models demonstrate strong cross-category generalization for zero-shot anomaly detection (ZSAD). Existing methods achieve cross-category anomaly detection by fine-tuning on a single auxiliary dataset.
Limitations of Prior Work: (i) Static learnable tokens fail to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) Fixed text labels provide overly sparse category information, causing the model to overfit to specific semantic subspaces.
Key Challenge: While prompt learning eliminates the need for manual prompt design, its static nature and sparsity become generalization bottlenecks — normal/anomalous states are continuously variable, and the category label space itself is highly sparse.
Goal: Design a visual feature-conditioned dynamic prompt synthesis framework enabling prompts to adaptively model the state and category information of input images.
Key Insight: Decompose prompts into three parts — context words, state words, and category words — where context words can be shared while the latter two need to be dynamically generated based on visual features.
Core Idea: Inject normal/anomalous prototypes from local features into state words (explicit), sample from global features via VAE into category words (implicit), achieving visually-conditioned dynamic prompt synthesis.

Method¶

Overall Architecture¶

Built on pre-trained CLIP, the input image passes through a frozen visual encoder to extract global features \(\mathbf{g}\) and local features \(\mathbf{F}\). ESTS extracts normal/anomalous prototypes from local features and injects them into state words; ICTS samples diversified category tokens from global features via VAE; finally, a learnable text encoder and SAGA module enable image-level and pixel-level anomaly detection.

Key Designs¶

Explicit State Token Synthesis (ESTS):
- Function: Extracts representative normal and anomalous prototypes from fine-grained local features and explicitly injects them into prompt state words
- Mechanism: Uses consistency self-attention (V-V attention) to extract fine-grained local features \(\mathbf{F}\) from the frozen visual encoder, then prototype extractor \(\mathcal{P}_\theta\) generates \(M\) normal prototypes \(\mathbf{P}_n\) and anomalous prototypes \(\mathbf{P}_a\) under center constraints, assembling them as dynamic state tokens replacing static learnable tokens
- Design Motivation: Fixed state words (e.g., "good"/"damaged") cannot capture continuously diverse normal/anomalous patterns. Extracting prototypes from actual image local features enables adaptive modeling of the current image's state, enhancing generalization
Implicit Category Token Sampling (ICTS):
- Function: Models semantic global features via VAE and generates diversified category tokens through sampling
- Mechanism: A variational autoencoder \(\mathcal{E}_\psi\) parameterizes the latent distribution of global features \(\mathbf{g}\), drawing \(R\) decoded samples \(\mathbf{S} \in \mathbb{R}^{R \times C}\) as dense category tokens. Each input image thus generates \(R\) complete sets of normal/anomalous prompts
- Design Motivation: Fixed text labels are too sparse to provide rich category semantic information. VAE sampling implicitly augments category representation diversity, preventing model overfitting to a single semantic subspace
Spatially-Aware Global-local Alignment (SAGA):
- Function: Combines distance-aware spatial attention for fine-grained image-text alignment
- Mechanism: Approximates anomaly state using the distance between query features and nearest prototypes, introducing a distance-aware spatial attention mechanism to refine pixel-level text-image alignment. Global-local (glocal) similarity interaction enhances image-level alignment. Outputs image-level anomaly score \(s_{\text{cls}}\) and pixel-level anomaly map \(\mathcal{S}_{\text{seg}}\)
- Design Motivation: Standard global alignment ignores local spatial information, while anomaly detection inherently requires precise spatial localization

Loss & Training¶

Binary focal loss for image-level classification; Dice loss and binary cross-entropy loss for pixel-level segmentation. Fine-tuned on a single auxiliary training set (e.g., MVTec AD), directly applied to unseen categories at test time.

Key Experimental Results¶

Main Results¶

Dataset	Metric	CoPS	Prev. SOTA	Gain
13-dataset avg.	Cls AUROC	SOTA	-	+1.4%
13-dataset avg.	Seg AUROC	SOTA	-	+1.9%
MVTec AD	Cls AUROC	Best	AnomalyCLIP, etc.	Significant
VisA	Seg AUROC	Best	-	Clear advantage

Ablation Study¶

Configuration	Key Metric	Note
Full CoPS	Best	Complete model
w/o ESTS	Decreased	Removing explicit state synthesis has the largest impact
w/o ICTS	Decreased	Removing implicit category sampling also has notable impact
w/o SAGA	Decreased	Spatial-aware alignment is especially important for segmentation
Static prompt baseline	Significantly below CoPS	Validates the necessity of dynamic prompts

Key Findings¶

ESTS contributes the most, indicating that adaptive state modeling is the core challenge in zero-shot anomaly detection
ICTS's VAE sampling effectively mitigates category label sparsity, especially in cross-domain scenarios (industrial → medical)
Distance-aware spatial attention significantly improves pixel-level segmentation quality but has less impact on image-level classification

Highlights & Insights¶

Prompt decomposition design philosophy is elegant: shared context words + explicitly injected state words + implicitly sampled category words, each serving its purpose
VAE implicit augmentation is an elegant trick: replacing fixed labels with sampling naturally increases category representation diversity
The use of consistency self-attention (V-V) avoids introducing additional adaptation modules, preserving CLIP features' original semantics

Limitations & Future Work¶

Relies on CLIP's pre-trained feature space; may have limited effectiveness for visual domains not covered by CLIP (e.g., specialized industrial scenarios)
Prototype count M and sampling count R require manual tuning
Future work could explore adaptive prototype count determination or replacing CLIP with stronger visual foundation models

vs AnomalyCLIP: AnomalyCLIP uses static learnable tokens without visual conditioning; this work overcomes that limitation through explicit/implicit injection
vs AdaCLIP: AdaCLIP relies on hand-designed template sets; this work eliminates manual design through end-to-end learning
vs VCP-CLIP: VCP-CLIP directly embeds image features into category words; this work provides richer semantic diversity through VAE sampling

Rating¶

Novelty: ⭐⭐⭐⭐ The explicit + implicit dual-pathway dynamic prompt synthesis is a novel combination
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across 13 datasets with complete ablations
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-explained methods
Value: ⭐⭐⭐⭐ Practical advancement in zero-shot anomaly detection