Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection¶

Conference: CVPR 2026 arXiv: 2408.13516 Code: N/A Area: Medical Imaging / Industrial Anomaly Detection Keywords: Few-shot anomaly detection, multi-class unified model, bidirectional prompt learning, scale-aware training, CLIP

TL;DR¶

This paper proposes AnoPLe — a lightweight multimodal bidirectional prompt learning framework that requires neither manually crafted anomaly descriptions nor external auxiliary modules. Through text–visual prompt bidirectional interaction and scale-aware prefixes, AnoPLe achieves few-shot multi-class anomaly detection, delivering strong competitive results on MVTec-AD/VisA/Real-IAD while maintaining efficient inference (~28 FPS).

Background & Motivation¶

Background: Industrial anomaly detection is shifting from "one model per class" toward more practical few-shot + multi-class unified models. Few-shot MCAD requires: (a) only a few normal samples per class; (b) a single model covering multiple product categories.

Limitations of Prior Work: - WinCLIP relies on hand-crafted prompt templates, lacking flexibility. - PromptAD depends on class-specific anomaly descriptions (e.g., "broken fabric," "missing wire"); under multi-class settings, the description pool suffers semantic conflicts, and t-SNE visualizations reveal collapsed and aliased anomaly features. - IIPAD employs a large Q-Former to generate instance prompts, incurring substantial computational overhead.

Core Observations: (a) Normality is shared across classes (intact, uncontaminated, geometrically regular); (b) anomalies are highly class-dependent; (c) class names themselves serve as strong semantic priors (validated in the CCL paper).

Core Idea: Use only class names as textual anchors (without describing anomaly types), and leverage bidirectional text–visual prompt interaction to automatically learn class-aware, anomaly-type-agnostic representations.

Method¶

Overall Architecture¶

CLIP backbone → deep learnable text prompts + visual prompts → bidirectional matching interaction → scale-aware prefix (global + local training, global-only inference) → lightweight decoder producing pixel-level anomaly maps → anomaly scoring combined with a memory bank.

Key Designs¶

Text Prompt Learning:
- Normal: $\mathbf{e}_0^+ = [\texttt{class}]$; Anomalous: $\mathbf{e}_0^- = [\texttt{abnormal}][\texttt{class}]$
- Shared learnable context vector $\mathbf{P}_0^t$ with layer-wise deep prompts.
- Design Motivation: Using only the class name together with a unified "abnormal" token — the normal branch preserves a complete class prototype, while the anomalous branch is initialized from CLIP's implicitly encoded "non-normal" prior and subsequently refined through visual interaction.
- Key Distinction: No hand-crafted anomaly descriptions such as "broken," "contaminated," or "missing" are required.
Bidirectional Prompt Interaction: At each layer $j$, learnable linear projections $f_{v\rightarrow t}$ and $f_{t\rightarrow v}$ fuse the two modalities: $$\tilde{\mathbf{P}}_j^t = [\mathbf{P}_j^t, f_{v\rightarrow t}(\mathbf{P}_j^v)], \quad \tilde{\mathbf{P}}_j^v = [\mathbf{P}_j^v, f_{t\rightarrow v}(\mathbf{P}_j^t)]$$
- Why Bidirectional: Ablations show that unidirectional T→I (84.2% VisA I-AUC) and I→T (82.0%) are both inferior to bidirectional T↔I (86.0%). Text provides categorical structure; vision provides instance-level details; the two modalities are mutually complementary.
Scale-Aware Prefix:
- Training: full-resolution image $I_0$ (240×240) + four non-overlapping sub-image crops $I_1,\ldots,I_4$ (480×480 crops).
- Learnable prefix $c \in \mathbb{R}^{(N+1) \times d_v}$; inputs at different scales use the corresponding $c_i$.
- Inference uses the global prefix only: zero additional inference cost.
- Ablation: single-scale without prefix 89.9% → multi-scale without prefix 91.8% → multi-scale with prefix 94.5%.
Alignment Loss: $$\mathbf{s} = \sum_{(i,j)} \hat{\mathbf{M}}_{ij} \circ D_{ij}(\mathbf{z}), \quad \mathcal{L}_{align} = 1 - \langle \mathbf{z}_0, \mathbf{s} \rangle$$ Pixel-level anomaly evidence is aggregated with learned weighting and aligned with the global [CLS] representation, ensuring consistency between local and global decisions.

Loss & Training¶

$\mathcal{L} = \mathcal{L}_{pixel} + \mathcal{L}_{img} + \mathcal{L}_{align}$
Pixel-level: Dice loss + Focal loss.
Image-level: contrastive cross-entropy (with pseudo-anomalies generated via pixel-space and latent-space perturbations).
At inference, memory-bank similarity scores and decoder outputs are fused via the harmonic mean.

Key Experimental Results¶

Main Results (1-shot Multi-Class AD)¶

Method	MVTec I-AUC	MVTec P-PRO	VisA I-AUC	VisA P-PRO	Real-IAD I-AUC
PatchCore	66.5	66.9	69.8	70.0	59.3
WinCLIP	77.5	70.8	70.0	61.2	69.4
PromptAD	91.2	86.1	82.4	77.8	52.2
INP-Former	94.7	90.7	84.0	84.0	84.4
AnoPLe	94.5	90.8	86.0	87.5	81.2

Ablation Study¶

Configuration	MVTec I-AUC	VisA I-AUC	Note
Text prompt only	93.0	81.7	Lacks instance-level details
Visual prompt only	90.3	82.0	Lacks categorical structure
T→I unidirectional	93.5	84.2	Text guides vision
T↔I bidirectional	94.5	86.0	Bidirectional complementarity is optimal
w/o alignment loss	93.7	86.0→73.3 (Real-IAD)	Large impact on Real-IAD
w/o multi-scale / prefix	89.9	82.3	Indispensable

Key Findings¶

AnoPLe leads on VisA (the most challenging cross-class benchmark) with 86.0% I-AUC; its P-PRO of 87.5% substantially surpasses PromptAD's 77.8%.
Unseen-class generalization (leave-one-class-out): AnoPLe drops only 6.3% on held-out MVTec classes, versus a 26.2% drop for INP-Former.
Unseen anomaly type generalization: removing descriptions from PromptAD causes a large performance drop (−10.0 on screw), whereas AnoPLe is inherently unaffected.
Inference speed ~28 FPS, significantly faster than IIPAD (which requires an additional Q-Former forward pass).

Highlights & Insights¶

Elegance through simplicity: Without anomaly descriptions or external large modules, AnoPLe reaches state-of-the-art performance using only class names and bidirectional interaction.
The scale-aware prefix design — multi-scale during training, global-only at inference — elegantly resolves the efficiency–accuracy trade-off.
The paper highlights the often-overlooked asymmetry: normality is shared across classes, whereas anomalies are class-dependent.
The alignment loss contributes most significantly on large-scale multi-class data (Real-IAD, 30 categories).

Limitations & Future Work¶

AnoPLe trails INP-Former by ~3% on Real-IAD; purely visual methods may be better suited to certain scenarios.
The CLIP ViT-B/16+ backbone sets a performance lower bound; larger backbones may yield further gains.
The pseudo-anomaly generation strategy (pixel-space and latent-space perturbations) may not closely approximate real defects.
The zero-shot setting (no normal samples whatsoever) remains unexplored.

Aligned with the CCL perspective (class-aware contrastive learning): class semantics serve as a strong prior for organizing multi-class representations.
The bidirectional prompt interaction idea is generalizable to other VLM adaptation tasks (e.g., medical image analysis).
The scale-aware training strategy offers transferable value to any task requiring multi-scale reasoning under efficiency constraints.
Successful generalization to medical domains (e.g., fundus images, X-rays) validates the framework's universality.

Rating¶

Novelty: ⭐⭐⭐⭐ Bidirectional prompt interaction and scale-aware prefix design are novel, though the overall approach follows the VLM prompt tuning paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three major benchmarks, multiple shot settings, generalization experiments, t-SNE visualizations, and attention analysis are all comprehensive.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; comparison with prior methods is thorough.
Value: ⭐⭐⭐⭐⭐ Highly practical — lightweight, efficient, free of expert knowledge, and well-suited for real-world industrial deployment.