Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision¶
Conference: CVPR 2026 arXiv: 2603.27179 Code: GitHub Area: Reinforcement Learning Keywords: Anomaly Detection and Localization, Reasoning-Driven, Image-Level Supervision, MLLM Attention, Reinforcement Learning
TL;DR¶
This paper proposes two modules, ReAL and CGRO, which extract anomaly-relevant tokens from the autoregressive reasoning process of an MLLM and aggregate their visual attention maps to generate pixel-level anomaly maps. A consistency-guided reinforcement learning scheme then aligns reasoning tokens with visual evidence, enabling end-to-end anomaly detection, localization, and interpretable reasoning under image-level supervision only.
Background & Motivation¶
Industrial anomaly detection faces multiple challenges: - Limitations of traditional methods: Training product-specific models requires large collections of normal samples, incurring high deployment costs and poor generalization across product lines. - Existing MLLM-based approaches: Most methods support only image-level detection and textual reasoning, while pixel-level localization still relies on external visual modules (e.g., AnomalyGPT uses pretrained visual experts; EIAD uses SAM), leading to error propagation, reasoning–localization misalignment, and increased deployment complexity. - End-to-end approaches (e.g., OmniAD): These depend on dense pixel-level annotations and high-quality reasoning annotations, which are costly to obtain and introduce domain bias.
Core observation (Fig. 1): During MLLM text generation, only a small subset of tokens attend to genuine anomalous regions, and these tokens tend to correspond to anomaly-relevant semantics (e.g., "scratch," "mark"). The attention of most reasoning tokens is diffuse or focused on irrelevant regions, diluting localization precision.
Method¶
Overall Architecture¶
Given an image \(\mathbf{X}_v\) and a fixed text prompt ("Are there any defects or anomalies in the image?"), the MLLM generates an output sequence containing a reasoning chain and a final answer. The framework comprises two core modules: 1. ReAL (Reasoning-Driven Anomaly Localization): Selects anomaly-relevant tokens from the reasoning sequence and aggregates their visual attention maps to produce a pixel-level anomaly map. 2. CGRO (Consistency-Guided Reasoning Optimization): Drives reasoning–localization consistency via reinforcement learning, aligning reasoning tokens with visual attention.
Key Designs¶
- Anomaly-Relevant Reasoning Token Identification (core of ReAL): Each reasoning token is evaluated along two complementary dimensions:
- Cross-modal semantic relevance \(S_T^r\): The sum of attention weights from the reasoning token to anomaly-related words ("defect"/"anomaly"/"abnormal") in the input text, measuring semantic association with anomaly concepts.
- Intra-modal attention concentration \(S_I^r\): The visual attention map is binarized, connected components are extracted, and spatial entropy is computed—low entropy indicates attention focused on a specific region (potentially anomalous), while high entropy indicates diffuse attention.
After dual-threshold filtering (\(\hat{S}_T^r > \tau_t\) and \(\hat{S}_I^r > \tau_i\)), the visual attention maps \(\mathbf{A}_{r,I}\) of retained tokens are aggregated with composite weights \(w_r = \alpha\hat{S}_T^r + \beta\hat{S}_I^r\), yielding the reasoning-driven anomaly map \(\mathbf{A}_{\text{RDAM}}\).
- Consistency-Guided Reasoning Optimization (CGRO): Addresses inconsistent reasoning under limited supervision (e.g., the model answers "anomaly present" while the reasoning chain describes the image as normal). A class-conditional consistency reward \(R_{\text{cons}}\) is designed:
- For anomalous images (\(y=1\)): Encourages high spatial consistency (Jaccard Index \(\mathcal{J} > \delta_1\)) among the attention regions of top-\(t\) reasoning tokens.
- For normal images (\(y=0\)): Encourages low spatial consistency (\(\mathcal{J} < \delta_2\)), suppressing spurious focus on benign regions.
The total reward \(\mathcal{R}_{\text{total}} = \mathcal{R}_{\text{fmt}} + \mathcal{R}_{\text{acc}} + \mathcal{R}_{\text{cons}}\) is optimized via the GRPO framework.
- End-to-End without External Modules: The entire system relies on a single MLLM with no dependency on external segmentation (SAM) or detection modules, achieving true end-to-end anomaly detection, localization, and interpretable reasoning. Training requires only image-level labels (normal/anomalous).
Loss & Training¶
- Built on Qwen2.5-VL-7B with LoRA adapters applied to language and cross-modal layers; the visual encoder is frozen.
- Training data: 4K industrial images drawn from VisA, GoodsAD, Vision, and PR-REAL, with image-level annotations only.
- Batch size of 16 samples; 8 candidate outputs sampled per input (GRPO).
- Images uniformly resized to 420×420.
- Zero-shot evaluation (no domain overlap between training and test sets).
Key Experimental Results¶
Main Results¶
Average across four benchmarks (MVTec-AD, WFDD, SDD, DTD), image-level AUROC/ACC:
| Method | Parameters | Supervision | Image-Level AVG (AUROC, ACC) | Pixel-Level AVG (AUROC, ACC) | Reasoning (ROUGE-L, SBERT) |
|---|---|---|---|---|---|
| GPT-4.1 | — | — | 87.2, 88.4 | N/A | 20.8, 69.9 |
| Qwen2.5-VL+CGRO* | 7B | I | 83.9, 86.9 | 80.7, 97.1 | 27.1, 74.7 |
| Qwen2.5-VL+R1* | 7B | I | 80.0, 82.0 | 78.5, 96.7 | 26.3, 73.8 |
| AnomalyGPT | 7B | T+I+P | 71.1, 53.9 | 77.8, 98.4 | 11.9, 36.7 |
| Triad | 7B | T+I | 85.5, 83.8 | N/A | 8.6, 35.9 |
Highlight: Using only image-level supervision, the proposed method achieves localization performance comparable to AnomalyGPT, which relies on dense pixel-level annotations.
Ablation Study¶
ReAL + CGRO ablation (Qwen2.5-VL-7B, average over four datasets):
| Configuration | Image-Level AUROC | Pixel-Level AUROC | Pixel-Level ACC |
|---|---|---|---|
| Vanilla | 63.4 | 64.7 | 73.0 |
| Vanilla + ReAL | 63.4 | 61.7 | 85.6 |
| Vanilla + CGRO | 83.9 | 72.7 | 92.6 |
| Full (ReAL+CGRO) | 83.9 | 80.7 | 97.1 |
Token selection strategy ablation (pixel-level): - \(S_I\) only: AUROC 74.1 - \(S_T\) only: AUROC 76.7 - \(S_T + S_I\) (full): AUROC 80.7
Key Findings¶
- ReAL and CGRO are complementary: CGRO improves image-level detection (+20.5 AUROC), while ReAL improves pixel-level localization precision (+8.0 AUROC).
- The consistency reward eliminates reasoning–answer contradictions: without CGRO, the model frequently produces inconsistent outputs ("anomaly detected" but reasoning describes the image as normal) with diffuse attention.
- CGRO provides consistent gains across model scales from 3B to 7B parameters (image-level +15–20 AUROC).
- Improvements in reasoning quality and localization precision are mutually reinforcing.
Highlights & Insights¶
- Deep core insight: The work reveals that anomaly-aware attention patterns naturally emerge during MLLM reasoning; the key is to correctly identify and leverage them rather than introduce external modules.
- High annotation efficiency: Image-level labels—the least costly form of annotation—are sufficient to match methods trained with dense pixel-level supervision.
- Unified three-dimensional capability: A single model simultaneously performs detection, localization, and interpretable reasoning without external modules.
- Elegant consistency reward design: Class-conditional constraints based on the Jaccard Index align reasoning quality with spatial focus.
Limitations & Future Work¶
- Localization precision leaves room for improvement (pixel-level AUPR of 13.3%, well below dedicated segmentation methods).
- Reasoning token selection depends on threshold hyperparameters \(\tau_t, \tau_i\), which may require tuning for different product categories.
- Training images are sourced from other public AD datasets, potentially introducing domain bias.
- GRPO training is computationally expensive (8 candidate outputs per input).
- While the attention mechanism provides strong interpretability, performance on complex multi-defect scenarios remains unexplored.
Related Work & Insights¶
- Comparison with LISA: LISA employs a [SEG] token combined with SAM for reasoning-based segmentation; the proposed method entirely removes the external segmentation module.
- GRPO/R1 paradigm: Follows the reinforcement-learning-based reasoning optimization of DeepSeek-R1, with the novel introduction of a consistency reward.
- Comparison with OmniAD: OmniAD requires dense annotations for end-to-end training, whereas the proposed method needs only image-level labels.
- Broader implications: The attention aggregation strategy is generalizable to other tasks requiring MLLM-based spatial localization, such as referring segmentation and visual grounding.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The idea of activating the intrinsic reasoning potential of MLLMs for pixel-level localization is highly innovative, and the consistency reward is elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four benchmarks, comparisons against diverse MLLMs (including the GPT-4 family), and detailed ablations provide strong empirical support.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, though the dense notation requires careful reading.
- Value: ⭐⭐⭐⭐⭐ — Substantially reduces annotation costs for industrial anomaly detection and opens new avenues for deploying MLLMs in industrial quality inspection.