MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with MLLMs¶

Conference: CVPR 2026
arXiv: 2604.10971
Code: https://xcyao00.github.io/MMR-AD
Area: Multimodal VLM
Keywords: Anomaly Detection, Multimodal Large Language Models, Reasoning Dataset, Reinforcement Learning, General Anomaly Detection

TL;DR¶

MMR-AD constructs the largest multimodal reasoning-oriented industrial anomaly detection dataset to date (127K images, 188 product categories, 395 anomaly types) and proposes Anomaly-R1, a GRPO reinforcement learning-based baseline model that significantly outperforms general-purpose MLLMs.

Background & Motivation¶

Background: Industrial anomaly detection has progressively evolved from single-class to multi-class to cross-class settings, with General Anomaly Detection (GAD) as the ultimate goal: training a unified model to directly detect anomalies in novel categories without retraining. MLLMs, with their powerful visual understanding and language reasoning capabilities, are regarded as a promising vehicle for achieving GAD.

Limitations of Prior Work: (1) MLLM pretraining data exhibits a significant domain gap with industrial AD scenarios; (2) existing AD datasets are image-only and ill-suited for MLLM post-training; (3) existing multimodal AD datasets (MMAD, Anomaly-Instruct-125K) either provide only multiple-choice questions without reasoning, or contain large amounts of non-industrial web data.

Key Challenge: General-purpose MLLMs fall far short of practical requirements for industrial AD, particularly in precise anomaly localization, and addressing this gap requires large-scale, high-quality multimodal AD training data.

Goal: Construct a large-scale reasoning-oriented multimodal AD dataset suitable for both training and evaluation, and validate a reinforcement learning-based AD baseline model.

Core Idea: Manually curate and filter samples from 14 public AD datasets, annotate bounding boxes, automatically generate reasoning-oriented text using a strong MLLM, and train a reasoning-capable AD model via GRPO reinforcement learning.

Method¶

Overall Architecture¶

Dataset construction: 14 public AD datasets → manual curation to remove low-quality samples → bounding box and text label annotation → Qwen2.5-VL-72B automatic reasoning text generation (reference image + input image + visual/text prompts) → text consistency verification.
Baseline model: Qwen2.5-VL + LoRA → SFT cold start → GRPO reinforcement learning + contrastive sampling + domain knowledge injection.

Key Designs¶

Reasoning-Oriented Text Generation Pipeline:
- Function: Generate text annotations with detailed reasoning processes for each AD sample.
- Mechanism: A paired normal reference image and a query image are provided to Qwen2.5-VL-72B, supplemented by red bounding box visual prompts and text prompts specifying anomaly type and coordinates. The model is instructed to produce responses in a "reason-then-answer" format. Consistency between predicted and ground-truth regions is used for validation.
- Design Motivation: Anomalies are intrinsically deviations from normality; reference images allow the model to establish a baseline of normalcy. Reasoning-rich text is more conducive to learning step-by-step comparative analysis than simple answer labels.
Contrastive Sampling + Consistency Penalty in GRPO:
- Function: Enhance reasoning capability and localization precision through reinforcement learning.
- Mechanism: An outcome reward (correct answer: +1) is combined with a consistency penalty (−0.2 per missed bounding box when localization is inaccurate). Contrastive sampling ensures each query has both positive and negative responses: ground-truth texts from MMR-AD serve as guaranteed positive examples, while adversarial prompts are used to generate negative examples from all-positive response sets.
- Design Motivation: Rewarding answer correctness alone reinforces the "blindly predicting Yes" pattern; the consistency penalty compels the model to genuinely learn anomaly localization. Contrastive sampling addresses the zero-gradient issue that arises in GRPO when all sampled responses are identical.
Domain Knowledge Injection:
- Function: Guide the model to focus on known anomaly types for specific product categories.
- Mechanism: Prompts are augmented with statements such as "This product may exhibit the following anomaly types: broken, deformation, …", directing the model to inspect specific defects rather than treating all visual differences as anomalous.
- Design Motivation: Domain knowledge is necessary to distinguish normal intra-class variation from genuine anomalies in industrial settings.

Loss & Training¶

SFT cold start → GRPO reinforcement learning. GRPO employs a PPO clip + KL penalty objective.

Key Experimental Results¶

Main Results¶

Model	MVTecAD Detection Acc	MVTecAD Localization Acc	VisA Detection Acc
GPT-4o	~70%	~30%	~65%
Gemini-2.5	~72%	~35%	~68%
Anomaly-R1-7B	~85%	~60%	~80%
Anomaly-R1-7B† (+ Domain Knowledge)	~88%	~65%	~83%

Ablation Study¶

Configuration	Detection	Localization	Note
Full (SFT+RL)	Best	Best	Complete model
SFT only	Second best	Moderate	RL yields notable localization gains
Direct RL (w/o SFT)	Poor	Poor	Cold start is necessary
w/o Consistency Penalty	Good	Poor	Model learns to blindly predict Yes

Key Findings¶

The strongest general-purpose MLLMs (GPT-4o, Gemini-2.5) remain far below practical requirements for industrial AD, with particularly poor precise localization.
Reasoning-oriented text annotations are more beneficial for learning general AD capabilities than simple answer-only annotations.
Reinforcement learning yields the most significant improvement over pure SFT in localization precision.
Domain knowledge injection further boosts performance.

Highlights & Insights¶

Dataset extensibility: The raw bounding box annotations are provided, allowing future regeneration of text using stronger MLLMs—a forward-looking design choice worth emulating.
Consistency penalty: Elegantly incorporates localization precision into the reward function, avoiding the reinforcement learning trap of rewarding "correct but imprecise" predictions.

Limitations & Future Work¶

Reasoning texts are generated by Qwen2.5-VL-72B, introducing model-specific biases.
Despite its scale of 127K images, the dataset remains imbalanced across certain categories.
Future work may explore additional RL algorithms and larger-scale models.

vs. MMAD: MMAD provides only a multiple-choice format unsuitable for training; MMR-AD includes reasoning text enabling model fine-tuning.
vs. AnomalyGPT: AnomalyGPT relies on direct SFT without reasoning processes, resulting in poor generalization.

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale reasoning-oriented AD dataset; the RL baseline has practical value.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive multi-model comparisons, ablations, and RL technique analyses.
Writing Quality: ⭐⭐⭐⭐ Dataset construction and methodology are described clearly.
Value: ⭐⭐⭐⭐⭐ The dataset makes a substantial contribution to the AD community.