Kaputt: A Large-Scale Dataset for Visual Defect Detection¶
Conference: ICCV 2025 arXiv: 2510.05903 Code: Dataset Area: Other Keywords: defect detection, anomaly detection, large-scale dataset, retail logistics, benchmark
TL;DR¶
Kaputt introduces a large-scale retail logistics defect detection dataset comprising 230,000+ images and 48,000+ unique items — 40× the scale of MVTec-AD — and is the first to incorporate significant pose and appearance variation. State-of-the-art anomaly detection methods achieve no more than 56.96% AUROC on this benchmark, exposing critical shortcomings of existing approaches in real-world retail scenarios.
Background & Motivation¶
Automated visual defect detection is a critical component of quality assurance. Existing anomaly detection benchmarks (MVTec-AD, VisA) primarily target manufacturing scenarios, characterized by highly controlled object poses and limited categories (15 and 12, respectively). State-of-the-art methods have reached 99.9% AUROC on these datasets, approaching saturation.
Retail logistics scenarios, however, present fundamentally different challenges:
Extreme item diversity: physical properties vary widely, from food products to electronics
Highly variable defect types: ranging from subtle wrinkles to severe damage, many of which are difficult even for human inspectors
Severe sample scarcity: most items are observed only a few times, with limited normal and defective samples alike
Significant pose variation: items are placed arbitrarily in logistics containers, making pose uncontrollable
Existing datasets fail to capture these challenges. MVTec-AD contains only 5,354 images (1,258 defective), and VisA only 10,821. Leading anomaly detection methods suffer dramatic performance drops when transferred to logistics settings.
Core Problem: How can generalizable defect detection methods be developed when per-item samples are scarce, both normal and defective examples are limited, and intra-class variation is substantial?
Method¶
Overall Architecture¶
The primary contribution of this paper is the dataset and its accompanying comprehensive evaluation benchmark, rather than a novel method. The dataset design reflects careful engineering considerations.
Key Designs¶
-
Dataset Structure:
- Query set: 100,267 annotated images containing 29,316 defect instances
- Reference set: 1–3 unannotated "normal" reference images per item (138,154 images total)
- Item count: 48,376 unique items, with train/val/test splits strictly partitioned by item ID to prevent leakage
- Resolution: 12MP RGB camera, cropped to 2048×2048 pixels
- Train/val/test split: 85% / 5% / 10%
-
Multi-level Annotation Scheme:
- Defect severity: no defect / minor / severe, determined by majority vote among three independent annotators
- Defect type (7 categories, multi-label): penetration (holes/tears), deformation (dents/crushes), opened (open box/bag), deconstruction, spillage, surface (dirt/scratches), missing unit
- Item material: cardboard, plastic bag, hard plastic, bubble wrap, paper, books, etc.
- Deformation is the most common defect type but tends to be minor; spillage and deconstruction are typically severe
-
Data Collection Methodology:
- Hardware: 12MP RGB camera with f/12mm lens, top-down capture, uniform LED panel illumination to reduce plastic reflections
- Defect sample collection: two-stage strategy — (1) manually flagged defective items; (2) iterative mining using trained classifiers to surface candidates for manual annotation
- Quality control: filtering low-quality images, capping at 15 images per item, balancing defect rate to 28.6%, excluding items with no normal samples
Loss & Training¶
Rather than proposing a new method, this paper systematically evaluates four categories of baselines: - Training-free, reference-free (zero-shot): CLIP, Claude 3.5, Pixtral-12B - Training-free, reference-based (few-shot anomaly detection): PatchCore, WinCLIP - Training-based, reference-free (supervised): ResNet50, ViT-S/DINOv2, AutoGluon - Training-based, reference-based (hybrid): PatchCore with fine-tuned backbone, AutoGluon + reference
Key Experimental Results¶
Main Results¶
| Method | Type | APany (%) ↑ | APmajor (%) ↑ | AUROC ↑ |
|---|---|---|---|---|
| Random | - | 31.84 | 14.00 | 50.00 |
| CLIP | Zero-shot | 36.20 | 17.15 | 56.05 |
| Claude-icl | Zero-shot + context | 36.57 | 24.76 | 56.96 |
| PatchCore50 | Few-shot AD | 35.86 | 17.80 | 54.69 |
| WinCLIP-few | Few-shot AD | 34.05 | 19.29 | 52.41 |
| ResNet50 | Supervised | 81.06 | 74.93 | 88.36 |
| ViT-S | Supervised | 90.67 | 91.45 | 94.27 |
| PatchCore50-ft | Hybrid | 40.18 | 20.98 | 60.14 |
Ablation Study¶
Performance degradation when reducing defective training samples:
| Configuration | APany (%) | APmajor (%) | AUROC |
|---|---|---|---|
| ViT-S full training set | 90.67 | 91.45 | 94.27 |
| ViT-S 1% defect rate (Query only) | 57.7 | 40.5 | 74.4 |
| ViT-S 1% defect rate (Query + ref) | 40.4 | 14.9 | 63.2 |
Key comparison: anomaly detection methods across datasets:
| Dataset | AUROC |
|---|---|
| MVTec-AD (SOTA) | 99.9% |
| VisA (SOTA) | 99.5% |
| Kaputt (best unsupervised) | 56.96% |
Key Findings¶
- Anomaly detection methods fail comprehensively: All unsupervised/few-shot methods achieve no more than 56.96% AUROC, barely above random chance.
- VLMs are insufficient: Claude/Pixtral can describe objects but fail to detect subtle defects, consistent with findings by Jiang et al.
- Reference images are counterproductive: Naively incorporating reference images (e.g., feature averaging) degrades supervised method performance (96%→87% APany on the training set).
- Ceiling of supervised methods: ViT-S achieves 90.67% APany, yet still makes errors on deformable items and "adversarial" packaging designs (e.g., packaging printed with hole-like patterns).
- Pose variation is the core challenge: Anomaly detection methods misidentify normal pose and appearance variation as anomalies.
Highlights & Insights¶
- Genuinely exposes the bottleneck of anomaly detection: the issue is not method inadequacy but a fundamental shift in problem nature — from controlled manufacturing to open retail environments.
- Rigorous dataset design: item-ID-based splits prevent data leakage; three-annotator majority voting ensures label quality; defect rates are aligned with existing benchmarks.
- Four-scenario evaluation framework: the 2×2 matrix of training vs. no training × reference vs. no reference provides a comprehensive perspective.
- Scale advantage: 48K unique items and 29K defect instances constitute the largest benchmark of its kind.
Limitations & Future Work¶
- Only a single top-down viewpoint is captured; multi-view information is not exploited.
- Reference image quality is not guaranteed (<1% of reference images contain defects themselves), potentially introducing noise.
- Annotation errors remain (e.g., non-observable defects due to occlusion; confusion between design patterns and actual defects).
- No pixel-level segmentation annotations are provided, precluding evaluation of defect localization accuracy.
- All experiments use RGB images; depth, infrared, and other modalities are not explored.
Related Work & Insights¶
- MVTec-AD and VisA have saturated; Kaputt represents the next frontier for anomaly detection research.
- ARMBench targets a similar scenario but contains only one-quarter as many defective samples as Kaputt and covers only 2 defect types.
- Adapting anomaly detection methods to large intra-class variation remains a key open problem.
- Effective utilization of reference images is an underexplored research direction — naive feature averaging is clearly insufficient.
Rating¶
- Novelty: ⭐⭐⭐⭐ Dataset-driven contribution with precise problem formulation, but no methodological innovation
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four scenarios × multiple methods + training set reduction experiments + detailed error analysis
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-documented dataset descriptions
- Value: ⭐⭐⭐⭐⭐ Fills the benchmark gap in retail logistics defect detection and will drive community progress