Kaputt: A Large-Scale Dataset for Visual Defect Detection¶

Conference: ICCV 2025 arXiv: 2510.05903 Code: Dataset Area: Other Keywords: defect detection, anomaly detection, large-scale dataset, retail logistics, benchmark

TL;DR¶

Kaputt introduces a large-scale retail logistics defect detection dataset comprising 230,000+ images and 48,000+ unique items — 40× the scale of MVTec-AD — and is the first to incorporate significant pose and appearance variation. State-of-the-art anomaly detection methods achieve no more than 56.96% AUROC on this benchmark, exposing critical shortcomings of existing approaches in real-world retail scenarios.

Background & Motivation¶

Automated visual defect detection is a critical component of quality assurance. Existing anomaly detection benchmarks (MVTec-AD, VisA) primarily target manufacturing scenarios, characterized by highly controlled object poses and limited categories (15 and 12, respectively). State-of-the-art methods have reached 99.9% AUROC on these datasets, approaching saturation.

Retail logistics scenarios, however, present fundamentally different challenges:

Extreme item diversity: physical properties vary widely, from food products to electronics

Highly variable defect types: ranging from subtle wrinkles to severe damage, many of which are difficult even for human inspectors

Severe sample scarcity: most items are observed only a few times, with limited normal and defective samples alike

Significant pose variation: items are placed arbitrarily in logistics containers, making pose uncontrollable

Existing datasets fail to capture these challenges. MVTec-AD contains only 5,354 images (1,258 defective), and VisA only 10,821. Leading anomaly detection methods suffer dramatic performance drops when transferred to logistics settings.

Core Problem: How can generalizable defect detection methods be developed when per-item samples are scarce, both normal and defective examples are limited, and intra-class variation is substantial?

Method¶

Overall Architecture¶

The primary contribution of this paper is the dataset and its accompanying comprehensive evaluation benchmark, rather than a novel method. The dataset design reflects careful engineering considerations.

Key Designs¶

Dataset Structure:
- Query set: 100,267 annotated images containing 29,316 defect instances
- Reference set: 1–3 unannotated "normal" reference images per item (138,154 images total)
- Item count: 48,376 unique items, with train/val/test splits strictly partitioned by item ID to prevent leakage
- Resolution: 12MP RGB camera, cropped to 2048×2048 pixels
- Train/val/test split: 85% / 5% / 10%
Multi-level Annotation Scheme:
- Defect severity: no defect / minor / severe, determined by majority vote among three independent annotators
- Defect type (7 categories, multi-label): penetration (holes/tears), deformation (dents/crushes), opened (open box/bag), deconstruction, spillage, surface (dirt/scratches), missing unit
- Item material: cardboard, plastic bag, hard plastic, bubble wrap, paper, books, etc.
- Deformation is the most common defect type but tends to be minor; spillage and deconstruction are typically severe
Data Collection Methodology:
- Hardware: 12MP RGB camera with f/12mm lens, top-down capture, uniform LED panel illumination to reduce plastic reflections
- Defect sample collection: two-stage strategy — (1) manually flagged defective items; (2) iterative mining using trained classifiers to surface candidates for manual annotation
- Quality control: filtering low-quality images, capping at 15 images per item, balancing defect rate to 28.6%, excluding items with no normal samples

Loss & Training¶

Rather than proposing a new method, this paper systematically evaluates four categories of baselines: - Training-free, reference-free (zero-shot): CLIP, Claude 3.5, Pixtral-12B - Training-free, reference-based (few-shot anomaly detection): PatchCore, WinCLIP - Training-based, reference-free (supervised): ResNet50, ViT-S/DINOv2, AutoGluon - Training-based, reference-based (hybrid): PatchCore with fine-tuned backbone, AutoGluon + reference

Key Experimental Results¶

Main Results¶

Method	Type	APany (%) ↑	APmajor (%) ↑	AUROC ↑
Random	-	31.84	14.00	50.00
CLIP	Zero-shot	36.20	17.15	56.05
Claude-icl	Zero-shot + context	36.57	24.76	56.96
PatchCore50	Few-shot AD	35.86	17.80	54.69
WinCLIP-few	Few-shot AD	34.05	19.29	52.41
ResNet50	Supervised	81.06	74.93	88.36
ViT-S	Supervised	90.67	91.45	94.27
PatchCore50-ft	Hybrid	40.18	20.98	60.14

Ablation Study¶

Performance degradation when reducing defective training samples:

Configuration	APany (%)	APmajor (%)	AUROC
ViT-S full training set	90.67	91.45	94.27
ViT-S 1% defect rate (Query only)	57.7	40.5	74.4
ViT-S 1% defect rate (Query + ref)	40.4	14.9	63.2

Key comparison: anomaly detection methods across datasets:

Dataset	AUROC
MVTec-AD (SOTA)	99.9%
VisA (SOTA)	99.5%
Kaputt (best unsupervised)	56.96%

Key Findings¶

Anomaly detection methods fail comprehensively: All unsupervised/few-shot methods achieve no more than 56.96% AUROC, barely above random chance.
VLMs are insufficient: Claude/Pixtral can describe objects but fail to detect subtle defects, consistent with findings by Jiang et al.
Reference images are counterproductive: Naively incorporating reference images (e.g., feature averaging) degrades supervised method performance (96%→87% APany on the training set).
Ceiling of supervised methods: ViT-S achieves 90.67% APany, yet still makes errors on deformable items and "adversarial" packaging designs (e.g., packaging printed with hole-like patterns).
Pose variation is the core challenge: Anomaly detection methods misidentify normal pose and appearance variation as anomalies.

Highlights & Insights¶

Genuinely exposes the bottleneck of anomaly detection: the issue is not method inadequacy but a fundamental shift in problem nature — from controlled manufacturing to open retail environments.
Rigorous dataset design: item-ID-based splits prevent data leakage; three-annotator majority voting ensures label quality; defect rates are aligned with existing benchmarks.
Four-scenario evaluation framework: the 2×2 matrix of training vs. no training × reference vs. no reference provides a comprehensive perspective.
Scale advantage: 48K unique items and 29K defect instances constitute the largest benchmark of its kind.

Limitations & Future Work¶

Only a single top-down viewpoint is captured; multi-view information is not exploited.
Reference image quality is not guaranteed (<1% of reference images contain defects themselves), potentially introducing noise.
Annotation errors remain (e.g., non-observable defects due to occlusion; confusion between design patterns and actual defects).
No pixel-level segmentation annotations are provided, precluding evaluation of defect localization accuracy.
All experiments use RGB images; depth, infrared, and other modalities are not explored.

MVTec-AD and VisA have saturated; Kaputt represents the next frontier for anomaly detection research.
ARMBench targets a similar scenario but contains only one-quarter as many defective samples as Kaputt and covers only 2 defect types.
Adapting anomaly detection methods to large intra-class variation remains a key open problem.
Effective utilization of reference images is an underexplored research direction — naive feature averaging is clearly insufficient.

Rating¶

Novelty: ⭐⭐⭐⭐ Dataset-driven contribution with precise problem formulation, but no methodological innovation
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four scenarios × multiple methods + training set reduction experiments + detailed error analysis
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-documented dataset descriptions
Value: ⭐⭐⭐⭐⭐ Fills the benchmark gap in retail logistics defect detection and will drive community progress