ICCV 2025 3D Vision 3D anomaly detection multiview multimodal single-instance anomaly volume synthetic-to-real

SiM3D: Single-Instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark¶

Conference: ICCV 2025 arXiv: 2506.21549 Code: alex-costanzino/SiM3D Area: 3D Vision Keywords: 3D anomaly detection, multiview, multimodal, single-instance, anomaly volume, synthetic-to-real

TL;DR¶

This paper introduces SiM3D, the first benchmark for multiview multimodal 3D anomaly detection and segmentation targeting single-instance industrial scenarios. It employs industrial-grade sensors to acquire high-resolution data, replaces 2D anomaly maps with voxelized Anomaly Volumes, and is the first benchmark to support cross-domain synthetic-to-real evaluation.

Background & Motivation¶

Industrial anomaly detection (ADS) has experienced rapid growth since the release of MVTec AD (172 top-venue papers within five years), yet existing benchmarks leave two critical industrial pain points unresolved:

2D anomaly maps are insufficient for 3D localization: Existing benchmarks (MVTec AD/3D-AD, etc.) produce 2D anomaly maps from a single viewpoint. Industrial automation, however, requires precise 3D localization of defects for automated repair, necessitating multi-view observation and anomaly score prediction within a 3D voxel grid—i.e., an Anomaly Volume.

Multi-instance training data collection is costly and unnecessary: Existing benchmarks require multiple training samples to capture normal-instance variability. In manufacturing, objects are highly consistent (replicated from the same CAD prototype), so a single normal instance contains all information needed for detection. Moreover, recollecting large datasets after a production line changeover is prohibitively time-consuming.

The synthetic-to-real gap is unexplored: No prior ADS benchmark has investigated cross-domain generalization from CAD-rendered synthetic training data to real-object testing—a setup of high practical value in industry.

Core insight: The paradigm shift from 2D anomaly maps to 3D Anomaly Volumes, combined with single-instance training and synthetic-to-real evaluation, better reflects real industrial requirements. SiM3D addresses this gap through industrial-grade sensors and a carefully designed acquisition pipeline.

Method¶

Overall Architecture¶

SiM3D contributes a benchmark and adapts existing methods rather than proposing a new algorithm. The framework comprises four core components:

(1) Data Acquisition - Sensor: ZEISS Atos Q industrial 3D scanner (stereo pair of 12 Mpx grayscale cameras + light projector) mounted on a high-precision industrial robot arm. - Acquisition: Each object is scanned 360°, covering 12–36 viewpoints sampled from a concentric hemisphere. - Output per viewpoint: grayscale image (\(4096 \times 3000\) px), point cloud (~7 M points, 0.04–0.15 mm spacing), and known pose; each scan also provides an integrated mesh.

(2) Synthetic Data Generation

Blender's Python API is used to render synthetic data consistent with the real acquisition: - The real reference mesh is aligned to the CAD model. - The Cycles path-tracing renderer renders from the same viewpoints. - RGB images are converted to grayscale to match the real sensor. - Depth maps are back-projected to 3D point clouds.

(3) Anomaly Creation

Three defect categories are manually introduced into purchased objects: - Appearance anomalies: paint modifications, scratches. - Geometric anomalies: dents, deformations. - Mixed anomalies: contaminants (affecting both appearance and geometry).

For each object category, 50% of instances are kept defect-free and 50% are defective.

(4) 3D Annotation Pipeline

A two-step strategy integrating 2D and 3D information:

\[\text{2D Annotation} \xrightarrow{\text{Project onto Mesh}} \text{3D Mesh Annotation} \xrightarrow{\text{Manual Refinement}} \text{Voxelized GT (2 mm)}\]

2D segmentation masks are manually annotated on all viewpoint images where defects are visible.
The integrated mesh, intrinsic matrices, and view transforms are used to project 2D annotations onto the 3D mesh.
CloudCompare is used for 3D visualization and manual refinement.
Annotations are finally converted to voxel grids at 2 mm resolution.

Benchmark Design¶

Two training setups: - real2real: Train on 1 real normal instance → test on the remaining real instances. - synth2real: Train on synthetic data rendered from the CAD model → test on real instances.

Metrics extended to 3D: - Detection: I-AUROC (image-level AUROC). - Segmentation: Standard 2D ADS metrics are extended to operate on voxel grids (3D versions).

Adapting single-view methods to multiview: 1. Run a single-view ADS method independently per viewpoint to obtain 2D anomaly maps. 2. Project 2D anomaly scores into 3D space. 3. Aggregate across viewpoints into a voxel grid (per-voxel max pooling).

Dataset Statistics¶

Eight industrial object categories, 333 instances in total:

Object	Size (cm)	Total	Train	Normal Test	Anomaly Test	Views
Plastic stool	35×35×30	22	1 (real/synth)	10	10	12
Trash bin	26×21×33	42	1	20	20	12
Rattan vase	17×17×15	22	1	10	10	12
Bathroom furniture	33×33×50	20	1	8	10	36
Container	20×25×10	94	1	46	46	12
Plastic vase	12×12×9	99	1	48	49	12
Wooden stool	48×42×45	15	1	6	7	12
Washbasin	44×25×50	19	1	9	8	36

Key Experimental Results¶

Main Results: Anomaly Detection (I-AUROC)¶

Method	Modality	real2real Mean	synth2real Mean
PatchCore (WRN-101)	RGB	0.630	0.600
PatchCore (DINO-v2)	RGB	0.671	0.596
EfficientAD	RGB	0.594	0.573
AST	RGB	0.687	0.679
BTF	RGB+PC	0.444	0.446
M3DM (DINO-v2+FPFH)	RGB+PC	0.621	0.402
AST	RGB+Depth	0.636	0.495

Key Findings: - AST with RGB-only modality achieves the best performance under both setups; multimodal methods are consistently outperformed by pure RGB baselines. - real2real uniformly outperforms synth2real, indicating that the synthetic-to-real domain gap remains a major challenge. - Multimodal methods such as BTF and CFM degrade severely under the single-instance setting; memory-bank-based methods are unstable with extremely limited training data.

Anomaly Segmentation (Anomaly Volume)¶

Method	Modality	real2real Mean (vAUROC)	synth2real Mean
PatchCore (WRN-101)	RGB	0.754	0.451
PatchCore (DINO-v2)	RGB	0.678	0.540
AST	RGB	0.584	0.544
M3DM (DINO-v2+FPFH)	RGB+PC	0.621	0.402
AST	RGB+Depth	0.925	0.495

Key Findings: - AST (RGB+Depth) achieves a strong segmentation result under real2real (0.925) but drops substantially under synth2real (0.495). - PatchCore (WRN-101) performs competitively on real2real segmentation (0.754). - synth2real segmentation performance is generally poor, suggesting that precise 3D defect localization is more sensitive to domain shift than detection. - Performance varies greatly across object categories; texturally simple objects (e.g., wooden stool, 1.000) vastly outperform complex ones (e.g., washbasin, 0.250).

Highlights & Insights¶

Paradigm shift from 2D anomaly maps to 3D Anomaly Volumes: The voxelized Anomaly Volume is proposed as the standard output for 3D ADS, more directly supporting industrial automated repair than 2D maps.
Practical value of single-instance + synthetic-to-real settings: This is the first ADS work to explore cross-domain generalization from CAD-rendered training data to real-object testing, directly addressing the pain point of production line changeovers.
Industrial-grade data quality: The ZEISS Atos Q scanner and industrial robot arm yield 12 Mpx imagery and ~7 M-point clouds, far exceeding the precision of existing benchmarks.
Exposing critical deficiencies of existing methods: Multimodal methods paradoxically underperform pure RGB methods in the single-instance setting, and the large performance gap in synthetic-to-real transfer defines a clear direction for future research.

Limitations & Future Work¶

Only eight object categories are included, limiting diversity and suitability for generalization studies.
Data acquisition and annotation costs are extremely high (industrial-grade equipment, a four-expert team, and a multi-step labeling pipeline), making large-scale expansion difficult.
The 2 mm voxel resolution is a compromise between accuracy and memory footprint and may miss very fine defects.
Multi-view information is currently aggregated via simple max pooling; dedicated multi-view fusion methods are absent.

2D ADS benchmarks: MVTec AD/LOCO AD, VisA, Real-IAD, etc.
Multimodal ADS benchmarks: MVTec 3D-AD (RGB + XYZ maps), Eyecandies (synthetic RGB + depth).
Multiview ADS benchmarks: PAD (multiview RGB training, single-view RGB testing), Real3D-AD (point-cloud anomaly detection).
ADS methods: PatchCore, EfficientAD, M3DM, BTF, CFM, AST, etc.

Rating¶

Novelty: ⭐⭐⭐⭐ — The 3D Anomaly Volume concept is original; the single-instance + synthetic-to-real setup is insightful; however, no new method is proposed.
Technical Quality: ⭐⭐⭐⭐⭐ — Data acquisition is highly rigorous, the annotation pipeline is scientifically sound, and the benchmark design is comprehensive.
Experimental Thoroughness: ⭐⭐⭐⭐ — Seven methods are adapted across two setups, but dedicated multiview baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear; comparison tables with existing benchmarks are intuitive.
Overall Score: 7.5/10