InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models¶

Conference: CVPR 2026 arXiv: 2504.05662 Code: Project Page Area: Medical Imaging Keywords: Anomaly detection, diffusion models, DDIM inversion, reconstruction-free paradigm, industrial/medical defect detection

TL;DR¶

This paper proposes InvAD, which shifts diffusion-based anomaly detection from a "denoising-reconstruction in RGB space" paradigm to a "noising-inversion in latent space" paradigm. By applying DDIM inversion to directly infer the terminal latent variable and measuring deviation under the prior distribution, anomalies are detected without reconstruction. Only 3 inversion steps suffice to achieve state-of-the-art performance, with approximately 2× inference speedup.

Background & Motivation¶

Background: Diffusion-based anomaly detection (AD) methods have achieved strong results but suffer from a fundamental efficiency–accuracy trade-off.
Limitations of Prior Work: (1) Noise sensitivity—excessively strong noise corrupts normal regions causing false positives, while insufficient noise allows anomalous regions to be reconstructed faithfully, leading to missed detections; (2) Expensive multi-step denoising—satisfactory reconstruction requires iterative denoising, and most methods operate at roughly 1 FPS or below (e.g., DiAD at 0.1 FPS, GLAD at 0.2 FPS).
Key Challenge: Since diffusion models are trained solely on normal data distributions, reconstruction is not necessary for anomaly detection. Inversion can directly map an image into the latent space: normal images map to high-density regions of the prior distribution, whereas anomalous images map to low-density regions, entirely bypassing reconstruction and eliminating the need to tune noise strength.
Goal: Propose a reconstruction-free inference paradigm based on DDIM inversion that is both more accurate and significantly faster than existing diffusion-based AD methods.

Method¶

Overall Architecture¶

Input image → backbone feature extraction $\mathbf{z} = g_\phi(\mathbf{x})$ → DDIM inversion (3 steps) yielding $\mathbf{z}_T$ → anomaly score computed from the deviation of $\mathbf{z}_T$ under the prior distribution. No decoder or reconstruction is required.

Key Designs¶

DDIM Inversion Noising (Core): The image is forward-propagated along the PF-ODE trajectory to directly infer $\mathbf{x}_T$ from $\mathbf{x}_0$, using the discrete Euler approximation: $$\mathbf{x}_{\tau_{i+1}} = \sqrt{\alpha_{\tau_{i+1}}} f_\theta(\mathbf{x}_{\tau_i}) + \sqrt{1-\alpha_{\tau_{i+1}}} \epsilon_\theta^{(\tau_i)}(\mathbf{x}_{\tau_i}).$$ Crucially, only very few inversion steps are required ($S=3$, subset $\tau_3 = [333, 666, 999]$): even under the lower-accuracy Euler approximation, anomalous pixels are still mapped to low-density regions of the prior. Design Motivation: The deterministic nature of the PF-ODE guarantees a one-to-one mapping between normal images and the prior distribution, so anomalous deviations can be directly quantified via distributional typicality.
Feature-Space Diffusion Modeling: A pretrained EfficientNet-B4 extracts features $\mathbf{z} = g_\phi(\mathbf{x}) \in \mathbb{R}^{C \times h \times w}$ as the input space for the diffusion model, rather than raw pixel space. Advantages: (a) backbone features are invariant to low-level variations such as noise and illumination; (b) lower spatial resolution yields more efficient inference. A DiT-gigant architecture is used as the diffusion backbone.
Hybrid Anomaly Scoring: From the inverted $\mathbf{z}_T$, a pixel-level anomaly map is obtained via the channel-wise Euclidean norm: $\mathbf{z}_T^{\text{normed}}[i,j] = \|\mathbf{z}_T[:,i,j]\|_2$. The image-level score is defined as $s = \max(A) - \min(A) + \sum_{u,v} A[u,v]$, where the max-min difference mitigates the inverse scoring problem—since anomalies are typically sparse and localized, the max-min contrast filters out the influence of global outliers.

Loss & Training¶

Training: Standard DDPM $\epsilon$-prediction loss, trained on normal data only; AdamW optimizer, 300 epochs, $T=1000$, linear noise schedule.
Inference: $S=3$ inversion steps with uniform subset $\tau_3 = [333, 666, 999]$.
Plug-and-play design: Only the inference stage is modified; InvAD can directly replace the inference procedure of existing diffusion-based AD methods without retraining.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (InvAD)	OmiAD (ICML'25)	DiAD (AAAI'24)	FPS
MVTecAD	I-AUROC	99.0	98.8	97.2	88.1 vs 39.4 vs 0.1
VisA	I-AUROC	96.9	95.3	86.8	74.1 vs 35.3
MPDD	I-AUROC	96.5	93.7	74.6	120 vs 49.8
BMAD (Medical)	mAD	87.2	—	—	88 vs 20

Ablation Study¶

Configuration	MVTecAD mAD	Notes
FDM only (no inversion)	57.3	Inversion is the essential component
Single-step inversion (pixel space)	44.9	Pixel-space diffusion + single step insufficient
FDM + single-step inversion	71.0	Feature space + single step
FDM + multi-step inversion (full)	83.7	Optimal configuration

Inversion Steps $S$	Reconstruction (best $r$)	Inversion (Ours)
3	64.9	99.0
5	75.0	98.9
10	97.9	98.4
50	98.0	96.0
1000	98.2	95.4

Key Findings¶

The inversion approach substantially outperforms reconstruction-based methods at very few steps ($S=3,5$); reconstruction methods require $S \geq 50$ to achieve comparable performance.
Inversion requires no tuning of the perturbation timestep, whereas reconstruction methods are highly sensitive to both $r$ and $S$.
Plug-and-play application: DiAD + InvAD yields +1.0 I-AUROC and +88 FPS; MDM + InvAD yields +6.3 I-AUROC and +60.8 FPS.
The hybrid NLL + Diff scoring is robust across inversion step counts $S$; using NLL or Diff alone is not.
State-of-the-art performance is also achieved across 6 datasets in the BMAD medical benchmark (mAD = 87.2), demonstrating cross-domain generalizability.

Highlights & Insights¶

Paradigm Innovation: The conceptual shift from "detect via denoising" to "detect via noising" is the core contribution—elegant in its simplicity and highly effective in practice.
Inversion naturally eliminates both the noise-strength tuning problem and the computational bottleneck of multi-step reconstruction.
The reason $S=3$ suffices to achieve SOTA is that precise reconstruction is unnecessary; only distinguishing distributional typicality between normal and anomalous samples is required.
The plug-and-play design allows InvAD to serve as a universal inference accelerator for existing diffusion-based AD methods.
Feature-space diffusion modeling is an important design choice that jointly improves efficiency and detection performance.

Limitations & Future Work¶

More than one function evaluation (NFE = 3) is still required; diffusion distillation could potentially compress this to a single step.
Pixel-level localization metrics (AP, F1_max) are inferior to some reconstruction-based methods, as inversion has an inherent disadvantage in precise boundary delineation.
The max-min contrast term in the scoring scheme is empirically motivated and lacks theoretical justification.
DiT-gigant has a large parameter count (1,223M); an MLP backbone achieves comparable detection accuracy but inferior localization.
Task-specific optimization of the inversion mechanism has not been explored.

The deterministic sampling and PF-ODE formulation of DDIM (Song et al., 2020) provide the theoretical foundation for the inversion approach.
Heng et al. (2024)'s use of score function norms to measure OOD typicality inspired the anomaly scoring design in this work.
OmiAD (Feng et al., 2025) achieves 1-step diffusion via adversarial distillation but incurs higher training complexity.
Compared to non-diffusion methods such as EfficientAD (Batzner et al., 2023), diffusion-based methods still maintain an accuracy advantage.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The "detect via noising" paradigm is a conceptually innovative contribution: concise, elegant, and highly effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four industrial and six medical datasets, with comprehensive ablations covering components, backbones, scoring strategies, step counts, and generalizability.
Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear, paradigm comparison figures are intuitive, and tables are well-organized.
Value: ⭐⭐⭐⭐⭐ — Highly practical; serves as a plug-and-play accelerator for existing methods with significant implications for both industrial and medical AD.