One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control¶

Conference: CVPR 2026 arXiv: 2603.18093 Code: N/A Area: AI Safety / Anomaly Detection Keywords: anomaly generation, training-free, self-attention grafting, diffusion models, industrial anomaly detection

TL;DR¶

O2MAG proposes a training-free few-shot anomaly generation method that synthesizes diverse and realistic anomalies from a single reference anomaly image via a tri-branch diffusion process with self-attention grafting (TriAG). It incorporates Anomaly Guidance Optimization (AGO) to align textual semantics and Dual Attention Enhancement (DAE) to ensure complete mask-region filling. The method significantly outperforms existing approaches on downstream anomaly detection benchmarks using MVTec-AD.

Background & Motivation¶

Background: Industrial anomaly detection suffers from severe data imbalance—normal images are abundant while anomaly images are scarce. Existing anomaly synthesis methods fall into two categories: training-based (e.g., DreamBooth fine-tuning, textual inversion) and training-free approaches.
Limitations of Prior Work: Training-based methods incur substantial computational and storage overhead and are prone to overfitting under few-shot settings. The only prior training-free method, AnomalyAny, manipulates only cross-attention and fails to precisely control anomaly semantics and spatial layout, resulting in insufficiently realistic generations.
Key Challenge: Industrial defects are extremely rare in Stable Diffusion's training data; simple text prompts cannot accurately describe defect semantics, causing generated images to deviate from the true anomaly distribution.
Goal: Leverage the intrinsic priors of diffusion models to synthesize diverse and realistic anomalies from a single reference anomaly image without any training.
Key Insight: PCA analysis of self-attention maps reveals that anomaly foregrounds and normal backgrounds are naturally separated in attention space, enabling cross-branch information transfer by manipulating self-attention K/V features.
Core Idea: Tri-branch parallel diffusion combined with mask-guided self-attention grafting, extracting foreground defect features from the reference anomaly branch and background features from the normal image branch.

Method¶

Overall Architecture¶

O2MAG consists of three parallel diffusion processes: a reference anomaly branch (noise obtained by DDIM inversion of the reference anomaly image), a normal image branch (inverted from a normal image), and a target anomaly branch (initialized with normal image noise). Self-attention grafting injects anomaly features into designated mask regions. An AGO module optimizes text embeddings, and a DAE module enhances attention within the mask region.

Key Designs¶

Tri-branch Attention Grafting (TriAG):
- Function: Transfers visual features of the reference anomaly into a specified region of the target image while preserving the normal background.
- Mechanism: The target branch's Q is kept unchanged; K and V are replaced selectively. Inside the target mask \(M_T\), K/V are sourced from the reference anomaly branch (querying only foreground anomaly features within reference mask \(M_R\)); outside the mask, K/V are sourced from the normal branch (querying only background features outside the mask). The final attention output is \(\text{Attn}^*_{T} = M_T \odot \text{Attn}_{fg} + (1-M_T) \odot \text{Attn}_{bg}\).
- Design Motivation: PCA visualizations of self-attention confirm natural separation between anomalous and normal regions in attention space, enabling precise content transfer via K/V manipulation.
Anomaly Guidance Optimization (AGO):
- Function: Bridges the gap between text-encoded anomaly semantics and the true visual appearance of real defects.
- Mechanism: The diffusion model is frozen; only the text embedding \(\mathbf{e}\) is optimized by minimizing the reconstruction loss \(\mathbf{e}^* = \arg\min_{\mathbf{e}} \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t, \mathbf{e})\|^2]\), shifting the text embedding from the normal semantic space toward the anomaly space. Optimization runs for 500 steps using Adam with a learning rate of \(3 \times 10^{-3}\).
- Design Motivation: Industrial defects are rarely represented in SD's training data, and text prompts such as "a photo of cable with bent wire" cannot accurately encode defect appearance; data-driven embedding alignment is therefore necessary.
Dual Attention Enhancement (DAE):
- Function: Ensures that anomalies fully fill the target mask region, preventing weak or incomplete defect generation.
- Mechanism: During denoising at selected timesteps, both self-attention and cross-attention weights within the mask region are amplified, producing stronger model responses to the defect area.
- Design Motivation: Small defect regions tend to be overlooked or attenuated during generation and require active attention reinforcement.

Loss & Training¶

No training is required. AGO performs lightweight optimization of text embeddings at inference time only (500 steps). The full pipeline is built on 50-step DDIM sampling with SD v1.5.

Key Experimental Results¶

Main Results¶

Method	AP-I (Detection)↑	AUC-P (Localization)↑	F1-P (Localization)↑	Accuracy (Classification)↑
DFMGAN	93.5	86.7	59.2	-
AnomalyDiffusion	99.3	98.9	78.2	-
DualAnoDiff	99.4	99.1	82.6	78.5
SeaS	99.3	98.7	79.1	-
O2MAG	99.7	99.3	84.6	90.6

O2MAG achieves state-of-the-art performance across all metrics; its classification accuracy exceeds the best training-based method by 12.1%.

Ablation Study¶

Configuration	AP-I	F1-P	Note
Full O2MAG	99.7	84.6	Complete model
w/o AGO	98.9	81.2	Contribution of text embedding optimization
w/o DAE	99.3	82.4	Contribution of attention enhancement
w/o TriAG mask	97.5	75.8	Mask guidance is the core component

Key Findings¶

Mask guidance in TriAG is the most critical component; removing it causes foreground/background confusion.
AGO contributes substantially to classification performance (+12.1%), as accurate anomaly semantics are essential for category discrimination.
A training-free method surpasses training-based methods on downstream anomaly detection for the first time, demonstrating the expressive power of SD's self-attention priors.
KID and IC-LPIPS metrics confirm that O2MAG achieves superior generation quality and diversity.

Highlights & Insights¶

Anomaly–Normal Separation in Self-Attention: PCA visualizations reveal that SD's self-attention naturally encodes semantic separation between foreground and background, a finding generalizable to other image editing tasks.
Training-Free Surpasses Training-Based: The results demonstrate that carefully designed attention manipulation can outperform fine-tuning-based methods, substantially lowering the barrier to anomaly synthesis.
Tri-Branch Parallel Design: By decoupling the anomaly source, background source, and generation target into separate branches, the method achieves precise region-level control.

Limitations & Future Work¶

AGO still requires 500 optimization steps, increasing inference latency.
The method relies on predefined anomaly masks, which may be difficult to obtain in real-world scenarios.
Generation quality is limited for extremely small defects.
Future work could integrate segmentation models such as SAM to automate mask generation.

vs. AnomalyAny: AnomalyAny manipulates only cross-attention; O2MAG operates on self-attention K/V, achieving finer-grained control.
vs. DualAnoDiff: DualAnoDiff requires training a dedicated defect branch; O2MAG is entirely training-free.
vs. MasaCtrl: O2MAG extends MasaCtrl's attention manipulation paradigm by introducing a tri-branch structure and mask guidance specifically designed for anomaly synthesis.

Rating¶

Novelty: ⭐⭐⭐⭐ Tri-branch attention grafting is novel, though it builds on existing attention manipulation concepts.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on MVTec-AD with multi-dimensional metrics.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and visualizations are thorough.
Value: ⭐⭐⭐⭐ Substantially lowers the training barrier for industrial anomaly synthesis.