Skip to content

Anomaly-Preference Image Generation (APO)

Conference: ICML 2026
arXiv: 2605.02439
Code: Not explicitly released
Area: Diffusion Models / Industrial Anomaly Detection / Preference Alignment
Keywords: Few-shot anomaly generation, DPO, implicit preference, time-aware LoRA, hierarchical sampling

TL;DR

The authors reformulate "few-shot anomaly image generation" as a "preference optimization problem without manual annotation": real anomalies serve as positive samples, while the denoising bias of the reference model at the same timestep acts as an implicit negative sample. A DPO-style loss aligns the diffusion model to the anomaly distribution. Time-aware LoRA rank adjustment (TACA) preserves structural diversity, and hierarchical CFG controls text-anomaly alignment strength. On benchmarks like MVTec, both fidelity and diversity are improved.

Background & Motivation

Background: Industrial visual anomaly detection is constrained by "scarcity of defect samples + expensive annotation." The mainstream approach uses diffusion models to synthesize more samples from a few real anomalies to augment detectors. Methods fall into two camps: fine-tuning (e.g., AnomalyDiffusion, DualAnoDiff, SeaS) and training-free (e.g., AnomalyAny).

Limitations of Prior Work: Fine-tuning methods often decouple appearance and location learning, leading to semantic inconsistency; dual-stream architectures suffer from conflicting gradients. Training-free methods shift all computation to inference, causing high latency. Neither approach explicitly constrains the "alignment of generated and real anomaly distributions," resulting in either overfitting (poor diversity) or distribution drift (poor fidelity).

Key Challenge: There is a trade-off between realism and diversity. Directly applying DPO requires paired preference data (both positive and negative samples generated by the model), but in few-shot scenarios, such pairs are unattainable—distribution alignment is needed without manual preference signals.

Goal: (1) Construct a stable optimization objective without manual annotation to directly align the generated and target anomaly distributions; (2) maintain diversity during alignment; (3) enable inference-time control over the balance between "base model coherence" and "anomaly pattern alignment."

Key Insight: The core of DPO is KL-regularized policy optimization, with the key trick being reward reparameterization as \(\beta\log\frac{p_\theta}{p_\text{ref}}\). By using real anomalies as positive samples and treating the "reference model's denoising error at the same \(\mathbf{z}_t\)" as an implicit baseline, preference gradients can be obtained without manual negatives—this is the core idea.

Core Idea: Replace the DPO paired preference log-ratio with the "denoising bias between policy and reference models, \(\Delta = \|\hat\epsilon_\theta-\epsilon\|^2 - \|\hat\epsilon_\text{ref}-\epsilon\|^2\)," turning few-shot anomaly generation into an analytically tractable, stable implicit preference optimization problem.

Method

The APO framework consists of three components: Constrained Policy Optimization (CPO, rewriting DPO to a pair-free form), Time-Aware Capacity Allocation (TACA, adjusting LoRA rank by diffusion timestep), and Hierarchical Sampling (three-way CFG at inference). These address "alignment," "diversity," and "controllability," respectively.

Overall Architecture

  • Input: A small set of \(K\) real anomaly samples \(\{(\mathbf{x}_i, \mathbf{c}_i)\}\) + pretrained Stable Diffusion as reference model \(\epsilon_\text{ref}\).
  • Training Loop: Sample a real anomaly → add noise to arbitrary \(t\) → denoise with both \(\epsilon_\text{ref}\) and \(\epsilon_\theta\) → compute bias \(\Delta\) → update policy with \(\mathcal{L}_\text{APO} = -\log\sigma(-\beta_t\Delta)\) (only updating TACA-style time-aware LoRA). No negative samples or reward model required.
  • Inference: Decompose denoising into "unconditional + text-conditional + anomaly alignment" components, with \(s_\text{text}\) and \(s_\text{align}\) controlling weights for hierarchical sampling. Aggregate alignment deviation at each step (after upsampling) to form an anomaly localization mask.

Key Designs

  1. Implicit Preference Alignment CPO (Pair-Free DPO Rewriting):

    • Function: Rewrites DPO's paired preference loss as a "policy vs. reference model denoising bias on the same real sample," removing dependence on manual annotation.
    • Mechanism: Starting from constrained optimization \(\max_\theta \mathbb{E}[\mathcal{R}] - \beta D_\text{KL}(p_\theta \| p_\text{ref})\), the optimal policy is \(p^*_\theta \propto p_\text{ref}\exp(\mathcal{R}/\beta)\). Reward is reparameterized as \(\mathcal{R} = \beta\log\frac{p_\theta}{p_\text{ref}}\). Using variational ELBO, the "KL difference along diffusion trajectory" simplifies to closed-form \(\delta = \frac{1}{2}\mathbb{E}_t[\lambda'_t(\|\hat\epsilon_\theta-\epsilon\|^2 - \|\hat\epsilon_\text{ref}-\epsilon\|^2)]\), where \(\lambda'_t\) is the time derivative of log-SNR. The final loss is \(\mathcal{L}_\text{APO} = -\log\sigma(-\beta_t\Delta)\), with \(\beta_t = -\frac{1}{2}\beta\lambda'_t\) providing time-adaptive weighting. \(\Delta < 0\) means the policy model better aligns with the real anomaly distribution than the reference model; the sigmoid converts this "better" into a preference probability. The monotonic log-sigmoid is more stable and less prone to overfitting than pure L2 loss.
    • Design Motivation: Traditional DPO requires \((\mathbf{z}^w, \mathbf{z}^l)\) paired manual preferences, which are unavailable in few-shot anomaly settings. Using the reference model's denoising at the same \(\mathbf{z}_t\) as a natural baseline is equivalent to "any real anomaly is preferred over the reference model's prediction at that noise level"—retaining DPO's KL regularization stability without annotation cost.
  2. Time-Aware Capacity Allocation (TACA):

    • Function: Dynamically adjusts LoRA trainable rank by diffusion timestep—small rank at high-noise steps preserves structural diversity; large rank at low-noise steps learns fine-grained anomaly details.
    • Mechanism: Diffusion denoising at high-noise steps determines global structure (layout/composition), while low-noise steps determine texture details. Uniform fine-tuning across all timesteps causes high-noise steps to overfit background patterns, suppressing diversity. TACA writes LoRA weight updates as \(\Delta W_t = B \cdot G_t \cdot A\), where \(G_t\) is a diagonal mask selecting dimensions, with active dimensions \(k(t) = \lfloor k_\text{min} + (k_\text{max}-k_\text{min})(T-t)/T \rfloor\). As \(t\to T\) (high noise), \(k\to k_\text{min}\) (almost frozen); as \(t\to 0\) (low noise), \(k\to k_\text{max}\) (high adaptation).
    • Design Motivation: Treating "model capacity" as a resource to be redistributed along the time axis better respects the physical meaning of the diffusion process than uniform fine-tuning. This preserves both structural diversity and anomaly detail, addressing the root cause of the fidelity-diversity trade-off.
  3. Hierarchical Sampling + Deviation-Guided Localization:

    • Function: At inference, decomposes denoising into three independently tunable components; simultaneously upsamples alignment bias into a pixel-level anomaly mask, providing free anomaly localization.
    • Mechanism: Denoising prediction is written as \(\hat\epsilon = \epsilon_\text{ref}(\mathbf{z}_t, t) + s_\text{text}\Delta_\text{text} + s_\text{align}\Delta_\text{align}\), where \(\Delta_\text{text} = \epsilon_\text{ref}(\mathbf{z}_t, \mathbf{c}, t) - \epsilon_\text{ref}(\mathbf{z}_t, t)\) controls text condition strength, and \(\Delta_\text{align} = \epsilon_\theta(\mathbf{z}_t, \mathbf{c}, t) - \epsilon_\text{ref}(\mathbf{z}_t, \mathbf{c}, t)\) controls anomaly alignment strength. The two scales are independently tunable, equivalent to sampling from \(p_\text{ref} \cdot (p_\text{ref}^{\mathbf{c}}/p_\text{ref})^{s_\text{text}} \cdot (p_\theta^{\mathbf{c}}/p_\text{ref}^{\mathbf{c}})^{s_\text{align}}\). \(\|\Delta_\text{align}\|\) is aggregated with TACA weights \(k(t)\), bilinearly upsampled, and Gaussian smoothed to obtain \(\mathbf{P}_\text{anomaly}\)—the anomaly localization map is this alignment bias itself.
    • Design Motivation: Traditional CFG has only one scale and cannot separate "text semantics" from "anomaly patterns"; the hierarchical form provides fine-grained user control. The localization mask is a free byproduct, as the bias signal itself is the best indicator of anomaly saliency.

Loss & Training

Only the LoRA adapter is trained; the reference model \(\epsilon_\text{ref}\) is fully frozen. The loss is a single term: \(\mathcal{L}_\text{APO} = -\log\sigma(-\beta_t\Delta)\), with timestep \(t \sim \mathcal{U}(0,T)\) uniformly sampled and noise \(\epsilon \sim \mathcal{N}(0,\mathbf{I})\). \(\beta\) controls KL strength, and \(\beta_t\) is automatically scaled by log-SNR.

Key Experimental Results

Main Results (MVTec Fidelity + Diversity)

Compared with Crop&Paste, DFMGAN, AnomalyDiffusion, DualAnoDiff, AnomalyAny, SeaS. Two core metrics: IS (Inception Score, higher is more realistic/diverse) and IC-LPIPS (Intra-Class LPIPS, higher is more diverse).

Category Metric AnomalyDiff DualAnoDiff SeaS APO (Ours)
bottle IS / IC-L 1.58 / 0.19 2.17 / 0.43 1.78 / 0.21 2.19 / 0.45
cable IS / IC-L 2.13 / 0.41 2.15 / 0.43 2.09 / 0.42 2.20 / 0.45
capsule IS / IC-L 1.59 / 0.21 1.62 / 0.32 1.56 / 0.26 2.18 / 0.34
carpet IS / IC-L 1.16 / 0.24 1.36 / 0.29 1.13 / 0.25 1.39 / 0.36

APO almost always improves both metrics for each category—fidelity and diversity are enhanced simultaneously, not at the expense of each other.

Ablation Study

Configuration Key Observation Description
CPO only (no TACA) Good fidelity but reduced diversity Validates TACA as key for diversity
TACA only (no CPO) Good diversity but distribution drift Validates CPO as key for fidelity
Full APO Both achieved Modules are complementary and indispensable
Adjust \(s_\text{text}, s_\text{align}\) Smoothly traverses trade-off curve Hierarchical sampling provides fine-grained controllability

Key Findings

  • CPO and TACA contribute orthogonally: the former aligns to the target distribution, the latter preserves generative diversity; both are needed for dual SOTA in fidelity and diversity.
  • Anomaly localization mask is a free lunch: no explicit localization supervision during training, but \(\|\Delta_\text{align}\|\) naturally highlights anomaly regions, serving as pseudo-labels for downstream detectors.
  • DPO reparameterized \(\beta_t\) automatically scales with log-SNR, yielding much more stable training than fixed \(\beta\)—this time-adaptive weighting is generally valuable for preference learning in diffusion models.

Highlights & Insights

  • Extending DPO to pair-free diffusion scenarios: Using "reference model denoising error" instead of "model-generated negative samples" is a simple yet powerful preference signal design; applicable to any few-shot scenario where "I have a small target distribution and want the diffusion model to align."
  • Time-step-based model capacity allocation: TACA is a one-line code change (adjusting LoRA mask by \(t\)), but greatly alleviates high-noise-step overfitting during fine-tuning. This "allocate trainable parameters by noise schedule" idea can transfer to style transfer, personalized generation, or any LoRA fine-tuning scenario.
  • Alignment bias as saliency map: No localization supervision during training, yet pixel-level masks are produced at inference—this "distribution alignment → implicit localization" byproduct directly saves a segmentation head in industrial deployment.

Limitations & Future Work

  • Only validated on industrial anomalies (MVTec, VisA, etc.); transferability to medical anomalies or natural scene defects requires further study.
  • Inference doubles the forward passes (both \(\epsilon_\theta\) and \(\epsilon_\text{ref}\) must run); real-time applications require model distillation.
  • TACA's \(k_\text{min}, k_\text{max}\) are manually designed schedules; whether optimal allocation schedules can be learned in a data-driven way is an open question.
  • Stability of CPO in ultra-few-shot scenarios (\(K=1\) or \(2\)) is not specifically explored; experiments use \(K\) in the range of \(5\sim10\).
  • vs AnomalyDiffusion / DualAnoDiff: Those use textual inversion or dual-stream fine-tuning; this work uses preference learning + LoRA, yielding a lighter and more stable framework.
  • vs SeaS: SeaS takes a unified multi-task approach; this work follows "align distribution first, then hierarchical sampling," with a different philosophy. The advantage here is a better fidelity-diversity Pareto.
  • vs DPO/Diffusion-DPO (Wallace 2024): Classic Diffusion-DPO requires human-ranked image pairs; this work replaces them with real samples + reference model bias, being the first to apply DPO to "no preference pairs available" diffusion scenarios.
  • vs RLHF for diffusion: Traditional RLHF requires training a reward model and then PPO; APO is one-step, single-loss, and training cost is on par with standard fine-tuning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Reformulating DPO as an implicit "pair-free preference" method is highly ingenious
  • Experimental Thoroughness: ⭐⭐⭐⭐ Full MVTec categories + multiple baselines and metrics; lacks sensitivity analysis for ultra-small \(K\)
  • Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations, complete pseudocode; a few dense symbols
  • Value: ⭐⭐⭐⭐⭐ Strong industrial demand for anomaly generation + defect detection augmentation; method is directly deployable