DiffDoctor: Diagnosing Image Diffusion Models Before Treating¶

Conference: ICCV 2025 arXiv: 2501.12382 Code: Unavailable Area: Diffusion Models Keywords: Diffusion Models, Artifact Detection, Pixel-level Feedback, Image Quality, Model Fine-tuning

TL;DR¶

This paper proposes DiffDoctor, the first method to fine-tune diffusion models using pixel-level feedback. It first trains a robust artifact detector (1M+ samples with a category-balancing strategy), then backpropagates gradients through the detector to the diffusion model by minimizing the artifact confidence of each pixel in synthesized images, achieving significant artifact reduction on unseen prompts.

Background & Motivation¶

Background: Image diffusion models (e.g., FLUX.1, SDXL, Kolors) can generate diverse images but still produce artifacts such as shape distortions (e.g., malformed hands/faces), implausible content (e.g., extra limbs), and watermarks. Existing improvement methods primarily leverage image-level scores or human preference feedback to optimize these models.

Limitations of Prior Work: - Coarse feedback granularity: Methods such as ImageReward, DDPO, and DiffDPO rely on image-level quality scores or pairwise comparisons, ignoring the fact that artifacts are sparsely distributed within an image—only local regions may be problematic, and image-level feedback cannot precisely guide correction. - Imbalanced artifact data: Existing annotated datasets (RichHF, PAL4VST) suffer from severe class imbalance—hands and faces in images generated by low-quality models nearly always contain artifacts, causing detectors to learn the shortcut that "all hands are artifacts." - Underutilization of pixel-level annotations: Although PAL4VST and RichHF provide fine-grained annotations, these are used only for post-processing (batch ranking or inpainting) and have not been directly applied to fine-tuning diffusion models.

Key Challenge: Solving a problem requires first localizing it—fine-tuning a model with image-level scores without knowing where artifacts reside is inefficient and may degrade overall image quality.

Core Idea: Diagnose-then-Treat—first train a pixel-level artifact detector to accurately localize problematic regions, then directly propagate the detector's gradient signal back to the diffusion model for pixel-aware treatment.

Method¶

Overall Architecture¶

DiffDoctor is a two-stage pipeline: - Diagnosing: Train a robust artifact detector that takes a synthesized image as input and outputs a per-pixel artifact confidence map (range 0–1). - Treating: Freeze the detector; after the diffusion model generates an image, perform a forward pass through the detector to obtain the artifact map, minimize artifact confidence, and backpropagate gradients to the diffusion model parameters.

Key Designs¶

Class-Balanced Artifact Detector Training:
- Function: Addresses the class imbalance in existing artifact annotations to train a reliable pixel-level detector.
- Mechanism: (1) Introduce high-quality real photographs as negative samples to balance the distribution; (2) use an MLLM to assign category labels to images and compute aggregated artifact confidence per category to identify anomalously high or low categories; (3) use an LLM to generate diverse prompts for these categories and synthesize additional balanced samples using SOTA models (which produce higher-quality images); (4) select 2K hard cases from these samples for manual fine-grained annotation.
- Design Motivation: Targeted correction of detector shortcuts such as the misclassification that "hands/faces are always artifacts."
Pseudo-label Scaling (Human-in-the-Loop):
- Function: Extend annotated data to 1M+ using a semi-supervised learning approach.
- Mechanism: Images not selected as hard cases are assigned pseudo-labels by the current detector. A dynamic augmentation strategy is designed—images whose maximum artifact confidence falls below a threshold are downscaled and padded back to the original resolution, increasing the probability of small complex regions (which are less likely to contain artifacts) appearing, combined with strong augmentation.
- Design Motivation: Smaller regions are more error-prone; downscaled artifact-free images are more likely to contain small but correctly structured complex regions, helping to balance the distribution.
- Model Backbone: SegFormer-b5; sigmoid is applied to the output logits to produce the confidence map.
Pixel-Aware Treating:
- Function: Directly optimize the diffusion model using the detector's gradient signal.
- Mechanism: The diffusion model performs denoising with gradient tracking enabled; the generated image \(\pi_\theta(z_T)\) is passed through the frozen detector to obtain the artifact map \(C(\pi_\theta(z_T))\). The pixel-level loss is: \(\mathcal{L}_{\text{pixel}} = \frac{1}{N_{\text{aggr}}}\sum_{i,j} M \circ C(\pi_\theta(z_T))[i,j]\), where \(M\) is a threshold mask (processing only pixels with confidence \(>0.1\)). Gradients are truncated at the last 25% of denoising steps to reduce memory usage. Only LoRA layers (rank=16) are trained.
- Design Motivation: Pixel-level penalization is more precise than image-level penalization and avoids quality degradation caused by globally suppressing the output.
Offline Regularization:
- Function: Prevent model collapse (i.e., generation of blurry images).
- Mechanism: A standard diffusion loss \(\mathcal{L}_{\text{offline}} = \|(z_T - z_0) - v_\theta(z_t, t)\|\) is incorporated as KL regularization to constrain the updated model from deviating from the real image distribution. The final loss is \(\mathcal{L} = \mathcal{L}_{\text{pixel}} + 0.25 \cdot \mathcal{L}_{\text{offline}}\).
- Design Motivation: Pure pixel-level treatment training over an extended period leads to model collapse (analogous to reward hacking); regularization delays this collapse.

Loss & Training¶

Detector training: MSE loss \(\mathcal{L}_{\text{AD}} = \frac{1}{N}\sum_i \|\hat{C}_\theta(x_i) - C(x_i)\|_2^2\)
Diffusion model treatment: \(\mathcal{L} = \mathcal{L}_{\text{pixel}} + 0.25 \cdot \mathcal{L}_{\text{offline}}\)
Learning rate: 1e-4; primary experiments are conducted on FLUX.1 Schnell (4-step inference); the method also applies to SDXL and Kolors.

Key Experimental Results¶

Main Results¶

Artifact Detector Comparison:

Method	MSE (Ours)↓	KL (FN)↓	KL(1-) (FP)↓	MSE (Real)↓	KL(1-) (Real)↓
PAL4VST	0.480	5.053	2.394	0.591	5.740
RichHF*	1.601	1.059	7.044	0.979	6.082
+real photos	1.167	1.111	4.803	0.029	1.558
+hard cases	0.504	0.981	2.983	0.003	0.458
+pseudo 1M	0.337	1.004	2.231	0.002	0.371

Diffusion Model Treatment Results:

Method	Artifact Frequency↓	ImageReward↑	CLIP-T↑
FLUX.1 (original)	82.66%	1.179	35.463
FLUX.1 + HPSv2	80.67%	1.022	35.037
FLUX.1 + DiffDoctor	22.00%	1.183	35.611
SDXL (original)	55.33%	0.974	36.211
SDXL + DiffDoctor	27.50%	1.008	36.217
Kolors (original)	65.31%	0.823	34.251
Kolors + DiffDoctor	29.33%	0.824	34.424

DiffDoctor reduces the artifact frequency of FLUX.1 from 82.66% to 22.00% (a reduction of over 60%), while ImageReward and CLIP-T are maintained or slightly improved.

Ablation Study¶

Pixel Selection Strategy (treating FLUX.1 with the best detector):

Strategy	ImageReward	CLIP-T
All pixels	1.161	35.278
Max pixel	1.159	35.510
Threshold	1.183	35.611

Threshold-based selection outperforms both all-pixel and single-pixel strategies, demonstrating the benefit of finer-grained pixel-level control.
Using a naive detector (high false-positive rate) for treatment causes severe model collapse, with ImageReward dropping to −0.9.

Personal Reflections¶

Highlights: This is the first work to apply pixel-level feedback to fine-tune diffusion models, achieving a remarkable reduction in artifact frequency; the category-balancing strategy systematically addresses shortcut learning in the detector.
Limitations: The treatment process requires a full forward denoising pass, a detector forward pass, and backpropagation, incurring substantial memory and computational overhead; model collapse still occurs with prolonged training, necessitating early stopping.
Insights: Detector quality is the bottleneck of treatment effectiveness; the "diagnose before treat" paradigm is potentially generalizable to quality control in other generative models.

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Novelty: Pending
Experimental Thoroughness: Pending
Writing Quality: Pending
Value: Pending