HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance¶
Conference: NeurIPS 2025 arXiv: 2505.19742 Code: Available Area: Segmentation Keywords: Human image restoration, motion blur, one-step diffusion, dual-prompt guidance, classifier-free guidance
TL;DR¶
This paper proposes HAODiff, a human-aware one-step diffusion model that generates adaptive positive–negative prompt pairs via a three-branch Dual-Prompt Guidance (DPG) module. Combined with an explicit Human Motion Blur (HMB) degradation pipeline and Classifier-Free Guidance (CFG), HAODiff substantially outperforms existing state-of-the-art methods on human image restoration tasks.
Background & Motivation¶
Complex degradation in human images: Real-world human images suffer simultaneously from generic degradations (noise, compression, downsampling) and Human Motion Blur (HMB). Existing methods typically address only one of these.
HMB absent from degradation pipelines: Mainstream blind image restoration (BIR) models adopt the Real-ESRGAN degradation pipeline (downsampling + compression + noise + low-pass blur), but lack simulation of local human motion blur—one of the most common and challenging degradation types in human imagery.
Inadequate negative prompt design: Existing methods employ fixed negative prompts (e.g., empty text or fixed noise descriptions), which cannot provide adaptive guidance tailored to the specific degradation pattern of each image.
Computational efficiency: Multi-step diffusion models (e.g., SUPIR requiring 50 steps / 26.67 s) incur substantial inference overhead, whereas one-step diffusion models significantly reduce computational cost while maintaining quality.
Method¶
Overall Architecture¶
HAODiff adopts a two-stage training framework: - Stage 1: Trains the three-branch Dual-Prompt Guidance (DPG) module to predict, from the LQ image, an HQ image (source of positive prompts), residual noise, and an HMB segmentation mask (sources of negative prompts). - Stage 2: Injects the positive and negative prompt embedding vectors generated by DPG into the one-step diffusion model, guiding single-step LQ→HQ denoising via CFG.
Key Designs¶
1. Degradation Pipeline with HMB
The core innovation is the explicit introduction of human motion blur into the degradation process:
- The Sapiens model is used to perform human body-part segmentation on HQ images, yielding six category masks: head, left/right upper limbs, left/right lower limbs, and full body.
- A body-part category is randomly selected; a spatial weight map is obtained via morphological operations (erosion → dilation → Gaussian blur) and normalization: \(W_s = (\text{Norm} \circ \text{Morph} \circ \text{Seg})(I_H)\)
- A random motion trajectory is simulated via a Markov process, producing a PSF; FFT-based convolution yields a globally motion-blurred image \(I_B\).
- The original and blurred images are blended according to the spatial weight map: \(I_{\text{HMB}} = W_s \odot I_H + (1-W_s) \odot I_B\)
- Key design choice: HMB is applied in the first degradation stage (since motion blur logically occurs at capture time), with generic degradations (noise, compression, etc.) applied in the second stage.
- The first stage has three possible states: no degradation / HMB / generic degradation.
2. Three-Branch Dual-Prompt Guidance (DPG)
Built on a Swin Transformer, the module comprises a shared feature extractor and three independent reconstruction branches:
- Shared backbone \(H_E\): Extracts features after 4× downsampling through 2 RSTBs (each with 6 STLs, 6-head attention).
- Three independent branches \(H_{R_i}\): Each contains 2 RSTBs (3 STLs, 3-head attention), predicting respectively:
- Branch 1: HQ image \(\hat{I}_H^P\) (source of positive prompts)
- Branch 2: Residual noise \(\hat{I}_R = I_L - I_H\) (first source of negative prompts)
- Branch 3: HMB segmentation mask \(\hat{M}_{\text{HMB}}\) (second source of negative prompts; single-channel + sigmoid)
Key insight: The negative prompt should not be the LQ image itself, but rather the residual noise (which contains only degradation information without structural content); otherwise, restored images lose fidelity. The HMB mask additionally provides precise spatial localization of local motion blur.
3. One-Step Diffusion Model (OSD)
- Based on SD2.1-base, with LoRA (rank=16) fine-tuning of the UNet.
- The final-layer features from each DPG branch are mapped to SD-compatible embedding vectors via a Prompt Embedder (Performer Encoder + Attention Pooling).
- Features from the two negative branches are concatenated and fed into a shared Prompt Embedder to produce \(p_{\text{neg}}\); the positive branch generates \(p_{\text{pos}}\) independently.
- CFG-guided noise prediction: \(z_\varepsilon = z_{\text{neg}} + \lambda_{\text{cfg}} \cdot (z_{\text{pos}} - z_{\text{neg}})\), with \(\lambda_{\text{cfg}}=3.5\).
- The UNet concatenates positive and negative prompts along the batch dimension, obtaining \(z_{\text{pos}}\) and \(z_{\text{neg}}\) in a single forward pass.
Loss & Training¶
Stage 1 (DPG training): $\(\mathcal{L} = \mathcal{L}_1(\hat{I}_H^P, I_H) + \mathcal{L}_1(\hat{I}_R, I_L - I_H) + \alpha \cdot \mathcal{L}_{\text{Dice}}(\hat{M}_{\text{HMB}}, M_{\text{HMB}})\)$ - \(\alpha = 0.02\); Adam optimizer, lr=2e-3, batch=16, trained for 20K iterations on 4× A6000 GPUs.
Stage 2 (OSD training): $\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{MSE}}(\hat{I}_H, I_H) + \mathcal{L}_{\text{EA}}(\hat{I}_H, I_H) + \beta \cdot \mathcal{L}_{\mathcal{G}}(\hat{z}_H)\)$ - \(\mathcal{L}_{\text{EA}}\): Edge-aware DISTS perceptual loss (dual-path DISTS on the original image and its Sobel edge map). - \(\mathcal{L}_{\mathcal{G}}\): GAN generator loss, using the downsampling modules of a pretrained SDXL UNet as the discriminator. - \(\beta = 0.01\); AdamW optimizer, lr=1e-5, batch=2, trained for 120K iterations on 2× A6000 GPUs.
Key Experimental Results¶
Results on PERSONA-Val Synthetic Dataset¶
| Method | Steps | Time (s) | DISTS↓ | LPIPS↓ | FID↓ | CLIPIQA↑ | NIQE↓ |
|---|---|---|---|---|---|---|---|
| SUPIR | 50 | 26.67 | 0.1415 | 0.2929 | 13.84 | 0.7908 | 3.777 |
| SeeSR | 50 | 5.05 | 0.1295 | 0.2555 | 12.82 | 0.7620 | 3.574 |
| OSDHuman | 1 | 0.11 | 0.1356 | 0.2384 | 14.41 | 0.7312 | 3.828 |
| HAODiff | 1 | 0.20 | 0.1023 | 0.2046 | 8.36 | 0.7737 | 2.830 |
With single-step inference, HAODiff reduces DISTS by 27.8% (vs. OSDHuman) and FID by 42%, achieving comprehensive superiority.
Results on MPII-Test Real Motion Blur Dataset¶
| Method | CLIPIQA↑ | MANIQA↑ | NIQE↓ | HMB-R↓ |
|---|---|---|---|---|
| SUPIR | 0.6702 | 0.6256 | 4.423 | 0.2776 |
| SeeSR | 0.6478 | 0.6636 | 4.615 | 0.2612 |
| OSDHuman | 0.6726 | 0.6535 | 3.912 | 0.2283 |
| HAODiff | 0.7203 | 0.7057 | 3.065 | 0.1167 |
The HMB-R (motion blur residual ratio) of 0.1167 is half that of OSDHuman, demonstrating significantly stronger motion blur removal capability.
Ablation Study¶
- Three-branch DPG vs. single branch: The three-branch dual-prompt guidance yields notable improvements in DISTS/LPIPS over positive-prompt-only guidance.
- Adaptive vs. fixed negative prompts: Adaptive residual noise negative prompts outperform fixed empty-text or fixed noise descriptions.
- Degradation pipeline with vs. without HMB: Incorporating HMB simulation significantly reduces HMB-R on MPII-Test.
- CFG coefficient \(\lambda_{\text{cfg}}\): 3.5 is the optimal value; lower values provide insufficient guidance while higher values introduce artifacts.
Key Findings¶
- Single-step HAODiff outperforms multi-step diffusion models such as SUPIR (50 steps) and SeeSR (50 steps) across nearly all metrics.
- Residual noise is a more principled negative prompt source than the LQ image itself—it precisely characterizes degradation features while preserving structural information.
- The HMB segmentation mask provides spatial localization of local motion blur, enabling targeted processing rather than uniform full-image denoising.
- Inference time of only 0.20 s (at 512×512) is competitive for practical applications.
Highlights & Insights¶
- Degradation pipeline innovation: This work is the first to explicitly incorporate human motion blur simulation into the BIR degradation pipeline, achieving realistic HMB synthesis via body-part segmentation and spatial weight maps.
- Adaptive dual prompts: Constructing adaptive negative prompts from residual noise and HMB masks enables more effective CFG guidance than fixed negative prompts.
- Efficiency without quality compromise: Single-step diffusion with LoRA fine-tuning achieves 0.2 s inference while comprehensively surpassing 50-step large models.
- New benchmark MPII-Test: Comprising 5,427 real HMB images with YOLO-detector-based quantitative evaluation, this benchmark provides a standardized motion blur assessment protocol for human image restoration.
Limitations & Future Work¶
- The HMB simulation in the degradation pipeline models only rigid-body motion blur (PSF convolution-based), and does not account for non-uniform blur arising from articulated joint flexibility.
- The method depends on the quality of the Sapiens segmentation model—segmentation failures may lead to inaccurate HMB simulation.
- The SD2.1-base backbone is limited to 512×512 resolution; further extension is needed for high-resolution human images.
- The HMB detector on MPII-Test achieves only mAP@0.5 = 0.62, limiting the reliability of the HMB-R metric.
- Extending DPG to video human restoration scenarios remains an avenue for future exploration.
Related Work & Insights¶
- The three-branch DPG design (positive / negative / localization) is generalizable to other spatially-aware restoration tasks (e.g., dehazing, deraining).
- Using residual noise as a negative prompt source is superior to using the LQ image directly, and can be adopted in other diffusion-based restoration models.
- Body-part-level degradation simulation provides a reference paradigm for designing domain-specific degradation pipelines.
Rating¶
- Novelty: ⭐⭐⭐⭐ Substantial innovations in both the degradation pipeline and dual-prompt guidance.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers synthetic and real data, multiple metrics, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear with complete derivations.
- Value: ⭐⭐⭐⭐ Strong practical utility for human restoration; the new benchmark is a meaningful contribution.