HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance¶

Conference: NeurIPS 2025 arXiv: 2505.19742 Code: Available Area: Segmentation Keywords: Human image restoration, motion blur, one-step diffusion, dual-prompt guidance, classifier-free guidance

TL;DR¶

This paper proposes HAODiff, a human-aware one-step diffusion model that generates adaptive positive–negative prompt pairs via a three-branch Dual-Prompt Guidance (DPG) module. Combined with an explicit Human Motion Blur (HMB) degradation pipeline and Classifier-Free Guidance (CFG), HAODiff substantially outperforms existing state-of-the-art methods on human image restoration tasks.

Background & Motivation¶

Complex degradation in human images: Real-world human images suffer simultaneously from generic degradations (noise, compression, downsampling) and Human Motion Blur (HMB). Existing methods typically address only one of these.

HMB absent from degradation pipelines: Mainstream blind image restoration (BIR) models adopt the Real-ESRGAN degradation pipeline (downsampling + compression + noise + low-pass blur), but lack simulation of local human motion blur—one of the most common and challenging degradation types in human imagery.

Inadequate negative prompt design: Existing methods employ fixed negative prompts (e.g., empty text or fixed noise descriptions), which cannot provide adaptive guidance tailored to the specific degradation pattern of each image.

Computational efficiency: Multi-step diffusion models (e.g., SUPIR requiring 50 steps / 26.67 s) incur substantial inference overhead, whereas one-step diffusion models significantly reduce computational cost while maintaining quality.

Method¶

Overall Architecture¶

HAODiff adopts a two-stage training framework: - Stage 1: Trains the three-branch Dual-Prompt Guidance (DPG) module to predict, from the LQ image, an HQ image (source of positive prompts), residual noise, and an HMB segmentation mask (sources of negative prompts). - Stage 2: Injects the positive and negative prompt embedding vectors generated by DPG into the one-step diffusion model, guiding single-step LQ→HQ denoising via CFG.

Key Designs¶

1. Degradation Pipeline with HMB

The core innovation is the explicit introduction of human motion blur into the degradation process:

The Sapiens model is used to perform human body-part segmentation on HQ images, yielding six category masks: head, left/right upper limbs, left/right lower limbs, and full body.
A body-part category is randomly selected; a spatial weight map is obtained via morphological operations (erosion → dilation → Gaussian blur) and normalization: $W_s = (\text{Norm} \circ \text{Morph} \circ \text{Seg})(I_H)$
A random motion trajectory is simulated via a Markov process, producing a PSF; FFT-based convolution yields a globally motion-blurred image $I_B$.
The original and blurred images are blended according to the spatial weight map: $I_{\text{HMB}} = W_s \odot I_H + (1-W_s) \odot I_B$
Key design choice: HMB is applied in the first degradation stage (since motion blur logically occurs at capture time), with generic degradations (noise, compression, etc.) applied in the second stage.
The first stage has three possible states: no degradation / HMB / generic degradation.

2. Three-Branch Dual-Prompt Guidance (DPG)

Built on a Swin Transformer, the module comprises a shared feature extractor and three independent reconstruction branches:

Shared backbone $H_E$: Extracts features after 4× downsampling through 2 RSTBs (each with 6 STLs, 6-head attention).
Three independent branches $H_{R_i}$: Each contains 2 RSTBs (3 STLs, 3-head attention), predicting respectively:
- Branch 1: HQ image $\hat{I}_H^P$ (source of positive prompts)
- Branch 2: Residual noise $\hat{I}_R = I_L - I_H$ (first source of negative prompts)
- Branch 3: HMB segmentation mask $\hat{M}_{\text{HMB}}$ (second source of negative prompts; single-channel + sigmoid)

Key insight: The negative prompt should not be the LQ image itself, but rather the residual noise (which contains only degradation information without structural content); otherwise, restored images lose fidelity. The HMB mask additionally provides precise spatial localization of local motion blur.

3. One-Step Diffusion Model (OSD)

Based on SD2.1-base, with LoRA (rank=16) fine-tuning of the UNet.
The final-layer features from each DPG branch are mapped to SD-compatible embedding vectors via a Prompt Embedder (Performer Encoder + Attention Pooling).
Features from the two negative branches are concatenated and fed into a shared Prompt Embedder to produce $p_{\text{neg}}$; the positive branch generates $p_{\text{pos}}$ independently.
CFG-guided noise prediction: $z_\varepsilon = z_{\text{neg}} + \lambda_{\text{cfg}} \cdot (z_{\text{pos}} - z_{\text{neg}})$, with $\lambda_{\text{cfg}}=3.5$.
The UNet concatenates positive and negative prompts along the batch dimension, obtaining $z_{\text{pos}}$ and $z_{\text{neg}}$ in a single forward pass.

Loss & Training¶

Stage 1 (DPG training): $$\mathcal{L} = \mathcal{L}_1(\hat{I}_H^P, I_H) + \mathcal{L}_1(\hat{I}_R, I_L - I_H) + \alpha \cdot \mathcal{L}_{\text{Dice}}(\hat{M}_{\text{HMB}}, M_{\text{HMB}})$$ - $\alpha = 0.02$; Adam optimizer, lr=2e-3, batch=16, trained for 20K iterations on 4× A6000 GPUs.

Stage 2 (OSD training): $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{MSE}}(\hat{I}_H, I_H) + \mathcal{L}_{\text{EA}}(\hat{I}_H, I_H) + \beta \cdot \mathcal{L}_{\mathcal{G}}(\hat{z}_H)$$ - $\mathcal{L}_{\text{EA}}$: Edge-aware DISTS perceptual loss (dual-path DISTS on the original image and its Sobel edge map). - $\mathcal{L}_{\mathcal{G}}$: GAN generator loss, using the downsampling modules of a pretrained SDXL UNet as the discriminator. - $\beta = 0.01$; AdamW optimizer, lr=1e-5, batch=2, trained for 120K iterations on 2× A6000 GPUs.

Key Experimental Results¶

Results on PERSONA-Val Synthetic Dataset¶

Method	Steps	Time (s)	DISTS↓	LPIPS↓	FID↓	CLIPIQA↑	NIQE↓
SUPIR	50	26.67	0.1415	0.2929	13.84	0.7908	3.777
SeeSR	50	5.05	0.1295	0.2555	12.82	0.7620	3.574
OSDHuman	1	0.11	0.1356	0.2384	14.41	0.7312	3.828
HAODiff	1	0.20	0.1023	0.2046	8.36	0.7737	2.830

With single-step inference, HAODiff reduces DISTS by 27.8% (vs. OSDHuman) and FID by 42%, achieving comprehensive superiority.

Results on MPII-Test Real Motion Blur Dataset¶

Method	CLIPIQA↑	MANIQA↑	NIQE↓	HMB-R↓
SUPIR	0.6702	0.6256	4.423	0.2776
SeeSR	0.6478	0.6636	4.615	0.2612
OSDHuman	0.6726	0.6535	3.912	0.2283
HAODiff	0.7203	0.7057	3.065	0.1167

The HMB-R (motion blur residual ratio) of 0.1167 is half that of OSDHuman, demonstrating significantly stronger motion blur removal capability.

Ablation Study¶

Three-branch DPG vs. single branch: The three-branch dual-prompt guidance yields notable improvements in DISTS/LPIPS over positive-prompt-only guidance.
Adaptive vs. fixed negative prompts: Adaptive residual noise negative prompts outperform fixed empty-text or fixed noise descriptions.
Degradation pipeline with vs. without HMB: Incorporating HMB simulation significantly reduces HMB-R on MPII-Test.
CFG coefficient $\lambda_{\text{cfg}}$: 3.5 is the optimal value; lower values provide insufficient guidance while higher values introduce artifacts.

Key Findings¶

Single-step HAODiff outperforms multi-step diffusion models such as SUPIR (50 steps) and SeeSR (50 steps) across nearly all metrics.
Residual noise is a more principled negative prompt source than the LQ image itself—it precisely characterizes degradation features while preserving structural information.
The HMB segmentation mask provides spatial localization of local motion blur, enabling targeted processing rather than uniform full-image denoising.
Inference time of only 0.20 s (at 512×512) is competitive for practical applications.

Highlights & Insights¶

Degradation pipeline innovation: This work is the first to explicitly incorporate human motion blur simulation into the BIR degradation pipeline, achieving realistic HMB synthesis via body-part segmentation and spatial weight maps.
Adaptive dual prompts: Constructing adaptive negative prompts from residual noise and HMB masks enables more effective CFG guidance than fixed negative prompts.
Efficiency without quality compromise: Single-step diffusion with LoRA fine-tuning achieves 0.2 s inference while comprehensively surpassing 50-step large models.
New benchmark MPII-Test: Comprising 5,427 real HMB images with YOLO-detector-based quantitative evaluation, this benchmark provides a standardized motion blur assessment protocol for human image restoration.

Limitations & Future Work¶

The HMB simulation in the degradation pipeline models only rigid-body motion blur (PSF convolution-based), and does not account for non-uniform blur arising from articulated joint flexibility.
The method depends on the quality of the Sapiens segmentation model—segmentation failures may lead to inaccurate HMB simulation.
The SD2.1-base backbone is limited to 512×512 resolution; further extension is needed for high-resolution human images.
The HMB detector on MPII-Test achieves only mAP@0.5 = 0.62, limiting the reliability of the HMB-R metric.
Extending DPG to video human restoration scenarios remains an avenue for future exploration.

The three-branch DPG design (positive / negative / localization) is generalizable to other spatially-aware restoration tasks (e.g., dehazing, deraining).
Using residual noise as a negative prompt source is superior to using the LQ image directly, and can be adopted in other diffusion-based restoration models.
Body-part-level degradation simulation provides a reference paradigm for designing domain-specific degradation pipelines.

Rating¶

Novelty: ⭐⭐⭐⭐ Substantial innovations in both the degradation pipeline and dual-prompt guidance.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers synthetic and real data, multiple metrics, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Method description is clear with complete derivations.
Value: ⭐⭐⭐⭐ Strong practical utility for human restoration; the new benchmark is a meaningful contribution.