Boost Your Human Image Generation Model via Direct Preference Optimization¶

Conference: CVPR 2025
arXiv: 2405.20216
Code: To be confirmed
Area: Alignment RLHF / Human Image Generation
Keywords: DPO, Human Image, Curriculum Learning, Statistics Matching, LoRA

TL;DR¶

This paper proposes HG-DPO, which utilizes real human images as winning images (instead of generated image pairs) and designs a three-stage curriculum learning strategy (Easy/Normal/Hard) to progressively bridge the distribution gap between generated and real images. Combined with a statistics matching loss to resolve color shift, it reduces the FID from 37.34 to 29.41 (-21.4%), improves CI-Q from 0.906 to 0.934, and outperforms Diffusion-DPO with a 99.97% win rate.

Background & Motivation¶

Background: Human image generation is a core focus of image synthesis. DPO has been applied to align diffusion models, but existing methods rely on AI-generated winning/losing image pairs.

Limitations of Prior Work: (a) Using generated images as winning images imposes a quality ceiling; (b) Direct application of real images leads to color shifts and training instability.

Key Challenge: Real images serve as better alignment targets, but the distribution gap between generated and real images causes training collapse when used directly (Naive DPO FID skyrockets to 112.67).

Goal: Bridge the distribution gap to enable utilizing real human images as optimization targets in DPO.

Key Insight: Progressively narrow the gap via curriculum learning + eliminate color shifts using statistics matching.

Core Idea: A three-stage curriculum DPO (Easy $\rightarrow$ Normal $\rightarrow$ Hard, progressively introducing real images) combined with a statistics matching loss.

Method¶

Overall Architecture¶

Based on SD 1.5 + LoRA fine-tuning (U-Net rank 8, text encoder rank 64), a three-stage curriculum training progressively transitions the winning image from the best generated image to an intermediate domain, and finally to real images. $\beta=2500$. The training dataset contains ~50K real human images (filtered from LAION-Aesthetics) and their corresponding prompts. During inference, the trained LoRA weights are directly used with zero extra computational overhead.

DPO Loss Review¶

The loss function for diffusion model DPO is defined as: $$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \left(r(x_w, c) - r(x_l, c)\right)\right)\right]$$ where $r(x, c) = \log \frac{p_\theta(x|c)}{p_{\text{ref}}(x|c)}$ is the implicit reward, and $x_w, x_l$ are the winning and losing images respectively. The core innovation of HG-DPO lies in the selection strategy of $x_w$ — shifting progressively from generated images to real images.

Key Designs¶

Three-Stage Curriculum Learning: Easy (300K steps, DPO between generated images, selection from a pool of $N>2$ images) $\rightarrow$ Normal (20K, transition phase with intermediate domain) $\rightarrow$ Hard (20K, real images as winning images). Directly using real images causes the FID to surge to 112.67; curriculum learning effectively prevents this training collapse.
- Easy Stage: For each prompt, $N=8$ images are generated and scored using PickScore/HPSv2. The highest and lowest-scoring images form the preference pair. This stage teaches the model to distinguish between high- and low-quality generation, laying the foundation for introducing real images.
- Normal Stage: The winning image is an interpolated mixture of the best generated image from the Easy stage and a real image (weighted by $\alpha=0.5$), while the losing image remains a low-scoring generated image. This progressively reduces the distribution distance between the generated and real domains.
- Hard Stage: Real human images are directly used as winning images, and the losing images are generated by the current model. By this point, the model has adapted to the distribution shift and can undergo stable training.
- Statistics Matching Loss $\mathcal{L}_{stat}$: Matches the channel-wise mean and standard deviation between the generated and real images to eliminate color shifts. The formula is: $\mathcal{L}_{stat} = \sum_c \left|\mu_c^{gen} - \mu_c^{real}\right| + \left|\sigma_c^{gen} - \sigma_c^{real}\right|$ This loss is only enabled during the Hard stage. Adding this loss improves CI-Q from 0.888 to 0.906.
- Preference Evaluator: In the Easy stage, a reward model ranks multiple generated images for each prompt to construct preference pairs with the best and worst samples.
- LoRA Configuration: U-Net rank=8 (attention layers), text encoder rank=64 (CLIP text encoder). Text encoder LoRA is particularly critical in the Hard stage — it helps the model adapt to the semantic distribution of real images; removing it causes the FID to increase from 29.41 to 32.18.

Key Experimental Results¶

Main Results¶

Method	P-Score	FID↓	CI-Q	CI-S	ATHEC
Diffusion-DPO	17.93	112.67	0.820	0.944	36.30
AlignProp	23.02	49.92	0.860	0.966	17.05
Curriculum-DPO	22.44	35.35	0.889	0.956	23.36
HG-DPO	22.60	29.41	0.934	0.986	29.41

Win-rate: vs Diffusion-DPO 99.97%, vs Pick-a-Pic v2 86.03%

Ablation Study¶

Configuration	FID	CI-Q	Description
Base	37.34	0.906	Baseline
Naive (Direct Real)	112.67	0.820	Collapse
Easy only	36.00	0.906	DPO between generated images
+ Normal + Hard	28.66	0.937	Curriculum progression
+ TE LoRA	29.41	0.934	Full HG-DPO
- $\mathcal{L}_{stat}$	31.52	0.888	Without statistics matching
Easy (using only 2 images)	38.91	0.895	$N=2$ is insufficient

Key Findings¶

The Hard stage contributes most to the FID reduction ($37 \rightarrow 29$), while the Easy stage contributes most to the P-Score.
Utilizing a pool of $N=8$ images is substantially better than $N=2$ — more candidates establish a more pronounced quality gap in the preference pairs.
The statistics matching loss eliminates visible color shift artifacts (e.g., bluish/yellowish tints), improving CI-Q by +0.018.
Support for personalized T2I: HG-DPO + InstantBooth achieves an FID of 29.30 (vs 39.61) while maintaining face similarity.
The Hard stage requires only 20K steps (~6% of total training), yet contributes the largest drop in FID.

Highlights & Insights¶

Using real images as DPO targets is the core innovation — breaking the upper bound of generated image preference alignment, whereas traditional DPO using AI-generated winning images has an inherent quality ceiling.
Curriculum learning to bridge the distribution gap is elegant and practical — applying domain adaptation concepts to DPO. The three-stage progressive design is much more stable than direct fine-tuning.
Targeted design of the statistics matching loss: Color shifts are a unique issue in generated-to-real image DPO, and channel-wise statistics alignment presents a simple and effective solution.
Compatibility with personalized models: The LoRA trained by HG-DPO can be directly stacked with personalization methods like InstantBooth, showing that the learned preference knowledge is generalizable.

Limitations & Future Work¶

Highly dependent on high-quality real human image datasets, with non-trivial training costs (340K steps total). The Easy stage demands a significant amount of GPU time to generate preference pairs.
Only validated on human images; could be extended to other domains with abundant real-world data (e.g., landscapes, architecture, animals).
The hyperparameter $\beta=2500$ is exceptionally large, far exceeding typical values in LLMs (0.1-0.5), indicating that the hyperparameter space of diffusion DPO deviates substantially from LLMs, resulting in high tuning costs.
The interpolation strategy ($\alpha=0.5$) in the Normal stage is a heuristic choice; an adaptive $\alpha$ might yield better results.
Larger foundation models (e.g., SDXL, SD3) have not been tested, leaving scalability unverified.

vs Diffusion-DPO: Directly applying DPO to the human domain results in an FID of 112, which HG-DPO solves using curriculum learning. The failure of Diffusion-DPO highlights that the generated-to-real distribution gap is a fundamental bottleneck.
vs Curriculum DPO: Different focus areas — Curriculum DPO focuses on ranking sample difficulty, whereas HG-DPO focuses on winning image selection (the progressive transition from generated to real images).
vs AlignProp: AlignProp backpropagates differentiable reward signals, achieving an FID of 49.92 (better than Diffusion-DPO but inferior to HG-DPO). AlignProp's advantage is not requiring preference pairs, but it is constrained by reward model quality.
vs RLHF for LLMs: The application of DPO in diffusion models is fundamentally different from LLMs — the winning/losing items in LLMs are text sequences, whereas diffusion models operate in pixel space where the distribution gap is much more severe.

Rating¶

Novelty: ⭐⭐⭐⭐ Real image DPO + curriculum gap bridging, a highly novel idea.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 metrics, compared with 10 methods, multiple seeds, and PT2I extension.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and comprehensive ablations.
Value: ⭐⭐⭐⭐ Important reference for human image DPO.

Supplementary Notes¶

Classified under llm_alignment in this repo, but actually belongs to the direction of image segment / diffusion model alignment.
The step ratio (300K:20K:20K) among the three curriculum stages implies that the Easy stage is the most time-consuming — requiring generating multiple images for each prompt and ranking them via a reward model.
The choice of $\beta=2500$ is noteworthy — such a massive $\beta$ suggests the model is highly sensitive to preference differences; a small $\beta$ might fail to produce meaningful gradient signals in diffusion DPO.
The method can be directly extended to larger models like SDXL/SD3 by scaling the LoRA rank.
The quality and diversity of the real image dataset are crucial for the Hard stage — low-quality real images may conversely degrade generation quality.