Rethinking Direct Preference Optimization in Diffusion Models¶
Conference: SPIGM@NeurIPS 2025 / AAAI 2026 (Oral)
arXiv: 2505.18736
Code: GitHub
Area: LLM Alignment / Diffusion Models
Keywords: Diffusion Models, DPO, Preference Optimization, Reference Model Update, Timestep-Aware
TL;DR¶
To address two core issues in DPO for diffusion models — limited exploration and reward scale imbalance — this paper proposes a stable reference model update strategy and a timestep-aware training strategy, both of which can be integrated into various preference optimization algorithms.
Background & Motivation¶
Aligning text-to-image (T2I) diffusion models with human preferences is an active research direction. While preference optimization techniques such as DPO have been extended from LLMs to diffusion models, several domain-specific challenges remain:
Limited Exploration: A frozen reference model constrains the policy's exploration space, resulting in insufficient generation diversity.
Reward Scale Imbalance: The magnitude of reward signals varies drastically across denoising timesteps, undermining training stability.
Diffusion-Specific Difficulty: Unlike LLMs, the generation process in diffusion models involves multi-step denoising, with different optimization objectives at each step.
Method¶
Overall Architecture¶
Two orthogonal improvement strategies are proposed, which can be combined with existing methods such as Diffusion-DPO, DPOK, and D3PO.
Key Designs¶
1. Stable Reference Model Update
In standard DPO, the reference model \(\pi_{\text{ref}}\) remains fixed, limiting exploration: - Proposes gradual updates to the reference model: \(\pi_{\text{ref}}^{(t+1)} = (1-\alpha) \pi_{\text{ref}}^{(t)} + \alpha \pi_\theta^{(t)}\) - Reference model regularization: constrains the updated reference model from deviating too far from the initial model - Balances exploration and stability: small \(\alpha\) yields conservative updates; large \(\alpha\) yields aggressive updates
2. Timestep-Aware Training
The signal-to-noise ratio varies substantially across timesteps in diffusion models: - Observation: reward signal magnitudes at large timesteps (high noise) are far greater than at small timesteps (low noise) - This causes training to be dominated by large timesteps, leaving small timesteps under-optimized - Solution: apply normalized weighting to losses across different timesteps
where \(w(t)\) is a normalized weight based on the variance of the reward signal at timestep \(t\).
Loss & Training¶
Full training objective: $\(\mathcal{L} = \mathcal{L}_{\text{DPO}} + \lambda_1 \mathcal{L}_{\text{reg}} + \lambda_2 \mathcal{L}_{\text{timestep-norm}}\)$
Training pipeline: reference model updates and timestep normalization are incorporated into the standard DPO training loop.
Key Experimental Results¶
Main Results¶
Human preference evaluation on SDXL (Pick-a-Pic, HPSv2):
| Method | HPSv2 ↑ | Pick Score ↑ | Aesthetic ↑ | CLIP Score |
|---|---|---|---|---|
| SDXL (baseline) | 27.5 | 21.8 | 5.82 | 0.315 |
| Diffusion-DPO | 28.2 | 22.3 | 5.95 | 0.318 |
| Diffusion-DPO + Ours | 28.8 | 22.9 | 6.12 | 0.322 |
| D3PO | 28.0 | 22.1 | 5.90 | 0.316 |
| D3PO + Ours | 28.5 | 22.6 | 6.05 | 0.320 |
Results on SD1.5:
| Method | HPSv2 ↑ | Aesthetic ↑ | Diversity (FID) |
|---|---|---|---|
| SD1.5 | 25.8 | 5.45 | 15.2 |
| Diffusion-DPO | 26.5 | 5.68 | 18.5 |
| Diffusion-DPO + Ours | 27.1 | 5.85 | 16.8 |
Ablation Study¶
Individual contributions of each strategy (SDXL, HPSv2):
| Configuration | HPSv2 | Gain |
|---|---|---|
| Diffusion-DPO (baseline) | 28.2 | - |
| + Reference Model Update | 28.5 | +0.3 |
| + Timestep-Aware | 28.5 | +0.3 |
| + Both Combined | 28.8 | +0.6 |
Key Findings¶
- The two strategies offer complementary contributions, each providing approximately +0.3 HPSv2 improvement.
- Reference model updates become more effective in later training stages, as the policy drifts further from the initial model.
- Timestep-aware training yields the greatest improvement at low-noise steps, which govern fine-grained detail quality.
- The method integrates seamlessly into multiple algorithms, including Diffusion-DPO and D3PO.
Highlights & Insights¶
- Orthogonal Improvements: The two strategies are mutually independent and can be applied separately or jointly.
- Generality: Both components serve as plug-and-play modules compatible with any diffusion-based preference optimization method.
- Practical Insight: Reward imbalance across timesteps is an overlooked yet important issue in diffusion DPO.
Limitations & Future Work¶
- The optimal value of \(\alpha\) is task- and model-dependent.
- Reference model updates increase memory requirements, as additional model parameters must be stored.
- Validation is limited to image generation; video generation settings remain unexplored.
- The design of timestep normalization weights lacks adaptivity.
Related Work & Insights¶
- Diffusion-DPO (Wallace et al.): extends DPO to diffusion models
- D3PO: an alternative diffusion preference optimization method
- Online DPO: related work on reference model updates in LLMs
Rating¶
- ⭐ Novelty: 7/10 — Both improvements are effective but conceptually straightforward
- ⭐ Value: 8/10 — Plug-and-play design with open-source code offers high practical utility
- ⭐ Writing Quality: 8/10 — Ablation analysis is clear and experimental design is well-structured