CVPR 2025 Image Generation Step-by-step preference optimization aesthetic alignment diffusion model post-training step-aware preference model online learning

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization¶

Conference: CVPR 2025
arXiv: 2406.04314
Code: https://github.com/RockeyCoss/SPO
Area: Alignment RLHF / Diffusion Models
Keywords: Step-by-step preference optimization, aesthetic alignment, diffusion model post-training, step-aware preference model, online learning

TL;DR¶

This paper proposes Step-by-step Preference Optimization (SPO), which samples multiple candidates from the same noisy latent at each denoising step and employs a step-aware preference model to select win/lose pairs to guide diffusion model fine-tuning. By implicitly distilling aesthetic information from generic preference data, SPO significantly improves aesthetic quality on SD-1.5 and SDXL, while achieving substantially faster convergence than DPO.

Background & Motivation¶

Background: Aligning text-to-image diffusion models with human preferences has recently become a research hotspot. DPO (Direct Preference Optimization) has been successfully applied to diffusion model fine-tuning (e.g., Diffusion-DPO), encouraging models to generate more human-preferred images using preference pairs.

Limitations of Prior Work: Existing DPO methods suffer from two core limitations. First, inconsistency between generic preference and aesthetic preference: preference labels in public datasets (such as Pick-a-Pic) conflate layout/composition opinions with aesthetic opinions. An image might be labeled as "preferred" due to better prompt alignment, despite being inferior in aesthetic details (e.g., containing artifacts or blurred details). Such noisy labels harm the model's aesthetic improvement. Second, two-trajectory methods struggle to capture subtle differences: Diffusion-DPO compares final images generated by two completely different trajectories, propagating the preference label to all intermediate steps. Since the intermediate states of the two trajectories differ drastically (mainly in layout), it is difficult for the model to focus on subtle aesthetic differences.

Key Challenge: Enhancing aesthetic quality requires aesthetic-specific preference data, which is extremely expensive to collect. Generic preference data is cheap but contains labels that conflict with aesthetics. How can one economically extract aesthetic signals from generic preference data?

Goal: 1) How to improve image aesthetics without collecting specialized aesthetic preference data? 2) How to focus preference comparisons on subtle visual details rather than large-scale layout differences?

Key Insight: The authors observe that if the win/lose pairs originate from the same noisy latent variable and undergo only one or a few denoising steps, the differences between them exist purely at the level of image details (aesthetics like color, texture, clarity) rather than layout. Consequently, even when evaluated by a generic preference model, the comparison naturally focuses on the aesthetic dimension.

Core Idea: At each denoising step, sample multiple candidates from the same noise, and select win/lose pairs using a step-aware preference model to fine-tune the diffusion model, naturally focusing the preference signals on aesthetic details.

Method¶

Overall Architecture¶

SPO is an online reinforcement learning method. At each iteration of training, given an intermediate noise state \(x_t\), \(k\) denoising candidates \(\{x_{t-1}^1, ..., x_{t-1}^k\}\) are sampled from the conditional distribution \(p_\theta(x_{t-1}|x_t, c, t)\). A Step-Aware Preference Model (SPM) is then used to score these \(k\) candidates, selecting the highest-scoring candidate as the win sample and the lowest as the lose sample to construct a preference pair for DPO loss update. Afterward, one sample is randomly selected from the candidate pool as the starting point for the next denoising step. This process is repeated across all timesteps to achieve step-by-step optimization.

Key Designs¶

Step-Aware Preference Model (SPM):
- Function: Evaluates the image quality of noisy intermediate states at any denoising timestep.
- Mechanism: Standard preference models (such as PickScore) can only handle clean images \(x_0\) and cannot evaluate noisy intermediate states \(x_t\). Initialized with PickScore, SPM incorporates adaptation layer normalization conditioned on timesteps into the CLIP vision encoder (borrowing design from DiT) to allow the model to adjust its feature extraction behavior according to the timestep \(t\). During training, the same noise is added to each pair of preference images at timestep \(t\), assuming the original preference order remains unchanged. Preference is predicted via the following probability formula: \(\hat{p}_w = \frac{\exp(\tau \cdot f_{\text{CLIP-V}}(x_t^w, t) \cdot f_{\text{CLIP-T}}(c))}{\exp(\tau \cdot f_{\text{CLIP-V}}(x_t^w, t) \cdot f_{\text{CLIP-T}}(c)) + \exp(\tau \cdot f_{\text{CLIP-V}}(x_t^l, t) \cdot f_{\text{CLIP-T}}(c))}\). To mitigate the domain gap between noisy images and the pre-trained CLIP, DDIM is used to directly estimate \(\hat{x}_0\) from \(x_t\).
- Design Motivation: The core requirement of SPO is to evaluate image quality at intermediate denoising steps. Preference models without timestep conditioning completely fail when facing high-noise images. SPM serves as the fundamental infrastructure for realizing step-by-step preference optimization.
Random Selection to Initialize the Next Step:
- Function: Ensures the diversity of training trajectories and prevents bias toward specific modes.
- Mechanism: After selecting the win/lose pair at each step, the next step is not initialized with the win sample, but randomly selected from the \(k\) candidates. If only the win sample is used, the training trajectory biases toward high-preference regions, reducing generalization; if only the lose sample is used, the model continuously learns from low-quality regions. Random selection guarantees diversity in the trajectory distribution.
- Design Motivation: Ablation studies confirm that initializing with either win or lose samples leads to a significant drop in performance. Random selection is a simple yet crucial design.
Multi-Step Extension (MSPO) for Strong Models:
- Function: Key to increasing variance among candidates for strong models like SDXL.
- Mechanism: For strong diffusion models like SDXL, the differences between candidates generated by single-step denoising are too subtle, making it difficult for SPM to distinguish them. MSPO extends the single-step to multiple steps: sampling \(k\) candidates of \(x_{t-1}\) from \(x_t\), then denoising each candidate for another \(j\) steps to get \(x_{t-j}\), and choosing the win/lose pair at the \(x_{t-j}\) level. \(j=4\) yields the best results. When \(j \to \infty\), MSPO degenerates into standard Diffusion-DPO.
- Design Motivation: Balancing candidate variance—too small and SPM struggles to judge, too large and it reverts to being dominated by layout differences.

Loss & Training¶

The loss function of SPO is a step-level application of the standard DPO loss:

\(\mathcal{L}(\theta) = -\mathbb{E}_{t, c, x_{t-1}^w, x_{t-1}^l} \left[ \log\sigma\left(\beta \log\frac{p_\theta(x_{t-1}^w|c,t,x_t)}{p_{\text{ref}}(x_{t-1}^w|c,t,x_t)} - \beta \log\frac{p_\theta(x_{t-1}^l|c,t,x_t)}{p_{\text{ref}}(x_{t-1}^l|c,t,x_t)}\right) \right]\)

where \(\beta=10\) is the regularization strength. LoRA fine-tuning is used for training, with rank=4 for SD-1.5 and rank=64 for SDXL.

Key Experimental Results¶

Main Results (SDXL)¶

Method	PickScore ↑	HPSV2 ↑	ImageReward ↑	Aesthetic ↑
SDXL	21.95	26.95	0.538	5.950
Diffusion-DPO	22.64	29.31	0.944	6.015
MAPO	22.11	28.22	0.717	6.096
SPO	23.06	31.80	1.080	6.364

Main Results (SD-1.5)¶

Method	PickScore ↑	HPSV2 ↑	ImageReward ↑	Aesthetic ↑
SD-1.5	20.53	23.79	-0.163	5.365
DDPO	21.06	24.91	0.082	5.591
Diffusion-DPO	20.98	25.05	0.112	5.505
SPO	21.43	26.45	0.171	5.887

Ablation Study¶

Ablation Item	PickScore	HPSV2	ImageReward	Aesthetic
Initialize with lose samples	17.87	11.31	-2.269	3.963
Initialize with win samples	19.36	18.63	-1.374	5.338
Random initialization (SPO)	21.43	26.45	0.171	5.887
SPM without timestep conditioning	21.19	25.84	0.137	5.678
Replace SPM with PickScore	20.28	23.09	-0.298	5.410

Key Findings¶

SPO outperforms Diffusion-DPO across all four automatic evaluation metrics, with a notable aesthetic score improvement of +0.349 (SDXL).
User studies show that SPO significantly outperforms Diffusion-DPO in visual appeal with a 58.27% win rate.
SPO requires only 4.9% (SDXL) and 20.8% (SD-1.5) of the training computation compared to Diffusion-DPO, achieving vastly superior convergence speed.
The timestep range [0-750] yields the best performance; including excessively large timesteps ([750-1000]) is harmful, as high-noise regions offer almost no image details for comparison.
On GenEval, SPO shows a slight improvement compared to the SDXL baseline (+1.77%), but the gain in prompt alignment is not as substantial as that of Diffusion-DPO.

Highlights & Insights¶

The concept of "distilling aesthetics from generic data" is highly elegant: Instead of requiring expensive aesthetic-specific datasets, SPO cleverly forces the generic preference model's judgments to naturally focus on aesthetics by ensuring win/lose pairs differ only in fine details.
Efficiency crushes DPO: GPU hours are reduced to only 5% of DPO (SDXL), thanks to more accurate preference signals that reduce wasteful updates.
SPM is a reusable tool: As a standalone component, the step-aware preference model can be used in other scenarios where intermediate denoising states need to be evaluated.
Simplicity of random selection initialization: The simplest random strategy performs better than carefully engineered selection strategies, acting as a reminder against over-engineering.

Limitations & Future Work¶

Inapplicable to Flow Matching models: SPO requires the intermediate steps to be stochastic (DDIM with η=1.0), whereas flow matching models like SD3 and Flux have deterministic trajectories, making it impossible to sample multiple candidates from the same \(x_t\).
Limited help for prompt alignment: SPO focuses heavily on aesthetic details, showing minor improvements in prompt alignment at the layout/composition level.
On GenEval, SPO (55.20) scores lower than Diffusion-DPO (59.58), indicating a certain trade-off between aesthetics and prompt alignment.
The combination with RL-based methods (such as DDPO) has not been explored.
Training SPM takes an additional 8-29h of GPU time; although this is a one-time cost, it is not negligible.

Diffusion-DPO (Wallace et al., 2023): Pioneering work applying DPO to diffusion models, utilizing a two-trajectory preference propagation strategy.
D3PO (Yang et al., 2023): Similar to DPO but generates preference pairs online; suffers from the same limitation of being dominated by layout differences.
DenseReward (Yang et al., 2024): Employs time discounting to improve DPO, but still relies on the two-trajectory framework.
DDPO / DPOK: Fine-tunes diffusion models using policy gradient, which is computationally more expensive.
SPO's step-by-step comparison concept provides inspiration for fine-grained control of future diffusion models—refining the optimization granularity from the trajectory level down to the step level could represent a general improvement paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐