Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function¶
Conference: ICLR 2026 arXiv: 2512.04559 Code: https://github.com/Shin-woocheol/SQDF Area: Image Generation Keywords: Diffusion model fine-tuning, KL-regularized reinforcement learning, soft Q-function, reward over-optimization, text-to-image alignment
TL;DR¶
This paper proposes SQDF (Soft Q-based Diffusion Finetuning), which fine-tunes diffusion models under a KL-regularized RL framework via a training-free differentiable soft Q-function approximation and reparameterized policy gradients. Three complementary components—a discount factor, a consistency model, and an off-policy replay buffer—collectively optimize the target reward while effectively mitigating reward over-optimization, preserving sample naturalness and diversity.
Background & Motivation¶
Diffusion models have become the dominant paradigm for high-quality sample generation, yet practical deployment requires alignment with downstream objectives such as aesthetic quality, text-image correspondence, and human preference. Existing fine-tuning methods suffer severely from reward over-optimization, manifesting as:
Semantic Collapse: High-reward samples progressively lose semantic alignment with the original prompt, degenerating into unrecognizable abstract textures.
Diversity Collapse: Generated outputs converge toward highly homogeneous modes.
Limitations of prior work: - RL-based methods (DDPO): Do not exploit reward gradients, resulting in low optimization efficiency and rapid diversity collapse. - Direct backpropagation methods (DRaFT, ReFL): Leverage reward gradients but are prone to over-optimization. - KL-regularized methods: Require training auxiliary value networks—which is notoriously unstable in the diffusion MDP setting—or rely on high-variance Monte Carlo gradient estimators.
Key Challenge: How can one exploit powerful reward gradient signals while avoiding over-optimization through KL regularization?
Core Idea: The diffusion process is modeled as an MDP; by leveraging the posterior mean approximation from Tweedie's formula, a training-free, differentiable soft Q-function estimate is obtained, enabling direct model updates via reparameterized policy gradients.
Method¶
Overall Architecture¶
The diffusion reverse process is modeled as a finite-horizon MDP: state \(s_t = (x_{T-t}, T-t)\), action \(a_t = x_{T-t-1}\), policy \(\pi_\theta(a_t|s_t) = p_\theta(x_{T-t-1}|x_{T-t})\). A sparse reward \(r(x_0)\) is received at the terminal state \(x_0\). The optimization objective is the KL-regularized expected reward.
Pipeline: (1) Sample from diffusion model \(p_\theta\) and store in replay buffer → (2) Sample noisy observation \(x_t\) from buffer → (3) Denoise one step with \(p_\theta\) to obtain \(x_{t-1}\) → (4) Predict clean sample \(\hat{x}_0\) via consistency model \(f_\psi\) → (5) Evaluate with reward model \(r_\phi\) → (6) Update \(p_\theta\) via reparameterized policy gradient.
Key Designs¶
-
Training-Free Soft Q-Function Approximation: The key insight is that by recursively unrolling the soft Bellman equation and applying the one-step posterior mean approximation (Tweedie's formula), one obtains \(Q_{\text{soft}}^*(x_t, x_{t-1}) \approx r(\hat{x}_0(x_{t-1}))\). This entirely avoids training a value network—a procedure known to be highly unstable in the diffusion MDP setting. Crucially, since the approximate Q-function requires only a single forward pass through the parameterized reward model, it is differentiable and admits direct gradient computation.
-
Reparameterized Policy Gradient: Using the reparameterization trick \(x_{t-1} = \mu_\theta(x_t, t) + \sigma_t \epsilon\), the policy gradient is expressed as: $\(\nabla_\theta \mathcal{L} = \mathbb{E}_{x_t, \epsilon}[-\nabla_{x_{t-1}} r(\hat{x}_0) \cdot \nabla_\theta \mu_\theta + \alpha \nabla_\theta D_{KL}]\)$ This yields a low-variance, efficient gradient signal, significantly outperforming the REINFORCE estimator used in DDPO. The KL divergence term ensures the fine-tuned model does not deviate excessively from the pretrained distribution.
-
Discount Factor γ for Improved Credit Assignment (4.2.1): Prior methods implicitly adopt \(\gamma=1\), treating all denoising steps equally. However, early denoising steps (high noise levels) have negligible actual influence on the final sample. Introducing \(\gamma < 1\) down-weights early steps exponentially, better reflecting the true causal structure. The authors also prove that under the discounted MDP, the Q-function approximation becomes \(Q^* \approx \gamma^{t-1} r(\hat{x}_0)\), with upper and lower bounds coinciding under a first-order approximation.
-
Consistency Model for Improved Q Estimation (4.2.2): Tweedie's formula yields highly inaccurate posterior mean estimates at high noise levels (clearly illustrated in Figure 2-b). A consistency model \(f_\psi\) is used in place of Tweedie's formula for \(\hat{x}_0\) prediction. Trained by distilling the integral of the probability flow ODE, the consistency model provides uniformly accurate \(\hat{x}_0\) estimates across all timesteps (Figure 2-c), substantially improving Q-function approximation quality.
-
Off-Policy Replay Buffer (4.2.3): The SQDF loss function naturally supports off-policy updates (since \(x_t\) need not originate from the current policy). A replay buffer is introduced to reuse rare high-reward diverse samples, improving mode coverage and managing the reward–diversity trade-off.
Loss & Training¶
Final SQDF loss: $\(\mathcal{L}_{\text{SQDF}} = \mathbb{E}_{x_t \sim \mathcal{D}, x_{t-1} \sim p_\theta}[-\gamma^{t-1} r(f_\psi(x_{t-1})) + \alpha D_{KL}(p_\theta \| p')]\)$
- DDPM 50-step sampling
- Base model: Stable Diffusion v1.5 with LoRA fine-tuning
- Small-scale experiments: \(\gamma=0.9\), \(\alpha=2\), lr=\(1\times10^{-3}\), LoRA rank=4, batch=64, 2000 steps
- Large-scale experiments: \(\gamma=0.93\), \(\alpha=0.05\), lr=\(5\times10^{-4}\), LoRA rank=32, batch=258, 500 steps
- Consistency model: LCM-LoRA
Key Experimental Results¶
Main Results¶
Text-to-image fine-tuning (Stable Diffusion v1.5, optimizing aesthetic score / HPS):
From qualitative and quantitative results in Figures 3 and 4: - ReFL and DRaFT achieve high aesthetic scores but suffer sharp drops in alignment (ImageReward, HPS) and diversity (LPIPS, DreamSim). - DDPO fails to reach comparable aesthetic scores and exhibits rapid diversity collapse. - SQDF consistently achieves the highest alignment and diversity at equivalent reward levels.
Comparison with KL-regularized baselines (Figure 4 Pareto curves): SQDF occupies the Pareto frontier on nearly all metric pairs. By tuning \(\alpha\), SQDF flexibly navigates the trade-off between higher reward and better diversity.
Online black-box optimization (Table 1):
| Method | Target (Aesthetic↑) | ImageReward↑ | HPS↑ | LPIPS-Div↑ | DreamSim-Div↑ |
|---|---|---|---|---|---|
| PPO+KL | 6.63 | -1.35 | 0.24 | 0.47 | 0.44 |
| SEIKO-Bootstrap | 7.80 | -1.69 | 0.23 | 0.36 | 0.24 |
| SEIKO-UCB | 7.49 | -1.08 | 0.24 | 0.40 | 0.32 |
| SQDF-Bootstrap | 7.87 | 1.14 | — | — | — |
SQDF dominates all evaluation metrics, most notably lifting ImageReward from negative to positive, demonstrating robustness to inaccurate reward proxies in the black-box optimization setting.
Ablation Study¶
| Configuration | Aesthetic Score | DreamSim-Div | LPIPS-Div |
|---|---|---|---|
| SQDF (full) | 7.87 | 0.58 | 0.56 |
| w/o consistency model | 7.10 | 0.62 | 0.59 |
| w/o replay buffer | 8.06 | 0.56 | 0.55 |
| Discount Factor | Effect |
|---|---|
| \(\gamma=1\) | Higher aesthetic score but severe degradation in alignment and diversity |
| \(\gamma=0.9\) | Balanced optimization speed and sample quality |
| \(\gamma=0.85\) | Slower optimization but best diversity |
Key Findings¶
- The consistency model is critical for accelerating convergence—removing it reduces the target reward from 7.87 to 7.10.
- The replay buffer primarily protects diversity; its removal yields a slightly higher reward (8.06) at the cost of reduced diversity.
- \(\gamma\) provides an explicit control knob for the trade-off between optimization speed and sample quality.
- SQDF transfers effectively to SDXL (2.6B), with relative improvements highly consistent with those on SD 1.5.
Highlights & Insights¶
- The "training-free Q-function" concept is remarkably elegant—Tweedie's formula transforms the intractable value function training problem into a simple reward evaluation.
- The introduction of discount factor \(\gamma\) is straightforward yet well-grounded, supported by both theoretical derivation (first-order coincidence of upper and lower bounds) and thorough empirical validation.
- The consistency model serves as a superior drop-in replacement for Tweedie's formula, outperforming multi-step DDIM (4-step DDIM caused training instability).
- Off-policy update compatibility is a structural advantage of SQDF over DDPO/DRaFT, which are constrained to on-policy samples.
- The experimental design is comprehensive: beyond baseline comparisons, Pareto curve comparisons against KL-augmented baselines confirm that the advantage stems from the framework itself rather than regularization alone.
Limitations & Future Work¶
- The one-step Q-function approximation is mathematically coarse—the first-order approximation of the log moment-generating function may be insufficient when \(r/\alpha\) is large.
- The method depends on consistency model quality; inaccurate LCM-LoRA estimates will propagate bias into Q-function approximations.
- Validation is currently limited to the Stable Diffusion family; applicability to newer architectures such as flow matching remains untested.
- The replay buffer management strategy (priority sampling) may require task-specific tuning.
- Computational cost analysis is insufficient—per-step overhead of 62s (aesthetic) / 401s (HPS) warrants further optimization.
Related Work & Insights¶
- DDPO (Black et al., 2023): A PPO-based method that does not exploit gradients; simple but inefficient.
- DRaFT/ReFL: Direct gradient backpropagation; efficient but severely prone to over-optimization.
- SEIKO (Uehara et al., 2024): KL-regularized direct backpropagation, but relies on truncated backpropagation through the denoising chain.
- The proposed "training-free Q-function + reparameterization" framework may generalize to other generative models requiring RL fine-tuning, such as language model RLHF and protein design.
- The use of consistency models here inspires a broader paradigm of "distilled models as Q-value estimators."
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Training-free differentiable Q-function estimation combined with three complementary components is an ingenious design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two task settings, comprehensive ablations, Pareto curve comparisons, and SDXL extension.
- Writing Quality: ⭐⭐⭐⭐ — The method section is well-structured, though relegating several derivations to the appendix increases reading difficulty.
- Value: ⭐⭐⭐⭐⭐ — Provides a principled solution for diffusion model alignment with open-source code and generalizable methodology.