Fine-Tuning Diffusion Models via Intermediate Distribution Shaping¶

Conference: ICLR 2026 arXiv: 2510.02692 Code: None Area: Diffusion Models / Fine-Tuning Keywords: diffusion model fine-tuning, rejection sampling, KL regularization, intermediate distribution, inverse noise correction

TL;DR¶

This work unifies rejection-sampling-based fine-tuning methods under the GRAFT framework, proving that they implicitly perform KL-regularized reward maximization. Building on this, P-GRAFT is proposed to perform distribution shaping at intermediate denoising steps (achieving a better bias–variance trade-off), and Inverse Noise Correction is introduced to improve flow model quality without reward signals, yielding an 8.81% VQAScore improvement on text-to-image generation.

Background & Motivation¶

Background: Fine-tuning diffusion models commonly relies on PPO with KL regularization; however, the marginal likelihood of diffusion models is intractable, causing the KL term to be either ignored (leading to instability) or approximated via trajectory-level KL (suboptimal, with value function initialization bias).

Limitations of Prior Work: (1) Intractable marginal KL forces PPO to rely on relaxed approximations; (2) rejection-sampling methods (RAFT/BoN) are practical but lack clear theoretical grounding; (3) distribution shaping is applied only to the final data distribution, leaving the structural information of intermediate denoising steps unexploited.

Key Challenge: Diffusion model fine-tuning requires KL regularization for stability, yet the marginal KL is intractable.

Key Insight: Proving that rejection sampling implicitly enforces the marginal KL constraint (even though the likelihood is intractable), then exploiting the multi-step structure of diffusion models to perform shaping at intermediate distributions.

Core Idea: Rejection sampling = implicit KL regularization → apply rejection sampling at intermediate denoising steps → superior bias–variance trade-off.

Method¶

Overall Architecture¶

Two main contributions: (1) P-GRAFT: fine-tuning via rejection sampling at an intermediate denoising step \(t\); (2) Inverse Noise Correction: inverting the flow model to learn an improved initial noise distribution without reward signals.

Key Designs¶

GRAFT Unified Framework:
Function: Unifies classical rejection sampling, Best-of-N, Top-K, and related methods under a Generalized Rejection Sampling (GRS) framework.
Mechanism: Lemma 2.3 proves that the distribution of samples accepted by GRS is the solution to KL-regularized reward maximization, \(p^{\text{RL}}(x) \propto \exp(\hat{r}(x)/\alpha)\bar{p}(x)\), with reward reshaping incorporated.
Design Motivation: Although the marginal KL of diffusion models is intractable, GRS implicitly realizes it.
P-GRAFT (Partial-GRAFT):
Function: Performs rejection sampling on intermediate denoising states \(X_t\) rather than on final generated samples.
Mechanism: Lemma 3.2 proves that P-GRS shapes the intermediate distribution \(\bar{p}_t\) rather than the final distribution. The fine-tuned model handles denoising from \(T \to t\), while the original model handles \(t \to 0\). Bias–variance trade-off: large \(t\) yields high reward variance but a simpler learning problem (simpler score function); small \(t\) yields more accurate reward estimates but a harder learning problem.
Design Motivation: Choosing an appropriate intermediate time \(t\) allows both aspects to be balanced simultaneously.
Inverse Noise Correction:
Function: Inverts the flow model's mapping from data to noise to learn an improved initial noise distribution.
Mechanism: An adapter is trained to apply corrections in the noise space without requiring an explicit reward function.
Design Motivation: The invertibility of flow models enables inference of the initial noise distribution.

Loss & Training¶

P-GRAFT: Generate \(M\) trajectories → apply GRS at the intermediate step to select accepted samples → fine-tune the \(T \to t\) denoising segment via supervised fine-tuning (SFT).
Inverse Noise Correction: Parameter-efficient adapter fine-tuning.

Key Experimental Results¶

Main Results¶

Fine-tuning Stable Diffusion v2 for text-to-image generation:

Method	VQAScore	Relative Gain over Baseline	Notes
SD v2 (baseline)	baseline	—	No fine-tuning
Policy Gradient	moderate	moderate	PPO-type method
GRAFT (final step)	good	good	Standard rejection sampling
P-GRAFT	best	+8.81%	Intermediate-step rejection sampling
SDXL-Base	reference	—	Larger model

Multi-Task Validation¶

Task	Method	Result
Layout generation	P-GRAFT	Significant improvement
Molecule generation	P-GRAFT + deduplication	Improvement + diversity preservation
Unconditional image generation	Inverse Noise Correction	Improved FID + reduced FLOPs

Key Findings¶

P-GRAFT outperforms policy gradient methods (PPO) and standard GRAFT on text-to-image generation.
Hypothesis testing confirms that intermediate states \(X_t\) at smaller \(t\) carry more information about the final reward (variance analysis).
In molecule generation, the deduplication variant of GRS effectively prevents mode collapse; the reshaped reward automatically incorporates a diversity term.
Inverse Noise Correction improves FID without requiring reward signals and reduces per-image FLOPs.

Highlights & Insights¶

GRS = Implicit KL Regularization: This theoretical result addresses a fundamental technical challenge in diffusion model fine-tuning. Since marginal KL is intractable, it need not be computed explicitly — rejection sampling implicitly enforces it.
Bias–Variance Perspective on Intermediate Distribution Shaping: The choice of intermediate step is not merely an engineering decision but is grounded in clear mathematical principles — selecting \(t\) that minimizes the product of bias and variance.
Deduplicated GRS for Molecule Generation: The reshaped reward \(\hat{r}\) automatically includes a diversity penalty — \(\log(1/N_{\text{copies}})\) — elegantly preventing mode collapse.

Limitations & Future Work¶

The optimal intermediate time \(t\) in P-GRAFT requires empirical search.
Inverse Noise Correction is applicable only to flow models, which require invertibility.
Generating \(M\) complete trajectories incurs significant computational overhead.
The theoretical analysis relies on the assumption of a well-trained denoiser.

vs. PPO/DPPO: Avoids the difficulty of KL computation; implicit KL constraint yields greater stability.
vs. RAFT/RSO: GRAFT provides a unified theoretical perspective, and P-GRAFT further leverages the diffusion structure for additional gains.
vs. DPO for diffusion: DPO transfers KL via preference data; P-GRAFT employs rejection sampling more directly.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The GRAFT unified theory and the bias–variance analysis underlying P-GRAFT are both significant contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers text-to-image, layout, molecule, and unconditional image generation.
Writing Quality: ⭐⭐⭐⭐⭐ Theory and practice are tightly integrated, with clear mathematical derivations.
Value: ⭐⭐⭐⭐⭐ Carries important theoretical and practical implications for the paradigm of diffusion model fine-tuning.