Learning Few-Step Diffusion Models by Trajectory Distribution Matching¶

Conference: ICCV 2025 arXiv: 2503.06674 Code: Project Page Area: Diffusion Models / Image Generation Keywords: Diffusion Distillation, Few-Step Generation, Trajectory Distribution Matching, Score Distillation, Text-to-Image Acceleration

TL;DR¶

This paper proposes Trajectory Distribution Matching (TDM), a novel paradigm that unifies trajectory distillation and distribution matching by aligning the marginal distributions of student and teacher ODE trajectories at the distributional level. TDM enables efficient few-step diffusion model distillation, requiring only 2 A800 GPU-hours to distill PixArt-α into a 4-step generator that surpasses the teacher model.

Background & Motivation¶

Accelerating diffusion model sampling is critical for efficient deployment of AIGC systems. Existing distillation methods fall into two major categories, each with notable limitations:

Distribution Matching methods (e.g., DMD, SiD): achieve distribution-level alignment via score distillation and perform strongly in one-step generation, but are primarily optimized for single-step inference and lack flexibility for multi-step sampling—additional steps cannot be effectively exploited.

Trajectory Distillation methods (e.g., Progressive Distillation, Consistency Models): simulate teacher ODE trajectories at the instance level and support multi-step sampling, but instance-level trajectory matching imposes high demands on model capacity, and numerical errors from solving the teacher ODE propagate to the student.

Key Challenge: Distribution matching discards intermediate trajectory information, while trajectory distillation suffers from the difficulty of instance-level matching.

Core Idea of TDM: Match trajectories at the distributional level rather than at the instance level. This non-trivially unifies the advantages of both paradigms—leveraging trajectory information for fine-grained knowledge transfer while reducing learning difficulty through distribution-level alignment.

Method¶

Overall Architecture¶

TDM parameterizes a \(K\)-step student model as a discrete ODE sampler that generates intermediate samples \(x_{t_i}\) along the trajectory. Distribution-level score distillation is then applied to align the marginal distribution \(p_{\theta, t_i}\) at each timestep on the student trajectory with the corresponding teacher distribution \(p_{\phi, t_i}\). The entire process is data-free and does not require solving the teacher ODE.

Key Designs¶

TDM Objective:
- Function: Minimize the reverse KL divergence from the teacher distribution at each timestep along the student trajectory.
- Mechanism: The objective is formulated as: \(L(\theta) = \sum_{i=0}^{K-1} \sum_{\tau=t_i}^{t_{i+1}} \lambda_\tau \text{KL}(p_{\theta, \tau|t_i}(x_\tau) \| p_{\phi, \tau}(x_\tau))\) where \(p_{\theta, \tau|t_i}\) is the diffusion distribution of student trajectory samples. The gradient is computed as: \(\nabla_\theta L(\theta) \approx \sum_{i,\tau} \lambda_\tau [s_\psi(x_\tau, \tau) - s_\phi(x_\tau, \tau)] \frac{\partial x_{t_i}}{\partial \theta}\) A fake score \(s_\psi\) is required to approximate the score of the student distribution.
- Design Motivation: (1) Only student samples are needed—no teacher ODE sampling required (data-free and efficient); (2) numerical errors from teacher ODE solving are avoided; (3) distribution-level matching imposes lower model capacity requirements than instance-level matching. A key design ensures that diffusion intervals at different timesteps are non-overlapping, allowing a single fake score model to naturally distinguish between different distributions.
Sampling-Steps-Aware Objective:
- Function: Enable a single model to support flexible deterministic multi-step sampling.
- Mechanism: The objective is extended to an expectation over the number of sampling steps \(K\): \(\mathbb{E}_K \sum_{i=0}^{K-1} \sum_{\tau=t_i^K}^{t_{i+1}^K} \lambda_\tau \text{KL}(p_{\theta, \tau|t_i^K}(x_\tau|K) \| p_{\phi,\tau}(x_\tau))\) where \(K\) is injected as a conditioning signal into both the student and fake score models. The resulting model is referred to as TDM-unify.
- Design Motivation: Existing deterministic sampling distillation methods bind their learning objectives to a fixed number of steps, making it impossible to flexibly switch to \(M < K\) steps after training. Sharing a non-step-aware fake score introduces theoretical bias (formally derived in the paper), necessitating a steps-aware fake score.
Pseudo-Huber Surrogate Training Objective:
- Function: Replace the L2 loss with a Pseudo-Huber metric to stabilize training.
- Mechanism: The learning objective becomes: \(L(\theta) = \sum_{i,\tau} \sqrt{\|x_{t_i} - \text{sg}(\tilde{x}_{t_i})\|_2^2 + c^2} - c\) where \(c = 0.00054\sqrt{d}\). This normalizes gradients and yields more stable training.
- Design Motivation: TDM and Consistency Models share a structural similarity in their learning formulations (both minimize the distance between generated and corrected samples), motivating the transfer of the Pseudo-Huber metric from iCT to the score distillation framework.

Loss & Training¶

Fake Score Training: Denoising score matching with importance sampling is applied to efficiently learn the score in the neighborhood of student trajectory samples.
Better Teacher, Better Student: For SD-v1.5, fine-tuning the teacher on high-quality data prior to distillation is recommended.
Backpropagation considers only a single ODE step to reduce GPU memory consumption.
Training efficiency is exceptional: SDXL requires only 2 A800 GPU-days; PixArt-α converges in 500 iterations / 2 A800 GPU-hours.

Key Experimental Results¶

Main Results¶

Text-to-image generation quality comparison (SDXL backbone, 4 steps):

Method	HPS↑	AeS↑	CLIP↑	Training Cost
SDXL-Lightning	32.71	6.23	34.62	—
Hyper-SD	34.14	6.18	34.27	—
DMD2	31.46	5.88	35.51	160 A100-days
LCM	29.41	5.84	34.84	32 A100-days
TDM (Ours)	34.88	6.28	36.08	2 A800-days

PixArt-α backbone (4 steps, 1024 resolution):

Method	HPS↑	AeS↑	CLIP↑	Data-Free?
PixArt-α Teacher (25 steps)	32.21	6.23	34.11	—
LCM (4 steps)	30.55	6.17	33.49	✗
TDM (4 steps)	33.21	6.42	33.66	✓

SD-v1.5 backbone (TDM-unify, 1 step & 4 steps):

Method	Steps	HPS Avg↑	AeS↑	CLIP↑
Hyper-SD	1	28.01	5.64	30.87
TDM-unify-SFT	1	28.90	6.02	32.12
DMD2	4	29.49	5.91	31.53
TDM-unify-SFT	4	31.31	6.08	32.77

Ablation Study¶

Configuration	HPS Avg↑	AeS↑	Note
TDM w/o trajectory (K=1)	28.54	5.97	Degenerates to pure distribution matching
TDM w/ trajectory (K=4)	30.83	6.07	Trajectory information yields significant gains
TDM + Pseudo-Huber	31.31	6.08	Huber metric further improves performance
Shared fake score (non-step-aware)	↓	↓	Validates necessity of steps-aware objective

LoRA adaptation to unseen custom models (Realistic, SD-v1.5, 4 steps):

Method	HPS Avg↑	FID↓
LCM	27.72	26.89
Hyper-SD	30.36	37.83
TDM	31.22	20.23

Key Findings¶

TDM outperforms SDXL-Lightning by +2.17 HPS and DMD2 by +3.42 HPS on SDXL, while requiring less than 1/80 of DMD2's training cost.
The 4-step TDM-distilled PixArt-α surpasses the 25-step teacher model on both HPS and AeS.
TDM-unify enables flexible 1-step and 4-step sampling within a single model.
The method generalizes to video diffusion distillation: distilling CogVideoX-2B into a 4-step generator improves VBench total score from 80.91 to 81.65.

Highlights & Insights¶

Unified Paradigm: The first work to non-trivially unify trajectory distillation and distribution matching, with rigorous theoretical derivation of their connection.
Extreme Training Efficiency: PixArt-α converges in 500 iterations (0.01% of teacher training cost); SDXL requires only 2 A800 GPU-days.
Fully Data-Free: Requires no real image data, alleviating data acquisition and copyright concerns.
Deterministic + Flexible Sampling: TDM-unify is the first distillation method that simultaneously supports deterministic sampling and flexible step count adjustment.
Theoretical Connection to Consistency Models: Reveals a deep structural similarity between TDM and CM learning objectives.

Limitations & Future Work¶

The quality of fake score learning directly affects distillation performance and requires careful tuning of the training strategy.
Performance is upper-bounded by teacher model quality (a trade-off inherent to data-free approaches); the "Better Teacher, Better Student" strategy partially mitigates this at the cost of additional preprocessing.
Backpropagating through only a single ODE step is a memory-driven compromise; multi-step backpropagation may yield further gains.
The steps-aware fake score increases training complexity.

Closely related to but fundamentally different from DMD2: DMD2 predicts clean samples at each step and ignores intermediate trajectories, while TDM explicitly simulates the deterministic trajectory.
The Pseudo-Huber metric from Consistency Models is effectively transferred to the score distillation framework.
Key insight: Trajectory information is a unique asset of diffusion models that should not be discarded during distillation but instead leveraged at the distributional level.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Outstanding theoretical contribution in unifying two distillation paradigms with rigorous derivations.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across multiple backbones (SD-v1.5/SDXL/PixArt-α/CogVideoX), step counts, and metrics.
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretically clear; the logical flow from trajectory distillation to distribution matching is coherent and well-structured.
Value: ⭐⭐⭐⭐⭐ — State-of-the-art performance combined with extreme training efficiency and a novel theoretical paradigm; high practical impact.