ICCV 2025 Video Generation diffusion model distillation adversarial distribution matching few-step generation score distillation video synthesis acceleration

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis¶

Conference: ICCV 2025 arXiv: 2507.18569 Code: N/A Area: Video Generation Keywords: diffusion model distillation, adversarial distribution matching, few-step generation, score distillation, video synthesis acceleration

TL;DR¶

This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.

Background & Motivation¶

Background: Distribution Matching Distillation (DMD) is a mainstream score distillation approach that compresses teacher diffusion models into efficient one-step/few-step student generators by minimizing the reverse KL divergence.
Limitations of Prior Work: DMD relies on reverse KL divergence minimization, which exhibits zero-forcing behavior—driving probability mass in low-density regions toward zero—causing the model to focus on a limited set of dominant modes and thus prone to mode collapse. The ODE/GAN regularization additionally introduced in DMD/DMD2 merely compensates for this trade-off without fundamentally resolving the mode-seeking behavior.
Key Challenge: Predefined explicit divergence measures (reverse KL, Fisher divergence, etc.) struggle to fully capture the multi-faceted alignment requirements of complex high-dimensional text-conditioned image/video distributions. In one-step distillation, insufficient support overlap between student and teacher distributions leads to gradient explosion or vanishing.
Goal: (1) How to bypass the limitations of predefined divergences and achieve more flexible distribution matching? (2) How to provide better initialization for the highly challenging one-step distillation setting?
Key Insight: Leveraging the implicit data-driven measure of GANs as a replacement for explicit divergences. Hinge GAN theoretically minimizes the Total Variation Distance (TVD), which is symmetric and bounded, making it more suitable than reverse KL in low-overlap settings.
Core Idea: Employ a diffusion-based adversarial discriminator to align ODE predictions of real and fake score estimators across different noise levels, achieving implicit, adaptive distribution matching distillation.

Method¶

Overall Architecture¶

DMDX constitutes a unified pipeline: Adversarial Distillation Pretraining (ADP) is first applied to provide the student model with a better initialization, followed by ADM for score distillation fine-tuning. The input is noise \(\boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), the generator \(G_\theta\) outputs \(\hat{x}_0\), and the alignment target is the output distribution of the teacher model.

Key Designs¶

Adversarial Distribution Matching (ADM):
- Function: Replaces the DMD loss by adversarially aligning real and fake score estimator predictions across different noise levels.
- Mechanism: The discriminator \(D_\tau\) consists of a frozen teacher diffusion model augmented with multiple trainable heads. Given \(\hat{x}_0\) generated by the student, re-diffusion produces \(x_t\), which is then stepped along the PF-ODE by \(\Delta t\) via the real and fake score estimators, yielding \(x_{t-\Delta t}^{\text{real}}\) and \(x_{t-\Delta t}^{\text{fake}}\), respectively. The discriminator is trained with Hinge loss to distinguish the two: \(\mathcal{L}_{\text{GAN}}(\theta) = -\mathbb{E}[D_\tau(x_{t-\Delta t}^{\text{fake}}, t-\Delta t)]\). The default timestep interval is \(\Delta t = T/64\).
- Design Motivation: Compared to the reverse KL divergence in DMD, Hinge GAN theoretically minimizes TVD, which is symmetric (eliminating mode-seeking behavior) and bounded in \([0,1]\) (avoiding gradient explosion). The discriminator can learn arbitrary nonlinear functions to implicitly measure distributional discrepancy, endowing the framework with data-driven adaptability.
Adversarial Distillation Pretraining (ADP):
- Function: Provides a better initialization for ADM fine-tuning in one-step distillation, enlarging the support overlap between student and teacher distributions.
- Mechanism: ODE pairs \((x_T, x_0)\) are collected offline from the teacher model; linear interpolation is used to construct noisy samples with velocity prediction. A dual-space discriminator is employed: a latent-space discriminator \(D_{\tau_1}\) (initialized from the teacher model) and a pixel-space discriminator \(D_{\tau_2}\) (initialized from the SAM visual encoder), with weights \(\lambda_1=0.85\) and \(\lambda_2=0.15\). A cubic timestep schedule \([1-(t/T)^3]*T\) is introduced to bias sampling toward high noise levels, encouraging exploration of new modes.
- Design Motivation: In one-step distillation, the poor quality of student outputs results in minimal support overlap between \(p_{\text{fake}}\) and \(p_{\text{real}}\), causing gradient vanishing as \(p_{\text{fake}} \to 0\) and gradient explosion as \(p_{\text{real}} \to 0\) under reverse KL. Distribution-level adversarial distillation enables the student to capture a broader range of latent modes from the teacher.
Distinction Between ADM and ADP:
- Function: ADM is a score distillation method (supervising the full denoising process across noise levels), whereas ADP is an adversarial distillation method (concerned only with the clean data distribution at \(t=0\)).
- Mechanism: ADM solves the PF-ODE to preserve timestep information for the score estimator input, operating the discriminator in noise space; ADP artificially creates overlapping regions by randomly diffusing generator outputs, making discrimination harder and gradient signals smoother.
- Design Motivation: In ADM, insufficient distributional support overlap makes it easy for the discriminator to separate real from fake, leading to extreme gradient signals; ADP is therefore needed first to bring the two distributions closer together.

Loss & Training¶

ADM stage: Hinge GAN loss with alternating generator and discriminator updates, simultaneously learning the fake score estimator dynamically.
ADP stage: Distribution-level Hinge GAN loss based on ODE pairs + velocity MSE pretraining loss.
ADM requires no additional regularization terms (unlike DMD/DMD2), as GAN training implicitly incorporates the optimization direction of the reverse KL divergence.

Key Experimental Results¶

Main Results¶

Model / Dataset	Metric	Ours (DMDX/ADM)	Prev. SOTA	Gain
SDXL 1-step	CLIP Score	35.26	35.22 (DMD2)	+0.04
SDXL 1-step	HPSv2	27.70	27.45 (DMD2)	+0.25
SDXL 1-step	MPS	11.20	10.69 (DMD2)	+0.51
SD3-Medium 4-step	CLIP Score	34.91	34.40 (Flash)	+0.51
SD3-Medium 4-step	Pick Score	22.55	22.09 (Flash)	+0.46
SD3.5-Large 4-step	Pick Score	22.88	22.40 (LADD)	+0.48
CogVideoX 4-step	VBench Quality	85.25	84.73 (Teacher 50-step)	Surpasses teacher

Ablation Study¶

Configuration	FID↓	Notes
DMD loss (reverse KL)	Higher	DMD loss exhibits a steady downward trend during ADM training, indicating ADM implicitly subsumes reverse KL
ADM w/o pretraining	Unstable	Gradient explosion/vanishing issues
ADP + ADM (DMDX)	Best	Full pipeline
Uniform vs. cubic timestep schedule	—	Cubic schedule biases toward high noise, promoting mode diversity

Key Findings¶

Without directly optimizing the DMD loss, its value exhibits a steady downward trend during ADM training, validating that Hinge GAN implicitly encompasses reverse KL divergence optimization.
When provided with better initialization, TTUR (Two-Timescale Update Rule) has minimal impact on final performance.
Four-step ADM distillation surpasses the quality of 50-step teacher sampling on both SD3 and SD3.5.
CogVideoX 4-step distillation outperforms the 50-step teacher model on the VBench quality score.

Highlights & Insights¶

The paper provides a theoretical justification from the TVD-vs-reverse-KL perspective for the advantage of adversarial methods under low support overlap: the symmetry of TVD avoids mode-seeking, and its boundedness avoids numerical instability.
The GAN discriminator in score distillation is elegantly designed to operate by taking a \(\Delta t\)-step along the PF-ODE, naturally preserving timestep information.
This is the first successful application of score distillation to a large-scale video model such as CogVideoX, achieving 4-step quality surpassing the 50-step teacher.
The combination of dual-space discriminators (latent space + pixel space) enhances overall discriminative capacity.

Limitations & Future Work¶

One-step generation quality still has room for improvement at very high resolutions.
The stability of adversarial training depends on the quality of the pretraining stage.
Whether dominant FC identification in the discriminator can be automated or adaptively adjusted remains an open question.
The scalability to larger video models (e.g., HunyuanVideo 13B) is not discussed.

vs. DMD/DMD2: ADM replaces the explicit reverse KL divergence with an implicit GAN measure, eliminating the need for additional regularization; ADP replaces MSE pretraining with distribution-level adversarial distillation.
vs. SDXL-Lightning: Both share the adversarial distillation idea, but DMDX further introduces ADM for score distillation fine-tuning.
vs. LADD: ADP is inspired by LADD's synthetic-data adversarial distillation paradigm, but replaces it with ODE pair-based noise construction, a cubic timestep schedule, and dual-space discriminators.

Rating¶

Novelty: ⭐⭐⭐⭐ Advances score distillation from both theoretical (TVD vs. KL) and practical (implicit vs. explicit measure) perspectives.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers SDXL / SD3 / SD3.5 / CogVideoX, one-step and multi-step settings, image and video generation.
Writing Quality: ⭐⭐⭐⭐ Theoretical discussion is thorough and mathematical derivations are clearly presented.
Value: ⭐⭐⭐⭐ Provides a unified and efficient distillation framework for large-scale diffusion models.