Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis¶
Conference: ICCV 2025 arXiv: 2507.18569 Code: N/A Area: Video Generation Keywords: diffusion model distillation, adversarial distribution matching, few-step generation, score distillation, video synthesis acceleration
TL;DR¶
This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.
Background & Motivation¶
- Background: Distribution Matching Distillation (DMD) is a mainstream score distillation approach that compresses teacher diffusion models into efficient one-step/few-step student generators by minimizing the reverse KL divergence.
- Limitations of Prior Work: DMD relies on reverse KL divergence minimization, which exhibits zero-forcing behavior—driving probability mass in low-density regions toward zero—causing the model to focus on a limited set of dominant modes and thus prone to mode collapse. The ODE/GAN regularization additionally introduced in DMD/DMD2 merely compensates for this trade-off without fundamentally resolving the mode-seeking behavior.
- Key Challenge: Predefined explicit divergence measures (reverse KL, Fisher divergence, etc.) struggle to fully capture the multi-faceted alignment requirements of complex high-dimensional text-conditioned image/video distributions. In one-step distillation, insufficient support overlap between student and teacher distributions leads to gradient explosion or vanishing.
- Goal: (1) How to bypass the limitations of predefined divergences and achieve more flexible distribution matching? (2) How to provide better initialization for the highly challenging one-step distillation setting?
- Key Insight: Leveraging the implicit data-driven measure of GANs as a replacement for explicit divergences. Hinge GAN theoretically minimizes the Total Variation Distance (TVD), which is symmetric and bounded, making it more suitable than reverse KL in low-overlap settings.
- Core Idea: Employ a diffusion-based adversarial discriminator to align ODE predictions of real and fake score estimators across different noise levels, achieving implicit, adaptive distribution matching distillation.
Method¶
Overall Architecture¶
DMDX constitutes a unified pipeline: Adversarial Distillation Pretraining (ADP) is first applied to provide the student model with a better initialization, followed by ADM for score distillation fine-tuning. The input is noise \(\boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), the generator \(G_\theta\) outputs \(\hat{x}_0\), and the alignment target is the output distribution of the teacher model.
Key Designs¶
-
Adversarial Distribution Matching (ADM):
- Function: Replaces the DMD loss by adversarially aligning real and fake score estimator predictions across different noise levels.
- Mechanism: The discriminator \(D_\tau\) consists of a frozen teacher diffusion model augmented with multiple trainable heads. Given \(\hat{x}_0\) generated by the student, re-diffusion produces \(x_t\), which is then stepped along the PF-ODE by \(\Delta t\) via the real and fake score estimators, yielding \(x_{t-\Delta t}^{\text{real}}\) and \(x_{t-\Delta t}^{\text{fake}}\), respectively. The discriminator is trained with Hinge loss to distinguish the two: \(\mathcal{L}_{\text{GAN}}(\theta) = -\mathbb{E}[D_\tau(x_{t-\Delta t}^{\text{fake}}, t-\Delta t)]\). The default timestep interval is \(\Delta t = T/64\).
- Design Motivation: Compared to the reverse KL divergence in DMD, Hinge GAN theoretically minimizes TVD, which is symmetric (eliminating mode-seeking behavior) and bounded in \([0,1]\) (avoiding gradient explosion). The discriminator can learn arbitrary nonlinear functions to implicitly measure distributional discrepancy, endowing the framework with data-driven adaptability.
-
Adversarial Distillation Pretraining (ADP):
- Function: Provides a better initialization for ADM fine-tuning in one-step distillation, enlarging the support overlap between student and teacher distributions.
- Mechanism: ODE pairs \((x_T, x_0)\) are collected offline from the teacher model; linear interpolation is used to construct noisy samples with velocity prediction. A dual-space discriminator is employed: a latent-space discriminator \(D_{\tau_1}\) (initialized from the teacher model) and a pixel-space discriminator \(D_{\tau_2}\) (initialized from the SAM visual encoder), with weights \(\lambda_1=0.85\) and \(\lambda_2=0.15\). A cubic timestep schedule \([1-(t/T)^3]*T\) is introduced to bias sampling toward high noise levels, encouraging exploration of new modes.
- Design Motivation: In one-step distillation, the poor quality of student outputs results in minimal support overlap between \(p_{\text{fake}}\) and \(p_{\text{real}}\), causing gradient vanishing as \(p_{\text{fake}} \to 0\) and gradient explosion as \(p_{\text{real}} \to 0\) under reverse KL. Distribution-level adversarial distillation enables the student to capture a broader range of latent modes from the teacher.
-
Distinction Between ADM and ADP:
- Function: ADM is a score distillation method (supervising the full denoising process across noise levels), whereas ADP is an adversarial distillation method (concerned only with the clean data distribution at \(t=0\)).
- Mechanism: ADM solves the PF-ODE to preserve timestep information for the score estimator input, operating the discriminator in noise space; ADP artificially creates overlapping regions by randomly diffusing generator outputs, making discrimination harder and gradient signals smoother.
- Design Motivation: In ADM, insufficient distributional support overlap makes it easy for the discriminator to separate real from fake, leading to extreme gradient signals; ADP is therefore needed first to bring the two distributions closer together.
Loss & Training¶
- ADM stage: Hinge GAN loss with alternating generator and discriminator updates, simultaneously learning the fake score estimator dynamically.
- ADP stage: Distribution-level Hinge GAN loss based on ODE pairs + velocity MSE pretraining loss.
- ADM requires no additional regularization terms (unlike DMD/DMD2), as GAN training implicitly incorporates the optimization direction of the reverse KL divergence.
Key Experimental Results¶
Main Results¶
| Model / Dataset | Metric | Ours (DMDX/ADM) | Prev. SOTA | Gain |
|---|---|---|---|---|
| SDXL 1-step | CLIP Score | 35.26 | 35.22 (DMD2) | +0.04 |
| SDXL 1-step | HPSv2 | 27.70 | 27.45 (DMD2) | +0.25 |
| SDXL 1-step | MPS | 11.20 | 10.69 (DMD2) | +0.51 |
| SD3-Medium 4-step | CLIP Score | 34.91 | 34.40 (Flash) | +0.51 |
| SD3-Medium 4-step | Pick Score | 22.55 | 22.09 (Flash) | +0.46 |
| SD3.5-Large 4-step | Pick Score | 22.88 | 22.40 (LADD) | +0.48 |
| CogVideoX 4-step | VBench Quality | 85.25 | 84.73 (Teacher 50-step) | Surpasses teacher |
Ablation Study¶
| Configuration | FID↓ | Notes |
|---|---|---|
| DMD loss (reverse KL) | Higher | DMD loss exhibits a steady downward trend during ADM training, indicating ADM implicitly subsumes reverse KL |
| ADM w/o pretraining | Unstable | Gradient explosion/vanishing issues |
| ADP + ADM (DMDX) | Best | Full pipeline |
| Uniform vs. cubic timestep schedule | — | Cubic schedule biases toward high noise, promoting mode diversity |
Key Findings¶
- Without directly optimizing the DMD loss, its value exhibits a steady downward trend during ADM training, validating that Hinge GAN implicitly encompasses reverse KL divergence optimization.
- When provided with better initialization, TTUR (Two-Timescale Update Rule) has minimal impact on final performance.
- Four-step ADM distillation surpasses the quality of 50-step teacher sampling on both SD3 and SD3.5.
- CogVideoX 4-step distillation outperforms the 50-step teacher model on the VBench quality score.
Highlights & Insights¶
- The paper provides a theoretical justification from the TVD-vs-reverse-KL perspective for the advantage of adversarial methods under low support overlap: the symmetry of TVD avoids mode-seeking, and its boundedness avoids numerical instability.
- The GAN discriminator in score distillation is elegantly designed to operate by taking a \(\Delta t\)-step along the PF-ODE, naturally preserving timestep information.
- This is the first successful application of score distillation to a large-scale video model such as CogVideoX, achieving 4-step quality surpassing the 50-step teacher.
- The combination of dual-space discriminators (latent space + pixel space) enhances overall discriminative capacity.
Limitations & Future Work¶
- One-step generation quality still has room for improvement at very high resolutions.
- The stability of adversarial training depends on the quality of the pretraining stage.
- Whether dominant FC identification in the discriminator can be automated or adaptively adjusted remains an open question.
- The scalability to larger video models (e.g., HunyuanVideo 13B) is not discussed.
Related Work & Insights¶
- vs. DMD/DMD2: ADM replaces the explicit reverse KL divergence with an implicit GAN measure, eliminating the need for additional regularization; ADP replaces MSE pretraining with distribution-level adversarial distillation.
- vs. SDXL-Lightning: Both share the adversarial distillation idea, but DMDX further introduces ADM for score distillation fine-tuning.
- vs. LADD: ADP is inspired by LADD's synthetic-data adversarial distillation paradigm, but replaces it with ODE pair-based noise construction, a cubic timestep schedule, and dual-space discriminators.
Rating¶
- Novelty: ⭐⭐⭐⭐ Advances score distillation from both theoretical (TVD vs. KL) and practical (implicit vs. explicit measure) perspectives.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers SDXL / SD3 / SD3.5 / CogVideoX, one-step and multi-step settings, image and video generation.
- Writing Quality: ⭐⭐⭐⭐ Theoretical discussion is thorough and mathematical derivations are clearly presented.
- Value: ⭐⭐⭐⭐ Provides a unified and efficient distillation framework for large-scale diffusion models.