Skip to content

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

Conference: ICCV 2025 arXiv: 2507.18569 Code: N/A Area: Video Generation Keywords: diffusion model distillation, adversarial distribution matching, few-step generation, score distillation, video synthesis acceleration

TL;DR

This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.

Background & Motivation

  1. Background: Distribution Matching Distillation (DMD) is a mainstream score distillation approach that compresses teacher diffusion models into efficient one-step/few-step student generators by minimizing the reverse KL divergence.
  2. Limitations of Prior Work: DMD relies on reverse KL divergence minimization, which exhibits zero-forcing behavior—driving probability mass in low-density regions toward zero—causing the model to focus on a limited set of dominant modes and thus prone to mode collapse. The ODE/GAN regularization additionally introduced in DMD/DMD2 merely compensates for this trade-off without fundamentally resolving the mode-seeking behavior.
  3. Key Challenge: Predefined explicit divergence measures (reverse KL, Fisher divergence, etc.) struggle to fully capture the multi-faceted alignment requirements of complex high-dimensional text-conditioned image/video distributions. In one-step distillation, insufficient support overlap between student and teacher distributions leads to gradient explosion or vanishing.
  4. Goal: (1) How to bypass the limitations of predefined divergences and achieve more flexible distribution matching? (2) How to provide better initialization for the highly challenging one-step distillation setting?
  5. Key Insight: Leveraging the implicit data-driven measure of GANs as a replacement for explicit divergences. Hinge GAN theoretically minimizes the Total Variation Distance (TVD), which is symmetric and bounded, making it more suitable than reverse KL in low-overlap settings.
  6. Core Idea: Employ a diffusion-based adversarial discriminator to align ODE predictions of real and fake score estimators across different noise levels, achieving implicit, adaptive distribution matching distillation.

Method

Overall Architecture

DMDX constitutes a unified pipeline: Adversarial Distillation Pretraining (ADP) is first applied to provide the student model with a better initialization, followed by ADM for score distillation fine-tuning. The input is noise \(\boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\), the generator \(G_\theta\) outputs \(\hat{x}_0\), and the alignment target is the output distribution of the teacher model.

Key Designs

  1. Adversarial Distribution Matching (ADM):

    • Function: Replaces the DMD loss by adversarially aligning real and fake score estimator predictions across different noise levels.
    • Mechanism: The discriminator \(D_\tau\) consists of a frozen teacher diffusion model augmented with multiple trainable heads. Given \(\hat{x}_0\) generated by the student, re-diffusion produces \(x_t\), which is then stepped along the PF-ODE by \(\Delta t\) via the real and fake score estimators, yielding \(x_{t-\Delta t}^{\text{real}}\) and \(x_{t-\Delta t}^{\text{fake}}\), respectively. The discriminator is trained with Hinge loss to distinguish the two: \(\mathcal{L}_{\text{GAN}}(\theta) = -\mathbb{E}[D_\tau(x_{t-\Delta t}^{\text{fake}}, t-\Delta t)]\). The default timestep interval is \(\Delta t = T/64\).
    • Design Motivation: Compared to the reverse KL divergence in DMD, Hinge GAN theoretically minimizes TVD, which is symmetric (eliminating mode-seeking behavior) and bounded in \([0,1]\) (avoiding gradient explosion). The discriminator can learn arbitrary nonlinear functions to implicitly measure distributional discrepancy, endowing the framework with data-driven adaptability.
  2. Adversarial Distillation Pretraining (ADP):

    • Function: Provides a better initialization for ADM fine-tuning in one-step distillation, enlarging the support overlap between student and teacher distributions.
    • Mechanism: ODE pairs \((x_T, x_0)\) are collected offline from the teacher model; linear interpolation is used to construct noisy samples with velocity prediction. A dual-space discriminator is employed: a latent-space discriminator \(D_{\tau_1}\) (initialized from the teacher model) and a pixel-space discriminator \(D_{\tau_2}\) (initialized from the SAM visual encoder), with weights \(\lambda_1=0.85\) and \(\lambda_2=0.15\). A cubic timestep schedule \([1-(t/T)^3]*T\) is introduced to bias sampling toward high noise levels, encouraging exploration of new modes.
    • Design Motivation: In one-step distillation, the poor quality of student outputs results in minimal support overlap between \(p_{\text{fake}}\) and \(p_{\text{real}}\), causing gradient vanishing as \(p_{\text{fake}} \to 0\) and gradient explosion as \(p_{\text{real}} \to 0\) under reverse KL. Distribution-level adversarial distillation enables the student to capture a broader range of latent modes from the teacher.
  3. Distinction Between ADM and ADP:

    • Function: ADM is a score distillation method (supervising the full denoising process across noise levels), whereas ADP is an adversarial distillation method (concerned only with the clean data distribution at \(t=0\)).
    • Mechanism: ADM solves the PF-ODE to preserve timestep information for the score estimator input, operating the discriminator in noise space; ADP artificially creates overlapping regions by randomly diffusing generator outputs, making discrimination harder and gradient signals smoother.
    • Design Motivation: In ADM, insufficient distributional support overlap makes it easy for the discriminator to separate real from fake, leading to extreme gradient signals; ADP is therefore needed first to bring the two distributions closer together.

Loss & Training

  • ADM stage: Hinge GAN loss with alternating generator and discriminator updates, simultaneously learning the fake score estimator dynamically.
  • ADP stage: Distribution-level Hinge GAN loss based on ODE pairs + velocity MSE pretraining loss.
  • ADM requires no additional regularization terms (unlike DMD/DMD2), as GAN training implicitly incorporates the optimization direction of the reverse KL divergence.

Key Experimental Results

Main Results

Model / Dataset Metric Ours (DMDX/ADM) Prev. SOTA Gain
SDXL 1-step CLIP Score 35.26 35.22 (DMD2) +0.04
SDXL 1-step HPSv2 27.70 27.45 (DMD2) +0.25
SDXL 1-step MPS 11.20 10.69 (DMD2) +0.51
SD3-Medium 4-step CLIP Score 34.91 34.40 (Flash) +0.51
SD3-Medium 4-step Pick Score 22.55 22.09 (Flash) +0.46
SD3.5-Large 4-step Pick Score 22.88 22.40 (LADD) +0.48
CogVideoX 4-step VBench Quality 85.25 84.73 (Teacher 50-step) Surpasses teacher

Ablation Study

Configuration FID↓ Notes
DMD loss (reverse KL) Higher DMD loss exhibits a steady downward trend during ADM training, indicating ADM implicitly subsumes reverse KL
ADM w/o pretraining Unstable Gradient explosion/vanishing issues
ADP + ADM (DMDX) Best Full pipeline
Uniform vs. cubic timestep schedule Cubic schedule biases toward high noise, promoting mode diversity

Key Findings

  • Without directly optimizing the DMD loss, its value exhibits a steady downward trend during ADM training, validating that Hinge GAN implicitly encompasses reverse KL divergence optimization.
  • When provided with better initialization, TTUR (Two-Timescale Update Rule) has minimal impact on final performance.
  • Four-step ADM distillation surpasses the quality of 50-step teacher sampling on both SD3 and SD3.5.
  • CogVideoX 4-step distillation outperforms the 50-step teacher model on the VBench quality score.

Highlights & Insights

  • The paper provides a theoretical justification from the TVD-vs-reverse-KL perspective for the advantage of adversarial methods under low support overlap: the symmetry of TVD avoids mode-seeking, and its boundedness avoids numerical instability.
  • The GAN discriminator in score distillation is elegantly designed to operate by taking a \(\Delta t\)-step along the PF-ODE, naturally preserving timestep information.
  • This is the first successful application of score distillation to a large-scale video model such as CogVideoX, achieving 4-step quality surpassing the 50-step teacher.
  • The combination of dual-space discriminators (latent space + pixel space) enhances overall discriminative capacity.

Limitations & Future Work

  • One-step generation quality still has room for improvement at very high resolutions.
  • The stability of adversarial training depends on the quality of the pretraining stage.
  • Whether dominant FC identification in the discriminator can be automated or adaptively adjusted remains an open question.
  • The scalability to larger video models (e.g., HunyuanVideo 13B) is not discussed.
  • vs. DMD/DMD2: ADM replaces the explicit reverse KL divergence with an implicit GAN measure, eliminating the need for additional regularization; ADP replaces MSE pretraining with distribution-level adversarial distillation.
  • vs. SDXL-Lightning: Both share the adversarial distillation idea, but DMDX further introduces ADM for score distillation fine-tuning.
  • vs. LADD: ADP is inspired by LADD's synthetic-data adversarial distillation paradigm, but replaces it with ODE pair-based noise construction, a cubic timestep schedule, and dual-space discriminators.

Rating

  • Novelty: ⭐⭐⭐⭐ Advances score distillation from both theoretical (TVD vs. KL) and practical (implicit vs. explicit measure) perspectives.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers SDXL / SD3 / SD3.5 / CogVideoX, one-step and multi-step settings, image and video generation.
  • Writing Quality: ⭐⭐⭐⭐ Theoretical discussion is thorough and mathematical derivations are clearly presented.
  • Value: ⭐⭐⭐⭐ Provides a unified and efficient distillation framework for large-scale diffusion models.