Phased One-Step Adversarial Equilibrium for Video Diffusion Models¶

Conference: AAAI 2026 arXiv: 2508.21019 Code: https://v-pae.github.io/ Area: Diffusion Models / Video Generation Keywords: Video diffusion distillation, one-step generation, adversarial training, image-to-video, sampling acceleration

TL;DR¶

This paper proposes V-PAE (Video Phased Adversarial Equilibrium), a two-phase distillation framework consisting of stability priming followed by unified adversarial equilibrium, which compresses large-scale video diffusion models (e.g., Wan2.1-I2V-14B) to single-step generation, achieving a 100× speedup and surpassing existing acceleration methods by 5.8% in average quality on VBench-I2V.

Background & Motivation¶

Background: Video diffusion models (e.g., Wan2.1-I2V-14B) have achieved remarkable fidelity, but generating a 5-second video requires 50 iterative steps and approximately 15 minutes on 8×H20 GPUs, incurring prohibitive computational costs.
Limitations of Prior Work: Existing video distillation methods are largely direct adaptations of image distillation techniques (e.g., LCM, DMD2, ADD), and suffer from two fundamental limitations: (a) inability to distill large-scale (>10B) video models to a single step; and (b) poor generalization to conditional tasks, causing semantic degradation and condition-frame collapse in image-to-video (I2V) generation.
Key Challenge: One-step adversarial distillation requires generating video directly from Gaussian noise, yet the distributional gap between generated and real videos is so large that the discriminator receives trivially easy gradients, leading to training instability. Existing methods (e.g., DMD2, ADD) train only in the high-SNR regime, mismatched with the low-SNR sampling distribution.
Goal: How to achieve high-quality one-step distillation on large-scale video models? How to maintain video–image subject consistency in I2V tasks?
Key Insight: The distillation process is decomposed into two phases—first narrowing the distributional gap via VSD (priming), then performing adversarial distillation on the better-aligned distribution (equilibrium)—while reusing generator parameters as the discriminator backbone to reduce memory overhead.
Core Idea: The phased optimization strategy ensures that adversarial distillation is conducted after distribution alignment, avoiding the suboptimal optimization directions introduced by simultaneously mixing multiple losses.

Method¶

Overall Architecture¶

V-PAE consists of two sequential phases: (a) Stability Priming—applying variational score distillation (VSD) to reduce the distributional distance between generated and real videos; and (b) Unified Adversarial Equilibrium—reusing generator parameters as the discriminator backbone to achieve co-evolutionary adversarial equilibrium in Gaussian noise space. For conditional I2V tasks, a semantic discriminator head and a conditional SDS loss are additionally introduced to preserve video–image subject consistency.

Key Designs¶

Stability Priming (Phase I)
- Function: Reduces the distributional distance between the one-step generated video \(\hat{x}_0^\theta\) and real videos \(x \sim p_{\text{data}}\), providing a stable initialization for subsequent adversarial distillation.
- Mechanism: Three models are defined—a priming generator \(\mu^\theta\), a fixed real model \(\mu_{\text{real}}\), and a fake model \(\mu_{\text{fake}}^\phi\) (tracking the generated distribution via lightweight LoRA). The generator produces \(\hat{x}_0^\theta\) directly in the low-SNR interval \(t \in [0.9T, T]\); the noised output is then fed into both the real and fake models, and the score gradient difference is used as a distribution-matching loss. The fake model employs LoRA adaptation with zero initialization to improve tracking stability on large-scale models.
- Design Motivation: Without priming, the poor quality of generated videos provides uninformative gradients to the discriminator, making adversarial training extremely unstable. Unlike DMD2, which simultaneously optimizes VSD and adversarial losses, V-PAE avoids suboptimal optimization directions through phased training.
Unified Adversarial Equilibrium (Phase II)
- Function: Building upon the primed generator, further improves one-step video generation quality via self-discriminator adversarial training.
- Mechanism: The generator \(\mu^\theta\) is reused as the discriminator backbone (self-discriminator), with a lightweight discriminator head \(d^\psi\) computing discrimination logits. The generator samples \(\hat{x}_0^\theta = f^\theta(\epsilon, T)\) directly from endpoint \(\epsilon\); after noising, multi-layer features are extracted from the self-discriminator for discrimination. Adversarial training uses Hinge loss, with EMA weights \(\theta^-\) to ensure equilibrium stability. Spatiotemporal differential R1 regularization \(\mathcal{L}_{\text{STR1}}\) is introduced to prevent gradient explosion, with spatial perturbation \(\sigma_s=0.01\) and temporal perturbation \(\sigma_t=0.1\).
- Design Motivation: Conventional approaches require a separate discriminator backbone, where freezing it causes parameter asymmetry and full-parameter training leads to OOM on 14B models. Reusing the generator as backbone enables efficient co-evolution within limited memory and is the only feasible strategy at the 14B scale.
Video–Image Subject Consistency Preservation
- Function: Addresses semantic degradation and condition-frame collapse in I2V distillation.
- Mechanism: (a) Semantic discriminator head: Learnable queries \(q\) are concatenated with condition image embeddings and text embeddings, passed through self-attention, then cross-attended with multi-layer backbone features to enhance semantic awareness. (b) Conditional SDS loss: Leveraging the distributional stability of the pretrained model \(\mu_{\text{real}}\), the loss \(\mathcal{L}_{\text{C-SDS}} = \mathbb{E}[\|\hat{x}_0^\theta - f_{\text{real}}(\text{sg}(\hat{x}_{t'}^\theta), t')\|^2]\) minimizes the discrepancy between the condition frame and generated frames, preventing condition-frame collapse.
- Design Motivation: I2V is a primary application of video generation, but one-step distillation severely disrupts semantic consistency between the generated video and the input image. The multimodal semantic discriminator head compensates for the insufficient semantic awareness of purely adversarial training.

Loss & Training¶

Phase I (500 steps): Learning rate \(1 \times 10^{-6}\); trained with VSD distribution-matching loss; fake model tracked via LoRA.
Phase II (1000 steps): Learning rate \(2 \times 10^{-6}\); generator loss \(\mathcal{L}_G = \mathcal{L}_{\text{UAE-G}} + 10 \cdot \mathcal{L}_{\text{C-SDS}}\); discriminator head loss \(\mathcal{L}_D = \mathcal{L}_{\text{UAE-D}} + \mathcal{L}_{\text{STR1}}\); EMA decay rate 0.995.
Data: Synthetic data (generated by Wan2.1-T2V-14B) combined with open-source data (Koala-36M, Intern4K).

Key Experimental Results¶

Main Results¶

Comparison of distillation methods on VBench-I2V (base model: Wan2.1-I2V-14B):

Method	Type	NFE	Semantic Alignment (SA)	Temporal Coherence (TC)	Frame Quality (FQ)	Total	Latency (s)
Baseline	Euler	100	92.90	80.82	70.44	-	890
APT	AD	1	84.87	75.21	64.69	-	-
DMD2	VSD	1	83.15	71.67	62.47	-	-
MD	VSD	1	84.02	74.25	64.76	-	-
V-PAE	AD	1	91.54	79.56	68.66	-	9.37
V-PAE	AD	4	94.93	82.24	70.76	-	-

Ablation Study¶

Configuration	SA	TC	FQ	Notes
No priming + adversarial	-	-	-	Adversarial training unstable; large distributional bias
Consistency distillation priming (APT-style)	80.05	67.96	56.55	Moderate Phase I quality
V-PAE Phase I (VSD)	84.92	72.34	59.75	VSD priming superior
Frozen backbone discriminator	Lower	Lower	Lower	Parameter asymmetry degrades quality
Full-parameter discriminator	OOM	OOM	OOM	Insufficient memory for 14B model
Self-discriminator (V-PAE)	Best	Best	Best	Co-evolutionary equilibrium

Key Findings¶

Stability priming is critical: removing Phase I causes severe adversarial training instability.
V-PAE at 1-NFE approaches the 100-NFE baseline (gap of only 1.5%); at 4-NFE it surpasses the baseline by 3.3%.
The self-discriminator is the only feasible and best-performing discriminator design at the 14B parameter scale.
A conditioning strength of \(\lambda=10\) for the conditional SDS loss is highly effective in preventing condition-frame collapse in I2V; values that are too small result in inconsistency between the first frame and subsequent frames.

Highlights & Insights¶

Phased distillation paradigm: Narrowing the distributional gap before adversarial training is an intuitive and effective strategy, avoiding the suboptimal optimization directions observed in DMD2 when multiple losses are jointly optimized. This idea transfers naturally to other distillation tasks with large distributional gaps.
Self-discriminator design: Reusing the generator as the discriminator backbone simultaneously reduces memory consumption and enables co-evolution, representing an elegant engineering solution for adversarial distillation at the scale of extremely large models.
100× speedup from 15 minutes to ~10 seconds carries significant practical value for real-time video generation applications.

Limitations & Future Work¶

Validation is limited to I2V tasks; generalization to T2V has not been sufficiently demonstrated.
Training still requires large volumes of high-quality video data (synthetic + open-source), incurring non-trivial data preparation costs.
A 1.5% quality gap remains between one-step generation and the 100-step model, which may be more pronounced in challenging scenarios (complex motion, multi-object interactions).
Comparison with recent Flow Matching acceleration methods has not been conducted.

vs. DMD2: DMD2 jointly optimizes VSD and adversarial losses, leading to suboptimal optimization directions; V-PAE avoids this issue through phased training.
vs. APT: APT is effective only for small-scale models and short videos; V-PAE is the first to achieve one-step distillation of a 14B model.
vs. ADD/SDXL-Turbo: ADD is restricted to 4 uniformly spaced timesteps and cannot achieve single-step generation; V-PAE's self-discriminator overcomes this limitation.

Rating¶

Novelty: ⭐⭐⭐⭐ The phased distillation and self-discriminator ideas are novel, though the core components (VSD, adversarial distillation) are combinations of existing techniques.
Experimental Thoroughness: ⭐⭐⭐⭐ VBench-I2V evaluation is comprehensive and ablations are thorough, but T2V and cross-model validation are lacking.
Writing Quality: ⭐⭐⭐⭐ Well-structured with well-motivated reasoning.
Value: ⭐⭐⭐⭐⭐ The 100× speedup carries major practical significance for real-time video generation.