Dual-Expert Consistency Model for Efficient and High-Quality Video Generation¶

Conference: ICCV 2025 arXiv: 2506.03123 Code: GitHub Area: Video Generation Keywords: Consistency Distillation, Video Generation Acceleration, Dual-Expert Model, Temporal Coherence Loss, GAN Distillation

TL;DR¶

This paper analyzes the optimization conflict between high- and low-noise levels in consistency model distillation, and proposes a parameter-efficient Dual-Expert Consistency Model (DCM). A semantic expert handles layout and motion while a detail expert handles fine-grained details, complemented by a temporal coherence loss and GAN with feature matching loss. On HunyuanVideo (13B), DCM achieves 4-step sampling quality approaching the 50-step baseline.

Background & Motivation¶

Background: Consistency Distillation (CD) is a mainstream diffusion model acceleration approach that trains a student model to map any point on the ODE trajectory to the same endpoint. Methods such as LCM and PCM have seen broad adoption.
Limitations of Prior Work: Directly applying consistency distillation to video diffusion models leads to severe temporal coherence degradation and loss of appearance detail. High-noise samples primarily learn semantic layout and motion, while low-noise samples refine fine details; however, the gradient magnitudes and loss contributions of these two regimes differ substantially, causing joint optimization to converge to suboptimal solutions.
Key Challenge: A single student model has limited capacity, and simultaneously learning semantic layout synthesis and fine-grained detail generation introduces optimization interference. Visualizations reveal significant differences in loss values and gradient norms between high- and low-noise samples during distillation.
Goal: How to decouple the optimization of the semantic learning stage from the detail learning stage while maintaining parameter efficiency?
Key Insight: Train two expert denoisers to handle two sub-trajectories of the ODE trajectory separately; after validating that the combination outperforms a single model, design a parameter-sharing scheme guided by parameter difference analysis.
Core Idea: Partition the ODE trajectory into semantic and detail segments, and implement decoupled distillation via a parameter-efficient scheme comprising a semantic expert and a LoRA-based detail expert.

Method¶

Overall Architecture¶

The teacher model's 50-step ODE trajectory is divided at \(t_\kappa\) (\(\kappa=37\)) into two segments: a semantic synthesis phase \(\{x_{t_i}\}_{i=\kappa}^N\) and a detail refinement phase \(\{x_{t_j}\}_{j=0}^\kappa\). Training proceeds in two stages: the semantic expert SemE is first trained on the high-noise sub-trajectory; SemE is then frozen and a LoRA-augmented detail expert DetE is trained on the low-noise sub-trajectory. During inference, experts are switched dynamically based on the current sampling stage.

Key Designs¶

Parameter-Efficient Dual-Expert Distillation:
- Function: Decouple semantic and detail learning with minimal additional parameters.
- Mechanism: Parameter differences between two independently trained experts are analyzed, revealing that the primary differences lie in (1) the timestep-dependent embedding layers \(\Psi\) and (2) the linear layers \(\Lambda\) within attention blocks. Stage 1 therefore trains SemE with full parameters on \([t_\kappa, t_N]\); Stage 2 initializes from SemE, freezes the backbone, and adds only new timestep-dependent embedding layers \(\Psi\) and attention-block LoRA \(\Lambda^\dagger\), training on \([t_0, t_\kappa]\).
- Design Motivation: Training two complete models doubles parameter count and memory; parameter difference analysis shows that the majority of weights can be shared. LoRA (\(\Lambda^\dagger\)) applied solely to attention linear layers is sufficient to capture the variation required for the detail stage.
Temporal Coherence Loss:
- Function: Enhance motion consistency in videos generated by the semantic expert SemE.
- Mechanism: SemE is encouraged to attend to inter-frame variations and motion at corresponding spatial positions. Inter-frame difference consistency is defined as: \(\mathcal{L}_{TC} = \|(x_{l:L}^{t_\kappa} - x_{0:L-l}^{t_\kappa}) - (\hat{x}_{l:L}^{t_\kappa} - \hat{x}_{0:L-l}^{t_\kappa})\|_2^2\), where \(x_{l:L}^{t_\kappa}\) denotes the video latent from frame \(l\) to frame \(L\).
- Design Motivation: The semantic stage primarily establishes motion and spatial layout; the temporal coherence loss explicitly encourages SemE to maintain consistent inter-frame motion and spatial relationships.
GAN + Feature Matching Loss:
- Function: Enhance fine-grained synthesis quality of the detail expert DetE.
- Mechanism: The frozen teacher model serves as a feature extraction backbone \(\Omega\). The outputs of the student and EMA student are forward-diffused to obtain fake and real samples, respectively; intermediate features are extracted to compute GAN and feature matching losses: \(\mathcal{L}_{FM} = \|\Omega(x_{fake}) - \Omega(x_{real})\|_2^2\). The discriminator head \(f_D\) and DetE are optimized alternately.
- Design Motivation: GAN losses have been validated for high-quality detail synthesis in distribution-matching distillation. Feature matching loss stabilizes GAN training and provides additional intermediate feature supervision.

Loss & Training¶

SemE: consistency loss \(\mathcal{L}_{SemE}\) + temporal coherence loss \(\mathcal{L}_{TC}\)
DetE: consistency loss \(\mathcal{L}_{DetE}\) + GAN loss \(\mathcal{L}_G\) + feature matching loss \(\mathcal{L}_{FM}\) + discriminator loss \(\mathcal{L}_D\)
Training is conducted on 24 A100 GPUs; each expert on HunyuanVideo is trained for 1,000 iterations with learning rates of 1e-6 / 5e-6.

Key Experimental Results¶

Main Results¶

Method	Steps	Latency (s)	VBench Total	Quality	Semantic
HunyuanVideo (Teacher)	50	1504.5	83.87	85.00	79.34
LCM	4	120.68	80.33	80.83	78.32
PCM	4	120.89	80.93	81.94	76.90
DCM (Ours)	4	121.52	83.83	85.12	78.67
DCM (Ours)	8	244.72	83.86	85.00	79.32

User preference study: DCM vs. LCM: 82.67% prefer DCM; DCM vs. PCM: 77.33% prefer DCM.

Ablation Study¶

Configuration	VBench Total	Quality	Semantic
VCM (baseline consistency model)	80.30	80.74	78.36
+ OD (trajectory decoupling)	83.08	84.20	78.59
+ OD + PE (parameter-efficient)	83.03	84.16	78.53
+ OD + PE + TC (temporal coherence)	83.42	84.63	—
Full DCM	83.83	85.12	78.67

Key Findings¶

Trajectory decoupling (OD) accounts for the largest performance gain, confirming the central hypothesis of optimization conflict between high- and low-noise regimes.
The parameter-efficient scheme (PE) sacrifices only 0.05 VBench points while substantially reducing parameter count.
The SemE+DetE combination markedly outperforms VCM on both semantic and detail metrics.
4-step DCM nearly recovers the 50-step teacher quality (83.83 vs. 83.87) at a 12.4× speedup.
Consistent effectiveness is also demonstrated on CogVideoX: 4-step DCM (79.99) vs. CogVideoX 50-step (80.59).

Highlights & Insights¶

In-depth analysis of distillation training dynamics: visualization of loss and gradient norm differences between high- and low-noise samples provides strong motivation for the dual-expert design.
Precise conclusions from parameter difference analysis—differences concentrated in embedding layers and attention linear layers—motivate the LoRA-based scheme.
First successful application of consistency distillation to a model at the HunyuanVideo scale (13B parameters).
Expert-specific optimization objectives (TC Loss for SemE, GAN for DetE) reflect a deep understanding of the distinct learning requirements of each stage.

Limitations & Future Work¶

The choice of boundary point \(t_\kappa\) (\(\kappa=37\)) is heuristic; adaptive boundary selection warrants exploration.
Dual-expert inference requires two sets of embedding layers and LoRA, adding inference complexity despite the small parameter overhead.
Evaluation is limited to 4/8 steps; more aggressive 1–2 step distillation remains unexplored.
The scalability of GAN loss stability to larger models requires further validation.

vs. LCM: A single-model consistency distillation approach cannot resolve the optimization conflict between high- and low-noise regimes.
vs. PCM: PCM segments the trajectory but still employs a single model to learn all segments.
vs. Hyper-SD: Integrates segmented trajectory consistency distillation with DMD but does not decouple the stages for video-specific scenarios.
vs. Seaweed-APT: Applies one-step adversarial post-training on real data, but is limited to 2-second videos.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-expert scheme is motivated by training dynamics analysis, and the parameter-efficient design demonstrates insightful reasoning.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two backbone models (HunyuanVideo + CogVideoX), VBench evaluation, user studies, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logical flow progressing from observation to hypothesis to validation to design.
Value: ⭐⭐⭐⭐⭐ First demonstration of high-quality 4-step distillation on a 13B-parameter video model; extremely high practical value.