V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models¶

Info	Content
Conference	ICCV 2025
arXiv	2508.03254
Code	Project Page
Area	Video Generation · Diffusion Model Distillation · Preference Learning
Keywords	knowledge distillation, DPO, pruning, video diffusion, preference learning

TL;DR¶

This paper proposes the ReDPO loss function and the V.I.P. iterative online preference distillation framework, which combines preference learning (DPO) with SFT regularization for distilling pruned video diffusion models. The approach matches or surpasses the performance of the full model while reducing parameters by 36.2%–67.5%.

Background & Motivation¶

Efficiency Challenges in Video Diffusion Models¶

Text-to-video (T2V) models incur extremely high computational costs, making deployment on edge devices difficult. Pruning can reduce parameter counts but leads to performance degradation, necessitating knowledge distillation for recovery.

Limitations of Prior Work¶

Blind Imitation via SFT: Supervised fine-tuning (SFT) minimizes the L2 distance between student and teacher predictions. However, a student model with insufficient capacity cannot fully replicate the teacher's output, resulting in distribution averaging—the student produces "blurry" samples that do not exist in the teacher's distribution.

Mode Collapse: Capacity-constrained students under SFT tend to generate blurry outputs with weak motion dynamics.

Selective Degradation from Pruning: Certain attributes degrade after pruning while others may be preserved or even improve. SFT cannot exploit this characteristic.

Core Insight¶

The authors observe that pruned models can outperform the full model on certain dimensions. Consequently, distillation should not blindly imitate the teacher across the board, but instead selectively repair degraded attributes—a setting where preference learning (DPO) holds a natural advantage.

Method¶

Overall Architecture¶

V.I.P. = Video diffusion distillation via Iterative Preference learning

Workflow: \(M_0\) (full model) → pruning → \(M_1\) → evaluation + data curation + ReDPO training → \(M_1'\) → pruning again → \(M_2\) → iterate…

1. Pruning Strategy¶

Modules with the least impact are removed progressively rather than through a single large-scale pruning step: - Each block is removed individually and its impact is evaluated using VideoScore. - The block with the smallest effect on the overall score is selected for removal. - Attributes whose performance degrades upon removal are identified as recovery targets.

2. Data Curation¶

Prompt Filtering: High-quality prompts of 5–25 words are retained; an LLM filters prompts to ensure relevance to the target attributes.

Video Filtering: Videos are generated by both the full model and the pruned model. VideoScore serves as a reward model to construct preference pairs:

\[S(v_{\text{full}}) > S(v_{\text{pruned}}) > \tau_p\]

This ensures that "winning" samples are of high quality while capturing the actual weaknesses of the pruned model.

3. ReDPO: Regularized Diffusion Preference Optimization¶

The core contribution is to combine DPO with SFT in order to address DPO's over-optimization issue.

Diffusion DPO Loss:

\[L_{\text{diff-dpo}}(\theta) = -\mathbb{E} \left[ \log \sigma \left( -\beta T \omega(\lambda_t) \left( \|\epsilon^w - \epsilon_\theta(x_t^w, t)\|^2 - \|\epsilon^w - \epsilon_{\text{ref}}(x_t^w, t)\|^2 - (\|\epsilon^l - \epsilon_\theta(x_t^l, t)\|^2 - \|\epsilon^l - \epsilon_{\text{ref}}(x_t^l, t)\|^2) \right) \right) \right]\]

SFT Regularization Term (to prevent over-optimization):

\[L_{\text{SFT}}(\theta) = \|\epsilon_\theta(x_t^w, t) - \epsilon_{\text{ref}}(x_t^w, t)\|^2\]

Final ReDPO Loss:

\[L_{\text{ReDPO}}(\theta) = L_{\text{diff-dpo}}(\theta) + w_{\text{SFT}} \cdot L_{\text{SFT}}(\theta)\]

DPO guides the student to selectively repair degraded attributes.
SFT constrains preference probabilities, preventing DPO over-optimization from causing a decline in absolute quality.

4. V.I.P. Iterative Online Distillation¶

Key design choices: - Fixed teacher: always the initial full model \(M_0\). - Iteratively updated student: after each pruning round, weaknesses are re-evaluated, data is re-curated, and training is repeated. - Online advantage: unlike offline DPO, losing samples at each stage are generated by the most recently pruned model, ensuring training data remains aligned with the current policy.

Key Experimental Results¶

Main Results: V.I.P. on Two Baselines (VideoScore Evaluation)¶

Model	Stage	Visual Quality	Temporal Consist.	Dynamic Degree	Text Align.	Average	Parameters
VideoCrafter2 Full	-	2.627	2.602	2.728	2.491	2.613	1.413B
VC2 Stage 2 Pruned	-	2.627	2.595	2.725	2.486	2.608	0.902B
VC2 + ReDPO	Stage 2	2.629	2.617	2.728	2.518	2.623 (+0.010)	0.902B
AnimateDiff Full	-	2.575	2.505	2.684	2.486	2.563	0.453B
AD Stage 3 Pruned	-	2.552	2.469	2.736	2.505	2.566	0.147B
AD + ReDPO	Stage 3	2.569	2.513	2.695	2.496	2.568 (+0.005)	0.147B

Key findings: - VideoCrafter2 achieves a 36.2% parameter reduction (1.413B→0.902B) with a 21% FLOPs reduction, matching or exceeding the full model on all metrics. - AnimateDiff achieves a 67.5% parameter reduction (0.453B→0.147B) while maintaining performance comparable to the full model.

ReDPO vs. SFT Comparison¶

Model	Method	Visual Quality	Temporal Consist.	Dynamic Degree	Text Align.
VC2	SFT	2.628	2.613	2.724	2.505
VC2	ReDPO	2.629	2.617	2.728	2.518
AD	SFT	2.564	2.515	2.679	2.477
AD	ReDPO	2.569	2.513	2.695	2.496

ReDPO consistently outperforms SFT on nearly all metrics. Distribution averaging under SFT leads to blurry outputs and weaker text alignment.

Ablation Study¶

Ablation	Visual Quality	Temporal Consist.	Dynamic Degree	Text Align.
w/o SFT (DPO only)	2.625	2.583	2.729	2.471
w/o online (offline)	2.626	2.603	2.719	2.483
V.I.P. (full)	2.629	2.617	2.728	2.518

Removing SFT: temporal consistency drops significantly (DPO over-optimization).
Removing online iteration: performance declines across the board (one-shot pruning incurs excessive loss).

User Study¶

V.I.P. is substantially preferred over SFT and the full model in overall preference ratings, demonstrating that ReDPO effectively aligns with human preferences.

Highlights & Insights¶

First application of preference learning to diffusion model distillation: Moving beyond the SFT paradigm, DPO's contrastive learning enables the student model to selectively allocate its limited capacity.
Elegant use of SFT as a regularizer: The method avoids pure SFT (which causes averaging) and pure DPO (which causes over-optimization), leveraging the complementary strengths of both.
Online iteration vs. one-shot pruning: Progressive pruning allows the model to adapt and recover at each step.
The observation that some attributes improve after pruning forms the theoretical foundation of the method—capacity should not be wasted imitating attributes that are already well-preserved.

Limitations & Future Work¶

Each stage requires video generation and evaluation with VideoScore, making the training pipeline relatively complex.
Reliance on a single reward model (VideoScore) may introduce evaluation bias.
The dynamic degree score sometimes declines when temporal consistency improves, reflecting an inherent quality–motion trade-off.
Experiments are conducted only on AnimateDiff and VideoCrafter2, without validation on larger-scale video generation models.

Video diffusion distillation: BK-SDM feature distillation, adversarial loss distillation, etc.
Preference alignment: DPO, PPO, VideoDPO, etc.
Model pruning: Structured pruning methods including channel pruning and block pruning.
Online learning: On-policy DPO, self-play, etc.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐⭐
Effectiveness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Overall Recommendation	⭐⭐⭐⭐