Skip to content

V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models

Info Content
Conference ICCV 2025
arXiv 2508.03254
Code Project Page
Area Video Generation · Diffusion Model Distillation · Preference Learning
Keywords knowledge distillation, DPO, pruning, video diffusion, preference learning

TL;DR

This paper proposes the ReDPO loss function and the V.I.P. iterative online preference distillation framework, which combines preference learning (DPO) with SFT regularization for distilling pruned video diffusion models. The approach matches or surpasses the performance of the full model while reducing parameters by 36.2%–67.5%.

Background & Motivation

Efficiency Challenges in Video Diffusion Models

Text-to-video (T2V) models incur extremely high computational costs, making deployment on edge devices difficult. Pruning can reduce parameter counts but leads to performance degradation, necessitating knowledge distillation for recovery.

Limitations of Prior Work

Blind Imitation via SFT: Supervised fine-tuning (SFT) minimizes the L2 distance between student and teacher predictions. However, a student model with insufficient capacity cannot fully replicate the teacher's output, resulting in distribution averaging—the student produces "blurry" samples that do not exist in the teacher's distribution.

Mode Collapse: Capacity-constrained students under SFT tend to generate blurry outputs with weak motion dynamics.

Selective Degradation from Pruning: Certain attributes degrade after pruning while others may be preserved or even improve. SFT cannot exploit this characteristic.

Core Insight

The authors observe that pruned models can outperform the full model on certain dimensions. Consequently, distillation should not blindly imitate the teacher across the board, but instead selectively repair degraded attributes—a setting where preference learning (DPO) holds a natural advantage.

Method

Overall Architecture

V.I.P. = Video diffusion distillation via Iterative Preference learning

Workflow: \(M_0\) (full model) → pruning → \(M_1\) → evaluation + data curation + ReDPO training → \(M_1'\) → pruning again → \(M_2\) → iterate…

1. Pruning Strategy

Modules with the least impact are removed progressively rather than through a single large-scale pruning step: - Each block is removed individually and its impact is evaluated using VideoScore. - The block with the smallest effect on the overall score is selected for removal. - Attributes whose performance degrades upon removal are identified as recovery targets.

2. Data Curation

Prompt Filtering: High-quality prompts of 5–25 words are retained; an LLM filters prompts to ensure relevance to the target attributes.

Video Filtering: Videos are generated by both the full model and the pruned model. VideoScore serves as a reward model to construct preference pairs:

\[S(v_{\text{full}}) > S(v_{\text{pruned}}) > \tau_p\]

This ensures that "winning" samples are of high quality while capturing the actual weaknesses of the pruned model.

3. ReDPO: Regularized Diffusion Preference Optimization

The core contribution is to combine DPO with SFT in order to address DPO's over-optimization issue.

Diffusion DPO Loss:

\[L_{\text{diff-dpo}}(\theta) = -\mathbb{E} \left[ \log \sigma \left( -\beta T \omega(\lambda_t) \left( \|\epsilon^w - \epsilon_\theta(x_t^w, t)\|^2 - \|\epsilon^w - \epsilon_{\text{ref}}(x_t^w, t)\|^2 - (\|\epsilon^l - \epsilon_\theta(x_t^l, t)\|^2 - \|\epsilon^l - \epsilon_{\text{ref}}(x_t^l, t)\|^2) \right) \right) \right]\]

SFT Regularization Term (to prevent over-optimization):

\[L_{\text{SFT}}(\theta) = \|\epsilon_\theta(x_t^w, t) - \epsilon_{\text{ref}}(x_t^w, t)\|^2\]

Final ReDPO Loss:

\[L_{\text{ReDPO}}(\theta) = L_{\text{diff-dpo}}(\theta) + w_{\text{SFT}} \cdot L_{\text{SFT}}(\theta)\]
  • DPO guides the student to selectively repair degraded attributes.
  • SFT constrains preference probabilities, preventing DPO over-optimization from causing a decline in absolute quality.

4. V.I.P. Iterative Online Distillation

Key design choices: - Fixed teacher: always the initial full model \(M_0\). - Iteratively updated student: after each pruning round, weaknesses are re-evaluated, data is re-curated, and training is repeated. - Online advantage: unlike offline DPO, losing samples at each stage are generated by the most recently pruned model, ensuring training data remains aligned with the current policy.

Key Experimental Results

Main Results: V.I.P. on Two Baselines (VideoScore Evaluation)

Model Stage Visual Quality Temporal Consist. Dynamic Degree Text Align. Average Parameters
VideoCrafter2 Full - 2.627 2.602 2.728 2.491 2.613 1.413B
VC2 Stage 2 Pruned - 2.627 2.595 2.725 2.486 2.608 0.902B
VC2 + ReDPO Stage 2 2.629 2.617 2.728 2.518 2.623 (+0.010) 0.902B
AnimateDiff Full - 2.575 2.505 2.684 2.486 2.563 0.453B
AD Stage 3 Pruned - 2.552 2.469 2.736 2.505 2.566 0.147B
AD + ReDPO Stage 3 2.569 2.513 2.695 2.496 2.568 (+0.005) 0.147B

Key findings: - VideoCrafter2 achieves a 36.2% parameter reduction (1.413B→0.902B) with a 21% FLOPs reduction, matching or exceeding the full model on all metrics. - AnimateDiff achieves a 67.5% parameter reduction (0.453B→0.147B) while maintaining performance comparable to the full model.

ReDPO vs. SFT Comparison

Model Method Visual Quality Temporal Consist. Dynamic Degree Text Align.
VC2 SFT 2.628 2.613 2.724 2.505
VC2 ReDPO 2.629 2.617 2.728 2.518
AD SFT 2.564 2.515 2.679 2.477
AD ReDPO 2.569 2.513 2.695 2.496

ReDPO consistently outperforms SFT on nearly all metrics. Distribution averaging under SFT leads to blurry outputs and weaker text alignment.

Ablation Study

Ablation Visual Quality Temporal Consist. Dynamic Degree Text Align.
w/o SFT (DPO only) 2.625 2.583 2.729 2.471
w/o online (offline) 2.626 2.603 2.719 2.483
V.I.P. (full) 2.629 2.617 2.728 2.518
  • Removing SFT: temporal consistency drops significantly (DPO over-optimization).
  • Removing online iteration: performance declines across the board (one-shot pruning incurs excessive loss).

User Study

V.I.P. is substantially preferred over SFT and the full model in overall preference ratings, demonstrating that ReDPO effectively aligns with human preferences.

Highlights & Insights

  1. First application of preference learning to diffusion model distillation: Moving beyond the SFT paradigm, DPO's contrastive learning enables the student model to selectively allocate its limited capacity.
  2. Elegant use of SFT as a regularizer: The method avoids pure SFT (which causes averaging) and pure DPO (which causes over-optimization), leveraging the complementary strengths of both.
  3. Online iteration vs. one-shot pruning: Progressive pruning allows the model to adapt and recover at each step.
  4. The observation that some attributes improve after pruning forms the theoretical foundation of the method—capacity should not be wasted imitating attributes that are already well-preserved.

Limitations & Future Work

  • Each stage requires video generation and evaluation with VideoScore, making the training pipeline relatively complex.
  • Reliance on a single reward model (VideoScore) may introduce evaluation bias.
  • The dynamic degree score sometimes declines when temporal consistency improves, reflecting an inherent quality–motion trade-off.
  • Experiments are conducted only on AnimateDiff and VideoCrafter2, without validation on larger-scale video generation models.
  • Video diffusion distillation: BK-SDM feature distillation, adversarial loss distillation, etc.
  • Preference alignment: DPO, PPO, VideoDPO, etc.
  • Model pruning: Structured pruning methods including channel pruning and block pruning.
  • Online learning: On-policy DPO, self-play, etc.

Rating

Dimension Score
Novelty ⭐⭐⭐⭐⭐
Effectiveness ⭐⭐⭐⭐
Writing Quality ⭐⭐⭐⭐
Value ⭐⭐⭐⭐
Overall Recommendation ⭐⭐⭐⭐