Skip to content

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Metadata

  • Conference: ICCV 2025
  • arXiv: 2412.15156
  • Code: GitHub
  • Area: Video Generation
  • Keywords: Video Generation, Prompt Enhancement, LLM, DPO, Reward-Guided Evolution

TL;DR

Prompt-A-Video is proposed, which automatically constructs training data via a reward-guided prompt evolution pipeline and optimizes an LLM through two-stage SFT and DPO training to generate enhanced prompts aligned with the preferences of specific video diffusion models.

Background & Motivation

Text-to-video models are highly sensitive to input prompts — they are trained on complex descriptions generated by LVLMs, whereas user inputs tend to be brief and coarse. Existing prompt optimization methods suffer from three major issues:

Modality mismatch: Image-oriented prompts emphasize static attributes (composition, color) while neglecting video-specific properties such as motion fluency and narrative coherence.

Cost disparity: The field lacks mature video prompt platforms and community-level experience.

Model unawareness: Prompts expanded directly by GPT do not account for the preferences of specific video generation models.

Goal: To design a video-centric, annotation-free, preference-aligned prompt optimization system.

Method

Multi-Dimensional Reward System

Image-level: - Aesthetic Predictor - MPS (Multi-dimensional Preference Score)

Video-level (VideoScore model): - Visual Quality (VQ), Temporal Consistency (TC), Dynamic Degree (DD), Text-Video Alignment (TVA), Factual Consistency (FC)

A comprehensive evaluation across 7 dimensions in total.

Reward-Guided Prompt Evolution (Data Engine)

Drawing on the concept of evolutionary algorithms, GPT-4o is employed as the evolution operator:

  1. Evaluation: Generate video → multi-dimensional reward scoring → scores appended to corresponding prompts.
  2. Selection: Select top-N prompts based on the aggregated scores across all metrics.
  3. Evolution: Feed the original prompt along with high-scoring historical prompts into GPT-4o to generate 3 refined variants at once.

The process iterates for 4 rounds. Prompts that exceed predefined thresholds on all dimensions and achieve the highest overall score are selected to construct (original, target) training pairs.

Two-Stage Optimization

Stage 1: SFT

\[\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x,y)} \log p(y|s,x)\]

LLaMA3-Instruct is fine-tuned with LoRA to equip the model with basic prompt enhancement capability.

Stage 2: DPO

The SFT model generates 5 candidates per input → videos are generated → reward models score them → the best and worst candidates form triplets → DPO optimization:

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]\]

Two rounds of iterative DPO are performed, with triplet data regenerated from the previous round's model at each iteration.

Key Experimental Results

Quantitative Comparison on Open-Sora 1.2 + CogVideoX

Model/Method VQ TC DD TVA FC Avg
Open-Sora (original) 3.079 3.084 3.203 3.156 3.060 3.116
GPT-4o 3.014 3.082 3.103 3.167 3.031 3.079
Ours (DPO-2) 3.254 3.286 3.411 3.358 3.282 3.318
CogVideoX (original) 2.899 2.886 3.186 3.167 2.808 2.989
GLM-4 2.878 2.948 3.139 3.184 2.833 2.996
Ours (DPO-2) 2.930 3.019 3.183 3.259 2.888 3.056

The average gain on Open-Sora is +0.202 (GPT-4o yields only −0.037); on CogVideoX the gain is +0.067.

User Study (VBench 100 Prompts)

Comparison Prompt-A-Video Win Rate
vs. original prompts (Open-Sora) ~65%
vs. GPT-4o (Open-Sora) ~55%
vs. original prompts (CogVideoX) ~60%

Key Findings

  1. SFT provides basic capability but limited performance gains; the DPO stage delivers substantive improvements.
  2. Directly applying an image-oriented prompt enhancement model (Promptist) to video actually degrades performance.
  3. Performance converges after two rounds of DPO; additional iterations yield no further improvement.
  4. The method generalizes to text-to-image tasks, outperforming Promptist and PAE on HPSv2.

Highlights & Insights

  1. Closed-loop optimization: The progressive pipeline of evolutionary algorithm → SFT → DPO assigns a clear objective to each stage.
  2. Model specificity: Different video models exhibit different preferences, and the same framework can be trained in a model-specific manner.
  3. Automated data engine: Zero human annotation is required; the process is entirely driven by reward models.
  4. Generalizability: The video-oriented optimization approach also improves image generation quality.

Limitations & Future Work

  • The evolution pipeline relies on GPT-4o, incurring non-trivial data construction costs.
  • DPO training requires extensive video generation and evaluation, resulting in significant computational overhead.
  • Dynamic Degree (DD) may be suppressed by other metrics during optimization.
  • Biases inherent in the reward models propagate into the optimization outcomes.
  • Video Generation: Open-Sora, CogVideoX, Sora
  • Prompt Optimization: Promptist, PAE
  • Alignment Methods: DPO, PPO, RLHF

Rating

  • Novelty: ★★★★☆ — First prompt optimization system specifically targeting video diffusion models.
  • Technical Depth: ★★★★☆ — The multi-stage optimization design is elegant and well-motivated.
  • Practicality: ★★★★★ — Directly improves the end-user experience in video generation.