Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM¶
Metadata¶
- Conference: ICCV 2025
- arXiv: 2412.15156
- Code: GitHub
- Area: Video Generation
- Keywords: Video Generation, Prompt Enhancement, LLM, DPO, Reward-Guided Evolution
TL;DR¶
Prompt-A-Video is proposed, which automatically constructs training data via a reward-guided prompt evolution pipeline and optimizes an LLM through two-stage SFT and DPO training to generate enhanced prompts aligned with the preferences of specific video diffusion models.
Background & Motivation¶
Text-to-video models are highly sensitive to input prompts — they are trained on complex descriptions generated by LVLMs, whereas user inputs tend to be brief and coarse. Existing prompt optimization methods suffer from three major issues:
Modality mismatch: Image-oriented prompts emphasize static attributes (composition, color) while neglecting video-specific properties such as motion fluency and narrative coherence.
Cost disparity: The field lacks mature video prompt platforms and community-level experience.
Model unawareness: Prompts expanded directly by GPT do not account for the preferences of specific video generation models.
Goal: To design a video-centric, annotation-free, preference-aligned prompt optimization system.
Method¶
Multi-Dimensional Reward System¶
Image-level: - Aesthetic Predictor - MPS (Multi-dimensional Preference Score)
Video-level (VideoScore model): - Visual Quality (VQ), Temporal Consistency (TC), Dynamic Degree (DD), Text-Video Alignment (TVA), Factual Consistency (FC)
A comprehensive evaluation across 7 dimensions in total.
Reward-Guided Prompt Evolution (Data Engine)¶
Drawing on the concept of evolutionary algorithms, GPT-4o is employed as the evolution operator:
- Evaluation: Generate video → multi-dimensional reward scoring → scores appended to corresponding prompts.
- Selection: Select top-N prompts based on the aggregated scores across all metrics.
- Evolution: Feed the original prompt along with high-scoring historical prompts into GPT-4o to generate 3 refined variants at once.
The process iterates for 4 rounds. Prompts that exceed predefined thresholds on all dimensions and achieve the highest overall score are selected to construct (original, target) training pairs.
Two-Stage Optimization¶
Stage 1: SFT
LLaMA3-Instruct is fine-tuned with LoRA to equip the model with basic prompt enhancement capability.
Stage 2: DPO
The SFT model generates 5 candidates per input → videos are generated → reward models score them → the best and worst candidates form triplets → DPO optimization:
Two rounds of iterative DPO are performed, with triplet data regenerated from the previous round's model at each iteration.
Key Experimental Results¶
Quantitative Comparison on Open-Sora 1.2 + CogVideoX¶
| Model/Method | VQ | TC | DD | TVA | FC | Avg |
|---|---|---|---|---|---|---|
| Open-Sora (original) | 3.079 | 3.084 | 3.203 | 3.156 | 3.060 | 3.116 |
| GPT-4o | 3.014 | 3.082 | 3.103 | 3.167 | 3.031 | 3.079 |
| Ours (DPO-2) | 3.254 | 3.286 | 3.411 | 3.358 | 3.282 | 3.318 |
| CogVideoX (original) | 2.899 | 2.886 | 3.186 | 3.167 | 2.808 | 2.989 |
| GLM-4 | 2.878 | 2.948 | 3.139 | 3.184 | 2.833 | 2.996 |
| Ours (DPO-2) | 2.930 | 3.019 | 3.183 | 3.259 | 2.888 | 3.056 |
The average gain on Open-Sora is +0.202 (GPT-4o yields only −0.037); on CogVideoX the gain is +0.067.
User Study (VBench 100 Prompts)¶
| Comparison | Prompt-A-Video Win Rate |
|---|---|
| vs. original prompts (Open-Sora) | ~65% |
| vs. GPT-4o (Open-Sora) | ~55% |
| vs. original prompts (CogVideoX) | ~60% |
Key Findings¶
- SFT provides basic capability but limited performance gains; the DPO stage delivers substantive improvements.
- Directly applying an image-oriented prompt enhancement model (Promptist) to video actually degrades performance.
- Performance converges after two rounds of DPO; additional iterations yield no further improvement.
- The method generalizes to text-to-image tasks, outperforming Promptist and PAE on HPSv2.
Highlights & Insights¶
- Closed-loop optimization: The progressive pipeline of evolutionary algorithm → SFT → DPO assigns a clear objective to each stage.
- Model specificity: Different video models exhibit different preferences, and the same framework can be trained in a model-specific manner.
- Automated data engine: Zero human annotation is required; the process is entirely driven by reward models.
- Generalizability: The video-oriented optimization approach also improves image generation quality.
Limitations & Future Work¶
- The evolution pipeline relies on GPT-4o, incurring non-trivial data construction costs.
- DPO training requires extensive video generation and evaluation, resulting in significant computational overhead.
- Dynamic Degree (DD) may be suppressed by other metrics during optimization.
- Biases inherent in the reward models propagate into the optimization outcomes.
Related Work & Insights¶
- Video Generation: Open-Sora, CogVideoX, Sora
- Prompt Optimization: Promptist, PAE
- Alignment Methods: DPO, PPO, RLHF
Rating¶
- Novelty: ★★★★☆ — First prompt optimization system specifically targeting video diffusion models.
- Technical Depth: ★★★★☆ — The multi-stage optimization design is elegant and well-motivated.
- Practicality: ★★★★★ — Directly improves the end-user experience in video generation.