Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM¶

Metadata¶

Conference: ICCV 2025
arXiv: 2412.15156
Code: GitHub
Area: Video Generation
Keywords: Video Generation, Prompt Enhancement, LLM, DPO, Reward-Guided Evolution

TL;DR¶

Prompt-A-Video is proposed, which automatically constructs training data via a reward-guided prompt evolution pipeline and optimizes an LLM through two-stage SFT and DPO training to generate enhanced prompts aligned with the preferences of specific video diffusion models.

Background & Motivation¶

Text-to-video models are highly sensitive to input prompts — they are trained on complex descriptions generated by LVLMs, whereas user inputs tend to be brief and coarse. Existing prompt optimization methods suffer from three major issues:

Modality mismatch: Image-oriented prompts emphasize static attributes (composition, color) while neglecting video-specific properties such as motion fluency and narrative coherence.

Cost disparity: The field lacks mature video prompt platforms and community-level experience.

Model unawareness: Prompts expanded directly by GPT do not account for the preferences of specific video generation models.

Goal: To design a video-centric, annotation-free, preference-aligned prompt optimization system.

Method¶

Multi-Dimensional Reward System¶

Image-level: - Aesthetic Predictor - MPS (Multi-dimensional Preference Score)

Video-level (VideoScore model): - Visual Quality (VQ), Temporal Consistency (TC), Dynamic Degree (DD), Text-Video Alignment (TVA), Factual Consistency (FC)

A comprehensive evaluation across 7 dimensions in total.

Reward-Guided Prompt Evolution (Data Engine)¶

Drawing on the concept of evolutionary algorithms, GPT-4o is employed as the evolution operator:

Evaluation: Generate video → multi-dimensional reward scoring → scores appended to corresponding prompts.
Selection: Select top-N prompts based on the aggregated scores across all metrics.
Evolution: Feed the original prompt along with high-scoring historical prompts into GPT-4o to generate 3 refined variants at once.

The process iterates for 4 rounds. Prompts that exceed predefined thresholds on all dimensions and achieve the highest overall score are selected to construct (original, target) training pairs.

Two-Stage Optimization¶

Stage 1: SFT

\[\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x,y)} \log p(y|s,x)\]

LLaMA3-Instruct is fine-tuned with LoRA to equip the model with basic prompt enhancement capability.

Stage 2: DPO

The SFT model generates 5 candidates per input → videos are generated → reward models score them → the best and worst candidates form triplets → DPO optimization:

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]\]

Two rounds of iterative DPO are performed, with triplet data regenerated from the previous round's model at each iteration.

Key Experimental Results¶

Quantitative Comparison on Open-Sora 1.2 + CogVideoX¶

Model/Method	VQ	TC	DD	TVA	FC	Avg
Open-Sora (original)	3.079	3.084	3.203	3.156	3.060	3.116
GPT-4o	3.014	3.082	3.103	3.167	3.031	3.079
Ours (DPO-2)	3.254	3.286	3.411	3.358	3.282	3.318
CogVideoX (original)	2.899	2.886	3.186	3.167	2.808	2.989
GLM-4	2.878	2.948	3.139	3.184	2.833	2.996
Ours (DPO-2)	2.930	3.019	3.183	3.259	2.888	3.056

The average gain on Open-Sora is +0.202 (GPT-4o yields only −0.037); on CogVideoX the gain is +0.067.

User Study (VBench 100 Prompts)¶

Comparison	Prompt-A-Video Win Rate
vs. original prompts (Open-Sora)	~65%
vs. GPT-4o (Open-Sora)	~55%
vs. original prompts (CogVideoX)	~60%

Key Findings¶

SFT provides basic capability but limited performance gains; the DPO stage delivers substantive improvements.
Directly applying an image-oriented prompt enhancement model (Promptist) to video actually degrades performance.
Performance converges after two rounds of DPO; additional iterations yield no further improvement.
The method generalizes to text-to-image tasks, outperforming Promptist and PAE on HPSv2.

Highlights & Insights¶

Closed-loop optimization: The progressive pipeline of evolutionary algorithm → SFT → DPO assigns a clear objective to each stage.
Model specificity: Different video models exhibit different preferences, and the same framework can be trained in a model-specific manner.
Automated data engine: Zero human annotation is required; the process is entirely driven by reward models.
Generalizability: The video-oriented optimization approach also improves image generation quality.

Limitations & Future Work¶

The evolution pipeline relies on GPT-4o, incurring non-trivial data construction costs.
DPO training requires extensive video generation and evaluation, resulting in significant computational overhead.
Dynamic Degree (DD) may be suppressed by other metrics during optimization.
Biases inherent in the reward models propagate into the optimization outcomes.

Video Generation: Open-Sora, CogVideoX, Sora
Prompt Optimization: Promptist, PAE
Alignment Methods: DPO, PPO, RLHF

Rating¶

Novelty: ★★★★☆ — First prompt optimization system specifically targeting video diffusion models.
Technical Depth: ★★★★☆ — The multi-stage optimization design is elegant and well-motivated.
Practicality: ★★★★★ — Directly improves the end-user experience in video generation.