LocoT2V-Bench: Benchmarking Long-form and Complex Text-to-Video Generation¶
Conference: ICML 2026
arXiv: 2510.26412
Code: TBD
Area: Video Generation / Multimodal VLM / Evaluation Benchmarks
Keywords: Long-form Video Generation Benchmark, Complex Text Alignment, Hierarchical Metadata, Character Consistency
TL;DR¶
LocoT2V-Bench is a professional benchmark for long-form + complex scene generation, featuring 234 real-world videos across 18 themes with prompts averaging 249 words. It introduces the LoCoT2V-Eval framework with 5 dimensions and 17 sub-dimensions (including hierarchical VQA, conditional gating, and an Auditor-Evaluator dual-agent HERD). Systematic evaluation of 17 models reveals a common performance bottleneck: "Strong perceptual quality, weak fine-grained alignment, and poor character consistency."
Background & Motivation¶
Background: Text-to-Video (T2V) has seen significant progress in short videos, but long-form generation (> 10 seconds, multi-scene, complex spatio-temporal dynamics) remains an open challenge. Existing benchmarks (e.g., VBench, EvalCrafter) target short videos with simplified prompts, making them unsuitable for evaluating complex scenes.
Limitations of Prior Work: - Primarily focus on frame-level visual quality and global prompt consistency, ignoring fine-grained alignment (character attributes, specific actions). - CLIP-Score and FID are inadequate for long videos and complex multi-scene prompts. - Insufficient evaluation of character consistency, long-term temporal coherence, and high-level narrative expression.
Key Challenge: The gap between professional-grade control requirements for long videos (precise character settings, camera movements, multi-scene coherence) and current simplified evaluation frameworks.
Goal: - Construct a long-video benchmark oriented toward professional production workflows (234 real videos, 18 themes, structured multi-scene prompts). - Design a comprehensive multi-dimensional evaluation framework covering Perceptual Quality, Text Alignment, Temporal Coherence, Dynamic Quality, and Human Expectation fulfillment.
Key Insight: Leveraging real-world videos to derive hierarchical metadata (scene, character, background, camera) and employing multi-round conditional VQA for more precise evaluation.
Core Idea: Hierarchical VQA + Conditional Gating combined with Auditor-Evaluator Dual-Agent HERD to systematically assess fine-grained alignment and high-level goal achievement in long-form generation.
Method¶
Overall Architecture¶
The framework consists of two core modules: - Data Construction: 234 videos collected from YouTube, with complex prompts for multiple scenes and hierarchical metadata generated via MLLM, LLM, and manual verification. - Evaluation: The LoCoT2V-Eval framework assesses generated videos across 5 major dimensions and 17 sub-dimensions.
Key Designs¶
-
Hierarchical VQA + Conditional Gating (Fine-grained Alignment):
- Function: Transitions text-video alignment from coarse-grained (global CLIP) to fine-grained (scene-by-scene, character-by-character verification).
- Mechanism: A tree-like multi-round QA framework performs hierarchical verification for each scene: Scene Existence Gating → Character Localization & Attribute Verification → Background & Camera Verification. A "Locate Query" ("Is there a man with a red hat?") anchors the character first, followed by a "Judge Query" ("Is this man tall?") to verify attributes, preventing hallucinatory scoring. Character attributes are calculated as \(f^c_{\text{attr}} = \frac{1}{N_c} \sum_k y_k\), and actions as \(f^c_{\text{action}} = a^c_s \cdot \frac{1}{M_c} \sum_q A(q \mid H_{N_c})\) (where \(a^c_s\) is the anchor flag).
- Design Motivation: Complex prompts require multi-level detail verification; conditional gating prevents models from receiving action scores if character localization fails (illusory scoring); multi-round dialogue history \(H_k = H_{k-1} \cup \{(q^{c, k}, y_k)\}\) ensures subsequent queries are grounded in verified context.
-
Multi-dimensional Evaluation Framework:
- Function: Systematically evaluates Perceptual Quality (PQ), Text-Video Alignment (TVA: Global OA + Fine-grained FGA), Temporal Quality (TQ: CC / BC / WE), Dynamic Quality (DQ), and Human Expectation Realization Degree (HERD).
- Mechanism:
- PQ: DeQA-Score with multi-scale frame sampling: \(PQ(v) = \frac{1}{|W|} \sum_w \frac{1}{n_\alpha} \sum_{f \in w} \text{DeQA}(f)\).
- OA: Qwen3-VL-8B replaces CLIP, scoring 0-100 to capture character/scene/interaction consistency.
- CC: SAM3 tracking → MLLM verification → FG-CLIP2 embedding similarity.
- BC / WE: Adjacent frame FG-CLIP2 + Optical Flow, computed via streaming to avoid OOM in long videos.
- DQ: Aggregates frame-level (motion, smoothness) and high-level (segment/video-level non-periodicity + info flow).
- Design Motivation: Current benchmarks lack fine-grained alignment and character identity consistency; this framework covers the full chain from low-level pixels to high-level semantics.
-
Auditor-Evaluator Dual-Agent (HERD Evaluation):
- Function: Reduces subjective bias in assessing how well the video satisfies human expectations implied by the prompt.
- Mechanism: The Auditor agent independently analyzes the video (without seeing the reference expectations) to generate an objective content report. The Evaluator then combines this report with the video to score 6 dimensions (Sentiment, Narrative, Character Development, Visual Style, Thematic Expression, General Impression) from 1-5. Weighted aggregation yields \(S_{\text{HERD}} = \frac{1}{|D|} \sum_d s_d\).
- Design Motivation: Single-agent evaluation is prone to first-impression bias or hallucinations; decoupling roles (reporting vs. evaluating) enhances objectivity and auditability.
Data Construction Comparison¶
| Benchmark | Samples | Avg. Word Count | Complexity | Features |
|---|---|---|---|---|
| EvalCrafter | 700 | 12.33 | 3.74 | Basic short video |
| VBench-Long | 946 | 7.64 | 2.54 | Simplified long video |
| VBench 2.0 | 90 | 125.46 | 8.13 | Complex single scene |
| LocoT2V-Bench | 234 | 248.85 | 8.70 | Long + Complex + Hierarchical Metadata |
Key Experimental Results¶
Main Results (Selected from 17 Long-form Video Models)¶
| Method | PQ | OA | FGA | TVA Mean | CC | BC | TQ Mean | HERD | DQ | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| FreeNoise | 73.89 | 18.12 | 10.38 | 14.25 | 15.38 | 98.77 | 69.85 | 53.65 | 50.55 | 52.44 |
| DiTCtrl | 56.55 | 48.25 | 45.54 | 46.90 | 25.72 | 96.86 | 72.50 | 60.75 | 49.37 | 57.21 |
| LongLive | 80.51 | 55.50 | 36.15 | 45.83 | 54.92 | 99.18 | 83.66 | 81.30 | 61.52 | 70.56 |
| LongCat-Video | 77.75 | 65.59 | 51.01 | 58.30 | 42.08 | 98.31 | 78.45 | 84.80 | 59.29 | 71.72 |
| Sora2 | 66.59 | 69.64 | 54.09 | 61.87 | 45.40 | 99.10 | 80.97 | 86.42 | 64.78 | 72.13 |
| Kling 3.0 | 70.26 | 73.08 | 56.94 | 65.01 | 36.97 | 98.96 | 78.55 | 87.47 | 56.16 | 71.49 |
Key Findings¶
- Strong Perception, Weak Fine-grained Alignment: PQ scores are 70-84%, but FGA scores are only 10-56% (a 2-7x difference). Models generate high-quality frames but struggle to follow complex textual constraints precisely.
- Superior Background Stability, Poor Character Consistency: BC is generally 95-99%, but CC is mostly < 50% (even the best, CausVid, reaches only 45.97%). Environmental stability is easier to maintain than character identity.
- Huge Gap between Global and Fine-grained Alignment: OA is 50-73%, but FGA drops by an average of 40 percentage points. MLLMs tend to give optimistic global ratings while overlooking missing finer details.
- Kling 3.0 and Sora2 Lead: HERD scores peak at 87.47% and 86.42%, respectively. Proprietary models show significantly stronger alignment with human expectations.
- Multi-prompt vs. Direct Input: Direct input methods (CausVid, SkyReels-V2) generally outperform multi-prompt decomposition methods (FreeNoise, MEVG) in FGA, suggesting end-to-end models handle complex text better.
Highlights & Insights¶
- Hierarchical Metadata Design: Unlike previous methods where LLMs directly generate long descriptions, this work reverse-engineers 4D metadata (Scene-Character-Background-Camera) from real videos to provide a ground truth for fine-grained evaluation.
- Conditional Gating VQA: The "Locate → Judge" transition and multiplication gate \(a^c_s\) prevent hallucinatory scoring; this approach is transferable to other multi-round reasoning evaluation tasks.
- Auditor-Evaluator Decoupling: Mimics the film review process (Content Analysis vs. Quality Rating) to eliminate single-agent hallucination/bias and improve the reliability of subjective metrics like HERD.
- Streaming Evaluation: Converts memory-intensive algorithms from EvalCrafter/VBench into streaming formats (multi-scale sampling, streaming CLIP/Optical Flow), enabling support for ultra-long videos.
- Complex Prompt Library: With 248.85 words and 8.70 complexity, this is the most challenging benchmark to date, reflecting the density of constraints in professional video production.
Limitations & Future Work¶
- The sample size of 234 is relatively small and may not cover all extreme edge cases.
- The 6 dimensions of HERD are inherently subjective; discrepancies may exist between GPT-5's generated expectations and actual user expectations.
- Character consistency relies on SAM3 tracking, which may accumulate errors during complex motions, partial occlusions, or long trajectories.
- The evaluation pipeline depends on multiple models, making deployment complex and affecting historical comparability when tools are upgraded.
- Future Work: Expand sample size to 500-1000, incorporate real user evaluation to validate HERD, and improve character tracking robustness.
Related Work & Insights¶
- vs. VBench / EvalCrafter: These use simplified prompts for short videos; Ours uses complex multi-scene prompts + hierarchical metadata + fine-grained evaluation.
- vs. VBench 2.0: VBench 2.0 uses complex prompts (125 words) but only 90 samples; Ours uses 249 words × 234 samples across 18 themes, with source-based prompts to reduce hallucination.
- vs. Multi-prompt Methods (Vlogger, StoryAdapter): These decompose long prompts using LLMs; Ours shows that direct input methods currently perform better, as decomposition may lose context.
- Insight: The fine-grained evaluation framework (Conditional VQA) is transferable to 3D generation and image editing; hierarchical metadata serves as a blueprint for systematically building complex prompt benchmarks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to systematically introduce hierarchical metadata, conditional gating VQA, and HERD for long-form video.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluates 17 representative models (both multi-prompt and direct input), exposing common bottlenecks.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, precise methodology, well-organized experiments, and actionable conclusions.
- Value: ⭐⭐⭐⭐⭐ Provides the most comprehensive benchmark for long-form video; guides model improvements; the design logic is highly transferable.