Video-T1: Test-Time Scaling for Video Generation¶
Conference: ICCV 2025 arXiv: 2503.18942 Code: liuff19.github.io/Video-T1 Area: LLM Reasoning Keywords: Test-time scaling, video generation, Tree-of-Frames, diffusion models, autoregressive video
TL;DR¶
This paper transfers the test-time scaling (TTS) paradigm from LLMs to video generation by reformulating TTS as a search problem over trajectories from Gaussian noise space to the target video distribution. It proposes the Tree-of-Frames (ToF) search algorithm for efficient inference-time compute scaling, achieving consistent quality improvements across diverse video generation models on VBench.
Background & Motivation¶
State of the Field¶
Video generation has seen remarkable progress through training-time scaling (more data, larger models, more compute), yet the cost of further scaling has become prohibitive. Meanwhile, the LLM community (e.g., DeepSeek-R1, OpenAI o1) has demonstrated that TTS can substantially improve model performance by increasing inference-time computation.
Limitations of Prior Work¶
Prohibitive training-time scaling costs: Training video generation models demands enormous data and computational resources, with scaling costs far exceeding those of LLMs.
Unexplored inference-time potential: Virtually no systematic study has investigated how increasing inference-time computation can improve video generation quality.
Domain-specific challenges: Unlike LLM token sequences, video requires simultaneous spatial quality and temporal coherence; the multi-step denoising process of diffusion models further complicates compute scaling.
Core Problem¶
To what extent can video generation quality be improved when models are granted additional inference-time computation?
Starting Point¶
The paper reformulates video generation TTS as a search problem: finding better trajectories through the space from Gaussian noise to the target video distribution, guided by test-time verifiers and heuristic search algorithms.
Method¶
Overall Architecture¶
The Video-T1 framework comprises three core components: - Video Generator \(\mathcal{G}\): a pretrained model that generates video conditioned on text - Test-Time Verifier \(\mathcal{V}\): a multimodal evaluation model that assesses generated video quality - Heuristic Search Algorithm \(f\): an optimization method that uses verifier feedback to guide the search
Two search strategies are proposed: Random Linear Search and Tree-of-Frames (ToF) Search.
Key Designs¶
1. Random Linear Search¶
- Function: Samples \(N\) Gaussian noise vectors, performs complete denoising for each to generate \(N\) videos, and selects the one with the highest verifier score (Best-of-N strategy).
- Mechanism: Can be viewed as a forest of \(N\) degenerate trees (each with a single path), from which the optimal path is selected.
- Time Complexity: \(O(TN)\), where \(T\) is the number of frames and \(N\) is the number of noise samples.
- Limitations: The linear structure is simplistic, lacks efficient optimization, and the independent trees provide no cross-candidate feedback, leading to redundant computation.
2. Tree-of-Frames (ToF) Search¶
- Function: Exploits the frame-by-frame generation property of autoregressive models to introduce inference-time reasoning along the temporal dimension, adaptively expanding and pruning video branches via a tree structure.
- Mechanism: Generation proceeds in three stages:
- (a) Image-level alignment: The first frame is aligned with the text prompt on core semantics (color, object count, spatial layout), influencing all subsequent frames.
- (b) Hierarchical prompting: Intermediate frames focus on motion stability and physical plausibility, with the verifier's evaluation emphasis adjusted dynamically.
- (c) Holistic evaluation: The final stage assesses overall video quality and text alignment.
- Three Core Techniques:
- Image-level alignment: Incrementally evaluates frame quality during denoising, enabling early rejection of low-quality candidates.
- Hierarchical prompting: Different verifier prompts are used at different stages — the first frame emphasizes semantic consistency, intermediate frames emphasize motion continuity, and the final frames emphasize overall quality.
- Heuristic pruning: Retains the top-\(k_t\) nodes at each timestep with a dynamic branching factor \(b_t\), balancing exploration and convergence.
- Complexity: In practice, \(b_t = 1\) for most timesteps and branching occurs only at stage transitions, reducing complexity to \(O(N+T)\).
3. Multi-Verifier Ensemble¶
- Function: Aggregates scores from multiple verifiers to select the final video.
- Core Formula: \(\hat{i} = \arg\max_{0<i<n} \frac{1}{|\mathcal{M}|} \sum_{v \in \mathcal{M}} c_v \text{Rank}_v(f^{(i)})\)
- Design Motivation: Different verifiers capture different evaluation dimensions; ensemble aggregation reduces individual bias and selects the globally optimal video.
Verifier Selection¶
Three multimodal reward models are employed: VisionReward (human preference across 29 dimensions), VideoScore (multi-dimensional LMM-based scoring), and VideoLLaMA3 (a state-of-the-art multimodal understanding model). VBench serves as the ground-truth verifier for measuring the performance upper bound.
Key Experimental Results¶
Main Results (VBench Total Score Improvement)¶
| Model | w/o TTS | +TTS | Gain% | Semantic Gain% |
|---|---|---|---|---|
| CogVideoX-5B | 81.61 | 84.42 | +3.44% | +10.1% |
| CogVideoX-2B | 80.91 | 83.89 | +3.68% | +3.38% |
| OpenSora-v1.2 | 79.76 | 81.65 | +2.37% | +9.87% |
| Pyramid-Flow (FLUX) | 81.61 | 86.51 | +5.86% | +18.6% |
| Pyramid-Flow (SD3) | 81.72 | 85.31 | +4.39% | +13.8% |
| NOVA | 78.56 | 79.80 | +1.58% | +2.43% |
Dimension-level highlights (Pyramid-Flow FLUX): - Multiple Objects: 61.08 → 88.93 (+45.6%) - Scene: 47.65 → 56.07 (+17.7%) - Object Class: 93.49 → 99.69 (+6.6%)
Ablation Study¶
| Configuration | Description | Outcome |
|---|---|---|
| ToF vs. Linear Search | Comparison under equal compute budget | ToF achieves comparable performance with ~70% fewer GFLOPs |
| Single vs. Multi-Verifier | VisionReward / VideoScore / VideoLLaMA3 | Ensemble further raises the TTS curve |
| Small model (NOVA 0.6B) vs. Large model (CogVideoX-5B) | Effect of model scale on TTS gains | Larger models benefit substantially more from TTS |
| Different generation dimensions | Scene / Object / Motion etc. | Common semantic dimensions show large gains; Motion Smoothness shows limited improvement |
GFLOPs comparison (\(N=7\)):
| Model | Linear Search | ToF Search | Ratio |
|---|---|---|---|
| Pyramid-Flow (FLUX) | 5.22×10⁷ | 1.62×10⁷ | 31% |
| Pyramid-Flow (SD3) | 3.66×10⁷ | 1.13×10⁷ | 31% |
| NOVA | 4.02×10⁶ | 1.41×10⁶ | 35% |
Key Findings¶
- TTS consistently improves quality across all video generation models, eventually converging to an upper bound.
- Stronger base models benefit more from TTS: CogVideoX-5B achieves larger gains and higher efficiency than NOVA.
- ToF search is far more efficient than linear search: it achieves comparable or superior performance at approximately one-third of the computation.
- Semantic alignment dimensions show the most significant gains (Object Class +19.5%, Scene +18.6%), while implicit attributes such as motion smoothness show limited improvement.
- Multi-verifier ensemble further raises the TTS performance ceiling.
Highlights & Insights¶
- Paradigm transfer: This is the first systematic adaptation of TTS from LLMs to video generation, establishing a general search framework for the domain.
- Elegant ToF design: Leveraging the frame-by-frame nature of autoregressive models, tree search is constructed along the temporal dimension, reducing complexity from \(O(TN)\) to \(O(N+T)\).
- Hierarchical prompting strategy: Attending to different quality dimensions at different stages mirrors human multi-level judgment of video quality.
- Verifiers as the critical bottleneck: The framework's performance ceiling is determined by verifier quality, underscoring the importance of advancing video evaluation models.
- Implicit conclusion: Inference-time computation may be more cost-effective than additional training — offering a new optimization pathway beyond training-time scaling.
Limitations & Future Work¶
- ToF is limited to autoregressive models: Full-frame diffusion models (e.g., CogVideoX) can only use linear search and cannot benefit from ToF's efficiency advantages.
- Verifier capability constrains the upper bound: Current verifiers are insufficient for evaluating motion smoothness, temporal flickering, and related dimensions, limiting gains on these metrics.
- Inference overhead remains substantial: Even with ToF's efficiency improvements, inference costs in large-scale deployment far exceed those of single-pass generation.
- More sophisticated search strategies unexplored: Techniques common in LLM TTS such as MCTS and beam search have not been investigated.
- Evaluation limited to VBench: Human preference studies are absent from the evaluation.
Related Work & Insights¶
- OpenAI o1 / DeepSeek-R1 established the immense potential of TTS in LLMs; this paper transfers that insight to visual generation.
- TTS for image generation provides direct inspiration, though the temporal dimension of video introduces new challenges.
- NOVA, Pyramid-Flow, and other autoregressive video models are naturally compatible with ToF's tree-search structure.
- Verifier quality is the decisive factor in TTS effectiveness — improvements to video evaluation models (e.g., VideoLLaMA3) directly translate into stronger TTS performance.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First systematic introduction of TTS to video generation; the ToF search design is both novel and efficient.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Six generation models, three verifiers, and multi-dimensional analysis; human evaluation is absent.
- Writing Quality: ⭐⭐⭐⭐ — Concepts are clearly presented, the framework is well-structured, and algorithm pseudocode is rigorous.
- Value: ⭐⭐⭐⭐⭐ — Provides a fundamentally new optimization pathway for video generation beyond training-time scaling, with broad implications.