Video-T1: Test-Time Scaling for Video Generation¶

Conference: ICCV 2025 arXiv: 2503.18942 Code: liuff19.github.io/Video-T1 Area: LLM Reasoning Keywords: Test-time scaling, video generation, Tree-of-Frames, diffusion models, autoregressive video

TL;DR¶

This paper transfers the test-time scaling (TTS) paradigm from LLMs to video generation by reformulating TTS as a search problem over trajectories from Gaussian noise space to the target video distribution. It proposes the Tree-of-Frames (ToF) search algorithm for efficient inference-time compute scaling, achieving consistent quality improvements across diverse video generation models on VBench.

Background & Motivation¶

State of the Field¶

Video generation has seen remarkable progress through training-time scaling (more data, larger models, more compute), yet the cost of further scaling has become prohibitive. Meanwhile, the LLM community (e.g., DeepSeek-R1, OpenAI o1) has demonstrated that TTS can substantially improve model performance by increasing inference-time computation.

Limitations of Prior Work¶

Prohibitive training-time scaling costs: Training video generation models demands enormous data and computational resources, with scaling costs far exceeding those of LLMs.

Unexplored inference-time potential: Virtually no systematic study has investigated how increasing inference-time computation can improve video generation quality.

Domain-specific challenges: Unlike LLM token sequences, video requires simultaneous spatial quality and temporal coherence; the multi-step denoising process of diffusion models further complicates compute scaling.

Core Problem¶

To what extent can video generation quality be improved when models are granted additional inference-time computation?

Starting Point¶

The paper reformulates video generation TTS as a search problem: finding better trajectories through the space from Gaussian noise to the target video distribution, guided by test-time verifiers and heuristic search algorithms.

Method¶

Overall Architecture¶

The Video-T1 framework comprises three core components: - Video Generator \(\mathcal{G}\): a pretrained model that generates video conditioned on text - Test-Time Verifier \(\mathcal{V}\): a multimodal evaluation model that assesses generated video quality - Heuristic Search Algorithm \(f\): an optimization method that uses verifier feedback to guide the search

Two search strategies are proposed: Random Linear Search and Tree-of-Frames (ToF) Search.

Key Designs¶

1. Random Linear Search¶

Function: Samples \(N\) Gaussian noise vectors, performs complete denoising for each to generate \(N\) videos, and selects the one with the highest verifier score (Best-of-N strategy).
Mechanism: Can be viewed as a forest of \(N\) degenerate trees (each with a single path), from which the optimal path is selected.
Time Complexity: \(O(TN)\), where \(T\) is the number of frames and \(N\) is the number of noise samples.
Limitations: The linear structure is simplistic, lacks efficient optimization, and the independent trees provide no cross-candidate feedback, leading to redundant computation.

2. Tree-of-Frames (ToF) Search¶

Function: Exploits the frame-by-frame generation property of autoregressive models to introduce inference-time reasoning along the temporal dimension, adaptively expanding and pruning video branches via a tree structure.
Mechanism: Generation proceeds in three stages:
- (a) Image-level alignment: The first frame is aligned with the text prompt on core semantics (color, object count, spatial layout), influencing all subsequent frames.
- (b) Hierarchical prompting: Intermediate frames focus on motion stability and physical plausibility, with the verifier's evaluation emphasis adjusted dynamically.
- (c) Holistic evaluation: The final stage assesses overall video quality and text alignment.
Three Core Techniques:
- Image-level alignment: Incrementally evaluates frame quality during denoising, enabling early rejection of low-quality candidates.
- Hierarchical prompting: Different verifier prompts are used at different stages — the first frame emphasizes semantic consistency, intermediate frames emphasize motion continuity, and the final frames emphasize overall quality.
- Heuristic pruning: Retains the top-\(k_t\) nodes at each timestep with a dynamic branching factor \(b_t\), balancing exploration and convergence.
Complexity: In practice, \(b_t = 1\) for most timesteps and branching occurs only at stage transitions, reducing complexity to \(O(N+T)\).

3. Multi-Verifier Ensemble¶

Function: Aggregates scores from multiple verifiers to select the final video.
Core Formula: \(\hat{i} = \arg\max_{0<i<n} \frac{1}{|\mathcal{M}|} \sum_{v \in \mathcal{M}} c_v \text{Rank}_v(f^{(i)})\)
Design Motivation: Different verifiers capture different evaluation dimensions; ensemble aggregation reduces individual bias and selects the globally optimal video.

Verifier Selection¶

Three multimodal reward models are employed: VisionReward (human preference across 29 dimensions), VideoScore (multi-dimensional LMM-based scoring), and VideoLLaMA3 (a state-of-the-art multimodal understanding model). VBench serves as the ground-truth verifier for measuring the performance upper bound.

Key Experimental Results¶

Main Results (VBench Total Score Improvement)¶

Model	w/o TTS	+TTS	Gain%	Semantic Gain%
CogVideoX-5B	81.61	84.42	+3.44%	+10.1%
CogVideoX-2B	80.91	83.89	+3.68%	+3.38%
OpenSora-v1.2	79.76	81.65	+2.37%	+9.87%
Pyramid-Flow (FLUX)	81.61	86.51	+5.86%	+18.6%
Pyramid-Flow (SD3)	81.72	85.31	+4.39%	+13.8%
NOVA	78.56	79.80	+1.58%	+2.43%

Dimension-level highlights (Pyramid-Flow FLUX): - Multiple Objects: 61.08 → 88.93 (+45.6%) - Scene: 47.65 → 56.07 (+17.7%) - Object Class: 93.49 → 99.69 (+6.6%)

Ablation Study¶

Configuration	Description	Outcome
ToF vs. Linear Search	Comparison under equal compute budget	ToF achieves comparable performance with ~70% fewer GFLOPs
Single vs. Multi-Verifier	VisionReward / VideoScore / VideoLLaMA3	Ensemble further raises the TTS curve
Small model (NOVA 0.6B) vs. Large model (CogVideoX-5B)	Effect of model scale on TTS gains	Larger models benefit substantially more from TTS
Different generation dimensions	Scene / Object / Motion etc.	Common semantic dimensions show large gains; Motion Smoothness shows limited improvement

GFLOPs comparison (\(N=7\)):

Model	Linear Search	ToF Search	Ratio
Pyramid-Flow (FLUX)	5.22×10⁷	1.62×10⁷	31%
Pyramid-Flow (SD3)	3.66×10⁷	1.13×10⁷	31%
NOVA	4.02×10⁶	1.41×10⁶	35%

Key Findings¶

TTS consistently improves quality across all video generation models, eventually converging to an upper bound.
Stronger base models benefit more from TTS: CogVideoX-5B achieves larger gains and higher efficiency than NOVA.
ToF search is far more efficient than linear search: it achieves comparable or superior performance at approximately one-third of the computation.
Semantic alignment dimensions show the most significant gains (Object Class +19.5%, Scene +18.6%), while implicit attributes such as motion smoothness show limited improvement.
Multi-verifier ensemble further raises the TTS performance ceiling.

Highlights & Insights¶

Paradigm transfer: This is the first systematic adaptation of TTS from LLMs to video generation, establishing a general search framework for the domain.
Elegant ToF design: Leveraging the frame-by-frame nature of autoregressive models, tree search is constructed along the temporal dimension, reducing complexity from \(O(TN)\) to \(O(N+T)\).
Hierarchical prompting strategy: Attending to different quality dimensions at different stages mirrors human multi-level judgment of video quality.
Verifiers as the critical bottleneck: The framework's performance ceiling is determined by verifier quality, underscoring the importance of advancing video evaluation models.
Implicit conclusion: Inference-time computation may be more cost-effective than additional training — offering a new optimization pathway beyond training-time scaling.

Limitations & Future Work¶

ToF is limited to autoregressive models: Full-frame diffusion models (e.g., CogVideoX) can only use linear search and cannot benefit from ToF's efficiency advantages.
Verifier capability constrains the upper bound: Current verifiers are insufficient for evaluating motion smoothness, temporal flickering, and related dimensions, limiting gains on these metrics.
Inference overhead remains substantial: Even with ToF's efficiency improvements, inference costs in large-scale deployment far exceed those of single-pass generation.
More sophisticated search strategies unexplored: Techniques common in LLM TTS such as MCTS and beam search have not been investigated.
Evaluation limited to VBench: Human preference studies are absent from the evaluation.

OpenAI o1 / DeepSeek-R1 established the immense potential of TTS in LLMs; this paper transfers that insight to visual generation.
TTS for image generation provides direct inspiration, though the temporal dimension of video introduces new challenges.
NOVA, Pyramid-Flow, and other autoregressive video models are naturally compatible with ToF's tree-search structure.
Verifier quality is the decisive factor in TTS effectiveness — improvements to video evaluation models (e.g., VideoLLaMA3) directly translate into stronger TTS performance.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First systematic introduction of TTS to video generation; the ToF search design is both novel and efficient.
Experimental Thoroughness: ⭐⭐⭐⭐ — Six generation models, three verifiers, and multi-dimensional analysis; human evaluation is absent.
Writing Quality: ⭐⭐⭐⭐ — Concepts are clearly presented, the framework is well-structured, and algorithm pseudocode is rigorous.
Value: ⭐⭐⭐⭐⭐ — Provides a fundamentally new optimization pathway for video generation beyond training-time scaling, with broad implications.