Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks¶

Conference: ICCV 2025 arXiv: 2503.17539 Code: None Area: Video Generation Keywords: video generation, diffusion transformer, parallel inference, temporal consistency, long video

TL;DR¶

This paper proposes Video Interface Networks (VINs), an abstraction module analogous to "fast thinking," which encodes long videos into fixed-size global tokens at each diffusion step to guide a DiT in generating multiple video chunks in parallel, enabling efficient and temporally consistent long video generation.

Background & Motivation¶

Diffusion Transformers (DiTs) can generate high-quality short videos, but extending them to long videos faces a quadratic complexity bottleneck.
Full-attention approaches on long videos lead to motion stagnation and repetition.
Autoregressive approaches (chunk-by-chunk generation) suffer from catastrophic forgetting, subject inconsistency, and temporal incoherence.
Existing parallel methods (e.g., FreeNoise, FreeLong) rely on handcrafted templates (noise rescheduling, frequency-band filtering) as consistency priors, capturing only shallow visual features and lacking deep semantic abstraction.
The work is inspired by Kahneman's dual-process theory of human cognition: System 1 (fast intuition) + System 2 (slow reasoning). DiTs operate solely as System 2 and lack the global abstraction capability of System 1.

Method¶

Overall Architecture¶

At each diffusion timestep: (1) the VIN encodes global semantics from the noisy input into fixed-size global tokens; (2) the DiT uses these global tokens to denoise each video chunk in parallel; (3) overlapping regions maintain consistency via token fusion. The VIN and DiT are jointly trained end-to-end.

Key Designs¶

Video Interface Network (VIN): The VIN consists of three components:
- Global Tokens: Fixed-size learnable embeddings \(Z_{init} \in \mathbb{R}^{N_{global} \times d}\) (512 tokens, dimension 4096), independent of the input.
- VIN Encoder: Samples keyframes from the input video at \(T_s = 1.0\) second intervals and encodes video information into the global tokens via cross-attention (global tokens as queries, video tokens as key-values).
- VIN Processor: Four self-attention blocks (32 heads) that iteratively refine the global tokens while incorporating text prompt embeddings.
- Core advantage: the size of global tokens is fixed and does not grow with video length, decoupling computation from input size and enabling scaling to arbitrarily long videos.
End-to-End Joint Training Objective: The noise distribution is factored as a product of conditional distributions over each chunk: \(P_\theta(\epsilon_t|X_t,t,Z_t) = \prod_i P_\theta(\epsilon_t^i | X_t^i, t, Z_t)\). The loss function is \(\mathcal{L}_{\alpha,\theta} = \mathbb{E}[\sum_i \|\epsilon_\theta([X_t^i, Z_t], t) - \epsilon_t^i\|^2]\). Each chunk additionally receives a local context of the last \(F_{local}=8\) frames from the preceding chunk (with stop-gradient to prevent inter-chunk gradient interference).
Token Fusion at Inference: Overlapping regions between adjacent chunks are merged via weighted averaging: \(\hat{\epsilon}_t^{fused}[k] = \frac{(\mathcal{F}_{local} - \mathcal{W}(k))\hat{\epsilon}_t^i[k] + \mathcal{W}(k)\hat{\epsilon}_t^{i+1}[k]}{\mathcal{F}_{local}}\), where \(\mathcal{W}(k)\) denotes the relative temporal position. An early fusion strategy (\(t > t_\alpha = 20\)) is adopted, as fusion in the early stages of the sampling chain yields the best results.

Loss & Training¶

Training data: 840,000 annotated videos, mixed at 64/128/256 frames (20/40/80 latent frames).
Chunk size \(F_{chunk}=20\) latent frames, local context \(F_{local}=8\) latent frames, 512 global tokens.
Inference: 50-step reverse diffusion, extended \(F_{local}=12\), early fusion cutoff \(t_\alpha=20\).
Backbone: a pretrained latent video DiT based on a modified Open-Sora, with a 3D VAE encoding 16 frames into 5 latent frames, resolution 192×320, 16 FPS.

Key Experimental Results¶

Main Results¶

VBench Long Evaluation (higher is better; Dynamic Degree requires balance):

Method	Subject Consistency	Background Consistency	Characteristic
Full Attention	Degrades with length	High but dynamic degree drops sharply	Motion stagnation
AutoRegressive	Lower than VIN	Lower than VIN	Catastrophic forgetting
StreamingT2V	Lowest	Lowest	Insufficient memory module
FreeNoise	Medium	Medium	Shallow prior
Spectral Blending	Medium	Medium	Limited frequency-domain filtering
VIN (Ours)	Highest	Highest	Preserves dynamics

Optical Flow Analysis (MAWE↓):

Method	64 frames	128 frames	256 frames	512 frames
AutoRegressive	~2.5	~3.0	~3.5	~4.5
FreeNoise	~2.0	~2.5	~3.0	~4.0
Full Attention	~1.5	~2.0	~2.5	~3.5
VIN	~1.0	~1.1	~1.5	<2.0

Ablation Study¶

Configuration	MAWE↓	Scene Cuts↓
Full Model	1.09	0.21
w/o Global Tokens	1.69	0.33
w/o fusion	1.13	1.00
Mid fusion	1.11	0.33
Late fusion	1.22	0.74
Local 8 / 10 frames	1.51 / 1.17	0.24 / 0.22
Keyframe 0.5s / 0.2s	1.14 / 1.21	0.34 / 0.29

Key Findings¶

Global tokens are the core component: removing them raises MAWE from 1.09 to 1.69, the most severe degradation observed.
Early fusion is the most effective: consistent with the intuition that diffusion models form object structure in the early stages of sampling.
Dense keyframe sampling is not beneficial: \(T_s = 0.2s\) actually underperforms \(T_s = 1.0s\), indicating that the VIN's semantic encoding possesses redundancy-suppression capability.
VIN reduces FLOPs by 25–40% and achieves 40–75% speedup compared to full attention, with only a marginal increase in memory.
In user studies, VIN is preferred by human evaluators for both overall appearance and temporal consistency (loss rate < 30%).
VIN attention heads exhibit semantic specialization: different heads focus on human bodies, architectural structures, objects, etc.

Highlights & Insights¶

An elegant dual-system analogy: VIN as System 1 for global abstraction and DiT as System 2 for local refinement, analogous to a painter's cognitive workflow.
End-to-end training: compared to template-based methods (FreeNoise/FreeLong), the consistency prior is learned rather than manually designed, yielding a more principled formulation.
Dynamically computed representations: global tokens are recomputed at every timestep rather than serving as static anchors, enabling graceful degradation.
Stop-gradient design: gradients are not propagated across shared chunk boundaries, preventing inter-chunk interference.

Limitations & Future Work¶

The VIN learns representations solely through the generation objective, without supervision from downstream tasks (e.g., segmentation, depth estimation).
Modalities beyond the original patch input (e.g., depth, 3D information) remain unexplored.
Resolution is limited to 192×320; practical applications require scaling to higher resolutions.
The token fusion mechanism is relatively straightforward, and more effective fusion strategies may exist.

Inspired by Recurrent Interface Networks (RINs), the method decouples semantic encoding from per-pixel denoising.
Compared to the long-term memory module in StreamingT2V, VIN's global tokens are dynamic and cover the entire video.
Unlike the shallow priors in FreeNoise/Spectral Blending, VIN learns deep semantic representations.
The global tokens exhibit potential as general-purpose feature representations that could be integrated with video editing and video understanding tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A dual-system paradigm for parallel video generation; the combination of global tokens and DiT is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation covering VBench, optical flow, scene cuts, user studies, and ablations.
Writing Quality: ⭐⭐⭐⭐ Clear structure, rich figures, and an intuitive dual-system analogy.
Value: ⭐⭐⭐⭐⭐ Provides a scalable new paradigm for long video generation with strong practical significance.