Arbitrary Generative Video Interpolation¶
Conference: ICLR 2026 arXiv: 2510.00578 Code: Project Page Area: Video Understanding / Video Generation Keywords: Video Frame Interpolation, Generative VFI, RoPE, Temporal Conditioning, Any-length Generation
TL;DR¶
ArbInterp proposes a generative video frame interpolation framework supporting arbitrary timestamps and arbitrary sequence lengths. It achieves precise temporal control via Timestamp-aware Rotary Position Embedding (TaRoPE) and enables seamless long-sequence stitching through an appearance-motion decoupled conditioning strategy.
Background & Motivation¶
Video Frame Interpolation (VFI) is a fundamental task in video generation: given a start frame and an end frame, the goal is to synthesize intermediate transition frames. In recent years, diffusion-based generative VFI methods (e.g., DynamiCrafter, TRF, GI) have demonstrated strong capability in producing high-quality intermediate frames.
However, existing generative VFI methods suffer from two critical limitations:
Fixed interpolation count: Current methods can only generate a fixed number of intermediate frames in a single pass (e.g., 7 or 15 frames), offering no flexibility in adjusting frame rate or total sequence duration. A user may need 2× interpolation (inserting 1 frame) or 32× interpolation (inserting 31 frames), yet no unified model handles both.
Incoherence in long sequences: When a large number of intermediate frames is required (e.g., 32× interpolation), direct generation of long sequences faces both memory constraints and quality degradation. Segment-wise generation is a natural remedy, but maintaining spatiotemporal coherence across segments is non-trivial, often resulting in unnatural motion or inconsistent appearance.
ArbInterp aims to build a unified generative VFI framework that simultaneously addresses the challenges of arbitrary timestamps and arbitrary sequence length.
Method¶
Overall Architecture¶
ArbInterp is built upon a video diffusion model. The overall pipeline operates at two levels: - Single-segment interpolation: Given boundary frames and a target timestamp sequence, TaRoPE precisely controls the temporal position of each generated frame. - Long-sequence interpolation: The full sequence is decomposed into multiple segments, with an appearance-motion decoupled conditioning strategy ensuring spatiotemporal consistency across segment boundaries.
Key Designs¶
-
Timestamp-aware Rotary Position Embedding (TaRoPE)
- Function: Enables the model to perceive and generate frames at arbitrary continuous timestamp positions.
- Mechanism: Conventional positional encodings map frame positions to fixed integer indices (0, 1, 2, …). TaRoPE instead modulates the target normalized timestamps (e.g., 0.25, 0.5, 0.75) into the rotation angles of RoPE. Specifically, along the temporal dimension, discrete position indices are replaced by target timestamps, aligning rotation angles with continuous time.
- Design Motivation: Fixed-position paradigms (e.g., integer-indexed RoPE in DynamiCrafter) can only generate uniformly spaced frames and cannot handle non-uniform timestamps (e.g., generating only the frame at \(t=0.3\)). TaRoPE transforms position encoding from discrete jumps to a continuous and controllable representation, allowing a single model to cover any interpolation factor from 2× to 32×.
-
Segment-wise Frame Synthesis
- Function: Decomposes a long sequence (e.g., 31 frames for 32× interpolation) into multiple short segments generated sequentially.
- Mechanism: Each segment generates a subset of frames; the last frame of the preceding segment serves as the initial boundary condition for the next. The core challenge is ensuring smooth transitions between segments.
- Design Motivation: Directly generating ultra-long sequences is constrained by GPU memory and results in quality degradation. Segment-wise generation is computationally feasible but requires additional mechanisms to guarantee inter-segment coherence.
-
Appearance-Motion Decoupled Conditioning
- Function: Handles inter-segment appearance consistency and motion continuity separately.
- Mechanism:
- Appearance conditioning: The endpoint frames (start and end) of the preceding segment serve as visual conditions, enforcing appearance style consistency between the newly generated segment and prior content.
- Motion conditioning: Normalized timestamps (temporal semantic information) maintain global motion coherence, ensuring that velocity and direction transition naturally across segments.
- Design Motivation: Using only the last frame of the preceding segment as a condition may cause appearance drift or motion discontinuity. Decoupling appearance and motion signals and injecting them separately affords more precise control over both dimensions of continuity.
Loss & Training¶
The paper does not elaborate on loss function details in the abstract or project page. As a diffusion-based approach, the core training strategy includes: - Standard denoising diffusion loss - Random sampling of diverse timestamps during training, enabling the model to handle arbitrary temporal positions via TaRoPE - Training with segments of varying lengths and interpolation factors to improve generalization
Key Experimental Results¶
Evaluation Benchmarks¶
The authors construct two comprehensive benchmarks:
- MultiInterp Benchmark: Multi-scale frame interpolation evaluation (2×, 4×, 8×, 16×, 32×), assessing model generalization across different interpolation factors.
- StreamInterp Benchmark: Streaming/long-sequence interpolation evaluation, testing spatiotemporal coherence in segment-wise generation.
Main Results¶
| Method | 2× Quality | 8× Quality | 16× Quality | 32× Quality | Evaluation Dimension |
|---|---|---|---|---|---|
| DynamiCrafter | Baseline | Baseline | Baseline | Baseline | Fixed frame count |
| TRF | Comparison | Comparison | Comparison | Comparison | Fixed positional encoding |
| GI | Comparison | Comparison | Comparison | Comparison | Generative interpolation |
| ArbInterp | Best | Best | Best | Best | Unified model, full coverage |
According to the abstract, ArbInterp outperforms all prior methods across interpolation scenarios, demonstrating higher fidelity and more seamless spatiotemporal continuity.
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| w/o TaRoPE (fixed positional encoding) | Quality degradation | Cannot adapt to varying interpolation factors |
| w/o appearance-motion decoupled conditioning | Inter-segment incoherence | Motion discontinuity and appearance drift |
| Varying segment lengths | Efficiency–quality trade-off | Shorter segments offer more flexibility but may accumulate errors |
Key Findings¶
- Continuous controllability of TaRoPE: A single model handles arbitrary interpolation from 2× to 32× without requiring separate training for each factor.
- Necessity of decoupled conditioning: Using only the last frame of the preceding segment as a condition (without separating appearance and motion) leads to progressive quality degradation in long sequences.
- Comprehensive superiority: ArbInterp surpasses competing methods not only quantitatively but also produces more natural and fluid visual results.
Highlights & Insights¶
- Elegant TaRoPE design: Encoding continuous timestamps into RoPE is a concise yet effective design that achieves arbitrary timestamp control with virtually no additional parameters. This idea generalizes to other generation tasks requiring the continuous relaxation of discrete positional indices.
- Decoupling design philosophy: Appearance consistency and motion continuity are orthogonal requirements; handling them separately yields more controllable results than joint treatment. This principle also has implications for video editing and video continuation tasks.
- Strong practical utility: Arbitrary interpolation factor combined with arbitrary sequence length means a single model satisfies all VFI deployment needs, substantially reducing system complexity.
- Benchmark contribution: The construction of MultiInterp and StreamInterp provides a foundation for fair comparison in future work.
Limitations & Future Work¶
- Inference speed of diffusion models: Generative methods based on iterative denoising are significantly slower than traditional optical-flow methods (e.g., RIFE, IFRNet); inference time may become a bottleneck at high interpolation factors.
- Accumulated error in long segments: Despite the decoupled conditioning strategy, it remains unclear whether progressive degradation occurs at extreme sequence lengths (e.g., 64×, 128×).
- Scene diversity: Demonstrations on the project page are primarily focused on driving and motion scenes; performance under challenging conditions such as complex occlusions and scene cuts is unknown.
- Comparison with non-generative methods: The paper mainly compares against generative VFI methods; a comprehensive quantitative comparison with efficient traditional optical-flow VFI methods (e.g., RIFE) would strengthen the evaluation.
- Training data requirements: Generative models typically require large-scale video pretraining; training cost and data provenance merit attention.
Related Work & Insights¶
- Relationship to DynamiCrafter: DynamiCrafter is a representative generative VFI method, but is constrained by fixed frame positional encoding. TaRoPE in ArbInterp directly addresses this fundamental limitation.
- Relationship to TRF and GI: TRF (Time-Reversal Fusion) and GI (Generative Interpolation) also explore generative interpolation but are equally subject to fixed-length constraints.
- Extension of RoPE to the temporal dimension: The original RoPE was designed for sequence position encoding in LLMs. ArbInterp extends it to the temporal dimension of video with support for continuous values—a cross-domain technical transfer worth noting.
- Implications for video generation: Both TaRoPE and the decoupled conditioning strategy may benefit tasks beyond frame interpolation, including video prediction and video continuation.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐