Generative Inbetweening through Frame-wise Conditions-Driven Video Generation¶

Conference: CVPR 2025
arXiv: 2412.11755
Code: https://fcvg-inbetween.github.io/
Area: Video Generation
Keywords: Video Interpolation, Generative Inbetweening, Diffusion Models, Frame-wise Condition Control, Temporal Consistency

TL;DR¶

FCVG is proposed, which extracts matching line segments from two keyframes and linearly interpolates them frame-by-frame as frame-wise conditions. These conditions are injected into the SVD video generation model, significantly resolving the ambiguity of the forward/backward paths in generative inbetweening synthesis, thereby achieving temporally stable video interpolation.

Background & Motivation¶

Generative inbetweening aims to generate intermediate frame sequences given the start and end frames. Unlike traditional optical flow-based frame interpolation, generative methods can handle large-motion scenarios but face the core challenge of interpolation path ambiguity:

Limitations of the Time Reversal strategy: Methods like TRF perform forward/backward denoising conditioned on the start and end frames respectively, followed by fusion. However, the motion generated by Image-to-Video (I2V) models exhibits diversity and randomness, leading to severe misalignment between the two paths. Consequently, the fusion results in incoherent or even completely different intermediate content.
Subsequent improvements remain insufficient: GI fine-tunes temporal attention layers, and VIBIDSampler improves fusion strategies, both attempting to align the two paths. However, obvious jitter and incoherence still persist in large-motion scenarios (e.g., rapid human movements).
Additional costs of existing solutions: Mitigation strategies such as noise re-injection significantly increase inference time (1.5-3×) and require manual parameter tuning for each pair of input images.

Core Insight: The root cause of path ambiguity is the lack of explicit conditional guidance for intermediate frames. Only the start and end frames have conditions, leaving the motion of intermediate frames entirely dependent on the model's random sampling. If explicit conditions are provided for every frame, the forward and backward paths will naturally align.

Method¶

Overall Architecture¶

FCVG is based on SVD (Stable Video Diffusion) and adopts the Time Reversal fusion strategy. The core innovation is the introduction of frame-wise conditions: matching line segments are extracted from the two input frames and linearly interpolated frame-by-frame, which are then injected into SVD via the lightweight ControlNeXt module, requiring the fine-tuning of only a minimal number of parameters.

Key Designs¶

Construction of Frame-wise Conditions:
- Function: To provide explicit motion guidance for each frame and eliminate the ambiguity of the interpolation path.
- Mechanism: A pre-trained GlueStick line matching model is used to establish line segment correspondences between the start and end frames, and the matching results are visualized as colored line maps (where the same color represents matching correspondences). For human scenes, DWPose is additionally used to extract pose skeletons. Then, frame-by-frame linear interpolation is performed on the starting condition \(\mathbf{c}_1\) and ending condition \(\mathbf{c}_N\) to obtain \(\mathbf{c}_{1 \rightarrow N}\).
- Design Motivation: Line matching inherently possesses global robustness, enabling it to handle large motions and complex scenes; line maps serve as sparse, structured representations that are easy to linearly interpolate; and matched line segments naturally encode motion direction and magnitude. Although the linearity assumption is not perfectly precise, it has been widely verified in prior video interpolation works to be sufficient for ensuring temporal consistency.
Condition Injection Mechanism:
- Function: To integrate frame-wise conditions into SVD without compromising its pre-trained knowledge.
- Mechanism: A lightweight scheme from ControlNeXt is adopted—using multiple ResNet blocks to encode conditions, aligning the feature distributions of the condition branch and SVD branch via cross normalization, and then fusing them by addition with an adjustable weight \(\gamma\): \(\hat{\mathbf{y}}_t = \mathbf{y}_t^{\text{SVD}} + \gamma \mathbf{y}_t^{\text{Con}}\).
- Design Motivation: ControlNeXt is more lightweight than ControlNet and does not significantly increase inference time. It only requires fine-tuning the value/output projection matrices of the attention layers in SVD and the lightweight ResNet blocks, while keeping most of the parameters frozen.
Time Reversal Fusion Strategy:
- Function: To generate intermediate frames based on bidirectional denoising fusion.
- Mechanism: The forward path is conditioned on \(I_{\text{start}}\) with \(\mathbf{c}_{1 \rightarrow N}\) as the frame-wise condition, while the backward path is conditioned on \(I_{\text{end}}\) with \(\mathbf{c}_{N \rightarrow 1}\) (time-reversed) as the frame-wise condition. The fusion at each step is conducted using a linear weight \(\lambda_i = 1 - \frac{i-1}{N-1}\).
- Design Motivation: With the introduction of frame-wise conditions, the two paths are already broadly aligned, meaning simple linear fusion is sufficient. There is no need for noise re-injection, which reduces the number of inference steps from 50 to 25, yielding an approximately 2× speedup.

Loss & Training¶

The original v-prediction objective of SVD is utilized: \(\mathcal{L} = \mathbb{E}[\|\mathbf{v} - f_\theta(\mathbf{z}_t, \mathbf{c}_{\text{image}}, t)\|_2^2]\)
Only the V/O projections of the attention layers and the lightweight ResNet encoding blocks are fine-tuned.
AdamW optimizer with a learning rate of \(1 \times 10^{-6}\) for 70K iterations.
Training resolution is 512×320, and inference resolution is 1024×576.
Training data: 524 video clips of 25 frames (DAVIS + RealEstate10K + Pexels), split into a 4:1 ratio.

Key Experimental Results¶

Main Results (Frame Gap=23)¶

Method	LPIPS ↓	FID ↓	VBench ↑	FVMD ↓	FVD ↓
FCVG (Ours)	0.1832	24.05	0.8619	5607.2	437.9
GI	0.2155	31.39	0.8606	5682.6	524.0
TRF	0.3687	42.76	0.8438	10458.0	823.4
FILM (Optical Flow)	0.1540	25.00	0.8615	8208.7	543.4

Ablation Study¶

Configuration	LPIPS ↓	FID ↓	FVMD ↓	FVD ↓
Full Model	0.1832	24.05	5607.2	437.9
w/o Control	0.2485	27.55	7217.5	536.5
w/o Matching	0.2124	24.17	6546.8	498.8
w/o Pose	0.1843	24.70	5520.9	446.1

Inference Efficiency Comparison¶

Method	Resolution	Inference Steps	Time (s)
FCVG (Ours)	25×(1024,576)	25	523
GI	25×(1024,576)	50	975
TRF	25×(1024,576)	50	1230

Key Findings¶

Frame-wise conditions are critical: Removing all control conditions degrades FVMD from 5607 to 7218 (+28.7%), proving the decisive role of frame-wise conditions in ensuring temporal stability.
Line matching is more important than pose: Removing matched line segments leads to more severe degradation in FVMD (6546 vs 5520). Matched line segments control global scene motion, while pose only improves fine-grained human details.
Insensitivity to control weight \(\gamma\): The model performance remains stable within the range \(\gamma \in [0.5, 2.0]\), with the default \(\gamma=1\) being applicable to most scenarios.
The model generalizes zero-shot to animations and line-art videos (data types not present in the training set).

Highlights & Insights¶

Extremely simple core idea: Simply providing an explicit condition for each frame fundamentally resolves path ambiguity. Using matched line segments as a sparse yet structured intermediate representation is a highly elegant design choice.
Non-linear interpolation paths: Although linear interpolation is used during training, the model supports non-linear motion curves (e.g., ease-in/ease-out) during inference, offering creative flexibility to users.
Practical speedup: By eliminating the noise re-injection step and halving the inference steps, the actual inference time is approximately 2× faster than GI/TRF.

Limitations & Future Work¶

Dependency on line-matching quality: Mismatches may occur when two frames share highly similar features (which can be mitigated by decreasing \(\gamma\)); conversely, when the difference between two frames is extremely large, the matched line segments become too sparse, and simply adjusting the control weight is insufficient.
Limitations of the linearity assumption: Albeit effective for most scenes, it remains an approximation for non-uniform motions (acceleration, deceleration, elastic motion).
Still computationally expensive: Generating 25 frames in 523 seconds is still far from real-time application, with the bottleneck residing in the pre-trained SVD itself.
Future explorations include: combining with drag-based editing (DragDiffusion) or text-guided control to generate richer conditional signals.

Complementarity with optical flow-based interpolation (FILM): FILM performs better in small-motion scenarios (lower LPIPS) but suffers from severe artifacts in large-motion scenarios, whereas FCVG exhibits greater stability in large-motion scenarios.
Relationship with GI / TRF / VIBIDSampler: Each of these is based on the Time Reversal strategy. The core improvement of FCVG lies in introducing frame-wise conditions to align the bidirectional paths.
Insight: In addressing control problems within diffusion models, "providing explicit conditions for each generation unit" serves as a universally effective strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ The concept of frame-wise conditions is simple yet cuts directly to the core of the problem, and line segment interpolation is a clever design.
Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons across various scenarios, ablation studies, and generalization experiments are comprehensively covered, although a user study is lacking.
Writing Quality: ⭐⭐⭐⭐ The analysis of the problem is clear, and the method section is well-illustrated.
Value: ⭐⭐⭐⭐ A practical and effective solution is proposed to address the temporal consistency issue in generative video interpolation.