Hierarchical Flow Diffusion for Efficient Frame Interpolation¶

Conference: CVPR 2025
arXiv: 2504.00380
Code: Project Page
Area: Image Generation / Video Understanding
Keywords: Video Frame Interpolation, Diffusion Model, Hierarchical Optical Flow, Coarse-to-Fine, End-to-End Training

TL;DR¶

This paper proposes to explicitly denoise bidirectional optical flow in a coarse-to-fine manner using a hierarchical diffusion model (instead of directly denoising the latent space) for video frame interpolation, followed by a flow-guided image synthesizer to generate the final frame. This achieves SOTA accuracy while being 10×+ faster than other diffusion-based methods.

Background & Motivation¶

Background: Video frame interpolation (VFI) aims to generate intermediate frames given two consecutive frames. Mainstream methods utilize bidirectional optical flow as an intermediate supervision signal based on the encoder-decoder paradigm. Recent diffusion-based methods model this as a latent space denoising process.

Limitations of Prior Work: (1) Non-diffusion methods (such as SGM-VFI) only produce over-smoothed mean solutions because of the inherently ill-posed nature (multiple solutions) of intermediate optical flow. (2) Although diffusion methods (such as LDMVFI, CBBD) can generate sharper results, directly denoising in the latent space exhibits an excessively large search space, resulting in low efficiency and failure to handle complex motions and large displacements.

Key Challenge: The dimension of the latent space is much larger than that of the optical flow space (2 channels × spatial resolution). Performing diffusion directly on the latent space is inefficient and disadvantageous for modeling motion structures.

Key Insight: Optical flow has only 4 channels (2 channels for each direction), with a search space much smaller than that of the latent space. Coarse-to-fine optical flow estimation can naturally handle large-displacement motions.

Core Idea: Transfer the diffusion process from the latent space to the optical flow space, efficiently denoising the optical flow using a hierarchical coarse-to-fine strategy, and then produce the final frame through a flow-guided synthesizer.

Method¶

Overall Architecture¶

Three-stage training pipeline: (1) Stage 1 trains the flow-guided image synthesizer (encoder-decoder); (2) Stage 2 freezes the synthesizer to train the hierarchical flow diffusion model; (3) Stage 3 performs end-to-end joint fine-tuning of both the synthesizer and the diffusion model. During inference: the encoder extracts multi-scale features \(\rightarrow\) hierarchical diffusion denoises multi-scale optical flow from noise \(\rightarrow\) optical flow guides the decoder to synthesize the target frame.

Key Designs¶

Flow-Guided Image Synthesizer:
- Function: Synthesizing the intermediate frame from two frames given the optical flow.
- Mechanism: Multi-scale encoder-decoder architecture. At each scale, encoder features are warped by optical flow and fused with decoder features. The final output contains a blending mask \(M\) and RGB residual \(\Delta\mathbf{I}\). The synthesis equation is \(\tilde{\mathbf{I}}_t = M \odot w(\mathbf{I}_0, \tilde{f}_0) + (1-M) \odot w(\mathbf{I}_1, \tilde{f}_1) + \Delta\mathbf{I}\).
- Design Motivation: First train the synthesizer using pseudo-GT optical flow generated by a pre-trained optical flow network (UniMatch), enabling it to learn high-quality image synthesis from optical flow, thereby providing strong condition information for the subsequent diffusion model.
Hierarchical Flow Diffusion:
- Function: Progressively denoising multi-scale bidirectional optical flow from Gaussian noise.
- Mechanism: Uniformly distribute the DDPM denoising process across 3 pyramid hierarchical levels (\(k_1{=}4\) to \(k_0{=}2\), i.e., 1/16 to 1/4 of the original resolution). At each hierarchy \(i\), the U-Net denoises the optical flow conditioned on encoder features \((\mathbf{F}_0^i, \mathbf{F}_1^i)\) at that hierarchy. During cross-hierarchy transitions, the currently estimated optical flow is 2× upsampled and approximates the input of the next hierarchy using the DDPM forward function. The U-Net parameters are shared across hierarchies, while only the flow projector and feature projector are independent.
- Design Motivation: The coarse-to-fine strategy is naturally suited for handling large displacements (coarse levels capture large motions, fine levels supplement details). The optical flow space only has 4 channels, offering a much smaller search space than the latent space, making denoising more efficient.
End-to-End Joint Fine-tuning:
- Function: Jointly optimize the synthesizer and the diffusion model to eliminate the information gap of two-stage separate training.
- Mechanism: Multi-scale optical flow output by the diffusion model is directly used to warp encoder features and fed into the synthesizer decoder, supervised by photometric loss over the final synthesized image quality. Both the synthesizer and the diffusion model update gradients simultaneously.
- Design Motivation: During separate training, the synthesizer is optimized for "perfect" pseudo-GT optical flows, but the actual optical flow output by the diffusion model has prediction errors. Joint fine-tuning allows them to adapt to each other.

Loss & Training¶

Stage 1 (Synthesizer Training): Photometric loss \(\mathcal{L}_{photo} = \mathcal{L}_{pixel} + 0.1 \cdot \mathcal{L}_{LPIPS} + 20 \cdot \mathcal{L}_{style}\), 200 epochs, batch 64.
Stage 2 (Diffusion Training): Multi-scale optical flow L1 loss \(\mathcal{L}_{flow} = \sum_i \|\tilde{f}_0^i - f_0^i\|_1 + \|\tilde{f}_1^i - f_1^i\|_1\), 200 epochs, 1000 denoising steps.
Stage 3 (Joint Fine-tuning): Photometric loss, 100 epochs, batch 32.
During inference, DDIM (\(\sigma_t{=}0\)) sampling is used with only 6 steps.

Key Experimental Results¶

Main Results¶

SNU-FILM benchmark (LPIPS/FID, ↓ lower is better):

Method	easy LPIPS	hard LPIPS	extreme LPIPS	extreme FID
SGM-VFI	0.0191	0.0611	0.1182	41.078
CBBD (Diffusion)	0.0112	0.0467	0.1040	36.729
Ours	0.0098	0.0405	0.0839	27.032

Xiph-4K (High-resolution challenge):

Method	LPIPS	FID
CBBD	0.0634	24.621
Ours	0.0614	14.132

DAVIS + Vimeo-90k:

Dataset	Method	LPIPS	FID
DAVIS	CBBD	0.0919	9.220
DAVIS	Ours	0.0753	7.237
Vimeo	CBBD	0.0123	1.961
Vimeo	Ours	0.0120	1.712

Ablation Study¶

Configuration	SNUFILM-hard LPIPS	extreme LPIPS
Vanilla (Single-scale diffusion)	0.0625	0.1199
Hierarchical diffusion (Ours)	0.0405	0.0839

Key Findings¶

Comprehensively outperforms the prior SOTA diffusion method CBBD and non-diffusion method SGM-VFI across all four datasets.
The advantages are especially significant in challenging scenarios (hard/extreme): extreme FID of 27.0 vs. CBBD's 36.7 (a 26% improvement).
Inference speed of 0.20s (1024×1024) is on par with the fastest non-diffusion method SGM-VFI, and 10× faster than the diffusion-based CBBD.
The hierarchical strategy improves LPIPS by 35% on the hard subset compared to single-scale diffusion.

Highlights & Insights¶

Clever Transfer of Diffusion Target: Performing diffusion on optical flow instead of the latent space, reducing the search space from the high-dimensional latent space to 4-channel optical flow, which substantially boosts efficiency.
Natural Compatibility Between Coarse-to-Fine Hierarchy and Diffusion: Diffusion itself is a progressive process from noise to signal, which fits perfectly with the coarse-to-fine estimation style of optical flow.
Achieving Dual SOTA in Speed and Quality: Simultaneously surpassing all baselines in both accuracy and efficiency, breaking the stereotype that diffusion-based methods trade speed for quality.

Limitations & Future Work¶

Relies on a pre-trained optical flow network to provide pseudo-GT, meaning the upper bound of optical flow quality is limited by this network.
Only supports single-frame interpolation between two frames; multi-frame or arbitrary-timestep interpolation is not discussed.
Utilizes only 6-step inference sampling; whether more steps can further improve quality is not fully explored.
Future work could explore extending the hierarchical diffusion strategy to video generation or other motion-sensitive tasks.

SGM-VFI: Non-diffusion SOTA, unifying forward/backward optical flow framework, efficient but tends to yield overly smooth results.
CBBD: Diffusion-based frame interpolation method; this work replaces its latent space diffusion with optical flow diffusion.
FlowDiffuser / DDVM: Works applying diffusion to optical flow estimation, but focusing on supervised settings with GT optical flow.
Insight: The hierarchical diffusion strategy can be extended to other vision tasks requiring multi-scale structured prediction.

Rating¶

⭐⭐⭐⭐ — The method design is simple yet effective, the motivation is clear, and the experiments are comprehensive and convincing. Shifting diffusion from the latent space to the optical flow space is a key insight, and the dual improvement in speed and quality holds practical application value.