Skip to content

BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

Background & Motivation

Video Frame Interpolation (VFI) aims to synthesize intermediate frames between two given frames, which is widely applied in slow-motion generation, video coding, and frame rate up-conversion. Existing methods are mostly based on optical flow estimation: they first estimate forward and backward optical flows, and then synthesize the intermediate frames through warping.

However, motion fields in real-world videos are often non-uniform, meaning that motion velocities and directions vary significantly across different regions within the same frame. Typical scenarios include:

High-speed foreground motion + static background: such as athletes in sports events

Multi-object motion with different speeds: such as multiple vehicles in traffic scenes

Mixed rotation and translation: such as complex camera motion in handheld shooting

Traditional methods typically assume that bidirectional motions are independent, estimating the forward and backward optical flows separately. This overlooks a key piece of information: there exists an inherent geometric constraint between bidirectional optical flows. For the same 3D scene point, its projected displacements in the two frames satisfy specific mathematical relationships.

This paper proposes the BiM (Bidirectional Motion) descriptor, a compact representation that simultaneously encodes bidirectional motion relationships, along with a lightweight frame interpolation framework based on BiM.

Method

BiM Descriptor

The BiM descriptor \([R, \Phi]\) consists of two components:

Magnitude Ratio \(R\)

\[R = rac{\|\mathbf{f}_{0 \to 1}\|}{\|\mathbf{f}_{1 \to 0}\|}\]

Where \(\mathbf{f}_{0 \to 1}\) and \(\mathbf{f}_{1 \to 0}\) represent the forward and backward optical flows, respectively. \(R\) captures the relative velocity information of the motion.

Angle Difference \(\Phi\)

\[\Phi = \angle(\mathbf{f}_{0 \to 1}) - \angle(\mathbf{f}_{1 \to 0}) - \pi\]

\(\Phi\) measures the deviation between the directions of the forward and backward optical flows. For strictly linear motion, \(\Phi = 0\); for non-linear motion (such as rotation or acceleration), \(\Phi \neq 0\).

Motion Type \(R\) \(\Phi\) Description
Uniform Translation 1.0 0 Same speed and opposite directions between frames
Accelerated Motion >1.0 0 Faster in the second half
Decelerated Motion <1.0 0 Faster in the first half
Curved Motion ≈1.0 ≠0 Presence of directional deviation
Complex Non-linear ≠1.0 ≠0 Both velocity and direction change

BiM-guided FlowNet

The BiM descriptor is injected into the optical flow estimation network as additional input channels:

\[\mathbf{f}_{t} = ext{FlowNet}(I_0, I_1, t, R, \Phi)\]

Unlike traditional methods that directly estimate the optical flow of intermediate frames, BiM-guided FlowNet leverages the global constraint information of bidirectional motions to significantly improve the optical flow accuracy in motion-non-uniform regions.

Content-Aware Upsampling Network (CAUN)

Traditional frame interpolation uses bilinear interpolation or separable convolutions to upsample warped features. This paper proposes CAUN, a content-aware upsampling module:

  • Input: low-resolution warped features, high-resolution original frames
  • Core: adaptive sampling kernel generation based on local content
  • Output: high-resolution synthesized frames

CAUN utilizes a finer sampling strategy in edge and texture regions and a larger receptive field in flat regions, achieving a balance between quality and efficiency.

Knowledge Distillation (KDVCF)

To further compress the model, this paper designs the KDVCF (Knowledge Distillation for Video Content-aware Frame interpolation) strategy:

Component Teacher Model Student Model
Backbone Network ResNet-50 MobileNetV3
Parameters 28.3M 6.88M
Distillation Loss - Feature Alignment + Output Matching
Inference Speed 3.2×

The distillation strategy includes: 1. Feature alignment distillation: minimizing the L2 distance of intermediate-layer features 2. Output matching distillation: matching the perceptual loss of the final synthesized frame

Experimental Results

Main Results

Method Parameters Vimeo90K PSNR↑ SSIM↑ UCF101 PSNR↑ SNU-FILM Hard↑
RIFE 9.8M 35.61 0.978 35.28 29.27
IFRNet 19.7M 35.80 0.979 35.36 29.51
AMT-S 12.3M 35.72 0.978 35.31 29.39
EMA-VFI 21.5M 35.86 0.979 35.40 29.56
BiM-VFI 6.88M 36.01 0.980 35.52 29.72

BiM-VFI achieves the best results across all datasets with the fewest parameters (6.88M).

Non-uniform Motion Scenarios

On the X-TEST and Xiph-4K datasets, which contain a substantial amount of non-linear motion, the advantages of BiM-VFI are even more pronounced:

Method X-TEST PSNR↑ Xiph-4K PSNR↑
RIFE 28.93 31.42
EMA-VFI 29.34 31.89
BiM-VFI 30.12 32.47

Ablation Study

Configuration Vimeo90K PSNR↑ Parameters
Full BiM-VFI 36.01 6.88M
w/o BiM descriptor 35.42 6.85M
w/o CAUN (Bilinear upsampling) 35.67 5.91M
w/o KDVCF (Teacher model) 36.23 28.3M
Only \(R\) 35.78 6.86M
Only \(\Phi\) 35.71 6.86M

Both components of the BiM descriptor contribute to the performance, with the full BiM descriptor bringing a +0.59dB improvement.

Conclusion & Future Work

By introducing the BiM descriptor \([R, \Phi]\) to explicitly model the internal relationship of bidirectional motion fields, and combining content-aware upsampling with knowledge distillation, BiM-VFI achieves state-of-the-art frame interpolation quality with only 6.88M parameters. This method is particularly suited for handling non-uniform motion scenarios, and its design philosophy—leveraging the constraint relationship of bidirectional optical flows—can be generalized to other video understanding tasks that require motion estimation.