BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions¶
Background & Motivation¶
Video Frame Interpolation (VFI) aims to synthesize intermediate frames between two given frames, which is widely applied in slow-motion generation, video coding, and frame rate up-conversion. Existing methods are mostly based on optical flow estimation: they first estimate forward and backward optical flows, and then synthesize the intermediate frames through warping.
However, motion fields in real-world videos are often non-uniform, meaning that motion velocities and directions vary significantly across different regions within the same frame. Typical scenarios include:
High-speed foreground motion + static background: such as athletes in sports events
Multi-object motion with different speeds: such as multiple vehicles in traffic scenes
Mixed rotation and translation: such as complex camera motion in handheld shooting
Traditional methods typically assume that bidirectional motions are independent, estimating the forward and backward optical flows separately. This overlooks a key piece of information: there exists an inherent geometric constraint between bidirectional optical flows. For the same 3D scene point, its projected displacements in the two frames satisfy specific mathematical relationships.
This paper proposes the BiM (Bidirectional Motion) descriptor, a compact representation that simultaneously encodes bidirectional motion relationships, along with a lightweight frame interpolation framework based on BiM.
Method¶
BiM Descriptor¶
The BiM descriptor \([R, \Phi]\) consists of two components:
Magnitude Ratio \(R\)¶
Where \(\mathbf{f}_{0 \to 1}\) and \(\mathbf{f}_{1 \to 0}\) represent the forward and backward optical flows, respectively. \(R\) captures the relative velocity information of the motion.
Angle Difference \(\Phi\)¶
\(\Phi\) measures the deviation between the directions of the forward and backward optical flows. For strictly linear motion, \(\Phi = 0\); for non-linear motion (such as rotation or acceleration), \(\Phi \neq 0\).
| Motion Type | \(R\) | \(\Phi\) | Description |
|---|---|---|---|
| Uniform Translation | 1.0 | 0 | Same speed and opposite directions between frames |
| Accelerated Motion | >1.0 | 0 | Faster in the second half |
| Decelerated Motion | <1.0 | 0 | Faster in the first half |
| Curved Motion | ≈1.0 | ≠0 | Presence of directional deviation |
| Complex Non-linear | ≠1.0 | ≠0 | Both velocity and direction change |
BiM-guided FlowNet¶
The BiM descriptor is injected into the optical flow estimation network as additional input channels:
Unlike traditional methods that directly estimate the optical flow of intermediate frames, BiM-guided FlowNet leverages the global constraint information of bidirectional motions to significantly improve the optical flow accuracy in motion-non-uniform regions.
Content-Aware Upsampling Network (CAUN)¶
Traditional frame interpolation uses bilinear interpolation or separable convolutions to upsample warped features. This paper proposes CAUN, a content-aware upsampling module:
- Input: low-resolution warped features, high-resolution original frames
- Core: adaptive sampling kernel generation based on local content
- Output: high-resolution synthesized frames
CAUN utilizes a finer sampling strategy in edge and texture regions and a larger receptive field in flat regions, achieving a balance between quality and efficiency.
Knowledge Distillation (KDVCF)¶
To further compress the model, this paper designs the KDVCF (Knowledge Distillation for Video Content-aware Frame interpolation) strategy:
| Component | Teacher Model | Student Model |
|---|---|---|
| Backbone Network | ResNet-50 | MobileNetV3 |
| Parameters | 28.3M | 6.88M |
| Distillation Loss | - | Feature Alignment + Output Matching |
| Inference Speed | 1× | 3.2× |
The distillation strategy includes: 1. Feature alignment distillation: minimizing the L2 distance of intermediate-layer features 2. Output matching distillation: matching the perceptual loss of the final synthesized frame
Experimental Results¶
Main Results¶
| Method | Parameters | Vimeo90K PSNR↑ | SSIM↑ | UCF101 PSNR↑ | SNU-FILM Hard↑ |
|---|---|---|---|---|---|
| RIFE | 9.8M | 35.61 | 0.978 | 35.28 | 29.27 |
| IFRNet | 19.7M | 35.80 | 0.979 | 35.36 | 29.51 |
| AMT-S | 12.3M | 35.72 | 0.978 | 35.31 | 29.39 |
| EMA-VFI | 21.5M | 35.86 | 0.979 | 35.40 | 29.56 |
| BiM-VFI | 6.88M | 36.01 | 0.980 | 35.52 | 29.72 |
BiM-VFI achieves the best results across all datasets with the fewest parameters (6.88M).
Non-uniform Motion Scenarios¶
On the X-TEST and Xiph-4K datasets, which contain a substantial amount of non-linear motion, the advantages of BiM-VFI are even more pronounced:
| Method | X-TEST PSNR↑ | Xiph-4K PSNR↑ |
|---|---|---|
| RIFE | 28.93 | 31.42 |
| EMA-VFI | 29.34 | 31.89 |
| BiM-VFI | 30.12 | 32.47 |
Ablation Study¶
| Configuration | Vimeo90K PSNR↑ | Parameters |
|---|---|---|
| Full BiM-VFI | 36.01 | 6.88M |
| w/o BiM descriptor | 35.42 | 6.85M |
| w/o CAUN (Bilinear upsampling) | 35.67 | 5.91M |
| w/o KDVCF (Teacher model) | 36.23 | 28.3M |
| Only \(R\) | 35.78 | 6.86M |
| Only \(\Phi\) | 35.71 | 6.86M |
Both components of the BiM descriptor contribute to the performance, with the full BiM descriptor bringing a +0.59dB improvement.
Conclusion & Future Work¶
By introducing the BiM descriptor \([R, \Phi]\) to explicitly model the internal relationship of bidirectional motion fields, and combining content-aware upsampling with knowledge distillation, BiM-VFI achieves state-of-the-art frame interpolation quality with only 6.88M parameters. This method is particularly suited for handling non-uniform motion scenarios, and its design philosophy—leveraging the constraint relationship of bidirectional optical flows—can be generalized to other video understanding tasks that require motion estimation.