MotionPro: A Precise Motion Controller for Image-to-Video Generation¶
Conference: CVPR 2025
arXiv: 2505.20287
Code: https://github.com/zhw-zhang/MotionPro-page
Area: Video Generation
Keywords: Image-to-Video Generation, Motion Control, Region-wise Trajectory, Motion Mask, Diffusion Models
TL;DR¶
MotionPro is proposed to achieve fine-grained, controllable image-to-video generation that distinguishes between object and camera motions, utilizing dual signals of region-wise trajectories and motion masks.
Background & Motivation¶
Existing controllable I2V methods (such as DragNUWA and DragAnything) mainly rely on large-kernel Gaussian filtering to diffuse sparse trajectories as condition signals, which suffer from two key limitations: 1. Inaccurate fine-grained motion: Gaussian filtering (with kernel sizes up to \(99 \times 99\)) diffuses trajectory signals into surrounding areas, leading to the loss of fine-grained motion details and generating unnatural movements (e.g., head turning). 2. Motion category ambiguity: Relying solely on trajectory conditions fails to distinguish "object motion" from "camera motion". For instance, a downward trajectory on a planet could represent either a downward camera pan or an upward movement of the planet, making a single trajectory signal highly ambiguous.
Furthermore, while MOFA-Video introduces a motion region mask, it is restricted to post-processing optical flow masking rather than being injected into the network as a generative condition, resulting in local distortions in the synthesized videos.
Method¶
Overall Architecture¶
MotionPro is built upon the pre-trained SVD (Stable Video Diffusion). During training, region-wise trajectories and motion masks are extracted from input videos as dual control signals. High-fidelity motion encoders encode these signals into multi-scale features, which are then injected into the multi-scale feature layers of the 3D-UNet via Adaptive Feature Modulation. Concurrently, LoRA fine-tuning is applied to all attention modules to enhance motion-trajectory alignment.
Key Designs¶
-
Region-wise Trajectory: The DOT optical flow tracking model is utilized to estimate the optical flow maps \(f^i\) and visibility masks \(M^i\) of the video. The global visibility mask is computed as \(M_g = \prod_{i=1}^{L} M^i\). The masked optical flow maps are then divided into \(k \times k\) (default \(k=8\)) local regions, and trajectories within these regions are sparsely selected via a random region selection mask \(M_{sel}\) (with the mask ratio randomly sampled from \([r_{min}, 1.0]\)). Core Advantage: Compared to Gaussian filtering, this directly preserves the original trajectory information within local regions, maintaining precise motion details.
-
Motion Mask: The temporal average of optical flow magnitudes across all frames is computed as \(f_{avg} = \frac{1}{L} \sum_{i=1}^{L} \|f^i\|_2\). Positions where \(f_{avg} > 1\) are marked as motion regions to obtain \(M_{mot}\), which is repeated \(L\) times to form the motion mask sequence \(\mathbf{M}_{mot} \in \{0,1\}^{L \times H \times W \times 1}\). Function: Globally identifies motion regions, clarifies the target motion category (object/camera motion), and eliminates ambiguity in trajectory signals.
-
Adaptive Feature Modulation: The trajectory and mask are concatenated and encoded by a lightweight Motion Encoder into multi-scale features \(l_s\). At each scale, spatial-temporal convolutional layers predict scale \(\gamma_s\) and bias \(\beta_s\) to modulate the video latent features: \(h_s' = GN(h_s) \cdot \gamma_s + \beta_s + h_s\). To guarantee training stability, zero initialization is adopted so that \(\gamma_s = 0, \beta_s = 0\) at start.
Loss & Training¶
- Denoising score matching (DSM) loss using the EDM training protocol: \(\mathcal{L} = \mathbb{E}[\lambda_\sigma \|\hat{\mathbf{z}}_0 - \mathbf{z}_0\|_2^2]\)
- LoRA rank is set to 32, with only the Motion Encoder and LoRA layers being trained
- Learning rate of \(1 \times 10^{-5}\) using the AdamW optimizer
- Trained on 6 A800 GPUs with a batch size of 48
- Video resolution of \(320 \times 512\), 16 frames at 8 fps
Key Experimental Results¶
Main Results — Fine-Grained Motion Control¶
| Dataset | Metric | MotionPro | MOFA-Video | DragNUWA | Gain |
|---|---|---|---|---|---|
| WebVid-10M | FVD ↓ | 59.88 | 87.70 | 96.65 | -27.82 |
| WebVid-10M | FID ↓ | 10.40 | 12.18 | 13.19 | -1.78 |
| MC-Bench | MD-Img ↓ | 10.56 | 13.94 | - | -3.38 |
| MC-Bench | MD-Vid ↓ | 8.34 | 10.50 | - | -2.16 |
Main Results — Object-Level Motion Control (MC-Bench)¶
| Method | MD-Img ↓ | MD-Vid ↓ | Frame Consis. ↑ |
|---|---|---|---|
| MOFA-Video | 15.56 | 12.04 | 0.9951 |
| DragAnything | 12.30 | 11.37 | 0.9917 |
| MotionPro | 10.48 | 8.59 | 0.9943 |
Ablation Study¶
| Configuration | MD-Vid (fine) ↓ | MD-Vid (obj) ↓ | Description |
|---|---|---|---|
| \(k=1\) | Higher | Higher | Region size is too small, insufficient trajectory signals |
| \(k=4\) | Medium | Medium | Performance gradually improves |
| \(k=8\) | Best | Best | Optimal balance between precision and coverage |
| \(k=16\) | Slightly worse | Slightly worse | Excessive region size harms fine-grained control |
| MotionPro\(_C\) (Concatenation Injection) | Higher | Higher | Feature concatenation requires strict spatial-temporal alignment |
| MotionPro\(_+\) (Additive Injection) | Medium | Medium | Additive fusion yields sub-optimal performance |
| MotionPro (Modulation Injection) | Best | Best | Indirect modulation does not require strict alignment |
Key Findings¶
- The average optical flow magnitude of videos generated by MotionPro is 8.95, which is substantially higher than the 4.95 of MOFA-Video, indicating richer and more dynamic motion variations.
- The motion mask effectively resolves motion category ambiguity typically observed in DragNUWA (e.g., misinterpreting object motion as camera motion).
- The MC-Bench benchmark includes 1.1K user-annotated image-trajectory pairs, aligning closely with real-world scenarios.
Highlights & Insights¶
- Region-wise Trajectory replacing Gaussian Filtering: A simple yet effective design change that directly employs original optical flow within local regions instead of Gaussian dilation.
- Conditional Utilization of Motion Mask: Injecting the mask into the generation process as a conditioning signal rather than using it merely for post-processing fundamentally eliminates motion category ambiguity.
- MC-Bench Benchmark: Fills the gap in controllable I2V evaluation caused by the lack of user-annotated benchmarks.
- Feature modulation (FiLM-style) is more suitable for fusing motion control signals compared to concatenation or addition.
Limitations & Future Work¶
- Built upon the SVD architecture, which limits the resolution to \(320 \times 512\), with no adaptation yet to stronger DiT-based foundation models.
- Training requires pre-computing optical flows using DOT, which increases data preprocessing overhead.
- Only supports a single reference image paired with trajectories/masks as input, and does not support text-driven motion control.
- Scenarios involving independent multiple-object motion control remain unexplored.
Related Work & Insights¶
- DragNUWA/DragAnything: Representative works of Gaussian trajectory-based methods and the primary baselines for comparison.
- MOFA-Video: A two-stage framework (sparse to dense trajectory) whose performance is limited as the mask is only utilized in post-processing.
- Motion-I2V: Shares a similar sparse-to-dense trajectory estimation pipeline.
- The region-wise design scheme presented in this work can be generalized to other tasks such as video editing and 3D scene motion control.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-signal design utilizing region-wise trajectories and motion masks is simple yet highly effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ Proposes the new MC-Bench benchmark and provides detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with clear problem motivation.
- Value: ⭐⭐⭐⭐ Provides practical and valuable insights for controllable video generation.