MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model¶
Conference: ECCV 2024
arXiv: 2405.20222
Area: Image Generation
TL;DR¶
This paper proposes MOFA-Video, which equips a frozen image-to-video diffusion model (SVD) with controllable motion capabilities by designing multiple domain-specific motion field adapters (MOFA-Adapters). It supports various control signals and their combinations, such as hand-drawn trajectories and facial landmarks, to achieve open-domain controllable image animation.
Background & Motivation¶
- In-domain animation methods (e.g., SadTalker) can finely control specific categories (faces, fluids) but are restricted to specific domains and cannot generalize to the open domain.
- Diffusion-based I2V models (e.g., SVD, AnimateDiff) can handle open-domain image animation, but the generated content may deviate from the input image and only support text or simple idle animations, exhibiting weak control capabilities.
- Limitations of existing control methods: DragNUWA models trajectories via adaptive normalization but suffers from poor spatial correspondence; MotionCtrl relies on T2V models and lacks a world coordinate system.
- Core Problem: How to build a unified framework to achieve fine-grained controllable animation from multiple motion domains on open-domain images?
Method¶
Overall Architecture¶
MOFA-Video appends the MOFA-Adapter as a motion control module to a frozen Stable Video Diffusion (SVD) model, similar to the concept of ControlNet. The key is to unify control signals from different domains into a sparse motion vector representation, which is then used to generate videos through a unified adapter structure.
Key Designs¶
1. MOFA-Adapter Structure: - Sparse-to-Dense (S2D) Motion Generation Network: Receives the first frame image and sparse motion prompts to generate a dense optical flow field, utilizing a CMP network structure. - Reference Image Encoder: A multi-scale convolutional feature encoder that extracts multi-scale features of the first frame for warping. - Fusion Encoder: A trainable copy of the SVD encoder, fusing the warped features with the features of the SVD decoder.
2. Domain-Aware Motion Control: - Open-Domain Trajectory: Trained by sampling sparse motion vectors from video optical flow, accepting hand-drawn trajectories during inference. - Facial Landmarks: Converts facial landmark displacements into sparse motion vectors, simplifying the framework with a unified representation. - Multi-Adapter Composition: MOFA-Adapters from different domains can be combined in a zero-shot manner, utilizing a mask-aware strategy to fuse control signals from different regions.
3. Long Video Generation: A periodic sampling strategy is proposed. Within each diffusion step, frames are grouped (14 frames per group, with a 7-frame overlap), and the predicted noise of overlapping frames is averaged to achieve temporal consistency for longer videos.
Loss & Training¶
The SVD parameters are frozen, and only the MOFA-Adapter parameters \(\theta_{\mathcal{M}}\) are optimized:
where \(\mathcal{S}\) is the frozen SVD and \(\mathcal{V}\) is the video latent representation.
Key Experimental Results¶
Main Results¶
Trajectory Control Comparison (vs. DragNUWA):
| Method | Frame Consistency↑ | LPIPS↓ | FID↓ | FVD↓ | Control Accuracy (User)↑ | Visual Quality (User)↑ |
|---|---|---|---|---|---|---|
| DragNUWA | 0.9302 | 0.2705 | 19.66 | 91.38 | 2.76 | 3.18 |
| MOFA-Video | 0.9390 | 0.2274 | 16.82 | 86.76 | 3.58 | 3.42 |
Portrait Animation Comparison (vs. SadTalker, StyleHEAT):
| Method | CPBD↑ | ID↑ | Fidelity (User)↑ | Naturalness (User)↑ | Visual Quality (User)↑ |
|---|---|---|---|---|---|
| SadTalker | 0.3218 | 0.9188 | 4.15 | 3.12 | 3.97 |
| StyleHEAT | 0.2577 | 0.7993 | 3.26 | 3.65 | 3.70 |
| MOFA-Video | 0.4075 | 0.9293 | 4.80 | 3.97 | 4.52 |
Ablation Study¶
Network Architecture Ablation (Trajectory Control):
| Variant | LPIPS↓ | FID↓ | FVD↓ |
|---|---|---|---|
| w/o warping (pure sparse conditions) | 0.2619 | 18.80 | 184.27 |
| w/o S2D (sparse warping) | 0.2376 | 16.87 | 81.80 |
| w/o tuning (directly using the reconstruction model) | 0.2163 | 16.97 | 102.17 |
| Full Model | 0.2274 | 16.82 | 86.76 |
Key Findings¶
- Sparse condition models cannot precisely control the trajectory of the target object due to the lack of spatial warping operations, which results in spatial misalignment.
- Sparse warping models can control trajectories but produce severe visual artifacts due to the lack of dense optical flow guidance.
- MOFA-Adapters for different domains must be trained separately; directly using the open-domain model for facial animation leads to unnatural expressions.
- The periodic sampling strategy significantly outperforms the naive frame grouping method, effectively resolving error accumulation and temporal inconsistency in long videos.
Highlights & Insights¶
- Unifying multi-domain motion control into a sparse motion vector problem is an elegant and scalable design.
- The explicit sparse-to-dense optical flow generation combined with the feature warping strategy achieves a good balance between control accuracy and generation quality.
- The zero-shot composition capability of multiple MOFA-Adapters makes it possible to simultaneously control both facial expressions and background motion.
- Compared to the implicit trajectory modeling of DragNUWA, the explicit optical flow method better confines the motion regions.
Limitations & Future Work¶
- Cannot control or generate new content that deviates significantly from the input image (constrained by the short-video training data of SVD).
- Visual artifacts such as blurriness or structure loss may occur under large motion guidance.
- Video length is restricted by SVD's 14-frame window, requiring additional periodic sampling strategies for long videos.
Rating¶
- Novelty: 7/10 — The adapter concept derives from ControlNet; the core innovation lies in the unified modeling of motion fields and multi-domain composition.
- Technical Depth: 8/10 — Explicit motion modeling with S2D + warping is solidly designed, and the multi-adapter composition scheme is reasonable.
- Experimental Thoroughness: 8/10 — Comparative and ablation experiments are comprehensive, though quantitative evaluations on long videos are lacking.
- Impact: 7/10 — Provides a practical, unified framework for controllable video generation.