MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model¶

Conference: ECCV 2024
arXiv: 2405.20222
Area: Image Generation

TL;DR¶

This paper proposes MOFA-Video, which equips a frozen image-to-video diffusion model (SVD) with controllable motion capabilities by designing multiple domain-specific motion field adapters (MOFA-Adapters). It supports various control signals and their combinations, such as hand-drawn trajectories and facial landmarks, to achieve open-domain controllable image animation.

Background & Motivation¶

In-domain animation methods (e.g., SadTalker) can finely control specific categories (faces, fluids) but are restricted to specific domains and cannot generalize to the open domain.
Diffusion-based I2V models (e.g., SVD, AnimateDiff) can handle open-domain image animation, but the generated content may deviate from the input image and only support text or simple idle animations, exhibiting weak control capabilities.
Limitations of existing control methods: DragNUWA models trajectories via adaptive normalization but suffers from poor spatial correspondence; MotionCtrl relies on T2V models and lacks a world coordinate system.
Core Problem: How to build a unified framework to achieve fine-grained controllable animation from multiple motion domains on open-domain images?

Method¶

Overall Architecture¶

MOFA-Video appends the MOFA-Adapter as a motion control module to a frozen Stable Video Diffusion (SVD) model, similar to the concept of ControlNet. The key is to unify control signals from different domains into a sparse motion vector representation, which is then used to generate videos through a unified adapter structure.

Key Designs¶

1. MOFA-Adapter Structure: - Sparse-to-Dense (S2D) Motion Generation Network: Receives the first frame image and sparse motion prompts to generate a dense optical flow field, utilizing a CMP network structure. - Reference Image Encoder: A multi-scale convolutional feature encoder that extracts multi-scale features of the first frame for warping. - Fusion Encoder: A trainable copy of the SVD encoder, fusing the warped features with the features of the SVD decoder.

2. Domain-Aware Motion Control: - Open-Domain Trajectory: Trained by sampling sparse motion vectors from video optical flow, accepting hand-drawn trajectories during inference. - Facial Landmarks: Converts facial landmark displacements into sparse motion vectors, simplifying the framework with a unified representation. - Multi-Adapter Composition: MOFA-Adapters from different domains can be combined in a zero-shot manner, utilizing a mask-aware strategy to fuse control signals from different regions.

3. Long Video Generation: A periodic sampling strategy is proposed. Within each diffusion step, frames are grouped (14 frames per group, with a 7-frame overlap), and the predicted noise of overlapping frames is averaged to achieve temporal consistency for longer videos.

Loss & Training¶

The SVD parameters are frozen, and only the MOFA-Adapter parameters \(\theta_{\mathcal{M}}\) are optimized:

\[\mathcal{L} = \| \mathcal{S}(\mathcal{V}_t, t, \mathcal{M}(\mathcal{V}_t, t, I, F^s; \theta_{\mathcal{M}})) - \mathcal{V} \|^2\]

where \(\mathcal{S}\) is the frozen SVD and \(\mathcal{V}\) is the video latent representation.

Key Experimental Results¶

Main Results¶

Trajectory Control Comparison (vs. DragNUWA):

Method	Frame Consistency↑	LPIPS↓	FID↓	FVD↓	Control Accuracy (User)↑	Visual Quality (User)↑
DragNUWA	0.9302	0.2705	19.66	91.38	2.76	3.18
MOFA-Video	0.9390	0.2274	16.82	86.76	3.58	3.42

Portrait Animation Comparison (vs. SadTalker, StyleHEAT):

Method	CPBD↑	ID↑	Fidelity (User)↑	Naturalness (User)↑	Visual Quality (User)↑
SadTalker	0.3218	0.9188	4.15	3.12	3.97
StyleHEAT	0.2577	0.7993	3.26	3.65	3.70
MOFA-Video	0.4075	0.9293	4.80	3.97	4.52

Ablation Study¶

Network Architecture Ablation (Trajectory Control):

Variant	LPIPS↓	FID↓	FVD↓
w/o warping (pure sparse conditions)	0.2619	18.80	184.27
w/o S2D (sparse warping)	0.2376	16.87	81.80
w/o tuning (directly using the reconstruction model)	0.2163	16.97	102.17
Full Model	0.2274	16.82	86.76

Key Findings¶

Sparse condition models cannot precisely control the trajectory of the target object due to the lack of spatial warping operations, which results in spatial misalignment.
Sparse warping models can control trajectories but produce severe visual artifacts due to the lack of dense optical flow guidance.
MOFA-Adapters for different domains must be trained separately; directly using the open-domain model for facial animation leads to unnatural expressions.
The periodic sampling strategy significantly outperforms the naive frame grouping method, effectively resolving error accumulation and temporal inconsistency in long videos.

Highlights & Insights¶

Unifying multi-domain motion control into a sparse motion vector problem is an elegant and scalable design.
The explicit sparse-to-dense optical flow generation combined with the feature warping strategy achieves a good balance between control accuracy and generation quality.
The zero-shot composition capability of multiple MOFA-Adapters makes it possible to simultaneously control both facial expressions and background motion.
Compared to the implicit trajectory modeling of DragNUWA, the explicit optical flow method better confines the motion regions.

Limitations & Future Work¶

Cannot control or generate new content that deviates significantly from the input image (constrained by the short-video training data of SVD).
Visual artifacts such as blurriness or structure loss may occur under large motion guidance.
Video length is restricted by SVD's 14-frame window, requiring additional periodic sampling strategies for long videos.

Rating¶

Novelty: 7/10 — The adapter concept derives from ControlNet; the core innovation lies in the unified modeling of motion fields and multi-domain composition.
Technical Depth: 8/10 — Explicit motion modeling with S2D + warping is solidly designed, and the multi-adapter composition scheme is reasonable.
Experimental Thoroughness: 8/10 — Comparative and ablation experiments are comprehensive, though quantitative evaluations on long videos are lacking.
Impact: 7/10 — Provides a practical, unified framework for controllable video generation.