MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion¶
Conference: ICCV2025 arXiv: 2405.20325 Code: francis-rings/MotionFollower Area: Model Compression Keywords: video motion editing, diffusion models, score guidance, lightweight controller, pose transfer
TL;DR¶
This paper proposes MotionFollower, which achieves video motion editing via two lightweight convolutional controllers (pose + appearance) and a consistency guidance mechanism based on score function regularization, surpassing strong baselines such as MotionEditor while reducing GPU memory consumption by approximately 80%.
Background & Motivation¶
- Limitations of existing video editing: Current diffusion model-driven video editing primarily focuses on attribute-level editing (style transfer, appearance/background modification), with very limited attention to motion information—the most distinctive and complex characteristic of video.
- Shortcomings of MotionEditor: As the sole pioneering work capable of motion editing, MotionEditor relies on ControlNet combined with attention injection, suffering from three major issues: (1) poor temporal consistency, (2) prohibitive computational overhead due to the attention-heavy ControlNet, and (3) performance degradation under large camera motion and complex backgrounds.
- Human motion transfer vs. motion editing: Existing human motion transfer methods (MagicAnimate, AnimateAnyone, Champ) aim at generating animations from a single image, which fundamentally differs from video motion editing—the latter must simultaneously preserve camera motion, per-frame background variation, and subject appearance.
- Core motivation: To design a lightweight, efficient motion editing framework that preserves source video details under complex scenarios involving large camera motion and complex backgrounds.
Method¶
Overall Architecture¶
MotionFollower is built upon a T2I diffusion model (LDM/SD1.5), inflating the 2D U-Net into a 3D U-Net by inserting temporal layers. The overall framework comprises two core designs:
- Two lightweight conditional controllers: Pose Controller (PoCtr) + Reference Controller (ReCtr)
- Score-guided consistency regularization at inference: dual-branch architecture (reconstruction branch + editing branch) + multiple loss functions
Lightweight Conditional Controllers¶
Pose Controller (PoCtr)¶
- Function: Encodes target pose signals to control motion modification.
- Architecture: Consists of only 4 convolutional blocks, each containing 2 convolutional layers, with no attention computation whatsoever.
- Mechanism: Encodes pose signals into a representation of the same dimensionality as the initial latent, which is directly added to the latent prior to denoising.
- Initialization strategy: The final projection layer weights are initialized to zero to avoid introducing excessive perturbation at the early stage of training.
- Compared to ControlNet: parameter count and computational cost are substantially reduced, and the robustness degradation caused by directly generating modified motion from random noise is avoided.
Reference Controller (ReCtr)¶
- Function: Preserves appearance information from the source video, replacing time-consuming DDIM inversion.
- Architecture: Likewise 4 convolutional blocks (2 convolutional layers each), downsampling source frame features to the same dimensionality as the latent.
- Multi-scale injection: Multi-scale source features are directly added to the features at each encoder block of the U-Net.
- CLIP embedding: Source frames are additionally projected via a CLIPProjector into CLIP embeddings, which interact with the latent through cross-attention to preserve precise low-level details.
- Compared to MagicAnimate/AnimateAnyone: The latter employ a full U-Net as the reference network with numerous attention operations and require channel concatenation, incurring approximately twice the memory overhead of ReCtr.
Loss & Training (Two-Stage)¶
| Stage | Data | Trainable Modules | Objective |
|---|---|---|---|
| Stage 1 (image) | Single frames; source/target randomly sampled from the same video | PoCtr + ReCtr + U-Net (temporal layers removed) | Learn image editing capability given source image and target pose |
| Stage 2 (video) | Video clips + 40% probability of degrading to image training | Temporal layers only (rest frozen) | Enhance temporal consistency; temporal layers initialized from AnimateDiff pretrained weights |
- Training data: 3K internet videos (60–90 seconds), cropped to 512×512.
- Each stage is trained for 100K steps with learning rate 1e-5.
- DWPose is used for pose extraction; a proprietary lightweight segmentation model is used for mask extraction.
Score-Guided Consistency Regularization (Core Inference Design)¶
Core Idea: The denoising process of a diffusion model can be viewed as a continuous process guided by the score function (the gradient of the log probability density with respect to the data). By adding a regularization term to the score function, the model can be forced to preserve semantic details from the source video without updating its weights.
Dual-Branch Architecture: - Reconstruction branch: random noise + source pose + source frames → reconstruct source video - Editing branch: random noise + target pose + source frames → generate edited video
Score Function Decomposition:
Region-wise Loss Design: Segmentation masks are used to decompose the loss into foreground and background components.
Foreground Loss \(\mathcal{L}_{fg}\)¶
- Maximizes the similarity between foreground features of the editing branch and the reconstruction branch.
- Employs mask pooling and spatially averaged cosine similarity.
- Ensures subject appearance remains consistent after motion modification.
Background Loss \(\mathcal{L}_{bg}\) (Three Sub-losses)¶
| Loss | Mask Definition | Function |
|---|---|---|
| \(\mathcal{L}_{over}\) (overlapping background) | \((1-M^e) \odot (1-M^r)\) | Enforces consistency between overlapping background regions of the edited and reconstructed outputs |
| \(\mathcal{L}_{body}\) (non-overlapping human body) | \(M^r \odot (1-M^e)\) | Mitigates ghosting/blurring caused by source pose bias |
| \(\mathcal{L}_{com}\) (complementary) | non-overlapping body region vs. source background | Guides the model to inpaint non-overlapping body regions with background information |
Final Guidance: Foreground and background gradients are combined via masks to update the latent in a region-wise manner:
Key: only the latent is optimized; model weights are not updated.
Key Experimental Results¶
Main Results (100 in-the-wild videos)¶
| Model | L1↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | FID-VID↓ | FVD↓ | Memory↓ |
|---|---|---|---|---|---|---|---|---|
| MotionEditor | 9.13E-5 | 17.34 | 0.68 | 0.34 | 31.98 | 20.57 | 395.43 | 42.6G |
| MagicAnimate | 1.09E-4 | 16.22 | 0.62 | 0.35 | 33.04 | 26.59 | 477.65 | 21.8G |
| AnimateAnyone | 9.45E-5 | 16.18 | 0.65 | 0.32 | 35.81 | 28.31 | 515.57 | 16.1G |
| Champ | 9.94E-5 | 16.12 | 0.58 | 0.36 | 36.55 | 25.89 | 452.65 | 17.4G |
| MotionFollower | 6.31E-5 | 20.85 | 0.75 | 0.22 | 26.30 | 12.42 | 276.04 | 9.8G |
- Leads across all metrics; FVD improves from 395→276 (↓30%); memory reduces from 42.6G→9.8G (↓77%).
- On a single A100, MotionEditor can only edit 16 frames, while MotionFollower can edit 56 frames.
Human Motion Transfer (TikTok Benchmark)¶
| Model | SSIM↑ | LPIPS↓ | FVD↓ |
|---|---|---|---|
| Champ | 0.773 | 0.235 | 170.20 |
| MotionFollower | 0.793 | 0.230 | 159.88 |
Achieves state-of-the-art results on the simpler motion transfer task as well.
Ablation Study¶
| Configuration | FID↓ | FID-VID↓ | FVD↓ | Memory |
|---|---|---|---|---|
| w/o ReCtr | 64.82 | 47.28 | 545.39 | 8.3G |
| Replace ReCtr→RNet | 26.35 | 13.12 | 288.57 | 17.8G |
| Replace PoCtr→CNet | 30.57 | 23.75 | 381.52 | 15.2G |
| w/o score guidance | 35.91 | 28.10 | 437.69 | 7.1G |
| Full model | 26.30 | 12.42 | 276.04 | 9.8G |
- Removing ReCtr causes a catastrophic performance drop, highlighting the critical role of appearance control.
- Replacing ReCtr with RNet doubles memory while yielding slightly worse performance, validating the superiority of the pure convolutional design.
- Score guidance contributes significantly to background consistency.
Highlights & Insights¶
- Extreme lightweight design: Replacing ControlNet and Reference Net with purely convolutional controllers reduces GPU memory by 80%—an excellent demonstration of "less is more." Complex attention mechanisms in fact disrupt the distributional alignment between the reference network and the U-Net.
- Elegance of score guidance: Model weights are left untouched; the latent is optimized at inference time solely through loss gradients. This "test-time optimization" paradigm can be seamlessly transferred to other diffusion models.
- Fine-grained region-wise loss design: Distinct constraints are imposed on foreground, background, overlapping, and non-overlapping regions via segmentation masks, effectively addressing ghosting, blurring, and background inconsistency in motion editing.
- Long video and large camera motion: MotionEditor suffers severe degradation on 600-frame long videos and under large camera motion, while MotionFollower handles these scenarios robustly through consistency guidance.
- Two-stage training + mixed training strategy: In the second stage, a 40% probability of degrading to image training effectively prevents temporal layer training from compromising single-frame editing capability.
Limitations & Future Work¶
- Dependence on pose detection quality: PoCtr relies on DWPose for pose extraction; editing quality may degrade when pose detection fails (severe occlusion, unconventional poses).
- Resolution limitation: Training and testing are conducted at 512×512; scalability to high-resolution (1024+) scenarios remains unexplored.
- Based on SD1.5 architecture: The more recent SDXL or DiT architectures are not leveraged, potentially limiting the upper bound of generation quality.
- Inference speed: Processing 24 frames requires 50 seconds on an A100, leaving a substantial gap for real-time applications; score guidance requires dual-branch forward passes, increasing inference overhead.
- Human motion only: The current approach is limited to pose-driven human motion editing; object-level or scene-level motion editing is not addressed.
- Dependence on segmentation model: Score guidance requires accurate foreground/background segmentation masks; segmentation errors propagate into the editing results.
Related Work & Insights¶
- MotionEditor (ICLR 2024): The pioneering motion editing work using ControlNet + attention injection; the primary baseline for this paper.
- AnimateDiff (ICLR 2024): Provides pretrained temporal layer weights used for initialization in Stage 2 of MotionFollower's training.
- MagicAnimate / AnimateAnyone / Champ: Human motion transfer methods employing a full U-Net as the reference network; this paper demonstrates that a pure convolutional alternative is superior.
- Score guidance: Conceptually rooted in SDE theory and classifier guidance, but here innovatively applied to region-level consistency constraints rather than global guidance.
- Inspiration: The paradigm of lightweight controllers + test-time latent optimization is generalizable to other conditional video generation tasks (e.g., video inpainting, video super-resolution).
Rating¶
- Novelty: ⭐⭐⭐⭐ — Score-guided consistency regularization and pure convolutional controller design are novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage of quantitative, qualitative, ablation, human evaluation, long video, and camera motion experiments.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with complete mathematical derivations.
- Value: ⭐⭐⭐⭐ — The 80% memory reduction carries significant practical value; the lightweight design philosophy is broadly transferable.