CVPR 2025 Video Generation Controllable video generation unified optical flow representation camera trajectory control frequency domain stabilization multi-condition fusion

AnimateAnything: Consistent and Controllable Animation for Video Generation¶

Conference: CVPR 2025
arXiv: 2411.10836
Code: https://github.com/yu-shaonian/Animate_Anything
Area: Video Understanding / Controlled Video Generation
Keywords: Controllable video generation, unified optical flow representation, camera trajectory control, frequency domain stabilization, multi-condition fusion

TL;DR¶

A two-stage controllable video generation framework is proposed. The first stage unifies different control signals (camera trajectories, user drag-and-drop annotations, reference videos) into a frame-by-frame optical flow representation. The second stage uses the unified optical flow to guide a DiT-based video diffusion model to generate the final video, introducing a frequency-domain stabilization module to suppress flickering under large motions.

Background & Motivation¶

Background: Controllable video generation is a crucial direction in video generation, encompassing camera trajectory control (CameraCtrl, MotionCtrl) and object motion control (Motion-I2V, MOFA-Video).

Limitations of Prior Work: - MotionCtrl/CameraCtrl only support camera control and rely on text descriptions for object motion, lacking sufficient precision. - Motion-I2V/MOFA-Video only support small-scale object motion and cannot handle camera motion. - When attempting to introduce multiple control signals simultaneously, conflicts arise due to differing modalities (e.g., camera motion is global while object motion is local), confusing the generative model.

Key Challenge: Various control signals (camera parameters, drag arrows, reference videos) essentially describe pixel motion, but their representations are completely different, making them difficult to unify and fuse.

Core Idea: If all control signals can be unified and converted into frame-by-frame optical flow representations, video generation can be guided in a unified manner, naturally resolving signal conflicts.

Method¶

Overall Architecture¶

A two-stage pipeline: - Stage 1 (Unified Optical Flow Generation): Through explicit injection (converting dragging etc. into sparse optical flow) and implicit injection (encoding camera trajectories as reference features), a unified dense optical flow is generated via the synergy of two models (FGM+CRM). - Stage 2 (Video Generation): The unified optical flow is encoded and fused with the video latents through ViT blocks, generating the final video using the CogVideoX framework combined with text conditions.

Key Designs¶

Explicit Injection — Handling signals that can be directly converted into optical flow:
- Function: Converts user drag annotations and other inputs into sparse optical flows, which are expanded into dense optical flows via the Flow Generation Model (FGM).
- Mechanism: Point-wise sparse optical flow \(F_{l-1}^s(x_i, y_i) = \hat{\mathcal{T}}_l(x_i, y_i) - \hat{\mathcal{T}}_0(x_i, y_i)\) is generated by extracting sparse control points via bicubic interpolation from user-annotated motion trajectories \(\mathcal{M} \in \mathbb{R}^{P \times 2}\). This flow is enhanced by CMP and fed into FGM (a U-Net LDM based on Controlled SD1.5) to produce dense optical flow.
- Design Motivation: Any signal capable of being converted into sparse optical flow (such as audio, video, keypoints, etc.) can be unified and integrated into FGM.
Implicit Injection — Handling signals that are difficult to directly convert into pixel optical flow:
- Function: Embeds camera trajectories into the optical flow generation process.
- Mechanism: A Camera Reference Model (CRM) is designed using Plücker embeddings to represent camera trajectories \(\ddot{p}_{f,h,w} = (t_f \times \hat{d}_{f,h,w}, \hat{d}_{f,h,w})\). Camera features are fused with the reference image via camera motion attention to generate multi-scale reference features, which are progressively injected into the denoising process of FGM via reference attention.
- Design Motivation: Camera motion is global and affects all foreground and background pixels, making it difficult to directly convert into sparse optical flow. Thus, it needs to be implicitly guided via multi-scale features.
Frequency Stabilization:
- Function: Suppresses video flickering and instability caused by large astronomical motions.
- Mechanism: FFT is introduced into the attention mechanism of DiT. A Fast Fourier Transform is applied to the weight matrix to obtain spectral features, which are then multiplied by a learnable weight matrix \(W\). Inverse FFT is performed to reconstruct the time-domain features, followed by calculating the dot-product attention.
- Design Motivation: Flickering in the temporal domain stems from feature misalignment between frames. However, frequency-domain features can more directly reveal the essential video-level information. Optimizing the frequency components can effectively suppress inter-frame inconsistency.

Loss & Training¶

Stage 1 trains FGM/CRM using Real10K and DL3DV10K, using Unimatch to extract ground-truth optical flow from training videos.
Stage 2 trains on WebVid10M and OpenVid, utilizing Flow VAE to compress optical flow into the latent space.
In Stage 2, only the optical flow encoder, input transformer blocks, and frequency stabilization module are trained, while other parameters are frozen.
To address the insufficiency of dynamic scene data, approximately 10K static camera videos are filtered from OpenVid for additional training.

Key Experimental Results¶

Main Results: Camera Trajectory Control Accuracy (DUSt3R Evaluation)¶

Method	Basic T-Err↓	Basic R-Err↓	Difficult T-Err↓	Difficult R-Err↓
CameraCtrl	0.090	0.300	0.082	0.306
MotionCtrl	0.057	0.233	0.060	0.267
AnimateAnything	0.041	0.159	-	-

Ablation Study: VBench Video Quality Evaluation¶

Method	FID↓	SSIM↑	FVD↓	SubC	MoS
DynamiCrafter	-	-	-	Lower	Lower
CogVideoX	-	-	-	Medium	Medium
AnimateAnything	Best	Best	Best	Best	Best

(Note: The specific values of FID/SSIM/FVD in the original paper are in Tab.2. They were not fully acquired here due to cache truncation, but the qualitative conclusion is clear)

Key Findings¶

Camera trajectory control accuracy is comprehensively leading: translation error is reduced by 28% (vs MotionCtrl) and rotation error by 32% under DUSt3R evaluation.
Unified optical flow representation is effective: when camera trajectory and drag annotations are input simultaneously, the model successfully merges global and local motions without conflict.
Frequency domain stabilization is critical for large motion scenarios: without the frequency module, significant camera motion scenarios exhibit obvious flickering.
Optical flow guidance provides the most significant improvements for human and animal motion scenarios.

Highlights & Insights¶

Optical Flow as a Unified Motion Language: Translating all heterogeneous control signals into a unified optical flow representation is an elegant abstraction—regardless of where the signals originate, they ultimately describe how pixels should move. This paradigm of unified representation can be transferred to other multimodal fusion scenarios.
Complementary Design of Explicit + Implicit Injection: Different injection mechanisms are tailored based on the signal characteristics—using explicit paths for signals directly convertible to optical flow, and implicit reference features for others. This task-specific design is more logical than a forced, simplistic unification.
Frequency Domain Perspective on Video Stabilization: Understanding video flickering from a signal processing perspective and integrating FFT into the attention mechanism to adjust frequency components offers a novel and effective outlook.

Limitations & Future Work¶

Practical scalability: FGM and CRM are two independent models, leading to high training and inference costs.
Limited training data for camera trajectories: Real10K and DL3DV10K mainly cover indoor/static scenes, potentially limiting generalization in dynamic scenarios.
Stage 1 optical flow prediction errors propagate to Stage 2, resulting in error accumulation.
Learnable weights of the frequency stabilization module may generalize poorly to unseen motion patterns.

vs Motion-I2V: Both leverage a two-stage approach with an intermediate optical flow representation; however, Motion-I2V generates optical flow solely from text and reference motion without camera control support. The explicit-implicit dual path design in this work is more comprehensive.
vs MOFA-Video: MOFA-Video requires users to specify motion directions for each region, leading to complex interaction. The proposed method avoids this issue through the implicit injection of camera motion.
vs CameraCtrl: CameraCtrl uses ControlNet-style conditional injection but lacks explicit pixel-level motion guidance. This work achieves higher precision through the intermediate optical flow representation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of unified optical flow representation, explicit/implicit dual-path injection, and frequency-domain stabilization is novel, offering a clear and elegant two-stage framework.
Experimental Thoroughness: ⭐⭐⭐⭐ Utilizes multiple evaluation methods (DUSt3R/VggSfM/ParticleSfM, VBench, FID, etc.), though the ablation studies are somewhat limited.
Writing Quality: ⭐⭐⭐⭐ The pipeline diagram is clear, and the methodology is presented in a well-organized manner, although some technical details could be more concise.
Value: ⭐⭐⭐⭐ Provides an effective solution for unifying multi-conditional control in video generation, with broad application prospects in film and VR.