FlowMotion: Training-Free Flow Guidance for Video Motion Transfer¶

Conference: CVPR 2026
arXiv: 2603.06289
Code: HKUST-LongGroup/FlowMotion
Area: LLM Pretraining
Keywords: Video Motion Transfer, Flow Matching, Training-Free, Latent Prediction, Velocity Regularization

TL;DR¶

FlowMotion is a training-free video motion transfer framework that directly leverages the latent prediction output of flow-based T2V models to construct motion guidance signals, avoiding gradient backpropagation through internal model layers while maintaining motion fidelity and significantly reducing inference time and memory overhead.

Background & Motivation¶

Video motion transfer demand: Given a source video and a text prompt, the goal is to generate a target video that preserves the motion patterns (object movement, camera trajectories, etc.) of the source while rendering a new scene—widely applicable in virtual reality, filmmaking, and related fields.
High cost of training-based methods: MotionDirector, MotionInversion, and similar methods require fine-tuning temporal attention or LoRA parameters for each reference video, taking 20 minutes to 2+ hours, making them unsuitable for real-time or large-scale scenarios.
Inefficiency of existing training-free methods: MotionClone, SMM, DiTFlow, and others depend on intermediate layer outputs (attention maps / diffusion features), requiring gradient backpropagation through deep internal layers, consuming 51–89 GB of GPU memory and 350–1800+ seconds of inference time.
Internal layer dependency limits flexibility: Existing training-free methods are tied to specific architectures (U-Net / DiT) and are difficult to generalize to new models; some also require additional inversion processes, further increasing time overhead.
Rise of flow-based T2V models: Models such as Wan and HunyuanVideo based on flow matching + DiT have become SOTA, but existing motion transfer methods have not fully exploited the properties of flow-based models.
Key observation—early latent predictions encode rich temporal information: The authors find that in the first few denoising steps of flow-based T2V models, the latent prediction (single-step estimate of the clean latent) already contains coarse motion trajectories and temporal dynamics, with appearance details accumulating subsequently—providing the theoretical basis for constructing motion guidance directly on prediction outputs.

Method¶

Overall Architecture¶

FlowMotion is built on top of flow-based T2V models (e.g., Wan2.1/2.2), with the core workflow:

Source video motion representation extraction (no inversion needed): The source video is encoded to clean latent \(z_0^{src}\), forward-noised to \(z_t^{src}\), fed to the T2V model to predict velocity \(v_t^{src}\), and then the latent prediction \(\hat{z}_0^{src}(t) = z_t^{src} - t \cdot v_t^{src}\) is computed as the motion representation.
Flow guidance during target video generation: In the first 10 denoising steps, the latent prediction \(\hat{z}_0(t)\) of the target latent \(z_t\) is computed and aligned with the source video's motion representation via a flow guidance loss. Gradients are backpropagated only to the latent itself, not through internal model layers.
Velocity regularization: The velocity at each step is regularized to suppress over-alignment and directional abrupt changes, ensuring smooth and stable motion evolution.

Flow Guidance Design (Two Objectives)¶

Latent Alignment (LA): Directly aligns source and target latent predictions to maintain global motion consistency: \(\mathcal{L}_{LA} = \|\hat{z}_0^{src}(t) - \hat{z}_0(t)\|_2^2\)
Difference Alignment (DA): Computes frame-to-frame differences \(\triangle(\hat{z}_0^{src}(t))\) and \(\triangle(\hat{z}_0(t))\) and aligns them, emphasizing temporal changes and suppressing static appearance information: \(\mathcal{L}_{DA} = \|\triangle(\hat{z}_0^{src}(t)) - \triangle(\hat{z}_0(t))\|_2^2\)
Total loss: \(\mathcal{L}_{FG} = \alpha \cdot \mathcal{L}_{LA} + \beta \cdot \mathcal{L}_{DA}\), where \(\alpha:\beta = 4:1\)

Velocity Regularization¶

To prevent over-fitting to appearance details and temporal instability from directly optimizing the latent prediction:

Compute the cumulative average velocity \(v_t^{avg} = (z_t - z_1) / (t-1)\)
Decompose the current velocity into a projection component \(v_t^{proj}\) along \(v_t^{avg}\) and an orthogonal component \(v_t^{orth}\)
Suppress the orthogonal component with a decay factor \(\gamma=0.1\): \(v_t^{reg} = v_t^{proj} + \gamma \cdot v_t^{orth}\)
Compute the latent prediction using the regularized velocity: \(\hat{z}_0(t) = z_t - t \cdot v_t^{reg}\)

Loss Function & Optimization¶

Guidance is applied only during the first 10 / 50 denoising steps; at each step, Adam optimizer performs 3 iterations to optimize the target latent
Learning rate 0.003, CFG scale = 6
Gradients backpropagate only to the latent, not through internal model layers → extremely low memory overhead

Key Experimental Results¶

Main Results (Table 1)¶

Method	Type	Backbone	Text Sim.↑	Motion Fid.↑	Temp. Cons.↑	Train Time(s)	Infer Time(s)	Memory(GB)
LoRA Tuning	train	Wan2.1-1.3B	0.327	0.782	0.977	8100	135	25.0
MotionDirector	train	ZeroScope-0.7B	0.335	0.801	0.969	1662	140	28.0
MotionInversion	train	ZeroScope-0.7B	0.328	0.839	0.970	1170	115	24.0
DeT	train	CogVideoX-2B	0.340	0.812	0.980	2760	133	20.0
MotionClone	free	AnimateDiff-1.3B	0.332	0.786	0.940	-	804	51.5
MOFT	free	AnimateDiff-1.3B	0.338	0.582	0.973	-	576	75.0
SMM	free	ZeroScope-0.7B	0.322	0.762	0.958	-	1839	89.4
DiTFlow	free	CogVideoX-2B	0.350	0.691	0.983	-	349	63.5
FlowMotion	free	Wan2.1-1.3B	0.347	0.850	0.986	-	213	19.3

FlowMotion achieves the best Motion Fidelity (0.850) and Temporal Consistency (0.986), with Text Similarity second only to DiTFlow; inference time is only 213s (fastest among training-free methods) and memory is only 19.3 GB (lowest among all methods).

Ablation Study (Table 3)¶

Variant	Text Sim.↑	Motion Fid.↑	Temp. Cons.↑
w/o DA (remove difference alignment)	0.341	0.842	0.981
w/o VR (remove velocity regularization)	0.313	0.809	0.968
Full FlowMotion	0.347	0.850	0.986

Removing VR causes significant drops across all metrics (especially Text Sim. from 0.347→0.313), demonstrating that velocity regularization is crucial for stable optimization.

Memory Efficiency Analysis (Table 4, same backbone Wan2.1-1.3B)¶

Guidance Source	Memory (GB)
Pure inference (no guidance)	17.7
Latent Prediction (this method)	19.3
Velocity output	93.1
Attention Map & Feature	OOM

Latent prediction guidance adds only 1.6 GB over pure inference, while velocity-based guidance requires 93 GB and attention-based guidance results in OOM.

User Study (Table 2, 20 volunteers, 1–5 scale)¶

Method	Motion↑	Temp.↑	Text↑	Overall↑
MotionInversion	3.41	3.34	2.69	2.83
DiTFlow	2.48	3.18	3.16	2.63
DeT	3.87	3.83	3.38	3.47
FlowMotion	4.51	4.52	4.51	4.45

Highlights & Insights¶

Extremely simple and efficient: Guidance signals are based directly on model prediction outputs; gradients do not pass through internal model layers, requiring only 19.3 GB memory and 213s inference—the most efficient training-free method
No inversion needed: Source video motion representation is extracted via forward noising + empty prompt, skipping the time-consuming inversion process
Architecture-agnostic: Does not depend on specific attention structures or U-Net/DiT internal modules; validated on both Wan2.1-1.3B and Wan2.2-5B
Elegant velocity regularization design: Decomposing velocity into projection along the cumulative direction and an orthogonal component, then decaying the orthogonal component to suppress over-alignment—a concise and effective approach

Limitations & Future Work¶

The motion representation is still global latent-level alignment, lacking fine-grained control over local/regional motion (e.g., transferring only foreground motion while keeping background free)
Using latent prediction as the motion representation couples appearance information to some extent; the authors note that using clean latent \(z_0^{src}\) instead improves accuracy but reduces text alignment and background diversity—adaptive balancing remains an open question
Evaluation is conducted only at 480×720, 49 frames; scalability to higher resolutions and longer videos is not verified
Baseline methods use different backbones (due to architecture incompatibility), somewhat limiting fairness

Rating¶

Novelty: ⭐⭐⭐⭐ — Approaching motion transfer from the latent prediction perspective of flow matching is novel and the design is concise
Experimental Rigor: ⭐⭐⭐⭐ — Covers quantitative/qualitative/ablation/user study/memory analysis with comprehensive baselines
Writing Quality: ⭐⭐⭐⭐ — Clear figures and tables, convincing motivation analysis, well-structured
Significance: ⭐⭐⭐⭐ — Achieves significant improvements in both efficiency and performance for training-free motion transfer, with practical value