MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation¶

Conference: ICCV 2025 arXiv: 2507.16310 Code: Project Page Area: Video Generation Keywords: motion transfer, text-to-video, training-free, TPS warping, temporal attention guidance

TL;DR¶

This paper proposes MotionShot, a training-free motion transfer framework that achieves high-fidelity motion transfer between arbitrary reference–target object pairs with significant appearance and structural differences, via a two-level motion alignment strategy combining high-level semantic alignment and low-level morphological alignment.

Background & Motivation¶

Challenges in Motion Transfer: - Most existing methods can only handle reference–target pairs with similar appearance (e.g., human→human, animal→same species). - When reference and target objects exhibit significant appearance/structural differences (e.g., anime character→Winnie the Pooh), motion transfer quality degrades sharply.

Limitations of Prior Work:

Keypoint sequence methods: Require predefined keypoints for each object category and cannot generalize to arbitrary objects.

Spatiotemporal feature methods: Motion and appearance are entangled in latent representations, causing reference appearance leakage.

Depth/edge/optical flow conditioning: Do not account for region-level semantic correspondence or pixel-level structural correspondence, and fail on object pairs with large differences.

Attention-based methods: Motion and structure are tightly coupled; when the target and reference differ significantly, the transferred motion becomes incompatible.

Core Problem: How to accurately transfer the motion patterns of a reference object while preserving the appearance of the target object?

Method¶

Overall Architecture¶

MotionShot builds upon the AnimateDiff video generation framework and consists of three main stages: 1. Semantic Motion Alignment: Establishes high-level semantic correspondences between reference and target objects. 2. Morphological Motion Alignment: Achieves low-level structural mapping via TPS transformation. 3. Attention-guided Generation: Guides video generation using the warped reference frames.

Key Designs¶

Semantic Motion Alignment:
- Pseudo-target generation: A ControlNet-segmentation model takes a degraded segmentation map of the reference object (coarse initial pose cue) and a text prompt as input to generate a pseudo-target object whose initial pose is close to that of the reference. The ControlNet conditioning weight is set to 0.6 so that the text prompt remains dominant.
- Structure-aware keypoint sampling: $m=30$ keypoints are sampled on the reference object, comprising uniform contour sampling (interval $d=200$) and Poisson disk interior sampling, ensuring that keypoints are spread across all regions of the object.
- Semantic feature matching: Features from Stable Diffusion (low-level spatial information) and DINOv2 (high-level semantic information) are fused, and similarity is computed via $L_2$ distance: $\text{Sim}(i,j) = -\|f_\text{tar}^s(i) - f_\text{ref}^s(j)\|_2$
- Design motivation: SD features provide fine spatial details but are error-prone in ambiguous regions; DINO captures high-level semantics but may miss fine details; fusion is complementary.
Morphological Motion Alignment:
- Target keypoint sequence construction: CoTracker3 tracks reference keypoints, and motion is transferred to the target space via global motion (elliptical rotation and translation) and local motion (polar-coordinate relative offsets).
- Global motion: $K_\text{tar}^t = \mathcal{S}(\mathcal{R}(K_\text{tar}^0, \Delta\Theta^t), \Delta O^t)$
- Local motion: Keypoint offsets are decomposed into radial scaling and polar angle offsets.
- TPS shape warping: A Thin Plate Spline (TPS) transformation warps reference frames into the target shape: $$\mathcal{T}^t(p) = A^t\begin{bmatrix}p\\1\end{bmatrix} + \sum_{i=1}^m w^{t,i}\mathcal{U}(\|\mathbf{K}_\text{tar}^{t,i}-p\|^2)$$
- Design motivation: Point-level guidance lacks continuity and disrupts temporal attention; TPS warping provides a continuous shape mapping.
Attention-guided Video Generation:
- A single-step noise–denoise pass is applied to the warped reference frames to extract temporal attention maps $A_\text{ref}^\tau$.
- Top-$k$ ($k=1$) sparse control masks are selected to reduce noise.
- An energy function is defined: $g = \|M^\tau \cdot (A_\text{ref}^\tau - A_\text{gen}^t)\|_2^2$
- Score-based guidance steers diffusion sampling: $\hat{\epsilon}_\theta = \epsilon_\theta(z_t, \text{text}, t) - \lambda\nabla_{z_t}g$
- A DDIM sampler is used; guidance is applied during the first 180 of 300 steps.
- Design motivation: Since reference frames have already been warped into the target shape, motion information in the temporal attention is naturally aligned with the target structure.

Loss & Training¶

MotionShot is a fully training-free framework: - No additional training data or fine-tuning is required. - It builds on pretrained AnimateDiff, ControlNet, Stable Diffusion, DINOv2, and CoTracker3. - All alignment is achieved through feature matching and geometric transformation.

Key Experimental Results¶

Main Results (Tables)¶

Quantitative Comparison (CLIP Scores + User Study):

Method	Text Align↑	Temporal Consist↑	Motion Preserv↑	Appear Diversity↑	User-Text↑	User-Temporal↑
VideoComposer	26.54	95.95	3.00	2.72	2.79	2.82
Gen-1	22.79	97.67	2.87	2.71	2.75	2.87
VMC	26.77	97.72	2.80	2.78	2.78	2.87
Tune-A-Video	26.60	95.99	2.86	2.78	2.88	2.86
Control-A-Video	24.87	95.54	2.94	2.66	2.40	2.92
MotionClone	26.41	97.48	2.90	2.50	2.80	2.82
MotionShot	26.95	97.81	4.95	4.95	4.94	4.90

In the user study, MotionShot achieves near-perfect scores (on a 5-point scale) across all four dimensions, far surpassing all baselines.

Ablation Study (Tables)¶

Ablation on Number of Keypoints:

Keypoints $m$	Contour	Interior	Effect
10	8	2	TPS warping fails; cannot match target shape
30	24	6	Reasonable warping; best overall performance
60	48	12	Overfitting; unnatural warping results

Comparison of Semantic Feature Matching Methods:

Method	Effect
X-Pose keypoint detector	Only predicts 17 points with uneven distribution; appearance mismatch
SD features only	Fine spatial detail but error-prone in ambiguous regions (e.g., tail)
DINO features only	Good high-level semantics but misses fine details (e.g., legs)
SD + DINO fusion	Balances fine-grained and high-level accuracy; best performance

Comparison of Shape Retargeting Methods:

Method	Issue
No warping of original sequence	Motion–shape mismatch; generated object deforms
Simple scaling	Consistent size but topological distortion (e.g., leg misalignment)
Keypoint-based TPS warping	Best motion accuracy and structural consistency

Key Findings¶

Motion preservation and appearance diversity scores in the user study approach the maximum (4.95/5), far exceeding the second-best method (~3.0/5).
The fully training-free approach substantially outperforms methods that require training in motion transfer quality.
SD+DINO feature fusion is critical for semantic matching; using either feature alone is insufficient.
TPS warping is a key step; point-level guidance or simple scaling both lead to shape artifacts.
30 keypoints is the optimal count: too few yields insufficient warping, too many leads to overfitting.

Highlights & Insights¶

Two-level alignment framework: The first work to explicitly model both high-level semantic alignment and low-level morphological alignment, addressing the core challenge of motion transfer between arbitrary object pairs.
Fully training-free: No additional training or fine-tuning is needed; powerful capabilities are achieved by composing existing pretrained models.
Pseudo-target generation strategy: Degraded segmentation maps guide the generation of a pseudo-target with a consistent initial pose, elegantly resolving the initial pose mismatch problem.
Structure-aware keypoint sampling: The combination of uniform contour sampling and Poisson disk interior sampling ensures even keypoint coverage across all regions of the object.
Motion decomposition: Decomposing motion into global rotation+translation and local polar-coordinate offsets enables fine-grained transfer of complex motions.

Limitations & Future Work¶

The method fails when the reference and target objects share no semantic similarity (e.g., airplane→flower).
Semantic correspondence relies on the feature quality of SD and DINOv2, and may be unreliable for objects far from the pretraining distribution.
Being built on the AnimateDiff framework, video quality and length are constrained by the underlying model.
The 300-step sampling with guidance applied for the first 180 steps leaves room for improving inference efficiency.
Only single-object motion transfer is supported; motion transfer involving multi-object interactions remains unexplored.

The two-level alignment paradigm (semantic + morphological) is generalizable to other tasks requiring cross-domain correspondence (e.g., style transfer, animation).
TPS warping, as a continuous spatial mapping tool, is better suited for attention-guided generation than discrete point-level guidance.
Training-free methods can address complex visual generation problems through the judicious combination of pretrained models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The two-level motion alignment framework is pioneering and resolves a longstanding challenge of arbitrary object pair motion transfer.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation, user study, and detailed ablations are provided, though additional automated motion fidelity metrics would strengthen the evaluation.
Writing Quality: ⭐⭐⭐⭐ The methodology is clearly described with rich illustrations and well-motivated design choices.
Value: ⭐⭐⭐⭐⭐ Near-perfect user study scores demonstrate strong practical value, and the training-free design lowers the barrier to adoption.