Skip to content

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Conference: ICCV 2025 arXiv: 2507.16310 Code: Project Page Area: Video Generation Keywords: motion transfer, text-to-video, training-free, TPS warping, temporal attention guidance

TL;DR

This paper proposes MotionShot, a training-free motion transfer framework that achieves high-fidelity motion transfer between arbitrary reference–target object pairs with significant appearance and structural differences, via a two-level motion alignment strategy combining high-level semantic alignment and low-level morphological alignment.

Background & Motivation

Challenges in Motion Transfer: - Most existing methods can only handle reference–target pairs with similar appearance (e.g., human→human, animal→same species). - When reference and target objects exhibit significant appearance/structural differences (e.g., anime character→Winnie the Pooh), motion transfer quality degrades sharply.

Limitations of Prior Work:

Keypoint sequence methods: Require predefined keypoints for each object category and cannot generalize to arbitrary objects.

Spatiotemporal feature methods: Motion and appearance are entangled in latent representations, causing reference appearance leakage.

Depth/edge/optical flow conditioning: Do not account for region-level semantic correspondence or pixel-level structural correspondence, and fail on object pairs with large differences.

Attention-based methods: Motion and structure are tightly coupled; when the target and reference differ significantly, the transferred motion becomes incompatible.

Core Problem: How to accurately transfer the motion patterns of a reference object while preserving the appearance of the target object?

Method

Overall Architecture

MotionShot builds upon the AnimateDiff video generation framework and consists of three main stages: 1. Semantic Motion Alignment: Establishes high-level semantic correspondences between reference and target objects. 2. Morphological Motion Alignment: Achieves low-level structural mapping via TPS transformation. 3. Attention-guided Generation: Guides video generation using the warped reference frames.

Key Designs

  1. Semantic Motion Alignment:

    • Pseudo-target generation: A ControlNet-segmentation model takes a degraded segmentation map of the reference object (coarse initial pose cue) and a text prompt as input to generate a pseudo-target object whose initial pose is close to that of the reference. The ControlNet conditioning weight is set to 0.6 so that the text prompt remains dominant.
    • Structure-aware keypoint sampling: \(m=30\) keypoints are sampled on the reference object, comprising uniform contour sampling (interval \(d=200\)) and Poisson disk interior sampling, ensuring that keypoints are spread across all regions of the object.
    • Semantic feature matching: Features from Stable Diffusion (low-level spatial information) and DINOv2 (high-level semantic information) are fused, and similarity is computed via \(L_2\) distance: \(\text{Sim}(i,j) = -\|f_\text{tar}^s(i) - f_\text{ref}^s(j)\|_2\)
    • Design motivation: SD features provide fine spatial details but are error-prone in ambiguous regions; DINO captures high-level semantics but may miss fine details; fusion is complementary.
  2. Morphological Motion Alignment:

    • Target keypoint sequence construction: CoTracker3 tracks reference keypoints, and motion is transferred to the target space via global motion (elliptical rotation and translation) and local motion (polar-coordinate relative offsets).
    • Global motion: \(K_\text{tar}^t = \mathcal{S}(\mathcal{R}(K_\text{tar}^0, \Delta\Theta^t), \Delta O^t)\)
    • Local motion: Keypoint offsets are decomposed into radial scaling and polar angle offsets.
    • TPS shape warping: A Thin Plate Spline (TPS) transformation warps reference frames into the target shape: $\(\mathcal{T}^t(p) = A^t\begin{bmatrix}p\\1\end{bmatrix} + \sum_{i=1}^m w^{t,i}\mathcal{U}(\|\mathbf{K}_\text{tar}^{t,i}-p\|^2)\)$
    • Design motivation: Point-level guidance lacks continuity and disrupts temporal attention; TPS warping provides a continuous shape mapping.
  3. Attention-guided Video Generation:

    • A single-step noise–denoise pass is applied to the warped reference frames to extract temporal attention maps \(A_\text{ref}^\tau\).
    • Top-\(k\) (\(k=1\)) sparse control masks are selected to reduce noise.
    • An energy function is defined: \(g = \|M^\tau \cdot (A_\text{ref}^\tau - A_\text{gen}^t)\|_2^2\)
    • Score-based guidance steers diffusion sampling: \(\hat{\epsilon}_\theta = \epsilon_\theta(z_t, \text{text}, t) - \lambda\nabla_{z_t}g\)
    • A DDIM sampler is used; guidance is applied during the first 180 of 300 steps.
    • Design motivation: Since reference frames have already been warped into the target shape, motion information in the temporal attention is naturally aligned with the target structure.

Loss & Training

MotionShot is a fully training-free framework: - No additional training data or fine-tuning is required. - It builds on pretrained AnimateDiff, ControlNet, Stable Diffusion, DINOv2, and CoTracker3. - All alignment is achieved through feature matching and geometric transformation.

Key Experimental Results

Main Results (Tables)

Quantitative Comparison (CLIP Scores + User Study):

Method Text Align↑ Temporal Consist↑ Motion Preserv↑ Appear Diversity↑ User-Text↑ User-Temporal↑
VideoComposer 26.54 95.95 3.00 2.72 2.79 2.82
Gen-1 22.79 97.67 2.87 2.71 2.75 2.87
VMC 26.77 97.72 2.80 2.78 2.78 2.87
Tune-A-Video 26.60 95.99 2.86 2.78 2.88 2.86
Control-A-Video 24.87 95.54 2.94 2.66 2.40 2.92
MotionClone 26.41 97.48 2.90 2.50 2.80 2.82
MotionShot 26.95 97.81 4.95 4.95 4.94 4.90
  • In the user study, MotionShot achieves near-perfect scores (on a 5-point scale) across all four dimensions, far surpassing all baselines.

Ablation Study (Tables)

Ablation on Number of Keypoints:

Keypoints \(m\) Contour Interior Effect
10 8 2 TPS warping fails; cannot match target shape
30 24 6 Reasonable warping; best overall performance
60 48 12 Overfitting; unnatural warping results

Comparison of Semantic Feature Matching Methods:

Method Effect
X-Pose keypoint detector Only predicts 17 points with uneven distribution; appearance mismatch
SD features only Fine spatial detail but error-prone in ambiguous regions (e.g., tail)
DINO features only Good high-level semantics but misses fine details (e.g., legs)
SD + DINO fusion Balances fine-grained and high-level accuracy; best performance

Comparison of Shape Retargeting Methods:

Method Issue
No warping of original sequence Motion–shape mismatch; generated object deforms
Simple scaling Consistent size but topological distortion (e.g., leg misalignment)
Keypoint-based TPS warping Best motion accuracy and structural consistency

Key Findings

  • Motion preservation and appearance diversity scores in the user study approach the maximum (4.95/5), far exceeding the second-best method (~3.0/5).
  • The fully training-free approach substantially outperforms methods that require training in motion transfer quality.
  • SD+DINO feature fusion is critical for semantic matching; using either feature alone is insufficient.
  • TPS warping is a key step; point-level guidance or simple scaling both lead to shape artifacts.
  • 30 keypoints is the optimal count: too few yields insufficient warping, too many leads to overfitting.

Highlights & Insights

  1. Two-level alignment framework: The first work to explicitly model both high-level semantic alignment and low-level morphological alignment, addressing the core challenge of motion transfer between arbitrary object pairs.
  2. Fully training-free: No additional training or fine-tuning is needed; powerful capabilities are achieved by composing existing pretrained models.
  3. Pseudo-target generation strategy: Degraded segmentation maps guide the generation of a pseudo-target with a consistent initial pose, elegantly resolving the initial pose mismatch problem.
  4. Structure-aware keypoint sampling: The combination of uniform contour sampling and Poisson disk interior sampling ensures even keypoint coverage across all regions of the object.
  5. Motion decomposition: Decomposing motion into global rotation+translation and local polar-coordinate offsets enables fine-grained transfer of complex motions.

Limitations & Future Work

  • The method fails when the reference and target objects share no semantic similarity (e.g., airplane→flower).
  • Semantic correspondence relies on the feature quality of SD and DINOv2, and may be unreliable for objects far from the pretraining distribution.
  • Being built on the AnimateDiff framework, video quality and length are constrained by the underlying model.
  • The 300-step sampling with guidance applied for the first 180 steps leaves room for improving inference efficiency.
  • Only single-object motion transfer is supported; motion transfer involving multi-object interactions remains unexplored.
  • The two-level alignment paradigm (semantic + morphological) is generalizable to other tasks requiring cross-domain correspondence (e.g., style transfer, animation).
  • TPS warping, as a continuous spatial mapping tool, is better suited for attention-guided generation than discrete point-level guidance.
  • Training-free methods can address complex visual generation problems through the judicious combination of pretrained models.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The two-level motion alignment framework is pioneering and resolves a longstanding challenge of arbitrary object pair motion transfer.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation, user study, and detailed ablations are provided, though additional automated motion fidelity metrics would strengthen the evaluation.
  • Writing Quality: ⭐⭐⭐⭐ The methodology is clearly described with rich illustrations and well-motivated design choices.
  • Value: ⭐⭐⭐⭐⭐ Near-perfect user study scores demonstrate strong practical value, and the training-free design lowers the barrier to adoption.