Skip to content

MotionDiff: Training-Free Zero-Shot Interactive Motion Editing via Flow-Assisted Multi-View Diffusion

Conference: ICCV 2025 arXiv: 2503.17695 Code: None Area: Image Editing / Motion Editing Keywords: Motion Editing, Optical Flow Guidance, Multi-View Consistency, Training-Free Diffusion, 3D Point Cloud

TL;DR

MotionDiff proposes a training-free, zero-shot multi-view motion editing approach that estimates multi-view optical flow from static scenes via a Point Kinematics Model (PKM), and leverages a decoupled motion representation to guide Stable Diffusion in generating high-quality, multi-view-consistent motion editing results.

Background & Motivation

Limitations of Prior Work

Background: Controllable editing via generative models is a prominent research direction, yet motion editing remains challenged by three key issues:

Difficulty Handling Complex Motions: Existing physics-based editing methods (e.g., DragGAN, DragDiffusion) primarily address simple drag/translation operations and struggle with complex motions such as rotation, scaling, and stretching.

Lack of Multi-View Consistency: Most methods focus on single-view editing without cross-view motion constraints, leading to inconsistent results across viewpoints.

Retraining or Special Representations Required: Many methods require retraining diffusion models or rely on specific scene representations such as NeRF or 3DGS, incurring high computational cost and data collection overhead.

The authors' core insight is that optical flow serves as a natural carrier for pixel-level motion. By modeling motion in 3D space—applying user-defined motion patterns to a 3D point cloud—multi-view-consistent optical flow can be obtained naturally to guide the diffusion model for precise motion editing.

Starting Point

Goal: ### Overall Architecture

MotionDiff comprises two inference stages:

  1. Multi-view Flow Estimation Stage (MFES): The user selects objects of interest from a static scene and defines motion patterns; multi-view optical flow is then estimated via PKM.
  2. Multi-view Motion Diffusion Stage (MMDS): The estimated optical flow guides Stable Diffusion for motion editing, with multi-view consistency enforced through the decoupled motion representation.

Key Designs

1. Point Kinematics Model (PKM)

P.

Method

Overall Architecture

MotionDiff comprises two inference stages:

  1. Multi-view Flow Estimation Stage (MFES): The user selects objects of interest from a static scene and defines motion patterns; multi-view optical flow is then estimated via PKM.
  2. Multi-view Motion Diffusion Stage (MMDS): The estimated optical flow guides Stable Diffusion for motion editing, with multi-view consistency enforced through the decoupled motion representation.

Key Designs

1. Point Kinematics Model (PKM)

PKM is the core contribution of this paper, estimating the 3D post-motion position \(P_m\) from user-defined single-view motion priors. The basic procedure is:

  • Mask Clustering is applied to segment the 3D point cloud; the user interactively selects the object of interest \(P_o\).
  • A GUI allows the user to define single-view optical flow \(f_s\) from a chosen viewpoint.
  • \(f_s\) is back-projected into 3D space to obtain sparse motion points \(P_{sm}\).
  • PKM estimates the complete \(P_m\) for each motion mode and projects it back to each view to obtain multi-view optical flow \(f_m\).

PKM supports four motion modes:

  • Translation: A 3D offset \(p_{off}\) is estimated from sparse point pairs \((P_{so}, P_{sm})\); \(P_m = P_o + p_{off}\).
  • Scaling: A scaling factor \(s_f\) is computed from the flow norm ratio; \(P_m = s_f \cdot P_o\).
  • Rotation: The rotation angle \(\varphi\) is obtained via the GUI; a rotation matrix \(Rot\) is constructed; \(P_m = Rot \odot (P_o - p_c) + p_c\).
  • Stretching: A stretching plane is determined, and per-point stretching factors are computed based on signed distances to the plane.

2. Decoupled Motion Representation (Core of MMDS)

The key idea of MMDS is to decompose the motion image into three components:

  • Static Background: The background region exposed after the moving object is displaced.
  • Moving Object: The foreground object after being warped by the optical flow.
  • Occlusion Region: Newly occluded or exposed areas resulting from the motion.

Different guidance strategies are designed for each component: - The background region is directly copied from the original image. - The moving object is guided by flow-warped features to drive diffusion generation. - The occlusion region is freely inpainted by the diffusion model.

3. Multi-View Consistency Guarantee

Since the optical flow for all views is derived from the motion of the same 3D point cloud via PKM, geometric consistency across views is inherently guaranteed without any additional cross-view constraints or retraining.

Loss & Training

MotionDiff is completely training-free and operates purely at inference time. The guidance strategy follows the classifier guidance paradigm:

\[\tilde{\varepsilon}_\theta(x_t; t; y) = \varepsilon_\theta(x_t; t; y) + \sigma_t \cdot \nabla_{x_t} \mathcal{L}(x_t)\]

where \(\mathcal{L}(x_t)\) incorporates an optical flow guidance loss and a region decoupling loss. A fixed timestep schedule is used during DDIM sampling.

Key Experimental Results

Main Results

Qualitative Comparison with Motion Editing Methods (primarily presented via visualization in the paper):

Method Translation Rotation Scaling Stretching Multi-View Consistent Training Required
DragDiffusion
Motion Guidance
DragonDiffusion
MotionDiff

Multi-View Editing Consistency Evaluation:

Method Multi-View Consistency Texture Preservation Motion Complexity
Motion Guidance Low Low Translation
DragDiffusion Low Medium Translation / Drag
MotionDiff High High Multiple Modes

Ablation Study

Ablation of Decoupled Motion Representation:

Setting Background Preservation Motion Accuracy Occlusion Completion
No Decoupling (Global Guidance) Poor Moderate Poor
Background + Motion Only Good Good Poor
Full Decoupling Good Good Good

The complete three-component decoupling (background + moving object + occlusion region) is critical to the final output quality.

Key Findings

  1. PKM is essential for multi-view-consistent motion editing: Modeling motion in 3D space fundamentally resolves the multi-view consistency problem.
  2. Optical flow is more expressive than drag points for complex motions: Optical flow naturally encodes complex transformations such as rotation and scaling.
  3. Decoupled representation substantially improves quality: Separately handling background, moving object, and occlusion region avoids texture degradation caused by global guidance.
  4. User-friendly interaction: The interactive GUI enables users to intuitively define motion patterns.

Highlights & Insights

  1. Elegant formulation of 3D motion modeling: Deriving multi-view optical flow from single-view motion priors naturally ensures geometric consistency.
  2. Completely training-free: No dataset preparation or model retraining is required; the method directly leverages pretrained Stable Diffusion.
  3. Rich motion modes: This is the first editing method to simultaneously support translation, rotation, scaling, and stretching.
  4. Clever decoupled representation design: The motion editing problem is decomposed into three well-defined sub-problems: background preservation, object transformation, and occlusion completion.

Limitations & Future Work

  1. Requires 3D point cloud as input: Dependence on pre-reconstructed static scenes limits applicability to arbitrary images.
  2. Rigid-body motion assumption: The four motion modes in PKM are essentially rigid transformations, making them unsuitable for non-rigid deformation (e.g., human body pose changes).
  3. Predominantly qualitative evaluation: Comprehensive quantitative benchmarks (e.g., large-scale FID or LPIPS evaluations) are lacking.
  4. Inference efficiency: The DDIM-based guided inference process may be slow.
  5. Occlusion completion quality depends on SD's generative capacity: Results may be suboptimal in complex occlusion scenarios.
  • DragGAN / DragDiffusion: Interaction via drag points for editing, but limited to simple translation.
  • Motion Guidance: Uses optical flow to guide diffusion models but lacks multi-view consistency and texture preservation.
  • NeRF-based editing: Requires retraining NeRF, incurring high computational cost.
  • 3DGS-based editing: Requires specific 3DGS scene representations.
  • Insights: Lifting motion editing into 3D space is the fundamental solution for multi-view consistency; the decoupled representation paradigm is generalizable to a broader range of editing tasks.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐