MotionReFit: Dynamic Motion Blending for Versatile Motion Editing¶

Conference: CVPR 2025
arXiv: 2503.20724
Code: https://awfuact.github.io/motionrefit/
Area: Human Understanding / Motion Editing
Keywords: Text-guided Motion Editing, Motion Blending, Diffusion Models, Autoregressive, Style Transfer

TL;DR¶

Proposes MotionReFit, the first versatile text-guided motion editing framework that supports both spatial and temporal editing without requiring additional specifications or LLMs, achieved through MotionCutMix data augmentation, an autoregressive diffusion model, and a motion harmonizer.

Background & Motivation¶

Background: Text-guided motion editing enables both semantic editing (e.g., changing hand movements) and style editing (e.g., an angry style). However, existing methods are limited by pre-collected training triplets.

Limitations of Prior Work: (1) The number of training triplets is limited, leading to poor generalization; (2) explicit specification of the edited body parts is required; (3) the generated edited motions lack smooth transitions in both spatial and temporal dimensions.

Core Idea: MotionCutMix dynamically generates training triplets online by blending body parts from different motion sequences, combined with an autoregressive diffusion model that generates segment-by-segment to ensure temporal smoothness.

Method¶

Overall Architecture¶

MotionCutMix generates extensive training triplets \(\rightarrow\) MotionReFit (autoregressive conditional diffusion model) generates edited motions segment by segment \(\rightarrow\) Motion Harmonizer (discriminator) corrects coordination mismatches in body parts via classifier guidance.

Key Designs¶

MotionCutMix Data Augmentation:
- Function: Expands a large number of training triplets from limited annotated data.
- Mechanism: For semantic editing, a source motion is randomly selected from a large motion library and blended with a target motion possessing annotated masks using soft-mask blending (SLERP interpolation) to generate new pre-edited, post-edited, and instruction triplets. For style editing, motions of the non-edited body parts are replaced. This scales the effective data size from \(N_S\) to \(N_L \times N_S\).
- Design Motivation: Online augmentation avoids the cost of pre-collecting massive amounts of triplets.
Autoregressive Diffusion Model:
- Function: Generates edited motion sequences segment by segment.
- Mechanism: Handles the original motion using a sliding window, where each segment retains its first two frames (noise-free) to serve as a connection to the previous segment, and noise is added/removed starting from the third frame. Conditioning signals include the previous motion segment, the original motion segment, CLIP-encoded editing instructions, and a progress indicator.
- Design Motivation: Segment-by-segment generation reduces the difficulty of learning long sequences, and the retained frames guarantee temporal smoothness.
Motion Harmonizer:
- Function: Eliminates coordination mismatches among body parts.
- Mechanism: Trains a binary discriminator to distinguish natural motions from blended motions, guiding the generation of more natural motions via classifier guidance during the denoising process: gradients push the generated results away from the distribution of "blended motions".
- Design Motivation: The blending nature of MotionCutMix introduces unnatural coordination patterns among body parts.

Loss & Training¶

Standard DDPM loss is used to train the diffusion model along with classifier-freed guidance. The STANCE dataset is constructed (comprising three subsets: 13K sequences for body-part replacement, 750 sequences for style transfer, and 16K triplets for fine-grained adjustment).

Key Experimental Results¶

Main Results¶

Achieves SOTA on body-part replacement and style transfer tasks: - Leading comprehensively in FID, diversity, and text fidelity. - Significantly outperforms methods like TMED and FineMoGen in instruction following.

Ablation Study¶

MotionCutMix significantly improves generalization capability (with more pronounced improvements under limited data).
The Motion Harmonizer effectively reduces unnatural motions.
Autoregressive vs. one-shot generation: Autoregressive performs significantly better on long sequences.

Key Findings¶

MotionCutMix does not affect the training convergence speed.
Soft-mask blending produces smoother transitions than hard-mask blending.
The progress indicator is crucial for temporal editing.

Highlights & Insights¶

The idea of MotionCutMix is similar to CutMix in the image domain but extended to the motion domain.
Using the harmonizer as a discriminator to guide denoising is an ingenious design.
Three editing tasks (replacement, style, and fine-tuning) are unified within a single framework.

Limitations & Future Work¶

The SMPL-X representation limits finger details.
Annotated data for style editing remains limited (750 sequences).
The effectiveness on editing completely different semantics (e.g., "walking \(\rightarrow\) dancing") remains to be validated.

Rating¶

Novelty: 8/10 — The MotionCutMix augmentation strategy is novel.
Technical Depth: 8/10 — Complete design of autoregression and the harmonizer.
Experimental Thoroughness: 8/10 — Three tasks + sufficient ablations.
Writing Quality: 8/10 — Well-structured.