SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction¶
Conference: CVPR 2025
arXiv: 2503.18211
Code: https://github.com/lzhyu/SimMotionEdit
Area: Human Understanding
Keywords: Motion Editing, Diffusion Model, Auxiliary Task, Motion Similarity, Transformer
TL;DR¶
Optioning SimMotionEdit, which introduces motion similarity prediction as an auxiliary task paired with a dual-module architecture of Condition Transformer + Diffusion Transformer, achieving SOTA performance in text-driven 3D human motion editing on the MotionFix dataset.
Background & Motivation¶
- Background: Text-driven 3D human motion synthesis has made remarkable progress (e.g., MDM, MotionDiffuse), but fine-grained editing starting from existing motion (rather than generating from scratch) remains a frontier challenge.
- Limitations of Prior Work: Existing methods (such as the attention manipulation in MotionCLR or TMED in MotionFix) suffer from insufficient editing alignment—there is a semantic mismatch between the generated motion and the text instructions or source motion. Training-free methods are constrained by the capacity of pre-trained models, while training-based methods (such as TMED) lack explicit modeling of "which frames need to be edited".
- Key Challenge: Motion editing requires satisfying two constraints simultaneously—consistency with the source motion (keeping unchanged parts intact) and alignment with the text instruction (modifying the correct parts). However, the model lacks the capability to locate the editing regions.
- Goal: Enable the model to first learn to "predict the similarity curve between the source motion and the edited motion," and then utilize this capability to guide the actual editing process.
- Key Insight: Analogous to the workflow of an animator—identifying the keyframes that need modification first, and then editing them. The motion similarity curve precisely encodes the information of "which frames have changed."
- Core Idea: Multi-task learning, where the features from the auxiliary task (motion similarity prediction) enhance the conditional representation of the main task (motion editing).
Method¶
Overall Architecture¶
The input consists of a source motion sequence \(X\) and a text editing instruction \(L\), and the output is the edited motion sequence \(M\). The model contains two modules: Condition Transformer, which processes source motion and text features and enhances feature interaction through the auxiliary task; and Diffusion Transformer, which receives the enhanced features and noisy edited motion to generate the edited result via DDPM denoising. The two tasks are jointly trained, with the total loss \(\mathcal{L} = \mathcal{L}_{aux} + \mathcal{L}_e\).
Key Designs¶
-
Motion Similarity Prediction Auxiliary Task
- Function: Enables the model to learn to predict "which frames need to be edited and the degree of change" from the source motion and text instructions.
- Mechanism: Construct a predictable similarity curve—for each frame \(i\), find the minimum distance within a sliding window \(|i-j| \leq W\) of the edited motion as the raw similarity \(S_i^{Rr}\), fuse two metrics of joint rotation and position \(S_i^R = w_1 S_i^{Rr} + w_2 S_i^{Rl}\), and then apply min-max normalization to \([0,1]\). MotionSNR (Signal-to-Noise Ratio) is used to filter noisy samples, and the normalized similarity is ultimately quantized into \(K\) discrete categories, trained with cross-entropy loss \(\mathcal{L}_{aux} = -\frac{1}{F}\sum_{i=0}^{F-1}\log p_{i, \mathfrak{s}_i}\).
- Design Motivation: (a) Sliding window matching avoids false discrepancies caused by frame alignment offsets; (b) normalization renders samples with different editing magnitudes comparable; (c) quantizing into a classification task is more robust than regression and allows the model to generalize to a wider range of editing scenarios.
-
Condition Transformer
- Function: Blends source motion features and text features to generate enhanced conditional representations.
- Mechanism: Standard Transformer encoder architecture, where the inputs are the source motion token sequence and CLIP text features. Guided by the auxiliary loss function, the Transformer learns to incorporate information about "which frames will change" into the feature size. The output is divided into enhanced source motion features and enhanced text features, which are fed into the Diffusion Transformer, respectively.
- Design Motivation: Decouple the input of the auxiliary task from the diffusion process—noisy variations in the edited motion should not affect the learning of similarity prediction.
-
Diffusion Transformer
- Function: Generates the edited motion based on the enhanced conditions.
- Mechanism: The input is the concatenated sequence of enhanced source motion features and noisy edited motion. Enhanced text features are injected via the AdaLN-Zero layer (similar to DiT), alongside the diffusion timestep \(t\). The target is to predict the original edited motion signal \(M_0\), with the editing loss formulated as \(\mathcal{L}_e = \mathbb{E}[\|M_0 - \mathcal{E}(M_t, t, L, X)\|_2^2]\). Training and inference utilize the DDPM framework.
- Design Motivation: Enable the Diffusion Transformer to focus on utilizing the enhanced conditional information for denoising generation, ensuring each module performs its designated role.
Loss & Training¶
- Total loss \(\mathcal{L} = \mathcal{L}_{aux} + \mathcal{L}_e\), with equal weights assigned to both terms.
- Standard DDPM training with \(T=1000\) steps, predicting \(M_0\).
- Low-quality training samples are filtered using a MotionSNR threshold (samples with extremely small editing magnitudes are excluded).
Key Experimental Results¶
Main Results¶
| Method | R@1 (Batch)↑ | R@1 (Test)↑ | AvgR (Test)↓ | M-score↑ |
|---|---|---|---|---|
| MDM | 4.03 | 0.10 | - | - |
| MDM-BP | 39.10 | 8.69 | 180.99 | - |
| TMED (MotionFix) | 62.90 | 14.51 | 56.63 | -3.512 |
| SimMotionEdit | 70.62 | 25.49 | 23.49 | -3.210 |
| Ground Truth | 100.0 | 64.36 | 1.74 | -3.175 |
Ablation Study¶
| Configuration | R@1 (Batch)↑ | R@1 (Test)↑ | Description |
|---|---|---|---|
| Full model | 70.62 | 25.49 | Full model |
| w/o Auxiliary Task | ~65 | ~19 | Retrieval metrics drop significantly without similarity prediction |
| Regression Alternative to Classification | ~67 | ~21 | Continuous regression of similarity is inferior to discrete classification |
| w/o MotionSNR Filtering | ~68 | ~22 | Low-quality samples introduce noise |
Key Findings¶
- The introduction of the auxiliary task significantly improves Generated-to-Target R@1 from ~62.9 to 70.62 (Batch setting), and from 14.51 to 25.49 (full test set).
- Quantizing into a classification task outperforms continuous regression—regression overfits to specific editing magnitudes, whereas classification permits better generalization.
- The M-score closely approaches Ground Truth (-3.210 vs -3.175), indicating high realism of the edited motions.
- MotionSNR filtering effectively purges noisy samples where text does not match the motion changes.
Highlights & Insights¶
- The philosophy of "locate-before-edit" is highly intuitive and effective. Analogous to attention editing or region selection in image editing, it achieves frame-level "focused attention" in motion editing through similarity prediction.
- Quantization instead of regression is a design choice worth learning from—edited motion is not unique, and different magnitudes of editing are all plausible; thus, classification tasks support multi-modality.
- The decoupled dual-Transformer architecture is cleanly designed: the Condition Transformer handles the auxiliary task and feature enhancement, while the Diffusion Transformer handles the generation, ensuring mutual independence.
Limitations & Future Work¶
- Validated only on the MotionFix dataset; dataset scale and diversity are relatively limited.
- The definition of similarity relies on Euclidean distances of joint rotations and positions, lacking a semantic-level similarity metric.
- Editing with changing frame lengths is not supported (the source and edited motions might have different frame counts).
- Hyperparameters such as the sliding window size \(W\), the number of quantization categories \(K\), and the MotionSNR threshold require manual tuning.
Related Work & Insights¶
- vs TMED (MotionFix): TMED directly trains a diffusion model for editing without explicitly modeling "which frames need to be modified". Ours enhances conditional features through an auxiliary task, boosting R@1 by ~11%.
- vs MotionCLR: MotionCLR achieves editing via attention manipulation but does not support free-text inputs. Ours supports free-text and undergoes supervised training.
- vs MDM-BP: MDM-BP performs editing based on a blueprint, but its retrieval metrics are far lower than ours.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of auxiliary-task-driven condition enhancement is proposed for the first time in motion editing, and the construction of similarity curves is also creative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Ablation studies cover the key design points, though only on one small dataset.
- Writing Quality: ⭐⭐⭐⭐ The derivation of the method's motivation is natural, and the diagrams are clear.
- Value: ⭐⭐⭐⭐ SOTA achieved in the field of motion editing; the auxiliary task paradigm is transferable.