Less is More: Improving Motion Diffusion Models with Sparse Keyframes¶
Conference: ICCV 2025 arXiv: 2503.13859 Area: Motion Generation · Diffusion Models · Keyframes Keywords: motion diffusion, sparse keyframes, masking-interpolation, Visvalingam-Whyatt, Lipschitz MLP
TL;DR¶
This paper proposes sMDM, a motion diffusion framework centered on sparse keyframes. By introducing a masking-interpolation strategy and the Visvalingam-Whyatt keyframe selection algorithm, sMDM reduces redundant frame processing and consistently outperforms dense-frame baselines in text alignment and motion quality.
Background & Motivation¶
Existing motion diffusion models (e.g., MDM) take dense frame sequences as input and output, and face two core issues:
High complexity in training and inference: The computational cost of self-attention layers grows quadratically with the number of frames, making training on large-scale motion datasets expensive. In practice, motion diffusion models suffer severe quality degradation under low diffusion step settings.
Lack of controllability: Generating all frames simultaneously makes it difficult to interpret or control model behavior. In contrast, professional animators define sparse keyframes first and then interpolate intermediate frames.
Core Motivation: Inspired by professional animation workflows—where animators focus on keyframes rather than every frame—sMDM builds a diffusion framework around sparse, geometrically meaningful keyframes.
Method¶
Overall Architecture (Fig. 2)¶
sMDM introduces three core modifications to the Transformer backbone of MDM:
- Masking: A binary keyframe mask \(\mathbf{M}\) suppresses non-keyframe tokens in self-attention, reducing attention complexity from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(K^2)\) (\(K \ll N\)).
- Interpolation: After self-attention, keyframe features are linearly interpolated to reconstruct non-keyframe features.
- Lipschitz MLP: Input/output linear layers are replaced with Lipschitz MLPs to ensure smooth interpolation.
Key Design 1: Keyframe Selection¶
During training: The Visvalingam-Whyatt geometric simplification algorithm automatically identifies geometrically significant keyframes. It iteratively removes the frame with the smallest "area" (i.e., lowest geometric importance) until the target keyframe count is reached. A compression rate of 80% is used (retaining 20% of frames as keyframes).
During inference (two-stage strategy): - Early stage (\(t > T' = \gamma \cdot T\)): A uniform mask is used (keyframes evenly distributed), providing a stable denoising baseline. - Late stage (\(t \leq T'\)): Dynamic mask update — Visvalingam-Whyatt is re-applied to the intermediate frames \(\mathbf{x}_t\) to reselect keyframes, focusing attention on the most critical frames. \(\gamma = 0.1\) yields the best results.
Key Design 2: Lipschitz MLP¶
To ensure that linear interpolation in feature space produces smooth motion outputs, a Lipschitz constraint is imposed on the input/output MLPs:
Sine activations are used in place of ReLU to better capture high-frequency motion details (e.g., abrupt direction changes).
Loss & Training¶
Standard diffusion reconstruction loss with Lipschitz regularization:
where \(\hat{\mathbf{x}}_0 = f_\theta(x_t, t, c)\). The loss is computed over interpolated dense frames rather than keyframes alone.
Key Experimental Results¶
Main Results: Text-to-Motion Generation (Tab. 1, HumanML3D)¶
| Method | R@1 ↑ | R@3 ↑ | FID ↓ | MM-Dist ↓ |
|---|---|---|---|---|
| MDM | 0.320 | 0.611 | 0.544 | 5.566 |
| MLD | 0.481 | 0.772 | 0.473 | 3.196 |
| T2M-GPT | 0.606 | 0.838 | 12.475 | 16.812 |
| MoMask | 0.621 | 0.846 | 12.232 | 16.138 |
| ReMoDiffuse§ | 0.510 | 0.795 | 0.103 | 2.974 |
| sMDM | 0.494 | 0.776 | 0.130 | 3.051 |
| sMDM-stella† | 0.554 | 0.829 | 0.151 | 2.740 |
Using the same CLIP encoder, sMDM substantially outperforms MDM. With the Stella-1.5B encoder (sMDM-stella), it surpasses MotionGPT, MotionLCM, and MotionBase—all of which use larger encoders—across all metrics.
Ablation Study (Tab. 1, lower half)¶
| Configuration | FID ↓ | R@3 ↑ | MM-Dist ↓ |
|---|---|---|---|
| Random keyframes | 0.249 | 0.777 | 3.086 |
| No interpolation (keyframe loss only) | 0.267 | 0.773 | 3.127 |
| No Lipschitz | 0.329 | 0.768 | 3.149 |
| sMDM (full) | 0.130 | 0.776 | 3.051 |
Key findings: - Geometric keyframe selection substantially outperforms random selection (FID 0.130 vs. 0.249). - Computing loss over interpolated dense frames is more effective than computing it over keyframes alone. - The Lipschitz constraint contributes significantly to FID improvement.
Stability Across Diffusion Steps (Tab. 2)¶
| Steps | Standard FID | + Dynamic Sampling FID |
|---|---|---|
| 1000 | 0.291 | 0.246 |
| 500 | 0.322 | 0.229 |
| 100 | 0.230 | 0.190 |
| 50 | 0.130 | 0.134 |
| 10 | 0.349 | 0.385 |
Key finding: Dynamic mask updates are especially effective under large step settings (FID reduced from 0.322 to 0.229), consistent with observations from spatial guidance methods such as OmniControl.
Long-Sequence Generation (Tab. 3, DoubleTake)¶
sPriorMDM generates more expressive motions (EES 2.338 vs. 1.856) with greater fidelity to text instructions, though transition-region FID is slightly higher—likely because the generated motions are more dynamic and such transition patterns are underrepresented in the training set.
Highlights & Insights¶
- Sparse training paradigm: This work demonstrates that "less is more"—focusing on 20% of keyframes yields better results than processing all frames, consistent with animation production intuition.
- Visvalingam-Whyatt algorithm: Its adoption endows keyframe selection with geometric meaning, surpassing random or uniform sampling strategies.
- Plug-and-play: sMDM does not alter MDM's core architecture; improvements are achieved solely through masking, interpolation, and Lipschitz constraints, making it straightforward to apply to other diffusion-based motion models.
- The method maintains strong quality at very low diffusion steps (10 steps), which is practically valuable for real-time applications.
Limitations & Future Work¶
- The keyframe compression rate is a fixed hyperparameter; different motion types may require different optimal rates.
- Dynamic mask updates are less effective at extremely low step counts (10 steps).
- Linear interpolation may be insufficient for reconstructing complex, high-frequency motion details.
Related Work & Insights¶
- Motion diffusion: MDM, MLD, MotionDiffuse, FlowMDM
- Keyframe-based methods: CondMDI, KEYIN
- Real-time control: DiP, CLoSD, CAMDM
Rating¶
- Novelty: ★★★★☆ — The idea of training motion diffusion models with sparse keyframes is both novel and practical.
- Technical Depth: ★★★★☆ — Each component is concisely designed and effective, with thorough ablations.
- Experimental Thoroughness: ★★★★★ — Validated across three downstream tasks with detailed multi-step stability analysis.
- Writing Quality: ★★★★☆ — Motivation is clearly articulated; the analogy to animation workflows is intuitive.