SMooDi: Stylized Motion Diffusion Model¶
Conference: ECCV 2024
arXiv: 2407.12783
Area: Image Generation
TL;DR¶
Introduces SMooDi—the first diffusion model that adapts a pre-trained text-to-motion model for stylized motion generation. Through a style adaptor and dual style guidance (classifier-free guidance + classifier-based guidance), it enables diverse stylized motion generation driven by content text and style motion sequences.
Background & Motivation¶
- Human motion consists of two dimensions: content (e.g., walking, waving) and style (e.g., elderly, happy, angry).
- Text-driven motion generation (e.g., MDM, MLD) has made significant progress but primarily focuses on content, lacking style control.
- Motion style transfer methods can transfer style from one sequence to another but require an existing content motion sequence as input, which limits flexibility.
- Simple concatenation of the two (generation followed by transfer) suffers from three issues:
- Low efficiency: Each sequence must be processed sequentially.
- Error accumulation: Style transfer models degrade in performance on imperfectly generated motions.
- Data limitations: Style transfer methods rely on limited motion content in specific style datasets.
Method¶
Overall Architecture¶
SMooDi is built upon the pre-trained MLD (Motion Latent Diffusion) model and comprises two core modules: 1. Style Adaptor: Injects style conditions through residual features. 2. Style Guidance: Dual guidance consisting of classifier-free and classifier-based guidance.
In the denoising step \(t\), the model takes the content text \(\mathbf{c}\), style motion \(\mathbf{s}\), and noisy latent variable \(\mathbf{z}_t\) as inputs to predict the noise \(\epsilon_t\).
Key Designs¶
1. Style Adaptor
- Train a trainable replica of the Transformer Encoder in MLD.
- An independent style encoder extracts embeddings from the style motion sequence.
- The connection between the Style Adaptor and MLD is established through zero-initialized linear layers (inspired by the ControlNet design).
- During training, the adaptor progressively learns style constraints and applies corrective features to the corresponding layers of MLD.
2. Classifier-Free Style Guidance
Decomposes the conditional guidance into two independent components for content and style:
Where \(w_c\) and \(w_s\) control the strength of content and style guidance, respectively, allowing a flexible balance between content preservation and style expression.
3. Classifier-Based Style Guidance
Specifies an analytical function \(G(\mathbf{z}_t, t, \mathbf{s})\) that computes the L1 distance between the generated motion and the reference style in the style embedding space:
Its gradient is utilized to guide the generated motion toward the target style. The style feature extractor is obtained by training a style classifier on the 100STYLE dataset.
Loss & Training¶
The total training loss consists of three terms:
- \(\mathcal{L}_{std}\): Standard denoising loss computed on the 100STYLE dataset.
- \(\mathcal{L}_{pr}\): Content prior preservation loss—computed by sampling from HumanML3D to prevent "content forgetting."
- \(\mathcal{L}_{cyc}\): Cycle prior preservation loss—reconstructs the original sequences after swapping the content and style of the two datasets, encouraging style-content disentanglement.
Key Experimental Results¶
Main Results¶
Comparison of the stylized text-to-motion generation task (HumanML3D content + 100STYLE style):
| Method | FID↓ | Foot Skating↓ | MM Dist↓ | R-precision↑ | Diversity→ | SRA(%)↑ |
|---|---|---|---|---|---|---|
| MLD+Motion Puzzle | 6.127 | 0.185 | 6.467 | 0.290 | 6.476 | 63.769 |
| MLD+Aberman et al. | 3.309 | 0.347 | 5.983 | 0.406 | 8.816 | 54.367 |
| ChatGPT+MLD | 0.614 | 0.131 | 4.313 | 0.605 | 8.836 | 4.819 |
| SMooDi | 1.609 | 0.124 | 4.477 | 0.571 | 9.235 | 72.418 |
The SRA of ChatGPT+MLD is only 4.82%, demonstrating that MLD cannot achieve stylized generation even when description text of style is provided.
Ablation Study¶
Contribution of each module to the performance:
| Configuration | FID↓ | Foot Skating↓ | MM Dist↓ | R-precision↑ | Diversity→ | SRA(%)↑ |
|---|---|---|---|---|---|---|
| Full Model | 1.609 | 0.124 | 4.477 | 0.571 | 9.235 | 72.418 |
| w/o \(L_{cyc}\) | 2.046 | 0.136 | 4.465 | 0.569 | 8.869 | 64.866 |
| w/o \(L_{pr}+L_{cyc}\) | 5.996 | 0.166 | 6.098 | 0.335 | 7.456 | 81.841 |
| w/o Classifier Guidance | 1.050 | 0.111 | 4.085 | 0.630 | 9.445 | 20.245 |
| w/o Adaptor | 2.984 | 0.123 | 4.526 | 0.550 | 8.372 | 69.952 |
Key Observations: - Removing \(L_{pr}+L_{cyc}\) yields an SRA of 81.84% but causes the FID to surge to 5.996 \(\rightarrow\) severe "content forgetting," where motions degrade entirely to locomotion from the style dataset. - Removing classifier guidance drops the SRA sharply from 72.4% to 20.2% \(\rightarrow\) classifier guidance is crucial for reflecting style. - The adaptor and classifier guidance are complementary: the adaptor establishes the basic style direction, while classifier guidance performs fine-grained adjustment.
Key Findings¶
- Complementary Dual Guidance: Classifier-free guidance captures coarse, style-related features, while classifier-based guidance provides precise style control; both are indispensable.
- Crucial Content Preservation Loss: Attempting to generate without the prior preservation loss leads to severe content forgetting. A single model cannot support diverse content and style simultaneously without it.
- Motion Style Transfer as a Downstream Task: Content motion sequences can be inverted into noisy latent representations via DDIM-Inversion, enabling style transfer without requiring additional optimization.
- In user studies, SMooDi receives higher user preference across three dimensions: realism, style reflection, and content preservation.
Highlights & Insights¶
- Re-adapts pre-trained text-to-motion diffusion models for stylized generation for the first time, offering a clear and effective design approach.
- The cycle prior preservation loss is designed ingeniously, encouraging disentanglement by swapping the style and content of two datasets.
- Decomposing style guidance into classifier-free and classifier-based components allows flexible adjustment of the balance between content and style.
- A single model supports 100 styles \(\times\) diverse content, eliminating the need for per-style fine-tuning required by previous methods.
Limitations & Future Work¶
- Classifier-based style guidance relies on a style classifier trained on the 100STYLE dataset; performance may degrade when the content text deviates significantly from locomotion.
- The 100STYLE dataset contains only locomotion-related movements, which limits the variety of learnable style types.
- The SRA metric in quantitative evaluation relies on a pre-trained classifier, which may not fully reflect the visually perceived style quality.
- Requires a pre-trained MLD model and additional style datasets, making the training pipeline relatively complex.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to adapt a pre-trained motion diffusion model for stylized generation; the dual guidance mechanism is novel.
- Practicality: ⭐⭐⭐⭐ — Multi-style with a single model, supporting motion style transfer as a downstream task.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Complete quantitative, qualitative, user study, and ablation experiments, with in-depth ablation analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, intuitive diagrams, and fully articulated motivation.