MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model¶
Conference: ECCV 2024
arXiv: 2404.19759
Area: Image Generation
TL;DR¶
Proposes MotionLCM, which first introduces consistency distillation into human motion generation, achieving real-time motion generation (~30ms/sequence) via single-step/few-step inference in the motion latent space, and realizes real-time controllable motion generation in the latent space through Motion ControlNet.
Background & Motivation¶
- Efficiency Bottlenecks of Diffusion Models: Existing motion diffusion models (MDM ~24s, MLD ~0.2s) suffer from slow inference speed, failing to meet the requirements of real-time applications.
- High Latency in Spatiotemporal Control: Controllable motion generation methods such as OmniControl have an inference time of approximately 81s/sequence, leaving a huge gap for real-time applications.
- Difficulty in Latent Space Control: In latent diffusion models, the latent representations lack explicit motion semantics, making it impossible to directly manipulate control signals.
- Opportunities of Consistency Models: Consistency Models (CM) achieve highly efficient single-step/few-step generation by learning the consistency function on the PF-ODE trajectory, which perfectly aligns with the goal of accelerating motion generation.
- Core Problem: How to accelerate motion generation to real-time levels while preserving generation quality and controllability?
Method¶
Overall Architecture¶
Trained in two stages: 1. Motion Latent Consistency Distillation: Distills consistency models from a pre-trained MLD (Motion Latent Diffusion model), enabling 1-4 step inference. 2. Latent Space Motion Control: Introduces Motion ControlNet into the latent space of MotionLCM, and utilizes the VAE decoder to provide explicit control supervision in the motion space.
Key Designs¶
Motion Latent Consistency Distillation: - Using MLD as the teacher model, the consistency function \(\textbf{f}_\Theta : (\mathbf{z}_t, t, w, \mathbf{c}) \mapsto \mathbf{z}_0\) is learned in the motion latent space. - Adopts \(k\)-step skip consistency distillation (LCM scheme) instead of step-by-step consistency, significantly reducing convergence time. - Integrates classifier-free guidance (CFG) into the distillation, where \(w\) is uniformly sampled from \([5, 15]\) during training. - Uses a DDIM solver (skip interval \(k=20\)) + Huber loss as the distance metric.
Motion ControlNet: - Initialized with a trainable copy of MotionLCM, with zero-initialized linear layers appended to each layer. - Control task definition: Given the initial pose of the first \(\tau\) frames (global 3D positions of 6 key joints) and a text description, generate the subsequent motion. - Trajectory Encoder: Stacks Transformer layers to encode trajectory signals, where the output feature of the [CLS] token is added to the noisy latent.
Explicit Supervision in Motion Space (Core Innovation): - Reconstruction loss solely in the latent space is insufficient to provide detailed control constraints. - Uses a frozen VAE decoder to decode the predicted latent \(\hat{\mathbf{z}}_0\) into the motion space, calculating the control joint position error. - Benefiting from the single-step inference capability of MotionLCM, this decoding process is more efficient compared to MLD.
Loss & Training¶
First Stage — Consistency Distillation Loss:
Second Stage — Total Loss for Control Training:
where \(\mathcal{L}_{recon}\) is the latent space reconstruction loss, \(\mathcal{L}_{control}\) is the control joint position error in the motion space, and \(\lambda=1.0\).
Key Experimental Results¶
Main Results¶
Comparison of Text-to-Motion Generation (HumanML3D):
| Method | AITS(s)↓ | R-Precision Top3↑ | FID↓ | MM Dist↓ | Diversity→ | MModality↑ |
|---|---|---|---|---|---|---|
| MDM | 24.74 | 0.611 | 0.544 | 5.566 | 9.559 | 2.799 |
| MotionDiffuse | 14.74 | 0.782 | 0.630 | 3.113 | 9.410 | 1.553 |
| MLD | 0.217 | 0.772 | 0.473 | 3.196 | 9.724 | 2.413 |
| MLD* (reprod.) | 0.225 | 0.796 | 0.450 | 3.052 | 9.634 | 2.267 |
| MotionLCM (1-step) | 0.030 | 0.803 | 0.467 | 3.022 | 9.631 | 2.172 |
| MotionLCM (2-step) | 0.035 | 0.805 | 0.368 | 2.986 | 9.640 | 2.187 |
| MotionLCM (4-step) | 0.043 | 0.798 | 0.304 | 3.012 | 9.607 | 2.259 |
Comparison of Controllable Motion Generation:
| Method | AITS(s)↓ | FID↓ | R-Precision Top3↑ | Traj. err.↓ | Loc. err.↓ | Avg. err.↓ |
|---|---|---|---|---|---|---|
| OmniControl | 81.00 | 2.328 | 0.557 | 0.3362 | 0.0322 | 0.0977 |
| MLD (LC&MC) | 0.552 | 0.555 | 0.754 | 0.2722 | 0.0215 | 0.1265 |
| MotionLCM 1-step (LC&MC) | 0.042 | 0.419 | 0.756 | 0.1988 | 0.0147 | 0.1127 |
| MotionLCM 2-step (LC&MC) | 0.047 | 0.397 | 0.759 | 0.1960 | 0.0143 | 0.1092 |
Ablation Study¶
Influence of Training Guidance Scale Range and EMA Rate:
| Setting | R-Precision Top1↑ | FID↓ | MM Dist↓ | Diversity→ |
|---|---|---|---|---|
| \(w \in [5,15]\) (Default) | 0.502 | 0.467 | 3.022 | 9.631 |
| \(w \in [2,18]\) | 0.497 | — | — | — |
| Huber Loss (Default) | 0.502 | 0.467 | — | — |
| L2 Loss | — | 0.592 | — | — |
Key Findings¶
- Speed: MotionLCM single-step inference takes only ~30ms, which is 1929 times faster than OmniControl and 13 times faster than MLD.
- Quality Improvement Instead of Decline: Single-step inference outperforms the R-Precision performance of MLD with 50-step DDIM, and 4-step inference achieves the best FID (0.304).
- Controllability: The latent representations generated by MotionLCM are more suitable for training Motion ControlNet than MLD, achieving significantly higher control accuracy under the same settings.
- Crucial Role of Motion Space Supervision: By adding the explicit control loss in the motion space, the Traj. err. decreases from 0.2986 to 0.1988 (a 33% reduction).
Highlights & Insights¶
- First to Introduce Consistency Distillation into Motion Generation: Demonstrates the feasibility and effectiveness of the LCM scheme within the motion latent space.
- Real-time Controllable Generation: MotionLCM + ControlNet enables autoregressive real-time motion generation (utilizing the final frame of the previous motion segment as the initial control signal for the subsequent one).
- Latent-Motion Dual-Space Supervision: Leverages the VAE decoder to decode latent space predictions into the motion space for explicit control supervision, cleverly addressing the lack of motion semantics in the latent space.
- Pareto Optimality of Efficiency and Quality: On the inference time vs. quality scatter plot, MotionLCM is closest to the origin, achieving the optimal balance between the two.
Limitations & Future Work¶
- Relies on the quality of the pre-trained MLD model, where the upper bound of distillation is limited by the teacher model.
- The control task is currently only defined as initial pose control (the first 25% of frames), with more flexible spatiotemporal control patterns yet to be explored.
- The FID for single-step inference is slightly inferior to 4-step inference, indicating a minor quality loss under extreme acceleration.
Rating¶
⭐⭐⭐⭐⭐ (5/5) — Pushes motion generation to the real-time level. The two-stage design of consistency distillation + ControlNet is simple yet efficient. The explicit supervision in the motion space solves the core challenge of latent space control, bearing great significance for practical applications.