GENMO: A GENeralist Model for Human MOtion¶
Conference: ICCV 2025 arXiv: 2505.01425 Code: Project Page Area: Human Understanding Keywords: Human motion modeling, motion estimation, motion generation, diffusion models, multimodal conditioning
TL;DR¶
This paper proposes GENMO, the first generalist model that unifies human motion estimation (recovering motion from video/2D keypoints) and motion generation (synthesizing motion from text/music/keyframes) within a single framework. Through a dual-mode training paradigm (regression + diffusion), GENMO achieves both precise estimation and diverse generation in a single model.
Background & Motivation¶
Human motion modeling is a long-standing research topic in computer vision and graphics, with broad applications in gaming, animation, and 3D content creation. Consider a practical creation scenario: a user wants to generate a motion sequence starting from a video clip, transitioning to text-described actions, synchronizing with music beats, and finally aligning with another video—all while maintaining fine-grained keyframe control. This demands a model capable of both faithfully reproducing observed motion and generating diverse plausible motion.
Key Challenge: Motion estimation and motion generation have fundamentally different objectives: - Estimation requires precise, deterministic output—given the same video, a unique motion sequence should be recovered. - Generation requires diverse output—given the same text description, multiple plausible motions should be producible.
This tension has led the two tasks to be handled by separate models, limiting cross-task knowledge transfer.
Key Insights: 1. Generative priors can improve estimation—under challenging conditions such as occlusion, the motion distribution learned by generative models provides useful constraints. 2. Diverse video data can enhance generation—large-scale in-the-wild videos (with only 2D annotations) broaden the motion distribution available to generative models. 3. Diffusion models provide a natural framework for unification—estimation can be viewed as "maximum-likelihood generation."
Method¶
Overall Architecture¶
GENMO is built upon the diffusion model framework and reformulates motion estimation as constrained motion generation: given conditioning signals (video, 2D keypoints, text, music, 3D keyframes), the model generates motion sequences \(x = \{x^i\}_{i=1}^N\) satisfying the given constraints. Each frame's motion representation is: $\(x^i = (\Gamma_{\text{gv}}^i, v_{\text{root}}^i, \theta^i, \beta^i, t_{\text{root}}^i, \pi^i, p^i)\)$ corresponding to gravity-view orientation (6D), root velocity (3D), SMPL joint angles (24×6D), shape parameters (10D), root translation (3D), camera pose, and contact labels.
Key Designs¶
-
Dual-Mode Training Paradigm:
- Function: Enables a single model to achieve both estimation accuracy and generation diversity.
- Mechanism:
- Estimation mode: The model input is set to pure Gaussian noise \(z \sim \mathcal{N}(\mathbf{0}, I)\) with the timestep set to the maximum value \(T\), and the model directly regresses the clean motion: $\(\mathcal{L}_{\text{est}} = \mathbb{E}_{z \sim \mathcal{N}(\mathbf{0}, I)} [\|x_0 - \mathcal{G}(z, T, \mathcal{C}, \mathcal{M})\|^2]\)$ This is equivalent to maximum likelihood estimation, forcing the model to predict the most probable motion in a single step from pure noise.
- Generation mode: Standard DDPM training, progressively denoising from a noisy motion: $\(\mathcal{L}_{\text{gen}} = \mathbb{E}_{t, x_t} [\|x_0 - \mathcal{G}(x_t, t, \mathcal{C}, \mathcal{M})\|^2]\)$
- Design Motivation: The authors observe that diffusion models under video conditioning exhibit high determinism—the first-step prediction is already very close to the final result—whereas text-conditioned predictions exhibit high variance. For estimation tasks, improving the quality of the "first-step prediction" (estimation mode) is therefore critical, without sacrificing diverse generation capability (generation mode).
-
Multimodal Conditioning Architecture:
- Function: Supports arbitrary combinations of multimodal conditioning inputs including video, music, 2D keypoints, and text.
- Mechanism:
- Frame-aligned conditions (video, music, 2D skeleton): An Additive Fusion Block projects features from each modality via separate MLPs, sums them, and fuses the result with the noisy motion to produce a token sequence.
- Text conditions: A novel Multi-Text Injection Block supports the injection of multiple text segments over different temporal windows: $\(f_{\text{out}} = \sum_{k=1}^K \text{MaskedMHA}(f_{\text{in}}, c_{\text{text}}^k, \Omega_k)\)$ where \(\Omega_k(i,j)\) is a binary mask restricting the \(k\)-th text segment to influence only motion frames within its corresponding temporal window.
- The backbone network uses a RoPE-based Transformer, supporting variable-length sequences and sliding-window attention at inference time.
- Design Motivation: Text and motion frames lack a frame-wise alignment relationship and cannot be naively concatenated (which would introduce positional bias). Multi-text injection elegantly addresses the need for temporally segmented text control via masked attention.
-
Estimation-Guided 2D Training:
- Function: Leverages in-the-wild videos with only 2D annotations to enhance the diversity of the generative model.
- Mechanism: 2D data is exploited in two steps:
- The estimation mode first generates pseudo 3D motion from 2D conditions: \(\hat{x}_0 = \mathcal{G}(z, T, \mathcal{C})\).
- The pseudo motion is noised and used for generation-mode training, with the loss computed via 2D reprojection: $\(\mathcal{L}_{\text{gen-2D}} = \mathbb{E} [\|x_{\text{2d}} - \Pi(\mathcal{G}(\hat{x}_t, t, \mathcal{C}))\|^2]\)$
- Design Motivation: 3D motion capture data is scarce and limited in diversity, whereas 2D annotations can be obtained at scale via detectors. Converting in-the-wild 2D video into training data through the estimation capability simultaneously enriches the generative distribution and avoids the noise inherent in 3D pseudo-labels.
Loss & Training¶
- Estimation mode: \(\mathcal{L}_{\text{est}} + \mathcal{L}_{\text{geo}}\) (including geometric regularization terms such as 3D joint/vertex constraints and contact constraints).
- Generation mode: \(\mathcal{L}_{\text{gen}} + \mathcal{L}_{\text{geo}}\) (3D data) or \(\mathcal{L}_{\text{gen-2D}} + \mathcal{L}_{\text{geo}}\) (2D data).
- Mode selection strategy: Strong conditions (video/2D skeleton) employ both estimation and generation modes simultaneously; weak conditions (text/music) use generation mode only.
- Inference supports sliding-window attention (window of \(W\) frames) for arbitrarily long sequence generation.
Key Experimental Results¶
Main Results¶
Global Motion Estimation (EMDB-2 dataset):
| Method | WA-MPJPE100 | W-MPJPE100 | RTE | Foot Sliding |
|---|---|---|---|---|
| WHAM (DPVO) | 135.6 | 354.8 | 6.0 | 4.4 |
| GVHMR (DPVO) | 111.0 | 276.5 | 2.0 | 3.5 |
| TRAM (DROID) | 76.4 | 222.4 | 1.4 | - |
| GENMO (DROID) | 74.3 | 202.1 | 1.2 | 8.8 |
Music-Driven Dance Generation (AIST++):
| Method | FIDk↓ | FIDm↓ | PFC↓ | BAS↑ |
|---|---|---|---|---|
| Bailando | 28.16 | 9.62 | 1.754 | 0.2332 |
| EDGE | 42.16 | 22.12 | 1.5363 | 0.2334 |
| GENMO (music only) | 16.10 | 13.91 | 0.7340 | 0.2282 |
| GENMO (generalist) | 40.91 | 18.51 | 0.3702 | 0.2708 |
Ablation Study¶
Contribution of Dual-Mode Training (motion estimation, RICH dataset):
| Configuration | WA-MPJPE100 | W-MPJPE100 | Note |
|---|---|---|---|
| Diffusion-only | 88.9 | 143.9 | Standard diffusion training only |
| Regression-only | 87.0 | 141.0 | Regression training only |
| Dual-mode | 75.3 | 118.6 | Best performance |
Effect of 2D Training Data (text-to-motion generation, Motion-X):
| Configuration | FID↓ | R@3↑ | MM Dist↓ | Note |
|---|---|---|---|---|
| MDM baseline | 2.389 | 0.313 | 6.745 | Baseline method |
| w/o 2D Training | 0.515 | 0.401 | 5.210 | Without 2D data |
| w/ 2D Training | 0.207 | 0.472 | 4.801 | With 2D data |
Key Findings¶
- The unified model outperforms task-specific models: GENMO surpasses the dedicated estimation method TRAM on motion estimation (W-MPJPE 202.1 vs. 222.4), benefiting from the generative prior as a constraint on motion plausibility.
- Both modes in dual-mode training are indispensable: Neither diffusion-only nor regression-only training matches the dual-mode configuration. Diffusion provides generation diversity, while regression ensures estimation precision.
- 2D data training comprehensively improves generation quality: On Motion-X, FID drops from 0.515 to 0.207 and R@3 improves from 0.401 to 0.472, validating the effectiveness of extracting training signal from in-the-wild videos.
- The generalist model achieves a higher FIDk than the task-specific model on music-driven dance (40.91 vs. 16.10), but exhibits superior diversity, physical plausibility, and music beat alignment (PFC 0.37 vs. 0.73, BAS 0.27 vs. 0.23), reflecting the mutual benefits of multi-task training.
- Motion interpolation experiments further validate the synergistic benefits of joint estimation and generation training.
Highlights & Insights¶
- Paradigm-level innovation: This work is the first to demonstrate that motion estimation and generation can be unified within a single diffusion framework with mutual performance benefits.
- The theoretical basis of dual-mode training is compelling: The estimation mode corresponds to denoising at the maximum timestep in a diffusion model, making it fully compatible with the generation mode.
- The multi-text injection design addresses the practical challenge of aligning text with motion frames, enabling temporally segmented control such as "run for 5 seconds, then dance for 10 seconds."
- Variable-length sequence support is achieved via RoPE combined with sliding-window attention, producing naturally coherent long sequences without post-hoc stitching.
- Being an NVIDIA product, the engineering implementation and data scale are both substantial.
Limitations & Future Work¶
- The use of SMPL parametric representation leads to inferior performance on HumanML3D metrics compared to methods using task-specific representations (representation mismatch issue).
- The generalist model generally underperforms task-specific models on individual tasks (e.g., FIDk on music-driven dance), though its overall performance is superior.
- Training involves multiple datasets and multiple training modes, resulting in considerable implementation complexity.
- The model supports only single-person motion and does not handle multi-person interaction.
- Only the SMPL skeleton representation is supported; facial expressions and hand details are excluded.
Related Work & Insights¶
- The gravity-view coordinate system proposed in GVHMR (NeurIPS 2024) is directly adopted by GENMO, demonstrating the general utility of this coordinate formulation.
- MDM (ICLR 2023) provides the foundational diffusion-based motion generation paradigm; GENMO extends it with the estimation mode and a multimodal architecture.
- Unlike large-model approaches such as MotionGPT that focus on language understanding, GENMO prioritizes precise geometric constraints and physical plausibility.
- The dual-mode training concept is generalizable to other tasks requiring unified estimation and generation (e.g., 3D scene understanding, audio).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Unifying estimation and generation represents a significant paradigm contribution; both dual-mode training and estimation-guided 2D training are highly novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple tasks spanning estimation (global/local) and generation (text/music/interpolation) with comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Content is rich but lengthy; some architectural details could be presented more concisely.
- Value: ⭐⭐⭐⭐⭐ Substantially advances the field of human motion modeling; the unified framework with bidirectional mutual benefits constitutes a compelling research direction.