Scaling Large Motion Models with Million-Level Human Motions¶

Conference: ICML 2025
arXiv: 2410.03311
Area: Human Understanding

TL;DR¶

This paper introduces MotionLib (the first million-level motion dataset, containing 1.2 million sequences), MotionBook (comprising lossless features and a 2D lookup-free motion tokenizer), and Being-M0 (a large motion model), demonstrating the scaling laws of both data and model size in the motion generation field for the first time.

Background & Motivation¶

Text-to-motion (T2M) generation is an emerging field with widespread applications in games, movies, and robotics. However, current methods are constrained by the scale of available data:

Enormous Data Quantity Gap: The largest existing motion dataset, Motion-X, contains only ~80K sequences, whereas vision-language datasets (e.g., ImageNet) are orders of magnitude larger.
Defects of Existing VQ Tokenizers:
Information Loss: Compressing complex motion states (containing joint positions, velocities, ground contacts, etc.) into a single 1D embedding.
Limited Codebook Capacity: Small codebooks restrict the diversity of the generated motions.
Feature Representation Issues: The commonly used H3D format omits raw rotation information, requiring time-consuming reconstruction methods.

Method¶

MotionLib Dataset¶

The first million-level motion dataset, containing 1.21 million motion sequences and 2.48 million text descriptions:

Dataset	Sequences	Texts	Hours	Text Type
HumanML3D	29.2K	89K	28.6	body
Motion-X	81.1K	142K	144.2	body
MotionLib	1.21M	2.48M	1456.4	Hierarchical

Construction Pipeline: 1. Collect over 20 million videos from public datasets and YouTube. 2. Extract SMPL parameters in the world coordinate system using WHAM. 3. Generate hierarchical text annotations: body-part-level descriptions (e.g., left arm) + full-body-level descriptions (1-3 sentences). 4. Refine raw motions using an RL policy \(\pi_{\text{refine}}\) to ensure adherence to physical laws.

MotionBook: Efficient Motion Encoding¶

Lossless Motion Features (SMPL-D135)¶

Each frame is encoded as \(m \in \mathbb{R}^{135}\): - Root Node (9D): 6D rotation \(\mathbf{r}_{rot} \in \mathbb{R}^6\), 2D XZ-plane velocity, and 1D height. - Body Joints (126D): 6D rotation vectors of 21 key joints \(\mathbf{j}^r \in \mathbb{R}^{21 \times 6}\).

Compared to the H3D format (263D), SMPL-D135 is more compact and retains complete rotation information.

2D Lookup-Free Quantization (2D-LFQ)¶

Core Innovation: 1. Treat the motion sequence as a single-channel image \(\mathcal{M} \in \mathbb{R}^{T \times D \times 1}\). 2. Divide the feature dimension into \(P\) components, which are encoded separately (e.g., root orientation, joint rotation, foot contact, etc.). 3. Represent the encoder output as \(\mathbb{E}(\mathcal{M}) \in \mathbb{R}^{\lfloor T/\alpha \rfloor \times P \times d}\).

Lookup-Free Quantization: Replace codebooks with an integer set \(\mathbb{C} = \times_{i=1}^d C_i\), where \(C_i = \{-1, 1\}\):

\[Q(z_i) = -\mathbb{1}\{z_i \leq 0\} + \mathbb{1}\{z_i > 0\}\]

The token index is computed as \(Index(z) = \sum_{i=1}^d 2^{i-1}\mathbb{1}\{z_i > 0\}\). This scales the codebook size by at least two orders of magnitude (from ~512 to ~65K+), while avoiding codebook collapse.

Being-M0: Large Motion Model¶

An autoregressive motion generation model based on a pretrained LLM:

Two-stage Training: 1. Motion-Text Alignment: Pretraining on the entire MotionLib dataset to learn basic motion-text associations. 2. Motion Instruction Tuning: Fine-tuning using over 250 instruction templates and 900K instruction-following data points.

Training Loss:

\[\mathcal{L}(\Theta) = -\sum_{j=1}^{L}\log P_\Theta(y_j | desc, \hat{y}_{1:j-1})\]

Key Experimental Results¶

Scaling Law Experiments¶

Decoder	Instructions	Parameters	MotionLib-eval FID ↓
GPT-2	0.02M	355M	30.612
GPT-2	1.2M	355M	6.936
LLaMA-3	0.02M	8B	29.257
LLaMA-3	0.08M	8B	21.295
LLaMA-3	0.5M	8B	8.973
LLaMA-3	1.2M	8B	6.029
LLaMA-2	1.2M	13B	6.221

Key Findings: - Increasing the data scale from 0.02M to 1.2M results in a significant drop in FID from ~30 to ~6, demonstrating a clear scaling behavior. - Larger models yield consistent improvements (LLaMA-3 8B outperforms GPT-2 355M), but the impact of data scale is far more significant than that of model scale.

Motion Tokenizer Comparison¶

2D-LFQ scales the codebook size from 512 in traditional VQ methods to 65536+ while keeping the MPJPE error low, achieving nearly 100% codebook utilization.

Generalization Capability¶

While existing models struggle on out-of-domain concept tests within MotionLib, Being-M0 demonstrates significantly better generalization capabilities on unseen motion categories.

Highlights & Insights¶

First Million-Level Motion Dataset: 1.2M sequences and 2.48M text annotations, which is more than 15 times larger than existing datasets.
First Demonstration of Scaling Laws in Motion Generation: Both data scale and model size are proven to effectively reduce generation errors.
Innovative 2D Lookup-Free Quantization: Scales the codebook size by two orders of magnitude, fundamentally resolving the limitations of codebook capacity.
Lossless Feature Design: SMPL-D135 is more compact than H3D while preserving complete rotation information.
Hierarchical Text Annotations: Body-part-level + full-body-level descriptions provide unprecedented textual granularity.

Limitations & Future Work¶

The quality of motions extracted from millions of web videos is highly uneven, necessitating an additional RL refinement strategy.
Some fast or intense motions still suffer from foot-sliding issues even after RL refinement.
The dataset primarily focuses on single-person scenarios, offering limited support for multi-person interactions.
The accuracy of SMPL parameters extracted from 2D videos remains limited compared to high-end MoCap data.
The computational overhead and inference cost for large LLM backbones are relatively high.

Rating¶

⭐⭐⭐⭐⭐ (5/5)

This is a landmark work in the field of motion generation. The construction of the million-level dataset, the first demonstration of scaling laws, and the innovative 2D-LFQ tokenizer are all significant contributions. The paper demonstrates immense engineering effort and highly systematic experiments, laying a solid foundation for future research.