Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation¶

Conference: ECCV 2024
arXiv: 2312.14828
Code: https://moonsliu.github.io/Pro-Motion
Area: Motion Generation / Multimodal
Keywords: Text-to-motion generation, open-vocabulary, LLM planning, diffusion models, human motion

TL;DR¶

This paper proposes PRO-Motion, a divide-and-conquer framework that decomposes text-to-motion generation into three stages: LLM-driven motion planning (Plan), script-based posture diffusion generation (Posture), and global translation and rotation estimation (Go). By reducing the complexity of each stage, it achieves high-quality open-vocabulary motion generation.

Background & Motivation¶

Background: Text-to-motion generation aims to automatically generate 3D human motion sequences based on natural language descriptions. Existing methods are usually trained on limited text-motion paired datasets (such as HumanML3D, KIT-ML), restricting their generation capability to the text distribution of the training set. Some methods attempt to adjust the motion and text spaces using CLIP, but the generated results are still limited to simple in-place motions.

Limitations of Prior Work: (1) The coverage of text descriptions in training data is limited, making models struggle to comprehend motion descriptions unseen in the training set; (2) Most motions generated by existing methods are "in-place" motions, lacking full-body translation and rotation, which limits dynamism; (3) Direct mapping from text to motion is overly complex—while the expression space of natural language is infinite, the space of possible human poses is highly structured.

Key Challenge: Direct end-to-end generation from open-vocabulary text to complete motion is extremely challenging, as it requires simultaneously understanding arbitrary natural language, generating plausible posture sequences, and predicting global motion trajectories. Neither existing data nor current models can support the generalization of such end-to-end mapping.

Goal: How to achieve true open-vocabulary text-to-motion generation, allowing the model to understand and generate motion descriptions that never appeared in the training set?

Key Insight: The key observation is that although the space of natural language descriptions is infinite (e.g., "happily spinning in place while dancing"), the underlying human postures can be described by a structured "script" template that covers all possible poses. If an LLM can be used to translate arbitrary text into this standardized script sequence, the mapping from script to posture becomes much simpler, as the script space is far smaller than the natural language space.

Core Idea: Divide-and-conquer strategy—planning natural language into structured posture script sequences using an LLM, and then using diffusion models to generate postures and full-body motion trajectories respectively.

Method¶

Overall Architecture¶

PRO-Motion consists of three modules in a pipelined fashion: (1) Motion Planner: takes natural language motion descriptions as input and utilizes an LLM to generate structured script descriptions of a sequence of key poses; (2) Posture-Diffuser: converts each script description into its corresponding SMPL posture parameters; (3) Go-Diffuser: predicts the global translation and rotation between motion frames based on the posture sequence to generate complete dynamic motions. The SMPL model is adopted as the human body representation, with a final output motion sequence of dimension \(64 \times 135\).

Key Designs¶

Motion Planner:
- Function: Translates open-vocabulary natural language motion descriptions into structured key pose script sequences.
- Mechanism: Utilizing LLMs (e.g., GPT-4) as the motion planner, carefully designed prompts guide the LLM to output standardized posture description scripts. The prompt defines five basic body part relationship categories: bending degree, relative distance, relative position, orientation, and ground contact, specifying the applicable body parts. According to these rules, the LLM decomposes complex natural language instructions (e.g., "dance freely") into structured posture descriptions for several key frames. The intermediate postures between key frames are later interpolated by the subsequent diffusion model.
- Design Motivation: This is the core innovation of the framework. While the natural language space is infinite, the structured script space is enumerable—since it only involves a finite set of body parts and relations. The powerful language understanding capability of LLMs is naturally suited for this "translation" task. Meanwhile, users can also manually edit the scripts to achieve precise control.
Posture-Diffuser:
- Function: Converts a single posture script description into corresponding SMPL posture parameters.
- Mechanism: It is implemented as a conditional diffusion model conditioned on the textual features of the posture script, generating corresponding SMPL postures (excluding global displacement) via a denoising process. It is trained on the PoseScript dataset, which provides a large volume of automatically generated posture description-posture pairs. The model employs a 3-layer Transformer with a latent dimension of 512, trained for 1000 epochs using a linear noise schedule. In addition, the encoder of a pre-trained text-posture retrieval model is used to extract semantic features of texts and postures.
- Design Motivation: Since script descriptions follow simple text templates (rather than complex natural language), the learning difficulty for Posture-Diffuser is significantly reduced. The compositional nature of scripts enables the model to generalize to posture combinations unseen during training.
Go-Diffuser:
- Function: Adds full-body translation and rotation to the posture sequences to generate complete dynamic motions.
- Mechanism: It is implemented as another diffusion model conditioned on the key posture sequence generated by Posture-Diffuser, predicting the full-body translation (3D) and root joint rotation (6D continuous rotation representation) for each frame. At the same time, it interpolates the intermediate frames between keyframes to a fixed length (64 frames). The model adopts an 8-layer Transformer with 4-head attention, uses a cosine noise schedule, and runs for 100 diffusion steps. The translation is represented as velocity (displacement difference between adjacent frames) for normalization.
- Design Motivation: Separating in-place posture generation from global motion estimation resolves the issue of previous methods generating only in-place motions. Go-Diffuser essentially learns "how to reasonably move in space given a sequence of postures," which is a much simpler task than end-to-end generation.

Loss & Training¶

Both diffusion models are trained using the standard DDPM framework. Posture-Diffuser is trained on the PoseScript-A dataset (automatically generated posture descriptions) with a batch size of 512 and a learning rate of 1e-4. Go-Diffuser is trained on the AMASS dataset with a batch size of 64, a learning rate of 1e-4, and a classifier-free guidance masking probability of 0.1. During training, the frame lengths are cropped and padded to a unified length of 64 frames.

Key Experimental Results¶

Main Results¶

Method	R-Precision ↑	FID ↓	MM-Dist ↓	Test Set
MDM	Lower	Higher	Higher	HumanML3D
T2M-GPT	Moderate	Moderate	Moderate	HumanML3D
MotionDiffuse	Moderate	Moderate	Moderate	HumanML3D
PRO-Motion (Ours)	Highest	Lowest	Lowest	HumanML3D
PRO-Motion (Open-Vocab)	Effective	Reasonable	Reasonable	IDEA-400

Ablation Study¶

Configuration	APE-Root ↓	APE-Mean ↓	Description
Posture-Diffuser only	No trajectory	In-place only	No Go-Diffuser, in-place postures only
w/o Motion Planner	Limited	Limited	Direct natural language conditioning, poor generalization
Full PRO-Motion	Lowest	Lowest	Complete three modules
Different LLMs	Small variation	Small variation	GPT-4 is optimal, but other LLMs are also applicable

Key Findings¶

The Motion Planner is key to achieving open-vocabulary generalization; without it, the model degenerates into traditional closed-set methods.
Go-Diffuser resolves the issue of in-place postures, rendering the generated movements more dynamic and realistic.
Users can directly edit the scripts generated by the Motion Planner to control specific postures precisely, enabling fine-grained interactive editing.
On the IDEA-400 open-vocabulary test set, PRO-Motion successfully handles complex textual instructions (e.g., "feeling a deep sense of happiness"), while other methods completely fail to generate reasonable motions.

Highlights & Insights¶

Ingenious application of LLM as a planner: Instead of directly letting the LLM generate motions, it is tasked with "translating" natural language into structured scripts. This fully leverages the LLM's language comprehension capabilities while bypassing the difficulty of generating continuous values. This LLM-as-planner concept can be transferred to other generation tasks.
Divide-and-conquering complexity reduction: Decomposing a difficult end-to-end problem into three simpler sub-problems, each addressable with sufficient data and tailored models. This system design approach is highly practical for engineering.
Compositional generalization of scripts: The compositional nature of structured scripts inherently supports generalization. Even if a specific combination (e.g., "raising the left hand over the head while extending the right foot backward") is unseen during training, it can still be composited as long as the descriptions of individual body parts have been learned.

Limitations & Future Work¶

Reliance on LLMs (such as GPT-4) for inference increases computational cost and API dependencies.
The fixed motion length of 64 frames limits the model's capability to generate long-sequence motions.
The SMPL representation does not include hand finger movements, failing to generate fine-grained hand gestures.
The quality of the Motion Planner depends heavily on prompt design, and different LLMs may require distinct prompt tuning.
Lacking physical constraints, the generated motions may suffer from self-penetration or physically implausible movements.

vs MDM (Motion Diffusion Model): MDM directly generates motions end-to-end from text, which is limited by the text distribution of the training set. PRO-Motion overcomes this limitation via its divide-and-conquer strategy.
vs MotionGPT: MotionGPT directly predicts motion tokens using an LLM, yet remains limited by training data. PRO-Motion tasks the LLM only with planning instead of execution, aligning better with the capability boundary of LLMs.
vs TEMOS: TEMOS operates within the SMPL space but only estimates local movements. PRO-Motion's Go-Diffuser additionally estimates global trajectories, yielding more dynamic results.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The divide-and-conquer methodology combining LLM planning + structured scripts + dual diffusion models is highly ingenious and original.
Experimental Thoroughness: ⭐⭐⭐⭐ Features rich closed-set and open-vocabulary experiments, ablations, and user-control examples, though the quantitative open-vocabulary evaluation is somewhat weak.
Writing Quality: ⭐⭐⭐⭐ The narrative is clear and fluent, with a comprehensive elaboration of the divide-and-conquer motivation.
Value: ⭐⭐⭐⭐ Open-vocabulary motion generation is an important direction; the divide-and-conquer approach offers broad reference value.