CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing¶
Conference: ECCV2024
arXiv: 2403.13900
Code: yh2371/CoMo
Area: Human Understanding
Keywords: human motion synthesis, motion editing, pose code, LLM, text-to-motion
TL;DR¶
This paper introduces CoMo, which decomposes motion sequences into semantically explicit pose codes (e.g., "left knee slightly bent") to achieve text-based controllable motion generation and LLM-based zero-shot motion editing.
Background & Motivation¶
Background¶
Background: Existing text-to-motion models (such as T2M-GPT, MDM, MLD), although capable of generating human motion from text, lack fine-grained control over the generation process. Specifically:
Limitations of Prior Work¶
Limitations of Prior Work: Modifying subtle poses at specific moments (such as "bending deeper") is extremely difficult.
Key Challenge¶
Key Challenge: Inserting new actions at specified time points (such as "finally squatting down") cannot be achieved.
Proposed Approach¶
Proposed Approach: Most methods map text to a superimposed latent code, requiring the generator to disentangle information for various body parts on its own, which leads to imprecise correspondences between text and motion.
These limitations make existing methods have limited applicability in scenarios that require fine control, such as animation creation and immersive technologies.
Goal¶
Goal: How to achieve fine-grained controllability in both spatial (various body parts) and temporal (individual frames) dimensions in text-driven human motion generation, while supporting intuitive natural language-based motion editing?
Method¶
CoMo consists of three core components:
1. Motion Encoder-Decoder¶
Encoder: Uses a predefined semantic pose codebook rather than a learned VQ-VAE. Based on the skeleton parser of PoseScript, it encodes each frame of motion into a K-hot vector through heuristic geometric threshold rules:
- The codebook contains 392 pose codes, divided into 70 pose categories.
- Each code describes the state of a body part (e.g., "right arm straight") or the spatial relationship between parts (e.g., "left hand and left foot close").
- Codes within each category are mutually exclusive, with only one code activated per category at any given moment.
- The temporal downsampling rate is \(l=4\), with a maximum code sequence length of 50.
Decoder: A 1D convolutional network that reconstructs continuous motions from the latent features of pose codes (sum of the embeddings of the activated codes). The training objective employs a smooth L1 loss to constrain both positions and velocities:
2. Motion Generator¶
An autoregressive multi-label prediction model based on a decoder-only Transformer:
- Input: CLIP-encoded text embeddings + 11 fine-grained keywords generated by GPT-4 (10 body parts + 1 emotion).
- Output: Step-by-step prediction of K-hot vectors at each timestep (Bernoulli distribution of each pose code).
- Training Objective: Binary cross-entropy loss, maximizing the average log-likelihood of all Bernoulli distributions.
- Additional: An
\<End\>code is used to mark the end of the motion.
3. Motion Editor¶
Leverages an LLM (GPT-4) to perform zero-shot editing on pose codes through a three-step sequential prompting process:
- Locate Edit Frames: The LLM determines the start and end frame indices that need editing.
- Locate Edit Parts: The LLM identifies the body parts to be modified and their corresponding pose categories.
- Modify Pose Code: The LLM reviews the codes within the selected categories and adjusts them based on the instructions.
The edited codes are concatenated with the unedited parts and reconstructed into the final motion via the decoder. This approach directly manipulates the encoded representation of the source motion, rather than regenerating it from the updated text as in methods like FineMoGen.
Key Experimental Results¶
Motion Generation (HumanML3D Dataset)¶
| Metric | CoMo | T2M-GPT | FineMoGen | GraphMotion |
|---|---|---|---|---|
| R-Precision Top-3↑ | 0.790 | 0.775 | 0.784 | 0.785 |
| FID↓ | 0.262 | 0.116 | 0.151 | 0.116 |
| MM-DIST↓ | 3.032 | 3.118 | 2.998 | 3.070 |
| Diversity↑ | 9.936 | 9.761 | 9.263 | 9.692 |
- The reconstruction quality of pose codes is close to real motion (reconstruction FID = 0.041 vs. real FID = 0.002).
- Achieves the best performance in R-Precision Top-3 and Diversity.
Motion Editing (54-User Study)¶
- On average, over 70% of the evaluators preferred CoMo's editing results.
- The advantages are particularly significant in scenarios involving body part modifications and action insertions/deletions.
- The advantage is less pronounced in global edits such as emotion or speed (textual descriptions are better suited for such global changes).
Ablation Study¶
- Removing the fine-grained keywords generated by the LLM drops the Top-1 accuracy on HumanML3D from 0.502 to 0.487.
- A codebook size of 392 strikes the best balance between complexity and reconstruction quality.
- A downsampling rate of \(l=4\) is the optimal choice (\(l=2\) provides better quality but results in sequences that are too long).
Highlights & Insights¶
- Semantic and Interpretable Motion Representation: Pose codes possess clear natural language semantics (e.g., "left knee slightly bent"), making the motion sequence human-readable and human-intervenable.
- LLM Zero-Shot Editing: Seamlessly leverages the language comprehension capabilities of LLMs to perform motion editing without fine-tuning, employing a simple and effective three-step prompting strategy.
- Direct Manipulation of Source Action: Unlike methods that regenerate motions from updated text, CoMo directly modifies the source motion encoding, thereby better maintaining consistency in the unedited regions.
- Unified Generation and Editing Framework: The identical pose code representation concurrently supports text-driven generation and interactive editing.
Limitations & Future Work¶
- Mainly Local Kinematic Descriptions: The current pose codes focus on local joint states and lack global descriptors such as velocity, style, trajectory, and motion repetition.
- Lack of Physical Constraint: Modifying pose codes with LLMs does not guarantee physically plausible motion sequences, which may lead to unnatural results.
- Suboptimal FID: There is still a performance gap in generation quality (FID = 0.262) compared to T2M-GPT (0.116), as the discretized representation loses some fine details.
- Lower Multimodality: While semantic pose codes enhance text-motion consistency, they may sacrifice diversity under the same text prompt.
- Dependence on GPT-4: Both editing and keyword generation depend on GPT-4, which increases inference cost and latency.
Related Work & Insights¶
| Method | Representation | Editing Capability | Edit Approach |
|---|---|---|---|
| T2M-GPT | VQ-VAE implicit token | No direct editing | Must modify text to regenerate |
| FineMoGen | Diffusion model latent | Optimized via global attention | Generates a new sequence for each edit |
| MDM | Diffusion model | Zero-shot inpainting | Frame/joint inpainting, can be unnatural |
| GraphMotion | Hierarchical semantic graph + diffusion | Limited | Relies on text semantic parsing |
| CoMo | Semantic pose code | LLM zero-shot editing | Directly modifies source motion encoding |
The key differentiator of CoMo is that its interpretable discrete representation allows LLMs to directly "comprehend" and "modify" motions, whereas other methods either require modifying the text for regeneration or perform indirect editing through inpainting.
Related Work & Insights¶
- The Power of Discretized Semantic Representation: Encoding continuous signals into human-understandable discrete symbols not only facilitates LLM reasoning but also opens up new paths for interactive editing. Similar concepts can be generalized to the controllable generation of other continuous signals (such as speech or music).
- Relationship with PoseScript: The codebook construction relies on the skeleton parser of PoseScript, representing a natural extension of PoseScript from static pose description to dynamic motion generation.
- Conversational Motion Generation: CoMo's iterative editing capability enables an interactive loop of "user description \(\rightarrow\) generation \(\rightarrow\) feedback \(\rightarrow\) editing \(\rightarrow\) satisfaction", which is highly suitable for practical animation production workflows.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Using semantic pose codes as an LLM-manipulable intermediate motion representation is a novel and promising direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Generation experiments cover two datasets, editing is validated with a 54-user study, and the ablation is comprehensive.
- Writing Quality: ⭐⭐⭐⭐ — Clear layout, well-explained concepts, and intensive figures/tables.
- Value: ⭐⭐⭐⭐ — Controllable motion generation is a practical demand, and the paradigm of semantic discrete representation combined with LLM-based editing provides strong reference value.