ECCV2024 Multimodal VLM Human motion synthesis Multimodal Large Language Models MoCap-free Keyframe generation Physical simulation GPT-4V

tags: - ECCV 2024 - Multimodal VLM - GPT-4V date: 2026-05-08 content_hash: 856baf5bd6994af9

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models¶

Conference: ECCV2024
arXiv: 2406.10740
Code: Not open-sourced
Area: Multimodal VLM
Keywords: Human motion synthesis, Multimodal Large Language Models, MoCap-free, Keyframe generation, Physical simulation, GPT-4V

TL;DR¶

This work achieves open-set human motion synthesis without using any motion capture (MoCap) data for the first time by leveraging an MLLM (GPT-4V) as a keyframe designer and animator combined with physics-based motion tracking.

Background & Motivation¶

Core Problem¶

Traditional human motion synthesis methods highly rely on motion capture (MoCap) data, which suffers from:

High Data Acquisition Costs: The largest public MoCap datasets only contain dozens of hours of motion, which is far from covering the diversity of daily human actions.

Poor Generalizability: Data-driven methods are limited to pre-recorded action categories, environments, and styles, lacking open-set generalization capabilities.

Limited Scenarios: Existing methods struggle to adapt to new environments and unseen human behaviors.

Motivation¶

Multimodal Large Language Models (MLLMs) are trained on internet-scale vision-language data, possessing rich world knowledge and reasoning capabilities. The authors decouple the high-level semantic understanding of MLLMs from low-level motion control, proposing a two-stage framework consisting of "keyframe generation + motion in-betweening." This design avoids forcing MLLMs to directly predict continuous motion states, which exceeds their capability boundaries.

Comparison with Prior Work¶

Reward-function-based methods (e.g., Eureka, Language2Reward): Can only handle a limited range of actions that can be represented by reward functions.
CLIP-based methods (e.g., MotionCLIP, AvatarCLIP): Offer limited zero-shot capability, fail to understand complex action combinations and sequences, and suffer from poor physical constraints.
Ours: Leverages MLLM for the first time to achieve MoCap-free open-set motion synthesis.

Method¶

Overall Architecture¶

FreeMotion consists of two stages:

Stage 1: MLLM-driven Sequential Keyframe Generation

Accomplished through the collaboration of two specialized GPT-4V agents.
Keyframe Designer: Decomposes high-level motion instructions into sequences of low-level body part descriptions.
Keyframe Animator: Adjusts human poses via predefined commands based on these descriptions.

Stage 2: Motion In-Betweening

Linear interpolation is performed between keyframes (at 20 fps).
Physically implausible poses are corrected using a CVAE-based motion tracking policy.

Key Designs¶

1. Keyframe Designer¶

Input: Full-body description \(D_i\), rendered image \(p_i\), joint coordinates \(\{x_i\}\), motion instruction \(I\).

Output: Next keyframe representation \(r_{i+1}\) (full-body description + body-part descriptions) and time interval \(t_i\).

Key aspects: - Starts from the initial standing pose \(D_0\) and iteratively generates the entire sequence of keyframes. - Integrates spatial decomposition (body parts) and temporal decomposition (keyframe intervals). - The MLLM automatically determines the end of motion (completion of acyclic actions or one full cycle of cyclic actions). - Rendered images serve as visual feedback to help the MLLM better understand the current state.

2. Keyframe Animator¶

Receives the keyframe descriptions generated by the Designer and adjusts the pose using a predefined set of commands:

Command	Function
Single joint movement	Moves a single joint to the target position
End effector movement	Moves the end effector quickly via a predefined IK chain
Pelvis rotation/movement (with support)	Rotates/moves the pelvis when supported by the ground (IK)
Pelvis rotation/movement (without support)	Rotates/moves the pelvis directly when not supported by the ground
Single joint roll	Rolls a single joint
Camera rotation	Rotates the camera to view specific body parts

Each body part is adjusted at most 5 times, with the total number of adjustments typically being fewer than 10 times.
Visual feedback mechanism: After each command execution, the updated rendered image and joint coordinates are fed back to the Animator.

3. Environment-Aware Motion Tracking¶

Extracts a heightmap around the humanoid pelvis, flattened into a vector \(o_t\) as the environmental visual signal.
CVAE Policy: Encoder \(q_\phi(z_t | s_t, \tilde{s}_{t+1}, o_t)\) + Decoder \(p_\theta(a_t | s_t, z_t)\).
MLP World Model: \(\omega(s_{t+1} | s_t, a_t, o_t)\) approximates the true transition probability.

Loss & Training¶

The training process mostly follows ControlVAE. The core losses include:

Reconstruction Loss: Measures the discrepancy between the target interpolated frame and the actual generated frame.
KL Divergence: Constrains the target distribution of the latent variables in the CVAE encoder.
World Model Loss: Measures the prediction error of the next state compared to the actual simulated state.
Physics simulation is conducted using the ODE physics engine.

Key Experimental Results¶

Main Results 1: HumanAct12 Motion Synthesis (User Study, 50 Participants)¶

Action Category	MDM	MLD	Ours
Warm up	26.00%	38.00%	36.00%
Walk	10.00%	22.00%	68.00%
Run	30.00%	32.00%	38.00%
Jump	16.00%	28.00%	56.00%
Drink	14.00%	46.00%	40.00%
Lift_dumbbell	26.00%	32.00%	42.00%
Sit	30.00%	44.00%	26.00%
Eat	22.00%	30.00%	48.00%
Turn_steering_wheel	32.00%	28.00%	40.00%
Phone	30.00%	32.00%	38.00%
Boxing	16.00%	24.00%	60.00%
Throw	20.00%	14.00%	66.00%
Average	22.67%	30.83%	46.50%

Note: MDM and MLD are trained on HumanAct12 data, whereas FreeMotion operates entirely without any MoCap data.

Main Results 2: Olympic Sports Motion Synthesis¶

Method	Average User Preference
MotionCLIP	~8%
AvatarCLIP	~10%
Ours	~82%

FreeMotion significantly outperforms CLIP-based baseline methods on Olympic sports motion synthesis, demonstrating the superior understanding of complex action sequences by MLLMs.

Main Results 3: Style Transfer (User Study)¶

Style	MotionCLIP	AvatarCLIP	Ours
Happy	22.67%	25.33%	52.00%
Proud	24.00%	18.00%	58.00%
Angry	14.00%	34.67%	51.33%
Childlike	28.67%	29.33%	42.00%
Depressed	14.67%	17.33%	68.00%
Drunk	11.33%	9.33%	79.33%
Old	17.33%	28.00%	54.67%
Heavy	20.00%	16.00%	64.00%
Average	19.08%	22.25%	58.67%

Main Results 4: Human-Scene Interaction¶

Method	Sit Success Rate	Lie Down Success Rate	Reach Success Rate	Sit Contact Error ↓	Lie Down Contact Error ↓	Reach Contact Error ↓
InterPhys	93.7%	80.0%	—	0.09	0.30	—
UniHSI	94.3%	81.5%	97.5%	0.032	0.061	0.016
AMP	83.6%	28.3%	96.6%	0.074	0.334	0.041
Ours	95%	60%	95%	0.066	0.224	0.012

Ablation Study¶

Ablation Item	Setting	User Preference
Body Part Description	W/o description vs Full	26% vs 74%
Visual Feedback	W/o visual feedback vs Full	32% vs 68%

Key Findings¶

Outperforming Supervised Methods Without MoCap Data: Achieved an average preference rate of 46.50% on HumanAct12, which is higher than MDM (22.67%) and MLD (30.83%).
World Knowledge of MLLMs holds a Key Advantage: In style transfer, the MLLM can reason about common-sense behaviors (e.g., "elderly people walk with a hunched back").
Body Part-Level Spatial Decomposition is Vital: Removing it drops the preference rate from 74% to 26%.
Visual Feedback Significantly Improves Pose Accuracy: Removing it drops the preference rate from 68% to 32%.
Physical Constraints Offer a Core Advantage over CLIP Methods: For CLIP-based methods, individual frames may be plausible, but the overall generated motion lacks physical realism.

Highlights & Insights¶

Paradigm Innovation: Proves for the first time that high-quality, open-set motion synthesis can be achieved without relying on MoCap data, breaking the conventional assumption that "motion synthesis requires motion data."
Clever Boundary Partitioning of Capabilities: Allocating high-level semantic planning (keyframe design) to the MLLM while delegating low-level continuous motion to physical simulation perfectly matches the strengths of MLLMs.
Dual-Agent Collaborative Design: The division of labor between the Designer (deciding "what to do") and the Animator (deciding "how to do it") resembles the relationship between main animators and in-between animators in traditional animation studios.
Plug-and-Play Environment Awareness: Heightmaps enable the motion tracking policy to adapt code-wise to complex environments, supporting non-flat terrains.
High Scalability: As the baseline capabilities of MLLMs improve, the network performance of this framework scales naturally.

Limitations & Future Work¶

Inability to Handle Highly Complex Actions: Motion requiring fine coordination, such as dancing, suffers from imprecise keyframe decomposition.
Insufficient Support for Long Textual Instructions: Understanding and executing complex, multi-step instructions remains challenging.
Degraded Performance under Contact-Rich Scenarios: The success rate of "Lie Down" is only 60%, significantly lower than UniHSI's 81.5%.
Low Inference Efficiency: Generating each keyframe requires multiple GPT-4V API calls (iteratively by the Designer and Animator), incurring high latency and financial costs.
Limited Evaluation Metrics: Heavily relies on user preference studies, with a lack of automated quantitative metrics.
Dependence on Physics Simulators: Still requires training a motion tracking policy for each downstream task.

Category	Representative Methods	Distinctive Difference from Ours
Data-driven	MDM, MLD, MotionDiffuse	Requires MoCap data for training, limiting generalizability
CLIP-based	MotionCLIP, AvatarCLIP	Zero-shot but poor motion quality, lacking physical constraints
Reward Design	Eureka, L2R	Relies on LLMs to generate reward functions, applicable only to few actions
Scene Interaction	UniHSI, InterPhys, AMP	Requires MoCap data or meticulously designed reward functions

Insights¶

Feasible Path for MLLMs as "World Models": Instead of directly outputting low-level control signals, the MLLM guides on a semantic level, which is then translated into physical motion through specialized modules.
Generality of Keyframing Principles: The concept of "discrete semantic anchors + continuous signal in-betweening" can be transferred to fields like video generation, robotics planning, etc.
Closed-Loop Visual Feedback: Allowing the MLLM to iteratively adjust based on visual feedback is an effective strategy to compensate for its limited precision in spatial reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Paradigm-shifting; first to achieve MoCap-free open-set motion synthesis.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers four downstream tasks, although automated metrics are limited and heavily relies on user studies.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated, with in-depth explanation of the dual-agent design.
Value: ⭐⭐⭐⭐ — Blazes a new trail for MLLM-driven motion synthesis, but practical application is hindered by API dependency and high inference costs.