Skip to content

tags: - ECCV 2024 - Multimodal VLM - GPT-4V date: 2026-05-08 content_hash: 856baf5bd6994af9


FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Conference: ECCV2024
arXiv: 2406.10740
Code: Not open-sourced
Area: Multimodal VLM
Keywords: Human motion synthesis, Multimodal Large Language Models, MoCap-free, Keyframe generation, Physical simulation, GPT-4V

TL;DR

This work achieves open-set human motion synthesis without using any motion capture (MoCap) data for the first time by leveraging an MLLM (GPT-4V) as a keyframe designer and animator combined with physics-based motion tracking.


Background & Motivation

Core Problem

Traditional human motion synthesis methods highly rely on motion capture (MoCap) data, which suffers from:

High Data Acquisition Costs: The largest public MoCap datasets only contain dozens of hours of motion, which is far from covering the diversity of daily human actions.

Poor Generalizability: Data-driven methods are limited to pre-recorded action categories, environments, and styles, lacking open-set generalization capabilities.

Limited Scenarios: Existing methods struggle to adapt to new environments and unseen human behaviors.

Motivation

Multimodal Large Language Models (MLLMs) are trained on internet-scale vision-language data, possessing rich world knowledge and reasoning capabilities. The authors decouple the high-level semantic understanding of MLLMs from low-level motion control, proposing a two-stage framework consisting of "keyframe generation + motion in-betweening." This design avoids forcing MLLMs to directly predict continuous motion states, which exceeds their capability boundaries.

Comparison with Prior Work

  • Reward-function-based methods (e.g., Eureka, Language2Reward): Can only handle a limited range of actions that can be represented by reward functions.
  • CLIP-based methods (e.g., MotionCLIP, AvatarCLIP): Offer limited zero-shot capability, fail to understand complex action combinations and sequences, and suffer from poor physical constraints.
  • Ours: Leverages MLLM for the first time to achieve MoCap-free open-set motion synthesis.

Method

Overall Architecture

FreeMotion consists of two stages:

Stage 1: MLLM-driven Sequential Keyframe Generation

  • Accomplished through the collaboration of two specialized GPT-4V agents.
  • Keyframe Designer: Decomposes high-level motion instructions into sequences of low-level body part descriptions.
  • Keyframe Animator: Adjusts human poses via predefined commands based on these descriptions.

Stage 2: Motion In-Betweening

  • Linear interpolation is performed between keyframes (at 20 fps).
  • Physically implausible poses are corrected using a CVAE-based motion tracking policy.

Key Designs

1. Keyframe Designer

Input: Full-body description \(D_i\), rendered image \(p_i\), joint coordinates \(\{x_i\}\), motion instruction \(I\).

Output: Next keyframe representation \(r_{i+1}\) (full-body description + body-part descriptions) and time interval \(t_i\).

Key aspects: - Starts from the initial standing pose \(D_0\) and iteratively generates the entire sequence of keyframes. - Integrates spatial decomposition (body parts) and temporal decomposition (keyframe intervals). - The MLLM automatically determines the end of motion (completion of acyclic actions or one full cycle of cyclic actions). - Rendered images serve as visual feedback to help the MLLM better understand the current state.

2. Keyframe Animator

Receives the keyframe descriptions generated by the Designer and adjusts the pose using a predefined set of commands:

Command Function
Single joint movement Moves a single joint to the target position
End effector movement Moves the end effector quickly via a predefined IK chain
Pelvis rotation/movement (with support) Rotates/moves the pelvis when supported by the ground (IK)
Pelvis rotation/movement (without support) Rotates/moves the pelvis directly when not supported by the ground
Single joint roll Rolls a single joint
Camera rotation Rotates the camera to view specific body parts
  • Each body part is adjusted at most 5 times, with the total number of adjustments typically being fewer than 10 times.
  • Visual feedback mechanism: After each command execution, the updated rendered image and joint coordinates are fed back to the Animator.

3. Environment-Aware Motion Tracking

  • Extracts a heightmap around the humanoid pelvis, flattened into a vector \(o_t\) as the environmental visual signal.
  • CVAE Policy: Encoder \(q_\phi(z_t | s_t, \tilde{s}_{t+1}, o_t)\) + Decoder \(p_\theta(a_t | s_t, z_t)\).
  • MLP World Model: \(\omega(s_{t+1} | s_t, a_t, o_t)\) approximates the true transition probability.

Loss & Training

The training process mostly follows ControlVAE. The core losses include:

  • Reconstruction Loss: Measures the discrepancy between the target interpolated frame and the actual generated frame.
  • KL Divergence: Constrains the target distribution of the latent variables in the CVAE encoder.
  • World Model Loss: Measures the prediction error of the next state compared to the actual simulated state.
  • Physics simulation is conducted using the ODE physics engine.

Key Experimental Results

Main Results 1: HumanAct12 Motion Synthesis (User Study, 50 Participants)

Action Category MDM MLD Ours
Warm up 26.00% 38.00% 36.00%
Walk 10.00% 22.00% 68.00%
Run 30.00% 32.00% 38.00%
Jump 16.00% 28.00% 56.00%
Drink 14.00% 46.00% 40.00%
Lift_dumbbell 26.00% 32.00% 42.00%
Sit 30.00% 44.00% 26.00%
Eat 22.00% 30.00% 48.00%
Turn_steering_wheel 32.00% 28.00% 40.00%
Phone 30.00% 32.00% 38.00%
Boxing 16.00% 24.00% 60.00%
Throw 20.00% 14.00% 66.00%
Average 22.67% 30.83% 46.50%

Note: MDM and MLD are trained on HumanAct12 data, whereas FreeMotion operates entirely without any MoCap data.

Main Results 2: Olympic Sports Motion Synthesis

Method Average User Preference
MotionCLIP ~8%
AvatarCLIP ~10%
Ours ~82%

FreeMotion significantly outperforms CLIP-based baseline methods on Olympic sports motion synthesis, demonstrating the superior understanding of complex action sequences by MLLMs.

Main Results 3: Style Transfer (User Study)

Style MotionCLIP AvatarCLIP Ours
Happy 22.67% 25.33% 52.00%
Proud 24.00% 18.00% 58.00%
Angry 14.00% 34.67% 51.33%
Childlike 28.67% 29.33% 42.00%
Depressed 14.67% 17.33% 68.00%
Drunk 11.33% 9.33% 79.33%
Old 17.33% 28.00% 54.67%
Heavy 20.00% 16.00% 64.00%
Average 19.08% 22.25% 58.67%

Main Results 4: Human-Scene Interaction

Method Sit Success Rate Lie Down Success Rate Reach Success Rate Sit Contact Error ↓ Lie Down Contact Error ↓ Reach Contact Error ↓
InterPhys 93.7% 80.0% 0.09 0.30
UniHSI 94.3% 81.5% 97.5% 0.032 0.061 0.016
AMP 83.6% 28.3% 96.6% 0.074 0.334 0.041
Ours 95% 60% 95% 0.066 0.224 0.012

Ablation Study

Ablation Item Setting User Preference
Body Part Description W/o description vs Full 26% vs 74%
Visual Feedback W/o visual feedback vs Full 32% vs 68%

Key Findings

  1. Outperforming Supervised Methods Without MoCap Data: Achieved an average preference rate of 46.50% on HumanAct12, which is higher than MDM (22.67%) and MLD (30.83%).
  2. World Knowledge of MLLMs holds a Key Advantage: In style transfer, the MLLM can reason about common-sense behaviors (e.g., "elderly people walk with a hunched back").
  3. Body Part-Level Spatial Decomposition is Vital: Removing it drops the preference rate from 74% to 26%.
  4. Visual Feedback Significantly Improves Pose Accuracy: Removing it drops the preference rate from 68% to 32%.
  5. Physical Constraints Offer a Core Advantage over CLIP Methods: For CLIP-based methods, individual frames may be plausible, but the overall generated motion lacks physical realism.

Highlights & Insights

  1. Paradigm Innovation: Proves for the first time that high-quality, open-set motion synthesis can be achieved without relying on MoCap data, breaking the conventional assumption that "motion synthesis requires motion data."
  2. Clever Boundary Partitioning of Capabilities: Allocating high-level semantic planning (keyframe design) to the MLLM while delegating low-level continuous motion to physical simulation perfectly matches the strengths of MLLMs.
  3. Dual-Agent Collaborative Design: The division of labor between the Designer (deciding "what to do") and the Animator (deciding "how to do it") resembles the relationship between main animators and in-between animators in traditional animation studios.
  4. Plug-and-Play Environment Awareness: Heightmaps enable the motion tracking policy to adapt code-wise to complex environments, supporting non-flat terrains.
  5. High Scalability: As the baseline capabilities of MLLMs improve, the network performance of this framework scales naturally.

Limitations & Future Work

  1. Inability to Handle Highly Complex Actions: Motion requiring fine coordination, such as dancing, suffers from imprecise keyframe decomposition.
  2. Insufficient Support for Long Textual Instructions: Understanding and executing complex, multi-step instructions remains challenging.
  3. Degraded Performance under Contact-Rich Scenarios: The success rate of "Lie Down" is only 60%, significantly lower than UniHSI's 81.5%.
  4. Low Inference Efficiency: Generating each keyframe requires multiple GPT-4V API calls (iteratively by the Designer and Animator), incurring high latency and financial costs.
  5. Limited Evaluation Metrics: Heavily relies on user preference studies, with a lack of automated quantitative metrics.
  6. Dependence on Physics Simulators: Still requires training a motion tracking policy for each downstream task.

Category Representative Methods Distinctive Difference from Ours
Data-driven MDM, MLD, MotionDiffuse Requires MoCap data for training, limiting generalizability
CLIP-based MotionCLIP, AvatarCLIP Zero-shot but poor motion quality, lacking physical constraints
Reward Design Eureka, L2R Relies on LLMs to generate reward functions, applicable only to few actions
Scene Interaction UniHSI, InterPhys, AMP Requires MoCap data or meticulously designed reward functions

Insights

  • Feasible Path for MLLMs as "World Models": Instead of directly outputting low-level control signals, the MLLM guides on a semantic level, which is then translated into physical motion through specialized modules.
  • Generality of Keyframing Principles: The concept of "discrete semantic anchors + continuous signal in-betweening" can be transferred to fields like video generation, robotics planning, etc.
  • Closed-Loop Visual Feedback: Allowing the MLLM to iteratively adjust based on visual feedback is an effective strategy to compensate for its limited precision in spatial reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Paradigm-shifting; first to achieve MoCap-free open-set motion synthesis.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers four downstream tasks, although automated metrics are limited and heavily relies on user studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated, with in-depth explanation of the dual-agent design.
  • Value: ⭐⭐⭐⭐ — Blazes a new trail for MLLM-driven motion synthesis, but practical application is hindered by API dependency and high inference costs.