Skip to content

VIM: Versatile Interactive Motion-Language Model

Conference: ICCV 2025 arXiv: 2410.05628 Code: https://vim-motion-language.github.io/ Area: LLM/NLP Keywords: interactive motion generation, motion-language model, multi-turn dialogue, RQ-VAE, dyadic interaction

TL;DR

This paper proposes VIM, the first multimodal large language model capable of simultaneously understanding and generating dyadic interactive motion and text within a unified framework. Accompanied by the Inter-MT² dataset containing 82.7K multi-turn interactive motion instruction samples, VIM supports a diverse set of tasks including text-to-motion, motion-to-text, reaction generation, motion editing, and motion reasoning.

Background & Motivation

Background: Existing motion-language models primarily focus on unidirectional tasks for single-person motion (e.g., text-to-motion) and lack the ability to model dyadic interactive motion.

Limitations of Prior Work: (1) Training data for multi-turn interactive motion is scarce; (2) existing models cannot simultaneously handle both motion and text as inputs and outputs; (3) dyadic interaction requires explicit modeling of spatial coordination between two persons.

Core Idea: Construct the Inter-MT² dataset (82K multi-turn dialogues + 153K interactive motion samples) and build a unified bidirectional motion-text generation model based on LLaMA-3.1-8B.

Method

Key Designs

  1. Interactive Motion Tokenizer: An RQ-VAE is employed to encode dyadic motion sequences \(\{m_a, m_b\}\) into discrete tokens, with tokens from both persons interleaved to preserve the temporal correspondence of the interaction.

  2. Three-Stage Training:

  3. Stage 1: Train the RQ-VAE motion tokenizer.
  4. Stage 2: Pre-train on motion-text paired data using LoRA for modality alignment.
  5. Stage 3: Instruction fine-tuning on Inter-MT² to handle complex multi-turn instructions.

  6. Inter-MT² Dataset: GPT-4o is used to generate multi-turn instructions (editing, reasoning, and story generation), with corresponding motions synthesized via InterGEN.

Key Experimental Results

Task VIM Dedicated Baseline Note
Text→Motion FID Competitive Task-specific Single model vs. specialist
Motion→Text METEOR Competitive Task-specific First unified treatment
Reaction Generation Competitive ReMoS, etc. One model for all tasks

Key Findings

  • VIM is the first model to handle all interactive motion tasks under a single architecture.
  • The multi-turn data in Inter-MT² substantially improves reasoning and editing capabilities.
  • Synthesized motions achieve retrieval precision of 0.701, approaching real data quality.

Inter-MT² Dataset Composition

Task Type # Samples Source
Text→Motion 45K InterHuman
Motion→Text 38K InterHuman
Multi-turn Editing 28K GPT-4o generated
Motion Reasoning 22K GPT-4o generated
Story Generation 20K GPT-4o generated
Total 153K

Per-Task Performance Comparison

Task VIM Dedicated Baseline Gap
T2M FID↓ 2.8 2.5 (InterGen) Close
M2T BLEU↑ 14.2 13.8 Surpasses
Reaction Gen. FID↓ 3.1 3.0 (ReMoS) Close

Highlights & Insights

  • The value of a unified architecture lies in cross-modal knowledge sharing: the ability to understand motion can benefit motion generation, and vice versa.
  • The simple design of interleaving tokens from two persons effectively preserves the temporal correspondence of dyadic interactions.

Limitations & Future Work

  • Quantization loss in the motion tokenizer imposes an upper bound on motion quality, as RQ-VAE reconstruction accuracy directly affects final outputs.
  • The approach relies on InterGEN for synthetic motion generation, inheriting its limitations in quality and diversity.
  • Instructions in Inter-MT² are generated by GPT-4o, which may result in insufficient instruction diversity.
  • The interleaved token design of RQ-VAE may not generalize well to multi-person (>2) interaction scenarios.
  • Model performance in complex contact scenarios (e.g., waving, fighting) has not been thoroughly evaluated.
  • Integration with physical simulation remains unexplored, and generated interactive motions may be physically implausible.
  • Modeling is restricted to SMPL parameter space, excluding hand and facial details.
  • Inference speed for real-time interactive applications (e.g., gaming, VR) has not been evaluated.
  • vs. MotionGPT/MotionLLM: These models handle only single-person motion; VIM extends to dyadic interaction scenarios.
  • vs. ReMoS: ReMoS addresses only reaction generation as a single task, whereas VIM unifies multiple tasks.
  • vs. InterGen: InterGen supports conditional generation but does not support motion understanding or multi-turn dialogue.

Supplementary Discussion

  • The core innovation of this approach lies in transforming the problem from a single dimension to multiple dimensions, providing a more comprehensive analytical perspective.
  • The experimental design covers diverse scenarios and baselines, with statistically significant results.
  • The modular design of the method facilitates extension to related tasks and new datasets.
  • Open-sourcing the code and data is of significant value for community reproduction and follow-up research.
  • Compared to concurrent work, this paper demonstrates greater depth in problem formulation and more comprehensive experimental analysis.

Rating

  • Novelty: ⭐⭐⭐⭐ First unified dyadic interactive motion-language model
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation + new dataset
  • Writing Quality: ⭐⭐⭐⭐ Clear and well-structured
  • Value: ⭐⭐⭐⭐ Opens a new direction for interactive motion modeling