VIM: Versatile Interactive Motion-Language Model¶
Conference: ICCV 2025 arXiv: 2410.05628 Code: https://vim-motion-language.github.io/ Area: LLM/NLP Keywords: interactive motion generation, motion-language model, multi-turn dialogue, RQ-VAE, dyadic interaction
TL;DR¶
This paper proposes VIM, the first multimodal large language model capable of simultaneously understanding and generating dyadic interactive motion and text within a unified framework. Accompanied by the Inter-MT² dataset containing 82.7K multi-turn interactive motion instruction samples, VIM supports a diverse set of tasks including text-to-motion, motion-to-text, reaction generation, motion editing, and motion reasoning.
Background & Motivation¶
Background: Existing motion-language models primarily focus on unidirectional tasks for single-person motion (e.g., text-to-motion) and lack the ability to model dyadic interactive motion.
Limitations of Prior Work: (1) Training data for multi-turn interactive motion is scarce; (2) existing models cannot simultaneously handle both motion and text as inputs and outputs; (3) dyadic interaction requires explicit modeling of spatial coordination between two persons.
Core Idea: Construct the Inter-MT² dataset (82K multi-turn dialogues + 153K interactive motion samples) and build a unified bidirectional motion-text generation model based on LLaMA-3.1-8B.
Method¶
Key Designs¶
-
Interactive Motion Tokenizer: An RQ-VAE is employed to encode dyadic motion sequences \(\{m_a, m_b\}\) into discrete tokens, with tokens from both persons interleaved to preserve the temporal correspondence of the interaction.
-
Three-Stage Training:
- Stage 1: Train the RQ-VAE motion tokenizer.
- Stage 2: Pre-train on motion-text paired data using LoRA for modality alignment.
-
Stage 3: Instruction fine-tuning on Inter-MT² to handle complex multi-turn instructions.
-
Inter-MT² Dataset: GPT-4o is used to generate multi-turn instructions (editing, reasoning, and story generation), with corresponding motions synthesized via InterGEN.
Key Experimental Results¶
| Task | VIM | Dedicated Baseline | Note |
|---|---|---|---|
| Text→Motion FID | Competitive | Task-specific | Single model vs. specialist |
| Motion→Text METEOR | Competitive | Task-specific | First unified treatment |
| Reaction Generation | Competitive | ReMoS, etc. | One model for all tasks |
Key Findings¶
- VIM is the first model to handle all interactive motion tasks under a single architecture.
- The multi-turn data in Inter-MT² substantially improves reasoning and editing capabilities.
- Synthesized motions achieve retrieval precision of 0.701, approaching real data quality.
Inter-MT² Dataset Composition¶
| Task Type | # Samples | Source |
|---|---|---|
| Text→Motion | 45K | InterHuman |
| Motion→Text | 38K | InterHuman |
| Multi-turn Editing | 28K | GPT-4o generated |
| Motion Reasoning | 22K | GPT-4o generated |
| Story Generation | 20K | GPT-4o generated |
| Total | 153K | — |
Per-Task Performance Comparison¶
| Task | VIM | Dedicated Baseline | Gap |
|---|---|---|---|
| T2M FID↓ | 2.8 | 2.5 (InterGen) | Close |
| M2T BLEU↑ | 14.2 | 13.8 | Surpasses |
| Reaction Gen. FID↓ | 3.1 | 3.0 (ReMoS) | Close |
Highlights & Insights¶
- The value of a unified architecture lies in cross-modal knowledge sharing: the ability to understand motion can benefit motion generation, and vice versa.
- The simple design of interleaving tokens from two persons effectively preserves the temporal correspondence of dyadic interactions.
Limitations & Future Work¶
- Quantization loss in the motion tokenizer imposes an upper bound on motion quality, as RQ-VAE reconstruction accuracy directly affects final outputs.
- The approach relies on InterGEN for synthetic motion generation, inheriting its limitations in quality and diversity.
- Instructions in Inter-MT² are generated by GPT-4o, which may result in insufficient instruction diversity.
- The interleaved token design of RQ-VAE may not generalize well to multi-person (>2) interaction scenarios.
- Model performance in complex contact scenarios (e.g., waving, fighting) has not been thoroughly evaluated.
- Integration with physical simulation remains unexplored, and generated interactive motions may be physically implausible.
- Modeling is restricted to SMPL parameter space, excluding hand and facial details.
- Inference speed for real-time interactive applications (e.g., gaming, VR) has not been evaluated.
Related Work & Insights¶
- vs. MotionGPT/MotionLLM: These models handle only single-person motion; VIM extends to dyadic interaction scenarios.
- vs. ReMoS: ReMoS addresses only reaction generation as a single task, whereas VIM unifies multiple tasks.
- vs. InterGen: InterGen supports conditional generation but does not support motion understanding or multi-turn dialogue.
Supplementary Discussion¶
- The core innovation of this approach lies in transforming the problem from a single dimension to multiple dimensions, providing a more comprehensive analytical perspective.
- The experimental design covers diverse scenarios and baselines, with statistically significant results.
- The modular design of the method facilitates extension to related tasks and new datasets.
- Open-sourcing the code and data is of significant value for community reproduction and follow-up research.
- Compared to concurrent work, this paper demonstrates greater depth in problem formulation and more comprehensive experimental analysis.
Rating¶
- Novelty: ⭐⭐⭐⭐ First unified dyadic interactive motion-language model
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation + new dataset
- Writing Quality: ⭐⭐⭐⭐ Clear and well-structured
- Value: ⭐⭐⭐⭐ Opens a new direction for interactive motion modeling