VIM: Versatile Interactive Motion-Language Model¶

Conference: ICCV 2025 arXiv: 2410.05628 Code: https://vim-motion-language.github.io/ Area: LLM/NLP Keywords: interactive motion generation, motion-language model, multi-turn dialogue, RQ-VAE, dyadic interaction

TL;DR¶

This paper proposes VIM, the first multimodal large language model capable of simultaneously understanding and generating dyadic interactive motion and text within a unified framework. Accompanied by the Inter-MT² dataset containing 82.7K multi-turn interactive motion instruction samples, VIM supports a diverse set of tasks including text-to-motion, motion-to-text, reaction generation, motion editing, and motion reasoning.

Background & Motivation¶

Background: Existing motion-language models primarily focus on unidirectional tasks for single-person motion (e.g., text-to-motion) and lack the ability to model dyadic interactive motion.

Limitations of Prior Work: (1) Training data for multi-turn interactive motion is scarce; (2) existing models cannot simultaneously handle both motion and text as inputs and outputs; (3) dyadic interaction requires explicit modeling of spatial coordination between two persons.

Core Idea: Construct the Inter-MT² dataset (82K multi-turn dialogues + 153K interactive motion samples) and build a unified bidirectional motion-text generation model based on LLaMA-3.1-8B.

Method¶

Key Designs¶

Interactive Motion Tokenizer: An RQ-VAE is employed to encode dyadic motion sequences \(\{m_a, m_b\}\) into discrete tokens, with tokens from both persons interleaved to preserve the temporal correspondence of the interaction.
Three-Stage Training:
Stage 1: Train the RQ-VAE motion tokenizer.
Stage 2: Pre-train on motion-text paired data using LoRA for modality alignment.
Stage 3: Instruction fine-tuning on Inter-MT² to handle complex multi-turn instructions.
Inter-MT² Dataset: GPT-4o is used to generate multi-turn instructions (editing, reasoning, and story generation), with corresponding motions synthesized via InterGEN.

Key Experimental Results¶

Task	VIM	Dedicated Baseline	Note
Text→Motion FID	Competitive	Task-specific	Single model vs. specialist
Motion→Text METEOR	Competitive	Task-specific	First unified treatment
Reaction Generation	Competitive	ReMoS, etc.	One model for all tasks

Key Findings¶

VIM is the first model to handle all interactive motion tasks under a single architecture.
The multi-turn data in Inter-MT² substantially improves reasoning and editing capabilities.
Synthesized motions achieve retrieval precision of 0.701, approaching real data quality.

Inter-MT² Dataset Composition¶

Task Type	# Samples	Source
Text→Motion	45K	InterHuman
Motion→Text	38K	InterHuman
Multi-turn Editing	28K	GPT-4o generated
Motion Reasoning	22K	GPT-4o generated
Story Generation	20K	GPT-4o generated
Total	153K	—

Per-Task Performance Comparison¶

Task	VIM	Dedicated Baseline	Gap
T2M FID↓	2.8	2.5 (InterGen)	Close
M2T BLEU↑	14.2	13.8	Surpasses
Reaction Gen. FID↓	3.1	3.0 (ReMoS)	Close

Highlights & Insights¶

The value of a unified architecture lies in cross-modal knowledge sharing: the ability to understand motion can benefit motion generation, and vice versa.
The simple design of interleaving tokens from two persons effectively preserves the temporal correspondence of dyadic interactions.

Limitations & Future Work¶

Quantization loss in the motion tokenizer imposes an upper bound on motion quality, as RQ-VAE reconstruction accuracy directly affects final outputs.
The approach relies on InterGEN for synthetic motion generation, inheriting its limitations in quality and diversity.
Instructions in Inter-MT² are generated by GPT-4o, which may result in insufficient instruction diversity.
The interleaved token design of RQ-VAE may not generalize well to multi-person (>2) interaction scenarios.
Model performance in complex contact scenarios (e.g., waving, fighting) has not been thoroughly evaluated.
Integration with physical simulation remains unexplored, and generated interactive motions may be physically implausible.
Modeling is restricted to SMPL parameter space, excluding hand and facial details.
Inference speed for real-time interactive applications (e.g., gaming, VR) has not been evaluated.

vs. MotionGPT/MotionLLM: These models handle only single-person motion; VIM extends to dyadic interaction scenarios.
vs. ReMoS: ReMoS addresses only reaction generation as a single task, whereas VIM unifies multiple tasks.
vs. InterGen: InterGen supports conditional generation but does not support motion understanding or multi-turn dialogue.

Supplementary Discussion¶

The core innovation of this approach lies in transforming the problem from a single dimension to multiple dimensions, providing a more comprehensive analytical perspective.
The experimental design covers diverse scenarios and baselines, with statistically significant results.
The modular design of the method facilitates extension to related tasks and new datasets.
Open-sourcing the code and data is of significant value for community reproduction and follow-up research.
Compared to concurrent work, this paper demonstrates greater depth in problem formulation and more comprehensive experimental analysis.

Rating¶

Novelty: ⭐⭐⭐⭐ First unified dyadic interactive motion-language model
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation + new dataset
Writing Quality: ⭐⭐⭐⭐ Clear and well-structured
Value: ⭐⭐⭐⭐ Opens a new direction for interactive motion modeling