Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow¶

Conference: ICLR 2026 arXiv: 2509.24099 Code: https://gprerit96.github.io/dualflow-page Area: 3D Motion Generation Keywords: Dyadic motion generation, Rectified Flow, Retrieval-Augmented Generation, Contrastive learning, Multi-modal conditioning

TL;DR¶

DualFlow proposes the first unified framework for dyadic interactive/reactive 3D motion generation under text+music multi-modal conditions via Rectified Flow and Retrieval-Augmented Generation (RAG). It introduces contrastive flow matching and synchronization loss, achieving 2.5% FID improvement and 76% R-precision improvement on the MDD dataset, with 2.5× faster inference.

Background & Motivation¶

Background: Dyadic motion generation is essential for VR/AR, game AI, and human-robot collaboration. Existing methods treat interactive (simultaneous two-person generation) and reactive (generating person B's motion conditioned on person A's motion) as separate tasks with incompatible architectures.

Limitations of Prior Work: (1) Interactive and reactive models employ different architectures and training objectives, precluding seamless task switching; (2) existing methods support only single-modal conditioning (text or music) and cannot handle joint conditioning; (3) diffusion-based methods require 50+ denoising steps, resulting in slow inference.

Key Challenge: Dyadic motion requires simultaneously modeling mutual responsiveness between two persons, physical plausibility, and multi-modal signal alignment, yet existing methods lack unified modeling capability.

Goal: How to unify interactive and reactive motion generation within a single architecture while supporting text+music multi-modal conditioning?

Key Insight: Leveraging the straight transport paths of Rectified Flow for fast inference, switching between tasks via symmetric/asymmetric masking mechanisms, and providing semantic guidance through RAG.

Core Idea: Unify task switching via a dual-branch architecture with cascaded DualFlow blocks, combined with a contrastive Rectified Flow objective and an LLM-decomposed RAG module for multi-modal semantic alignment.

Method¶

Overall Architecture¶

Inputs consist of text (encoded via CLIP-L/14), music (encoded via Jukebox), and motion sequences. Twenty cascaded DualFlow blocks process dyadic motion latents. Under the interactive setting, both branches are symmetrically activated; under the reactive setting, only the reactor branch is activated and conditioned on the actor's motion via causal cross-attention.

Key Designs¶

Multi-Modal Motion Retrieval (RAG):
- Function: Provides semantic anchors for dyadic motion generation.
- Mechanism: GPT-4o decomposes text descriptions into three dimensions—spatial relationships, body actions, and rhythm. Separate CLIP retrieval databases \((D^S, D^B, D^R)\) and a music retrieval database \(D^M\) (Jukebox features) are constructed. The similarity score \(s_i^q = \langle f_i^q, f_p^q \rangle \cdot e^{-\lambda \cdot \frac{|l_i - l_p|}{\max\{l_i, l_p\}}}\) accounts for both semantic similarity and temporal compatibility.
- Design Motivation: Direct retrieval from raw text overlooks the nuanced dimensions of interactive motion; LLM-based decomposition improves retrieval quality.
Contrastive Rectified Flow:
- Function: Enhances semantic alignment within the flow matching framework.
- Mechanism: The standard flow loss \(\mathcal{L}_{\text{flow}} = \mathbb{E}[\|\mathbf{v}_\theta(\mathbf{x}_t, t, c) - (\mathbf{x}_0 - \epsilon)\|_2^2]\) is augmented with a triplet contrastive loss \(\mathcal{L}_{\text{triplet}} = \mathbb{E}[\max(0, d(\hat{\mathbf{v}}, \mathbf{v}^+) - d(\hat{\mathbf{v}}, \mathbf{v}^-) + m)]\).
- Contrastive sample construction: Leveraging the hierarchical structure of RAG, positive samples are motions of similar style or text description, while negative samples are motions with large style differences or low text similarity (>0.6).
Synchronization Loss:
- Function: Enforces spatial relationship consistency between the two persons.
- Mechanism: Weighted inter-person joint distance loss \(\mathcal{L}_{\text{sync}} = \sum_{j_1,j_2} w_d(j_1,j_2) w_j(j_1,j_2) \|d_p(j_1,j_2) - d_{gt}(j_1,j_2)\|^2\).
- Distance weights \(w_d\) assign higher importance to joint pairs that are naturally closer; anatomical weights \(w_j\) differentiate body regions such as hands, upper limbs, and lower limbs.
Task Switching Mechanism:
- Interactive: Both branches are symmetrically activated; Motion Cross-Attention coordinates motion between the two persons.
- Reactive: The actor branch is masked; motion cross-attention is replaced by causal cross-attention with a Look-Ahead window of \(L=10\).

Loss & Training¶

\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CRF}} + \lambda_{\text{geo}} \mathcal{L}_{\text{geo}} + \lambda_{\text{inter}} \mathcal{L}_{\text{inter}}\), where \(\mathcal{L}_{\text{CRF}} = \mathcal{L}_{\text{flow}} + \lambda_{\text{triplet}} \mathcal{L}_{\text{triplet}}\), \(\mathcal{L}_{\text{geo}}\) comprises foot contact, joint velocity, and bone length losses, and \(\mathcal{L}_{\text{inter}}\) comprises distance map, relative orientation, and synchronization losses.

Key Experimental Results¶

Main Results on MDD Dataset (Duet Task)¶

Method	R-Prec@3↑	FID↓	MMDist↓	BAS↑
MDM(Both)	0.163	1.739	2.244	0.190
InterGen(Both)	0.302	0.426	1.532	0.185
DualFlow(Both)	0.513	0.415	0.513	0.200
GT	0.522	0.065	0.077	0.170

Reactive Task¶

Method	FID↓	MMDist↓	R-Prec@3↑
DuoLando(Both)	—	—	—
DualFlow(Both)	0.686	1.056	Best

Key Findings¶

DualFlow requires only 20 inference steps (vs. the standard 50-DDIM), yielding a 2.5× speedup.
Interactive task: FID improved by 2.5%, R-precision improved by 76%, MMDist improved by 3×.
Reactive task: FID improved by 1.7%, R-precision improved by 2.5×.
Ablation studies confirm that both the contrastive loss and the RAG module contribute significantly.
The synchronization loss effectively improves temporal coordination between the two persons.

Highlights & Insights¶

First unified dyadic motion generation framework: Seamless switching between interactive and reactive tasks via a masking mechanism eliminates the need to maintain two separate systems.
Novel adaptation of RAG to dyadic scenarios: LLM-based decomposition of text into spatial relationships, body actions, and rhythm is an elegant solution for handling interaction descriptions.
Practical advantage of Rectified Flow: High-quality results are achievable in just 20 inference steps, making the approach suitable for real-time applications.

Limitations & Future Work¶

Reliance on GPT-4o for text decomposition introduces additional computational cost and API dependency.
The current framework supports only dyadic scenarios; extension to multi-person (>2) settings requires architectural modifications.
Motion quality evaluation relies on automated metrics; perceptual quality assessment requires further user studies.
Extending DualFlow to fine-grained hand motion generation is a promising direction.

vs. InterGen: This diffusion-based dyadic model requires 50 denoising steps, whereas DualFlow requires only 20 steps while achieving superior performance.
vs. MDM: Direct extension of single-person diffusion models to dyadic scenarios performs poorly due to the absence of interaction modeling.
First dyadic RAG: Unlike existing single-person motion RAG methods, DualFlow introduces an interaction-aware retrieval mechanism.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of unified interactive/reactive generation, multi-modal conditioning, and dyadic RAG is unprecedented.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple settings, and thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured, though some sections are overly dense.
Value: ⭐⭐⭐⭐ Represents a clear advancement for the dyadic motion generation field.