Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow¶
Conference: ICLR 2026 arXiv: 2509.24099 Code: https://gprerit96.github.io/dualflow-page Area: 3D Motion Generation Keywords: Dyadic motion generation, Rectified Flow, Retrieval-Augmented Generation, Contrastive learning, Multi-modal conditioning
TL;DR¶
DualFlow proposes the first unified framework for dyadic interactive/reactive 3D motion generation under text+music multi-modal conditions via Rectified Flow and Retrieval-Augmented Generation (RAG). It introduces contrastive flow matching and synchronization loss, achieving 2.5% FID improvement and 76% R-precision improvement on the MDD dataset, with 2.5× faster inference.
Background & Motivation¶
Background: Dyadic motion generation is essential for VR/AR, game AI, and human-robot collaboration. Existing methods treat interactive (simultaneous two-person generation) and reactive (generating person B's motion conditioned on person A's motion) as separate tasks with incompatible architectures.
Limitations of Prior Work: (1) Interactive and reactive models employ different architectures and training objectives, precluding seamless task switching; (2) existing methods support only single-modal conditioning (text or music) and cannot handle joint conditioning; (3) diffusion-based methods require 50+ denoising steps, resulting in slow inference.
Key Challenge: Dyadic motion requires simultaneously modeling mutual responsiveness between two persons, physical plausibility, and multi-modal signal alignment, yet existing methods lack unified modeling capability.
Goal: How to unify interactive and reactive motion generation within a single architecture while supporting text+music multi-modal conditioning?
Key Insight: Leveraging the straight transport paths of Rectified Flow for fast inference, switching between tasks via symmetric/asymmetric masking mechanisms, and providing semantic guidance through RAG.
Core Idea: Unify task switching via a dual-branch architecture with cascaded DualFlow blocks, combined with a contrastive Rectified Flow objective and an LLM-decomposed RAG module for multi-modal semantic alignment.
Method¶
Overall Architecture¶
Inputs consist of text (encoded via CLIP-L/14), music (encoded via Jukebox), and motion sequences. Twenty cascaded DualFlow blocks process dyadic motion latents. Under the interactive setting, both branches are symmetrically activated; under the reactive setting, only the reactor branch is activated and conditioned on the actor's motion via causal cross-attention.
Key Designs¶
-
Multi-Modal Motion Retrieval (RAG):
- Function: Provides semantic anchors for dyadic motion generation.
- Mechanism: GPT-4o decomposes text descriptions into three dimensions—spatial relationships, body actions, and rhythm. Separate CLIP retrieval databases \((D^S, D^B, D^R)\) and a music retrieval database \(D^M\) (Jukebox features) are constructed. The similarity score \(s_i^q = \langle f_i^q, f_p^q \rangle \cdot e^{-\lambda \cdot \frac{|l_i - l_p|}{\max\{l_i, l_p\}}}\) accounts for both semantic similarity and temporal compatibility.
- Design Motivation: Direct retrieval from raw text overlooks the nuanced dimensions of interactive motion; LLM-based decomposition improves retrieval quality.
-
Contrastive Rectified Flow:
- Function: Enhances semantic alignment within the flow matching framework.
- Mechanism: The standard flow loss \(\mathcal{L}_{\text{flow}} = \mathbb{E}[\|\mathbf{v}_\theta(\mathbf{x}_t, t, c) - (\mathbf{x}_0 - \epsilon)\|_2^2]\) is augmented with a triplet contrastive loss \(\mathcal{L}_{\text{triplet}} = \mathbb{E}[\max(0, d(\hat{\mathbf{v}}, \mathbf{v}^+) - d(\hat{\mathbf{v}}, \mathbf{v}^-) + m)]\).
- Contrastive sample construction: Leveraging the hierarchical structure of RAG, positive samples are motions of similar style or text description, while negative samples are motions with large style differences or low text similarity (>0.6).
-
Synchronization Loss:
- Function: Enforces spatial relationship consistency between the two persons.
- Mechanism: Weighted inter-person joint distance loss \(\mathcal{L}_{\text{sync}} = \sum_{j_1,j_2} w_d(j_1,j_2) w_j(j_1,j_2) \|d_p(j_1,j_2) - d_{gt}(j_1,j_2)\|^2\).
- Distance weights \(w_d\) assign higher importance to joint pairs that are naturally closer; anatomical weights \(w_j\) differentiate body regions such as hands, upper limbs, and lower limbs.
-
Task Switching Mechanism:
- Interactive: Both branches are symmetrically activated; Motion Cross-Attention coordinates motion between the two persons.
- Reactive: The actor branch is masked; motion cross-attention is replaced by causal cross-attention with a Look-Ahead window of \(L=10\).
Loss & Training¶
\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CRF}} + \lambda_{\text{geo}} \mathcal{L}_{\text{geo}} + \lambda_{\text{inter}} \mathcal{L}_{\text{inter}}\), where \(\mathcal{L}_{\text{CRF}} = \mathcal{L}_{\text{flow}} + \lambda_{\text{triplet}} \mathcal{L}_{\text{triplet}}\), \(\mathcal{L}_{\text{geo}}\) comprises foot contact, joint velocity, and bone length losses, and \(\mathcal{L}_{\text{inter}}\) comprises distance map, relative orientation, and synchronization losses.
Key Experimental Results¶
Main Results on MDD Dataset (Duet Task)¶
| Method | R-Prec@3↑ | FID↓ | MMDist↓ | BAS↑ |
|---|---|---|---|---|
| MDM(Both) | 0.163 | 1.739 | 2.244 | 0.190 |
| InterGen(Both) | 0.302 | 0.426 | 1.532 | 0.185 |
| DualFlow(Both) | 0.513 | 0.415 | 0.513 | 0.200 |
| GT | 0.522 | 0.065 | 0.077 | 0.170 |
Reactive Task¶
| Method | FID↓ | MMDist↓ | R-Prec@3↑ |
|---|---|---|---|
| DuoLando(Both) | — | — | — |
| DualFlow(Both) | 0.686 | 1.056 | Best |
Key Findings¶
- DualFlow requires only 20 inference steps (vs. the standard 50-DDIM), yielding a 2.5× speedup.
- Interactive task: FID improved by 2.5%, R-precision improved by 76%, MMDist improved by 3×.
- Reactive task: FID improved by 1.7%, R-precision improved by 2.5×.
- Ablation studies confirm that both the contrastive loss and the RAG module contribute significantly.
- The synchronization loss effectively improves temporal coordination between the two persons.
Highlights & Insights¶
- First unified dyadic motion generation framework: Seamless switching between interactive and reactive tasks via a masking mechanism eliminates the need to maintain two separate systems.
- Novel adaptation of RAG to dyadic scenarios: LLM-based decomposition of text into spatial relationships, body actions, and rhythm is an elegant solution for handling interaction descriptions.
- Practical advantage of Rectified Flow: High-quality results are achievable in just 20 inference steps, making the approach suitable for real-time applications.
Limitations & Future Work¶
- Reliance on GPT-4o for text decomposition introduces additional computational cost and API dependency.
- The current framework supports only dyadic scenarios; extension to multi-person (>2) settings requires architectural modifications.
- Motion quality evaluation relies on automated metrics; perceptual quality assessment requires further user studies.
- Extending DualFlow to fine-grained hand motion generation is a promising direction.
Related Work & Insights¶
- vs. InterGen: This diffusion-based dyadic model requires 50 denoising steps, whereas DualFlow requires only 20 steps while achieving superior performance.
- vs. MDM: Direct extension of single-person diffusion models to dyadic scenarios performs poorly due to the absence of interaction modeling.
- First dyadic RAG: Unlike existing single-person motion RAG methods, DualFlow introduces an interaction-aware retrieval mechanism.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of unified interactive/reactive generation, multi-modal conditioning, and dyadic RAG is unprecedented.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple settings, and thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Well-structured, though some sections are overly dense.
- Value: ⭐⭐⭐⭐ Represents a clear advancement for the dyadic motion generation field.