ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data¶

Conference: CVPR 2026 arXiv: 2604.01082 Code: None Area: Human Motion Generation / Human Understanding / Interactive Motion Generation Keywords: Interaction-to-reaction generation, modular learning, motion prior, real-time generation, human-human/human-scene interaction

TL;DR¶

This paper proposes ReMoGen, a modular framework for real-time human interaction-to-reaction motion generation. It learns a general motion prior from large-scale single-person motion data (frozen during downstream training), adapts to different interaction domains (human-human/human-scene) via independently trained Meta-Interaction modules, and achieves per-frame low-latency online updates (0.047 s/frame) through Frame-wise Segment Refinement. ReMoGen comprehensively surpasses state-of-the-art methods on the Inter-X and LINGO benchmarks.

Background & Motivation¶

Background: Human motion generation has evolved from text-driven single-person synthesis to multi-agent interaction scenarios. Existing approaches include: text-to-motion methods (T2M, MotionDiffuse) that generate isolated motions only; human-scene interaction methods (TRUMANS, LINGO) that introduce spatial awareness but are limited to single agents; and human-human interaction methods (ReGenNet, FreeMotion) that attempt joint generation but operate predominantly in offline mode.
Limitations of Prior Work:
Data scarcity and heterogeneity: Single-person motion data is abundant (HumanML3D), whereas human-human interaction (Inter-X) and human-scene interaction (LINGO) datasets are scarce and exhibit large distributional discrepancies, causing end-to-end models trained on a single domain to overfit.
Real-time responsiveness: Diffusion models yield high quality but incur large latency incompatible with real-time use; autoregressive models are fast but suffer from error accumulation and drift.
Most existing methods assume full observation of the counterpart's complete trajectory, which is infeasible in practical online interaction settings.
Key Challenge: How to simultaneously achieve high-fidelity and low-latency interaction-to-reaction generation under data-scarce conditions.
Goal: (1) Efficient knowledge transfer across heterogeneous interaction domains; (2) Real-time responsiveness without sacrificing motion quality.
Key Insight: Decouple general motion prior learning from interaction-specific adaptation — freeze a backbone pretrained on large-scale single-person data and inject interaction awareness via lightweight modules.
Core Idea: Prior-guided modular learning + frame-level intra-segment refinement; the former addresses data heterogeneity and the latter addresses real-time requirements.

Method¶

Overall Architecture¶

The inputs are textual intent, observed motions of other agents, and scene context. ReMoGen comprises three components: (1) a frozen text-conditioned single-person motion prior (a VAE and latent diffusion model pretrained on HumanML3D); (2) Meta-Interaction modules (independently trained adapters for HHI and HSI, respectively); and (3) Frame-wise Segment Refinement (FWSR), a lightweight per-frame correction module. Motion is generated via segment-level autoregression: conditioned on a history window \(M_h^i \in \mathbb{R}^{H \times D}\) and text \(W\), the model predicts a future segment \(\hat{M}_f^i \in \mathbb{R}^{F \times D}\) (\(H=2\) history frames, \(F=8\) future frames).

Key Designs¶

Universal Motion Prior:
Function: Provides a strong generative foundation encoding basic kinematic structure, temporal dynamics, and language-motion correspondences.
Mechanism: Built on the DART architecture, a Transformer VAE encoder-decoder compresses motion segments into a latent space, and a conditional diffusion model generates within that latent space. The encoder maps a motion segment to latent representation \(z\); the denoiser \(G_\psi\) iteratively denoises under text embedding \(w\): \(\hat{z}_0 = G_\psi(z_t, t, M_h^i, w)\); the decoder reconstructs the motion. Generation uses 10 diffusion steps at 10 FPS.
Design Motivation: The motion prior learned from large-scale single-person data is already highly expressive. Joint fine-tuning destroys this knowledge — experiments show that joint fine-tuning degrades motion quality — making freezing the prior a critical design choice.
Meta-Interaction Module:
Function: Injects interaction awareness into the frozen motion prior.
Mechanism: Two independent encoders process interaction cues separately — an Others Encoder (TCN-based) extracts relative velocity, approach direction, and spatial relationships; a Scene Encoder (ViT-based) summarizes surrounding geometry and functional space. Interaction cues are injected via a Meta-Interaction Block: ego features first undergo self-attention to obtain \(h'\); cross-attention is then applied over interaction cues to extract interaction signals, which are transformed into FiLM-style affine parameters \((\gamma, \beta)\) and applied as \(h_{mod} = (1 + \tanh\gamma) \odot h' + \tanh\beta\).
Design Motivation: Each module is trained independently on its respective domain (HHI on Inter-X, HSI on LINGO) for 65k iterations, avoiding the difficulties of joint training on heterogeneous data. At inference, effects from multiple modules are composited as \(\Delta_{total} = \sum_i \alpha_i \Delta_i\) (with L2-norm clamping), enabling flexible mixing.
Frame-wise Segment Refinement (FWSR):
Function: Provides per-frame low-latency reactive updates on top of segment-level generation.
Mechanism: Standard segment-level autoregression faces a latency-quality trade-off — longer segments improve quality but slow updates, while shorter segments improve responsiveness but introduce jitter. At each frame within a segment, FWSR fine-tunes the initial segment latent \(z_0\) using a lightweight Meta-Interaction Block: \(\hat{z}^f = \text{Modulate}(z_0, \text{concat}(M_h^{(f-1)}, X_{dyn}^{(f)}))\), incorporating the latest observed interaction cues. Only the prediction at the corresponding frame position is retained; the history buffer is updated before processing the next frame.
Design Motivation: The large backbone provides stable long-term dynamics, while the lightweight adapter enables fast fine-grained reactivity. FWSR is trained independently (with the prior and Meta-Interaction modules frozen), ensuring it acts as a stable local adapter without altering the global motion structure.

Loss & Training¶

Stage-wise training: Pretraining the prior on HumanML3D → independent training of each Meta-Interaction module for 65k iterations (prior frozen) → training FWSR for 65k steps (prior and Meta-Interaction frozen).
Training objectives: Reconstruction loss \(L_{rec}\), latent space loss \(L_{latent}\), and auxiliary temporal increment loss \(L_{aux}\).
Optimizer: AdamW (lr=1e-4), batch size 1024, gradient clipping 1.0, EMA 0.999.
Training is feasible on a single NVIDIA RTX 3090.

Key Experimental Results¶

Main Results¶

Human-Human Interaction (Inter-X)

Method	FID ↓	R-Prec. (Top3) ↑	MM Dist. ↓	Latency (s/frame) ↓
ReGenNet	11.622	0.269	6.092	0.210
FreeMotion	3.383	0.284	5.438	0.221
SymBridge	2.569	0.355	4.955	0.040
Ours	0.181	0.464	4.076	0.042
Ours+FWSR	0.166	0.462	4.076	0.047

Human-Scene Interaction (LINGO)

Method	FID ↓	R-Prec. (Top3) ↑	MM Dist. ↓	Latency ↓
TRUMANS	4.731	0.178	10.822	0.074
LINGO	3.633	0.218	9.597	0.189
Ours	1.201	0.530	3.408	0.042

Ablation Study¶

Ablation on Motion Prior Usage (Inter-X)

Configuration	FID ↓	R-Prec. ↑	MM Dist. ↓
Prior Only	3.735	0.231	5.736
No Prior (from scratch)	0.270	0.412	4.385
Joint-Finetune	0.298	0.439	4.188
Ours (Frozen Prior + Module)	0.181	0.464	4.076

FWSR Ablation

Configuration	FID ↓	Latency ↓
Segment-level autoregression (Seg.)	0.181	0.042
Frame-level sliding window (Slide)	4.136	0.305
Segment + FWSR	0.166	0.047

Key Findings¶

Frozen prior + modular adaptation substantially outperforms joint fine-tuning: FID 0.181 vs. 0.298; joint fine-tuning erodes the pretrained kinematic knowledge.
FWSR achieves significant quality improvement (FID 0.181→0.166) at negligible additional latency (0.042→0.047 s/frame).
Zero-shot compositional generalization: Directly combining HHI and HSI modules on EgoBody (\(\alpha_{HHI}=\alpha_{HSI}=0.5\)) without retraining already outperforms zero-shot single-module inference, though it falls short of fine-tuning.
Prior initialization followed by only 2k–10k fine-tuning steps surpasses training from scratch for 500k steps (EgoBody), demonstrating the strong transfer efficiency of the pretrained prior.
The real-time threshold of 0.1 s/frame is comfortably satisfied (0.042–0.047 s/frame), whereas ReGenNet and FreeMotion both fail to meet it.

Highlights & Insights¶

The modular decoupling philosophy is particularly elegant: the prior supplies fundamental motion capability, the Meta-Interaction module provides interaction awareness, and FWSR provides real-time responsiveness — three orthogonal components that can be optimized independently. This design paradigm is transferable to any generative task requiring adaptation under data-scarce conditions.
The FiLM-modulated Meta-Interaction Block establishes a sound adapter design pattern for motion generation — injecting conditional signals via feature-level affine transformations without modifying original model parameters.
Compositional inference (weighted combination of multiple modules) naturally extends the framework to mixed interaction scenarios without retraining.

Limitations & Future Work¶

Currently limited to dyadic interaction and simple scene interaction; multi-person group interaction is not addressed.
The combination weights \(\alpha_i\) for Meta-Interaction modules must be set manually; adaptive learning of these weights warrants exploration.
Scene encoding uses voxelized 3D occupancy representations, which may be insufficient for fine-grained object interactions (e.g., manipulating objects on a table).
Fine-grained hand interactions (e.g., handshakes, object handovers) are not handled.
FWSR's per-frame correction may exhibit insufficient responsiveness under abrupt motion changes.

vs. SymBridge: Both target real-time interaction; SymBridge focuses on human-robot interaction but achieves higher FID (2.569 vs. 0.181). ReMoGen substantially improves quality through a stronger motion prior.
vs. FreeMotion: The offline version of FreeMotion achieves FID of only 0.492, but its latency is unacceptable. ReMoGen attains superior FID (0.166) under real-time constraints.
vs. ControlNet/LoRA paradigm: ReMoGen successfully transfers the adapter design pattern from image generation to motion generation; FiLM modulation replaces the additive injection used in ControlNet.

Rating¶

Novelty: ⭐⭐⭐⭐ The three-tier design of prior freezing + modular adaptation + frame-level refinement is clear and innovative, though individual components (FiLM, segment-level autoregression) are established techniques.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers HHI, HSI, and mixed scenarios; provides detailed ablations on prior usage strategy and FWSR effectiveness; includes EgoBody transfer experiments — comprehensive overall.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; the four research questions effectively organize the experimental section.
Value: ⭐⭐⭐⭐ Real-time interaction-to-reaction generation is an important yet underexplored problem, and the proposed framework offers a scalable solution.