HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion¶
Conference: NeurIPS 2025 arXiv: 2507.01737 Code: Available Area: Human Understanding Keywords: Human-Object Interaction, Motion Diffusion, Interaction Dynamics, Driver-Responder, Transformer
TL;DR¶
This paper models human-object interaction (HOI) generation as a Driver-Responder system, employing a lightweight Transformer-based interaction dynamics model to explicitly predict how objects respond to human actions. A residual dynamics loss is introduced during training to enforce causal consistency, while inference efficiency is preserved.
Background & Motivation¶
Generating realistic 3D human-object interactions (HOI) is a critical problem in VR/AR, computer animation, and robotics. Prior work exhibits the following limitations:
Independent modeling: Most methods treat human and object motion separately, resulting in physically implausible and causally inconsistent behaviors.
Contact modeling difficulty: Some approaches focus on object affordance or contact point prediction, but accurately modeling contact regions is inherently challenging.
Lack of causal structure: Existing diffusion models can generate globally plausible sequences but fail to capture fine-grained dynamics governing how objects respond to human actions.
The authors identify a key observation: HOI is fundamentally an asymmetric system — human motion follows internally driven dynamics (autonomous motion), whereas object motion is externally driven (cannot occur spontaneously). This asymmetry naturally motivates the Driver-Responder modeling paradigm.
Method¶
Overall Architecture¶
HOI-Dyn consists of two core components:
- Conditional Motion Diffusion: A Transformer-based conditional diffusion model that jointly encodes human, object, and interaction context.
- Interaction Dynamics: An auxiliary supervision module that enforces fine-grained causal consistency during training.
The central design principle is the Driver-Responder formulation:
where \(u^{(t)}\) is the control signal (error feedback based on human intent and object behavior), and \(s^{(t)}\) is the interaction context (contact state, object geometry, etc.).
Key Designs¶
1. Interaction Dynamics Model¶
Core idea: the relative motion of an object can be predicted from the relative motion of the human body:
To improve sensitivity to interactions of varying magnitude, the prediction horizon is extended from 1 to \(k\), where \(k\) is sampled uniformly from \([1, K]\):
Predictions are parameterized as rigid-body transformations (rotation \(\hat{\mathcal{R}} \in SO(3)\) and translation \(\hat{\mathcal{T}} \in \mathbb{R}^3\)), with SVD projection applied to ensure valid rotation matrices.
2. Object Dynamics Cost Function¶
Based on keypoint transformation error:
3. Implicit Contact Handling¶
An elegant design choice: explicit contact modeling is not required. In the absence of contact, no response is generated; when contact occurs, the object response is naturally determined by the interaction dynamics.
4. Network Architecture¶
The interaction dynamics model employs a lightweight Transformer (only 0.5M parameters), taking as input the current object state, interaction context, and accumulated human motion, and outputting rigid-body transformations for object motion. A coupled design (joint prediction of rotation and translation) outperforms a decoupled design.
Loss & Training¶
Two-stage training:
-
Stage 1: Pre-train the interaction dynamics model \(\mathcal{D}\) using the dynamics loss: $\(\mathcal{L} = \mathbb{E}_{t, k \sim \mathcal{U}(1,K)} \left[\frac{1}{k} \cdot \Phi(\Delta o_{t \to t+k}, \Delta o^*_{t \to t+k})\right]\)$
-
Stage 2: Train the diffusion model with total loss \(= \mathcal{L}_{\text{hoi}} + \mathcal{L}_{\text{dyn}} + \mathcal{L}_{\text{obj}}\)
Residual dynamics loss (core contribution):
The key insight is that even when \(\mathcal{D}\) is imperfect, taking the difference between residuals computed on generated and ground-truth sequences cancels systematic biases, allowing supervision to focus on genuine generation inconsistencies. This relies on the assumption that \(\mathcal{D}\) is locally smooth and temporally homogeneous.
The dynamics model is not used at inference, preserving runtime efficiency.
Key Experimental Results¶
Main Results¶
| Method | FID ↓ | \(C_{F1}\) ↑ | C% ↑ | MPJPE ↓ | \(T_{\text{obj}}\) ↓ | \(R_{\text{obj}}\) ↓ |
|---|---|---|---|---|---|---|
| InterDiff | 208.0 | 0.33 | 0.27 | 25.91 | 88.35 | 1.65 |
| MDM | 6.16 | 0.53 | 0.43 | 17.86 | 24.46 | 1.85 |
| CHOIS | 0.87 | 0.66 | 0.54 | 16.01 | 14.29 | 0.99 |
| HOI-Dyn | 0.48 | 0.71 | 0.60 | 15.60 | 12.47 | 0.90 |
| Method (3D-FUTURE) | FID ↓ | C% ↑ | FS ↓ |
|---|---|---|---|
| CHOIS | 1.67 | 0.47 | 0.42 |
| HOI-Dyn | 1.62 | 0.54 | 0.37 |
Ablation Study¶
| Architecture | K | Params | K=2 Loss |
|---|---|---|---|
| Coupled (K=2) D4-F64-H8 | 2 | 0.483M | 0.462 |
| Coupled (K=1) D4-F64-H8 | 1 | 0.483M | 0.514 |
| Decoupled (K=2) (D1-F64-H8)×2 | 2 | 0.463M | 0.532 |
| Coupled (K=2) D8-F128-H8 | 2 | 0.994M | 0.845 |
Key Findings¶
- Prediction horizon K: \(K=2\) or \(K=3\) yields optimal performance. Too small (\(K=1\)) fails to capture large-magnitude motions; too large (\(K=10\)) weakens modeling of subtle interactions.
- Coupled vs. decoupled: The coupled design for joint rotation and translation prediction significantly outperforms the decoupled counterpart at comparable parameter counts, validating the intrinsic coupling between rotation and translation in HOI.
- Model scale: A lightweight 0.5M-parameter model is sufficient; scaling to 1M parameters leads to performance degradation due to overfitting.
- Qualitative comparison: CHOIS produces premature motion artifacts (objects spontaneously moving toward contact points before human action); HOI-Dyn eliminates such artifacts.
Highlights & Insights¶
- Novel Driver-Responder perspective: Framing HOI from a control-theoretic standpoint naturally resolves the contact modeling challenge.
- Elegant residual loss design: Differencing residuals cancels systematic biases in the dynamics model, enabling effective supervision from an imperfect auxiliary model.
- Training-inference decoupling: The dynamics model is used only during training, introducing no additional inference overhead.
- Lightweight and efficient: 0.5M-parameter dynamics model with approximately 10 hours of training on a single A4500 GPU.
Limitations & Future Work¶
- Training is limited to the FullBodyManipulation dataset, restricting scene diversity.
- \(K\) requires manual selection; adaptive prediction horizon strategies merit exploration.
- The local smoothness and temporal homogeneity assumptions of the dynamics model may not hold for fast or highly dynamic interactions.
- The rigid-body transformation assumption limits applicability to deformable object interactions.
Related Work & Insights¶
- CHOIS (SOTA baseline): Sparse waypoint-guided diffusion without explicit causal interaction modeling.
- OMOMO: Generates human poses conditioned on full object trajectories, but lacks bidirectional causal modeling.
- CG-HOI: Uses contact fields on the human mesh as priors, but contact field prediction is itself challenging.
- Insight: The residual loss concept generalizes to other settings involving imperfect auxiliary models.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Driver-Responder formalization and residual dynamics loss are conceptually novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative and qualitative evaluation with thorough ablations, though limited to a single dataset.
- Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and motivation is well-articulated.
- Value: ⭐⭐⭐⭐ — Significant contribution to HOI generation with broadly applicable methodology.