HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion¶

Conference: NeurIPS 2025 arXiv: 2507.01737 Code: Available Area: Human Understanding Keywords: Human-Object Interaction, Motion Diffusion, Interaction Dynamics, Driver-Responder, Transformer

TL;DR¶

This paper models human-object interaction (HOI) generation as a Driver-Responder system, employing a lightweight Transformer-based interaction dynamics model to explicitly predict how objects respond to human actions. A residual dynamics loss is introduced during training to enforce causal consistency, while inference efficiency is preserved.

Background & Motivation¶

Generating realistic 3D human-object interactions (HOI) is a critical problem in VR/AR, computer animation, and robotics. Prior work exhibits the following limitations:

Independent modeling: Most methods treat human and object motion separately, resulting in physically implausible and causally inconsistent behaviors.

Contact modeling difficulty: Some approaches focus on object affordance or contact point prediction, but accurately modeling contact regions is inherently challenging.

Lack of causal structure: Existing diffusion models can generate globally plausible sequences but fail to capture fine-grained dynamics governing how objects respond to human actions.

The authors identify a key observation: HOI is fundamentally an asymmetric system — human motion follows internally driven dynamics (autonomous motion), whereas object motion is externally driven (cannot occur spontaneously). This asymmetry naturally motivates the Driver-Responder modeling paradigm.

Method¶

Overall Architecture¶

HOI-Dyn consists of two core components:

Conditional Motion Diffusion: A Transformer-based conditional diffusion model that jointly encodes human, object, and interaction context.
Interaction Dynamics: An auxiliary supervision module that enforces fine-grained causal consistency during training.

The central design principle is the Driver-Responder formulation:

\[\text{Driver (Human)}: h^{(t+1)} = h^{(t)} + \Delta t \cdot F_h(h^{(t)})$$ $$\text{Responder (Object)}: o^{(t+1)} = o^{(t)} + \Delta t \cdot F_o(o^{(t)}, s^{(t)}, u^{(t)})\]

where $u^{(t)}$ is the control signal (error feedback based on human intent and object behavior), and $s^{(t)}$ is the interaction context (contact state, object geometry, etc.).

Key Designs¶

1. Interaction Dynamics Model¶

Core idea: the relative motion of an object can be predicted from the relative motion of the human body:

\[\Delta o^{(t)} \approx \mathcal{D}(s^{(t)}, o^{(t)}, \Delta h^{(t)}; \theta_\mathcal{D})\]

To improve sensitivity to interactions of varying magnitude, the prediction horizon is extended from 1 to $k$, where $k$ is sampled uniformly from $[1, K]$:

\[\Delta o_{t \to t+k}^* \approx \mathcal{D}(s^{(t)}, o^{(t)}, \Delta h_{t \to t+k}; \theta_\mathcal{D})\]

Predictions are parameterized as rigid-body transformations (rotation $\hat{\mathcal{R}} \in SO(3)$ and translation $\hat{\mathcal{T}} \in \mathbb{R}^3$), with SVD projection applied to ensure valid rotation matrices.

2. Object Dynamics Cost Function¶

Based on keypoint transformation error:

\[\Phi(\Delta o, \Delta o^*) = \|\mathcal{P}^{(t+k)} - \hat{\mathcal{P}}^{(t+k)}\|_1\]

3. Implicit Contact Handling¶

An elegant design choice: explicit contact modeling is not required. In the absence of contact, no response is generated; when contact occurs, the object response is naturally determined by the interaction dynamics.

4. Network Architecture¶

The interaction dynamics model employs a lightweight Transformer (only 0.5M parameters), taking as input the current object state, interaction context, and accumulated human motion, and outputting rigid-body transformations for object motion. A coupled design (joint prediction of rotation and translation) outperforms a decoupled design.

Loss & Training¶

Two-stage training:

Stage 1: Pre-train the interaction dynamics model $\mathcal{D}$ using the dynamics loss: $$\mathcal{L} = \mathbb{E}_{t, k \sim \mathcal{U}(1,K)} \left[\frac{1}{k} \cdot \Phi(\Delta o_{t \to t+k}, \Delta o^*_{t \to t+k})\right]$$
Stage 2: Train the diffusion model with total loss $= \mathcal{L}_{\text{hoi}} + \mathcal{L}_{\text{dyn}} + \mathcal{L}_{\text{obj}}$

Residual dynamics loss (core contribution):

\[\mathcal{L}_{\text{dyn}} = \mathbb{E}_t \left[\|\Phi(\Delta\hat{o}_t^*, \Delta\hat{o}_t) - \Phi(\Delta o_t^*, \Delta o_t)\|_1\right]\]

The key insight is that even when $\mathcal{D}$ is imperfect, taking the difference between residuals computed on generated and ground-truth sequences cancels systematic biases, allowing supervision to focus on genuine generation inconsistencies. This relies on the assumption that $\mathcal{D}$ is locally smooth and temporally homogeneous.

The dynamics model is not used at inference, preserving runtime efficiency.

Key Experimental Results¶

Main Results¶

Method	FID ↓	$C_{F1}$ ↑	C% ↑	MPJPE ↓	$T_{\text{obj}}$ ↓	$R_{\text{obj}}$ ↓
InterDiff	208.0	0.33	0.27	25.91	88.35	1.65
MDM	6.16	0.53	0.43	17.86	24.46	1.85
CHOIS	0.87	0.66	0.54	16.01	14.29	0.99
HOI-Dyn	0.48	0.71	0.60	15.60	12.47	0.90

Method (3D-FUTURE)	FID ↓	C% ↑	FS ↓
CHOIS	1.67	0.47	0.42
HOI-Dyn	1.62	0.54	0.37

Ablation Study¶

Architecture	K	Params	K=2 Loss
Coupled (K=2) D4-F64-H8	2	0.483M	0.462
Coupled (K=1) D4-F64-H8	1	0.483M	0.514
Decoupled (K=2) (D1-F64-H8)×2	2	0.463M	0.532
Coupled (K=2) D8-F128-H8	2	0.994M	0.845

Key Findings¶

Prediction horizon K: $K=2$ or $K=3$ yields optimal performance. Too small ($K=1$) fails to capture large-magnitude motions; too large ($K=10$) weakens modeling of subtle interactions.
Coupled vs. decoupled: The coupled design for joint rotation and translation prediction significantly outperforms the decoupled counterpart at comparable parameter counts, validating the intrinsic coupling between rotation and translation in HOI.
Model scale: A lightweight 0.5M-parameter model is sufficient; scaling to 1M parameters leads to performance degradation due to overfitting.
Qualitative comparison: CHOIS produces premature motion artifacts (objects spontaneously moving toward contact points before human action); HOI-Dyn eliminates such artifacts.

Highlights & Insights¶

Novel Driver-Responder perspective: Framing HOI from a control-theoretic standpoint naturally resolves the contact modeling challenge.
Elegant residual loss design: Differencing residuals cancels systematic biases in the dynamics model, enabling effective supervision from an imperfect auxiliary model.
Training-inference decoupling: The dynamics model is used only during training, introducing no additional inference overhead.
Lightweight and efficient: 0.5M-parameter dynamics model with approximately 10 hours of training on a single A4500 GPU.

Limitations & Future Work¶

Training is limited to the FullBodyManipulation dataset, restricting scene diversity.
$K$ requires manual selection; adaptive prediction horizon strategies merit exploration.
The local smoothness and temporal homogeneity assumptions of the dynamics model may not hold for fast or highly dynamic interactions.
The rigid-body transformation assumption limits applicability to deformable object interactions.

CHOIS (SOTA baseline): Sparse waypoint-guided diffusion without explicit causal interaction modeling.
OMOMO: Generates human poses conditioned on full object trajectories, but lacks bidirectional causal modeling.
CG-HOI: Uses contact fields on the human mesh as priors, but contact field prediction is itself challenging.
Insight: The residual loss concept generalizes to other settings involving imperfect auxiliary models.

Rating¶

Novelty: ⭐⭐⭐⭐ — Driver-Responder formalization and residual dynamics loss are conceptually novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative and qualitative evaluation with thorough ablations, though limited to a single dataset.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and motivation is well-articulated.
Value: ⭐⭐⭐⭐ — Significant contribution to HOI generation with broadly applicable methodology.