TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions¶
Conference: ICCV 2025 arXiv: 2412.06334 Code: https://virtualhumans.mpi-inf.mpg.de/tridi/ Area: 3D Vision / Human-Object Interaction Keywords: 3D human-object interaction, joint probability modeling, trilateral diffusion, contact maps, multimodal generation
TL;DR¶
TriDi is proposed as the first unified diffusion model that jointly models the three-variable distribution of humans (H), objects (O), and interactions (I). A single network covers 7 conditional generation modes, outperforming dedicated unidirectional baselines across all settings.
Background & Motivation¶
Background: Modeling 3D human-object interactions (HOI) is critical for applications such as AR/VR and virtual human generation. Existing methods operate in a unidirectional conditional manner: some recover human pose from objects as \(P(H|O)\), while others recover object pose from humans as \(P(O|H)\), each requiring a dedicated model architecture, training pipeline, and design.
Limitations of Prior Work: Training specialized models for each conditional combination is (1) not scalable, (2) ignores the mutual dependencies among the three modalities, and (3) cannot support unconditional joint generation. Given a human and an object, multiple plausible interaction types exist (sitting, lifting, pushing, etc.), and a comprehensive model should simultaneously capture the relationships among all modalities.
Key Insight: This work shifts HOI modeling from the paradigm of "unidirectional conditional distributions" to "three-variable joint distributions." Inspired by UniDiffuser (bimodal diffusion), it extends the framework to three modalities, modeling \(P(H, O, I)\) within a compact architecture that naturally yields \(2^3 - 1 = 7\) operational modes.
Core Idea: (1) A Transformer-based trilateral diffusion process assigns independent timesteps to each modality and discovers fine-grained cross-modal relationships via token-level self-attention. (2) Textual descriptions and body contact maps are embedded into a shared latent space, balancing user controllability with spatial expressiveness.
Method¶
Overall Architecture¶
TriDi parameterizes the three modalities (H, O, I) as follows: - Human H = (pose \(\theta \in \mathbb{R}^{51\times3}\), shape \(\beta \in \mathbb{R}^{10}\), global pose \(g_H \in \mathbb{R}^9\)), based on the SMPL+H model - Object O = (global pose \(g_O \in \mathbb{R}^9\)), with geometry encoded via PointNeXt features and category one-hot encoding - Interaction I = (latent variable \(z_I \in \mathbb{R}^{128}\)), a compact encoding that jointly embeds contact maps and textual descriptions
The model receives noisy three-modal tokens along with three independent timesteps \((t^H, t^O, t^I)\) and predicts the denoised original samples.
Key Designs¶
-
Contact-Text Interaction Representation: Two encoders are trained — a contact map encoder \(E_\phi\) and a text encoder \(E_T\) — both mapping to a shared 128-dimensional latent space, along with a contact map decoder \(D_\phi\). The loss includes: contact map autoencoding BCE, text-to-contact-map encoding BCE, and text–contact-map latent alignment L2. This enables users to guide generation using either text or contact maps.
-
Trilateral Diffusion Formulation: UniDiffuser is extended to three modalities, with each modality assigned an independent noise timestep \((t^H, t^O, t^I)\). The training objective is:
\(\min_\psi \mathbb{E}_p \mathbb{E}_t \mathbb{E}_q \| \text{TriDi}_\psi(H^{t^H}, O^{t^O}, I^{t^I}; t^H, t^O, t^I; C_O) - (H^0, O^0, I^0) \|_2\)
Any operational mode can be selected by adjusting the timesteps: \(t=0\) denotes conditioning and \(t=T\) denotes the generation target.
-
Reconstruction-Based Guidance Mechanism: During denoising, a contact distance supervision function \(F\) is computed from the predicted contact maps over human–object distances, and gradient-based guidance is applied to correct predictions:
\(F(\hat{H}, \hat{O}, \hat{I}) = \sum_j |\hat{\phi}_{I_j} \hat{d}_j|\)
where \(\hat{d}_j\) is the distance from each human vertex to the nearest point on the object. This guidance is applied during the final 200 steps with weight \(\lambda=2.0\).
Loss & Training¶
The total loss consists of 6 terms: parameter-space losses (human L1, object L1, interaction L2) and vertex-space losses (human vertex L2, object vertex L2, human–object distance L2), with weighting coefficients \((2, 1, 1, 6, 2, 4)\) respectively.
Training details: 15M parameters, batch size 1024, learning rate \(1\times10^{-4}\) with cosine decay, AdamW optimizer, approximately 20 hours of training on an RTX 4090. A ZY-plane mirror augmentation is applied to address the right-hand bias in the training data, effectively increasing training diversity.
Key Experimental Results¶
Main Results¶
Distribution quality and geometric consistency are evaluated on the BEHAVE and GRAB datasets:
| Method | 1-NNA (→50) | COV↑ | MMD↓ | MPJPE↓ | MPJPE-PA↓ |
|---|---|---|---|---|---|
| GNet (BEHAVE, H,I|O) | 80.01 | 40.71 | 1.789 | 35.6 | 14.6 |
| ObjPOP (BEHAVE, O,I|H) | 81.36 | 35.02 | 0.329 | — | — |
| TriDi (H,I|O) | 67.89 | 47.81 | 1.352 | 20.8 | 12.3 |
| TriDi (O,I|H) | 63.72 | 51.71 | 0.166 | — | — |
TriDi achieves 1-NNA values closer to the ideal of 50 and improves COV by up to 47%, while also surpassing dedicated baselines in geometric consistency.
Ablation Study¶
| Configuration | 1-NNA | COV | MPJPE | Acc_cont |
|---|---|---|---|---|
| w/o augmentation | worse | lower | — | — |
| w/o I modality | — | — | worse | lower |
| w/o guidance | — | — | worse geometry | lower |
| Full model | best | best | best | best |
Augmentation improves distribution quality (1-NNA); guidance and the interaction modality contribute significantly to geometric consistency; joint three-modal modeling outperforms bimodal variants.
Key Findings¶
- Jointly trained TriDi matches or outperforms variants trained specifically for a single mode (s-TriDi-HI, s-TriDi-OI), demonstrating that joint modeling improves generalization.
- In a user study (N=40), TriDi outputs were preferred over baseline outputs approximately 89% of the time, approaching preference rates against ground truth (~52%).
- TriDi generalizes to unseen geometries (e.g., chairs and stools) and supports indirect interaction reconstruction from RGB images.
Highlights & Insights¶
- Unification: A single 15M-parameter model covers 7 operational modes, encompassing all specialized scenarios addressed by prior work while enabling new use cases such as joint generation of H+O+I.
- Contact-Text Representation: Elegantly unifies precise but unintuitive contact maps with user-friendly but coarse textual descriptions within a shared latent space, balancing controllability and spatial precision.
- Left-Right Symmetry Augmentation: Simple yet effective; prior HOI methods have uniformly neglected the right-hand bias present in training data.
Limitations & Future Work¶
- Due to skewed training data, the model performs well on high-frequency objects but exhibits limited generalization to functionally diverse unseen objects (e.g., wheelchairs, bicycles).
- The current method handles only single-frame static interactions; extension to dynamic sequences is an important future direction.
- The framework addresses single-person, single-object interactions; scaling to multi-person, multi-object scenarios is a natural next step.
Related Work & Insights¶
- The approach is conceptually aligned with UniDiffuser [Bao 2023], extending bimodal to trilateral diffusion and demonstrating the scalability of multimodal joint diffusion.
- CG-HOI [Diller 2024] requires strong text conditioning and trains on one dataset at a time; TriDi is more general.
- The method has direct applicability to scene population, virtual reality content creation, and interaction reconstruction.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Trilateral joint diffusion + unified Contact-Text representation
- Technical Depth: ⭐⭐⭐⭐ — Rigorous probabilistic modeling and guidance mechanism
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple datasets, ablations, user study, and downstream applications
- Practical Value: ⭐⭐⭐⭐ — AR/VR scene population and interaction reconstruction