AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation¶

Conference: AAAI 2026 arXiv: 2512.14095v1 Code: N/A Area: 3D Vision Keywords: 4D HOI Generation, Zero-shot, Anchor-based Prior Distillation, NeRF, Video Diffusion Model

TL;DR¶

AnchorHOI is proposed to achieve zero-shot text-driven 4D human-object interaction (HOI) generation by introducing two intermediate bridges — anchor NeRF and anchor keypoints — to distill interaction priors and motion priors from image and video diffusion models, respectively. The method outperforms existing approaches on both static 3D and dynamic 4D HOI generation.

Background & Motivation¶

Text-driven 4D HOI generation has broad applications in AR/VR, gaming, and robotics. Supervised methods rely on scarce and costly motion capture (mocap) paired data, severely limiting scalability. Recent zero-shot methods such as AvatarGO attempt to replace mocap data with pretrained image diffusion models, but suffer from two critical drawbacks: (1) human body pose is fixed to a canonical pose during interaction composition, lacking adaptability; and (2) motion is derived from text-to-body motion models that are agnostic to the object, ignoring interaction-aware motion synthesis. This motivates the need for richer prior sources and more effective prior distillation techniques.

Core Problem¶

How to effectively distill interaction priors from pretrained image and video diffusion models — without relying on mocap paired data — to generate 4D HOI sequences with realistic poses and interaction-aware motion? The core challenges are: (1) optimizing the high-degree-of-freedom SMPL-X body joint parameters under image diffusion model guidance is extremely difficult; and (2) synthetic videos generated by video diffusion models suffer from severe inter-subject occlusion, making it hard to reliably extract interaction motion information.

Method¶

Overall Architecture¶

AnchorHOI adopts a two-stage pipeline: first generating a static 3D HOI instance (interaction composition), then extending it into a dynamic 4D HOI sequence (motion synthesis). The input is a natural language description (covering the person, action, and object), and the output is a multi-frame 3D human-object interaction sequence. The core innovation is the introduction of "anchors" as intermediate bridges, decomposing the originally intractable direct optimization into a manageable two-step process.

Key Designs¶

Anchor NeRF for Interaction Composition: Directly optimizing SMPL-X parameters via SDS in a high-dimensional, nonlinear parameter space is nearly infeasible. Instead, a coarse entangled human-object NeRF is first generated via SDS from an image diffusion model, and the human portion is extracted as the anchor NeRF through multi-view feature alignment. OpenPose then detects 2D skeletal keypoints from rendered anchor NeRF images, and SMPL-X pose parameters are optimized by minimizing the discrepancy between projected 3D joints and detected 2D keypoints, enabling pose-adaptive interaction composition. The object is initialized from the object portion of the anchor NeRF and refined via SDS.
Anchor Keypoints for Motion Synthesis: HOI videos generated by video diffusion models exhibit severe occlusion at contact regions, making pixel-level cues insufficient for capturing interaction motion. Two types of anchor keypoints are therefore defined: (a) body keypoints — 18 2D human body keypoints detected per frame via OpenPose, providing robust pose cues under occlusion; and (b) contact keypoints — derived from 3D geometric proxies (sampled points on the object mesh surface and candidate contact vertices on the SMPL-X mesh), with valid contact point pairs identified using normal vector alignment and geometric proximity constraints to capture interaction information in occluded regions.
Motion Optimization: Using anchor keypoints as tracking cues, per-frame human and object motion parameters are jointly optimized via a combined loss, including: joint projection alignment loss (reprojecting SMPL-X joints to 2D and aligning with detected keypoints), contact constraint loss (minimizing distances between human-object contact pairs), penetration penalty, and regularization terms (rendering consistency, self-penetration penalty, and temporal smoothness).

Loss & Training¶

Pose alignment loss \(\mathcal{L}_{\text{align}}\): Geman-McClure robust distance between projected SMPL-X 3D joints and OpenPose-detected 2D keypoints across multiple views.
Total motion loss \(\mathcal{L}_{\text{total}} = \lambda_J \mathcal{L}_J + \lambda_C \mathcal{L}_C + \lambda_{\text{pen}} \mathcal{L}_{\text{pen}} + \lambda_{\text{reg}} \mathcal{L}_{\text{reg}}\)
- \(\mathcal{L}_J\): joint reprojection alignment (confidence-weighted Geman-McClure)
- \(\mathcal{L}_C\): Euclidean distance between contact keypoint pairs
- \(\mathcal{L}_{\text{pen}}\): human-object penetration penalty
- \(\mathcal{L}_{\text{reg}}\): rendering MSE + self-penetration penalty + temporal smoothness
Interaction composition: 3,000 iterations; motion synthesis: 1,000 iterations; Adam optimizer, lr = 0.01, A6000 GPU.

Key Experimental Results¶

Method	CLIP Score ↑	GPT-4V Overall ↑	User Semantic ↑	User Contact ↑	User Motion ↑	User Overall ↑
DreamGaussian4D	0.2833	25.00%	2.33	2.38	2.63	3.33
TC4D	0.3017	20.83%	3.11	2.32	2.39	3.66
AnchorHOI	0.3149	54.17%	4.79	4.75	4.87	4.83

Static 3D HOI comparison (vs. MVDream / InterFusion / AvatarGO): CLIP Score 0.3173 (highest); GPT-4V Overall selection rate 52.63% (vs. InterFusion 26.32%).

Ablation Study¶

w/o anchor NeRF: GPT-4V selection rate drops to 5.88% (vs. 94.12% for the full model); human pose fails to converge to a reasonable interaction posture.
w/o body keypoints: 5.89%; motion pose becomes unreasonable.
w/o contact keypoints: 17.65%; motion appears visually plausible but lacks physical contact.
Full model: 76.47%; achieves both reasonable pose and contact-aware motion.

Highlights & Insights¶

Elegance of the anchor strategy: Rather than performing difficult optimization directly in a high-dimensional parameter space, intermediate representations (NeRF and keypoints) are introduced as bridges, decomposing the problem into two manageable steps. This "build the bridge before crossing the river" paradigm has broad generalizability.
Hybrid prior exploitation: This work is the first to combine image diffusion models (static interaction priors) and video diffusion models (dynamic motion priors) for zero-shot 4D HOI generation.
Contact keypoint definition: Combining normal vector alignment and geometric proximity constraints to identify valid contacts has a clear physical rationale (opposing normals and small distances at contact regions).

Limitations & Future Work¶

The method assumes continuous contact between the human and object throughout the sequence, and cannot handle interactions where contact is broken and re-established (e.g., throwing and catching).
Only rigid objects are supported; articulated or deformable objects (e.g., opening a door, folding clothes) are not handled.
Generation speed is limited by SDS iterative optimization (3,000 + 1,000 iterations), precluding real-time use.
The approach depends on the quality of OpenPose 2D keypoint detection, which may become unreliable under heavy occlusion.

vs. AvatarGO: The most direct predecessor; however, AvatarGO fixes the human body in a canonical standing pose during interaction composition, making it unable to adapt to interaction-specific postures such as sitting or crouching, and its 4D component is not publicly available. AnchorHOI achieves pose-adaptive interaction composition through the anchor NeRF.
vs. DreamGaussian4D: Both use video diffusion model guidance, but DreamGaussian4D relies solely on RGB and mask pixel cues to drive HOI animation and cannot capture interaction motion in occluded regions. AnchorHOI provides more robust interaction motion cues via anchor keypoints.
vs. InterFusion (3D): InterFusion retrieves fixed poses from a prebuilt pose library, lacking adaptability to specific interaction scenarios. AnchorHOI achieves pose optimization aligned with interaction semantics through the anchor NeRF.

The anchor strategy (generating an intermediate representation first, then distilling to the target representation) is generalizable to other diffusion-model-guided 3D/4D generation tasks, particularly when the target representation (e.g., a parametric model) cannot directly receive diffusion model gradients. The contact keypoint definition approach (normal alignment + geometric proximity) is transferable to robot grasp planning and contact modeling in physical simulation.

Rating¶

Novelty: ⭐⭐⭐⭐ The anchor strategy converts an intractable direct optimization into a two-step controllable process in a clear and effective manner.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative (CLIP / GPT-4V / user study) + qualitative + ablation coverage is comprehensive, though comparisons with more 4D methods are lacking.
Writing Quality: ⭐⭐⭐⭐ Problem motivation and method description are clear; anchor illustration figures are intuitive.
Value: ⭐⭐⭐⭐ Significant progress on the emerging direction of zero-shot 4D HOI, though constrained by the rigid object and continuous contact assumptions.