ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions¶

Conference: CVPR 2026 arXiv: 2603.25791 Code: https://arthoi-reconstruction.github.io Area: 3D Vision / Hand-Object Interaction Reconstruction Keywords: Hand-Object Interaction, Articulated Object, 4D Reconstruction, Foundation Models, MLLM

TL;DR¶

ArtHOI presents the first complete pipeline for reconstructing 4D interactions between hands and articulated objects (e.g., scissors, glasses, laptops) from monocular RGB video. Through Adaptive Sampling Refinement (ASR) for metric scale and pose estimation, and an MLLM-guided hand-object alignment strategy, the method outperforms the baseline RSRD—which requires pre-scanned object geometry—across multiple datasets.

Background & Motivation¶

Background: Hand-object interaction (HOI) reconstruction is critical for human behavior analysis, robotic manipulation, and augmented reality. Early methods rely on predefined object templates or category-specific knowledge, limiting generalizability.

Limitations of Prior Work: - Template-free HOI methods improve generalization but are almost exclusively applicable to rigid objects - 4D reconstruction methods for articulated objects typically require pre-scanned objects (to obtain canonical shapes) or multi-view video, making them unsuitable for in-the-wild scenarios

Core Problem: Reconstructing hand–articulated-object interactions from monocular video is a severely ill-posed problem due to limited visual cues, frequent occlusions, and internal degrees of freedom in the object.

Key Insight: Drawing inspiration from how humans leverage accumulated knowledge and experience to perceive complex interactions, the paper proposes exploiting rich priors from multiple foundation models (image-to-3D, pose estimation, depth estimation, tracking, MLLM, etc.) to address this ill-posed problem.

Key Challenge: Naively integrating multiple foundation models fails because: (1) meshes generated by image-to-3D models are in normalized coordinate spaces and lack metric scale; (2) independently reconstructed hands and objects suffer from spatial misalignment.

Method¶

Overall Architecture¶

A four-stage pipeline: 1. Data Preprocessing: Extract masks, depth, and camera parameters using foundation vision models; inpaint video to remove hand occlusions 2. Canonical Object Mesh Reconstruction: HunYuan3D generates a normalized 3D mesh → ASR recovers metric scale and pose 3. Part Motion Reconstruction: CoTracker performs dense tracking → per-part per-frame SE(3) transformations are optimized 4. MLLM-Guided Hand-Object Alignment: WiLoR reconstructs hand meshes → MLLM infers contact states → joint optimization

Key Designs¶

Adaptive Sampling Refinement (ASR):
- Problem: Image-to-3D models produce normalized meshes; directly applying FoundationPose fails due to inconsistency between mesh scale and depth
- Mechanism: A coarse scale is first estimated via back-projected depth; candidate scales are then iteratively sampled within an adaptive range, FoundationPose is invoked for each candidate, rendered silhouettes are compared against object masks via IoU, and the optimal scale is selected
- Adaptive Strategy: If no improvement is observed in recent iterations, the sampling range is expanded (\(\delta \leftarrow 2\delta\)); otherwise it remains unchanged
- Design Motivation: By searching the scale space and using rendering feedback as a verification criterion, ASR robustly reconciles normalized meshes, noisy depth, and pose predictions
Part-wise Motion Reconstruction:
- PartField is used to segment the mesh into parts
- CoTrackerV3 tracks 2D trajectories of each part, which are lifted to 3D using depth
- Optimization objective: tracking consistency loss + motion smoothness regularization \(\mathcal{L}_{motion} = \mathcal{L}_{track} + \lambda_{smooth} \mathcal{L}_{smooth}\)
- \(\mathcal{L}_{smooth}\) enforces temporal smoothness via discrete second-order finite differences
MLLM-Guided Hand-Object Alignment:
- Contact Inference: A structured prompting strategy queries Qwen-VL-Max to infer per-frame contact states (whether contact occurs, which fingers are involved)
- Mitigating MLLM Errors: The camera viewpoint (egocentric vs. exocentric) is queried first to avoid left/right hand confusion; concatenated RGB and colored depth maps of adjacent frames are provided as rich context
- Two-Stage Optimization:
  - Stage 1: Optimize only object scale \(s_c^o\) (using the metric scale prior from the hand for alignment)
  - Stage 2: Fix object scale; jointly optimize hand pose \(\theta_i^h\) and global transformation \(\mathbf{T}_i^h\)
- Contact Loss: \(\mathcal{L}_{contact} = \sum_{i \in \mathbb{C}} \sum_{\mathbf{v}_t \in \mathbb{T}_i} \min_{\mathbf{v}_o \in \mathcal{G}_i^o} \|\mathbf{v}_o - \mathbf{v}_t\|_2\)

Loss & Training¶

Purely optimization-based framework; no neural network training is required
Processing a single video takes approximately 1 hour (100 frames, 960×540) on an A6000 GPU
ASR: 20 iterations, initial range \(\delta=0.03\)
Motion optimization: 500 iterations per frame, Adam optimizer, learning rate linearly decayed from 0.02 to 0.002
HOI alignment: 800 optimization steps

Key Experimental Results¶

Main Results (ArtHOI-RGBD Dataset)¶

Object	Method	CD(mm)↓	MSSD(mm)↓	F10↑	F5↑
Headphone	RSRD (pre-scanned)	14.71	41.06	41.67	20.91
Headphone	ArtHOI	8.12	30.43	69.68	42.19
CD Drive	RSRD	282.33	348.59	10.90	6.92
CD Drive	ArtHOI	3.33	9.71	96.01	78.75
Stapler	RSRD	288.70	363.92	0.80	0.34
Stapler	ArtHOI	4.49	20.15	91.63	67.94

Ablation Study (Key Module Contributions Inferred from ARCTIC Dataset Results)¶

Configuration	Description
w/o ASR	FoundationPose directly estimates scale and pose; unstable due to mesh/depth inconsistency
w/o MLLM Alignment	Hand-object configurations exhibit penetration or separation
Full ArtHOI	Physically plausible 4D HOI reconstruction

Key Findings¶

ArtHOI requires no pre-scanned objects yet surpasses RSRD—which does require pre-scanning—on most objects (CD Drive: CD reduced by 80×)
The method is also competitive on the RSRD dataset, with clear advantages on certain objects (Scissor, Sunglasses)
MLLM contact inference accuracy is sufficient to guide optimization, though errors such as left/right hand confusion remain

Highlights & Insights¶

"Reconciling multiple imperfect priors" paradigm: Individual foundation model predictions may be erroneous, but ASR and the optimization framework can harmonize them—a highly instructive design principle
MLLM as a physical prior provider: Using a language model to infer contact states for constraining physical optimization represents an innovative application of MLLMs in 3D reconstruction
Two new dataset contributions: ArtHOI-RGBD and ArtHOI-Wild provide evaluation benchmarks for articulated object HOI

Limitations & Future Work¶

Processing a single video takes approximately 1 hour, requiring significant efficiency improvements
The cascaded pipeline of multiple foundation models means errors propagate across stages
The quality of part segmentation by PartField directly affects motion reconstruction accuracy
MLLM contact inference remains error-prone (especially under heavy occlusion); learning-based alternatives could be explored

RSRD is the direct competitor but requires pre-scanned objects; ArtHOI eliminates this requirement through foundation model priors
Compared to EasyHOI (per-frame HOI reconstruction), ArtHOI achieves substantial improvements by exploiting temporal consistency
The "multi-foundation-model collaboration" paradigm is broadly applicable to other complex scene understanding tasks

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First monocular articulated-object HOI 4D reconstruction; significant methodological contribution
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation on three datasets, though ablations are limited and quantitative isolation of individual module contributions is insufficient
Writing Quality: ⭐⭐⭐⭐ Clear framework presentation, though the four-stage pipeline involves considerable technical detail
Value: ⭐⭐⭐⭐⭐ Pioneering problem formulation with direct applicability to robotics and AR