HORT: Monocular Hand-held Objects Reconstruction with Transformers¶
Paper Information¶
- Conference: ICCV 2025
- arXiv: 2503.21313
- Code: https://zerchen.github.io/projects/hort.html
- Area: 3D Vision / Hand-held Object Reconstruction
- Keywords: Hand-Object Reconstruction, Point Cloud, Transformer, Coarse-to-Fine, Monocular 3D
TL;DR¶
This paper proposes HORT, a coarse-to-fine Transformer-based framework that efficiently reconstructs dense 3D point clouds of hand-held objects from monocular images. By integrating image features with 3D hand geometry, HORT jointly predicts the object point cloud and its pose relative to the hand, achieving state-of-the-art performance in both reconstruction accuracy and inference speed.
Background & Motivation¶
Reconstructing the 3D shape of hand-held objects from monocular images has broad applications in action recognition, human-computer interaction, and robotic manipulation. Existing methods face critical bottlenecks:
Implicit representation methods (SDF, etc.): - Generate overly smooth 3D surfaces that lose geometric detail - Require Marching Cubes post-processing to obtain explicit meshes, resulting in slow inference (~2 seconds) - Cannot be flexibly applied to downstream tasks
Explicit representation methods: - HO uses vertex representations but is limited in resolution - D-SCO uses diffusion models to reconstruct high-resolution point clouds, but multi-step denoising leads to extremely slow inference (>13 seconds)
The root cause lies in the trade-off between high-quality reconstruction and efficient inference. Furthermore, hand geometry implicitly encodes cues about object geometry and location, yet existing methods fail to fully exploit this information.
Method¶
Overall Architecture¶
HORT adopts a coarse-to-fine two-stage strategy comprising four key modules:
- Image Encoder: Extracts 256+1 visual feature tokens using DINOv2-Large
- Hand Encoder: Encodes MANO hand geometry into rich 3D features
- Sparse Point Cloud Decoder: Jointly predicts a sparse point cloud and hand-relative pose
- Dense Point Cloud Decoder: Upsamples to a high-resolution point cloud using pixel-aligned features
Fine-grained Hand Feature Encoding¶
The key innovation lies in encoding 3D hand geometry:
- The MANO model reconstructs a hand mesh \(v_h \in \mathbb{R}^{778\times3}\), driven by joints \(j_h \in \mathbb{R}^{16\times3}\)
- Hand vertices are transformed into 22 local coordinate systems (16 joints + 5 fingertips + 1 palm center)
- The transformed coordinates and absolute vertex indices are concatenated to obtain \(e_h \in \mathbb{R}^{778\times67}\)
- A PointNet encodes this into hand features \(f_h \in \mathbb{R}^{1024}\)
This multi-coordinate-system representation captures pose- and shape-related geometric information of the hand, providing a strong structural prior for object reconstruction.
Sparse Point Cloud Decoder¶
The reconstruction task is decomposed into two subtasks: - Canonical object point cloud generation: Generated in the palm coordinate system - Hand-relative object pose estimation: Predicts only the 3D translation \(t_o \in \mathbb{R}^3\) relative to the palm (avoiding the ill-posed rotation prediction problem caused by object symmetry)
The decoder employs a unified multi-layer Transformer: - Defines \(1 + N_p^s\) learnable tokens (1 pose token + \(N_p^s\) point cloud tokens) - Applies self-attention followed by cross-attention separately over image features \(f_v\) and hand features \(f_h\) - A shared backbone enables mutual reinforcement between pose and point cloud predictions
Dense Point Cloud Decoder¶
Upsampling strategy from sparse to dense:
- Pixel-aligned feature extraction: Projects the sparse point cloud onto the image plane using predicted camera parameters and pose: $\(f_o = F(\pi(p_o^s + t_p + t_o, K_{cam}), f_v^r)\)$
- Local self-attention: Self-attention within KNN neighborhoods (\(k=16\)) to aggregate spatial and visual context
- Two-stage upsampling: Upsamples at \(2\times\) and \(4\times\) respectively, yielding a final dense point cloud of \(N_p^d = 16384\) points
Loss & Training¶
End-to-end training with the following total loss:
- \(\mathcal{L}_{pose}\): \(\ell_1\) loss supervising the 3D translation of the object relative to the palm
- \(\mathcal{L}_{cd}^s\): Chamfer Distance for the sparse point cloud
- \(\mathcal{L}_{cd}^d\): Chamfer Distance for the dense point cloud
- Both hyperparameters are set to 2
Key Experimental Results¶
Main Results: Comparison on ObMan Dataset¶
| Method | FS@5 ↑ | FS@10 ↑ | CD ↓ |
|---|---|---|---|
| HO | 0.23 | 0.56 | 6.4 |
| AlignSDF | 0.40 | 0.64 | 9.2 |
| gSDF | 0.44 | 0.66 | 8.8 |
| DDF-HO | 0.55 | 0.67 | 1.4 |
| D-SCO | 0.61 | 0.81 | 1.1 |
| HORT (Ours) | 0.66 | 0.88 | 1.0 |
HORT outperforms D-SCO on all metrics, with FS@5 improving by +0.05 and FS@10 by +0.07.
HO3D and DexYCB Real-world Datasets¶
| Method | HO3D FS@5 ↑ | HO3D CD ↓ | DexYCB FS@5 ↑ | DexYCB CD ↓ |
|---|---|---|---|---|
| D-SCO | 0.38 | 3.2 | 0.48 | 2.9 |
| HORT | 0.41 | 2.5 | 0.52 | 2.5 |
HORT maintains state-of-the-art performance on real-world datasets as well.
Ablation Study: Encoder Design¶
| Config | Palm | Joints | Image Encoder | FS@5 ↑ | CD ↓ |
|---|---|---|---|---|---|
| R1 | × | × | Fine-tune | 0.45 | 3.1 |
| R2 | ✓ | × | Fine-tune | 0.53 | 2.4 |
| R3 | ✓ | ✓ | Fine-tune | 0.60 | 1.8 |
| R4 | ✓ | ✓ | Scratch | 0.51 | 2.6 |
| R5 | ✓ | ✓ | Frozen | 0.48 | 2.9 |
Key findings: (1) Multi-coordinate-system hand encoding is critical (R1→R3: FS@5 improves by 0.15); (2) The fine-tuning strategy for the image encoder has a significant impact.
Inference Speed Comparison¶
HORT processes a single image in approximately 0.08 seconds (including mesh extraction), compared to 13 seconds for D-SCO and ~2 seconds for implicit methods — a 162× speedup.
Highlights & Insights¶
- Optimal speed-quality trade-off: HORT achieves superior reconstruction quality while being two orders of magnitude faster than D-SCO
- Effective exploitation of hand priors: Multi-coordinate-system transformation combined with PointNet-encoded hand geometry provides strong shape constraints for object reconstruction
- Advantages of end-to-end training: Compared to the modular training of D-SCO, end-to-end optimization enables better synergy among components
- Suitable as initialization for optimization-based methods: The feed-forward predictions can accelerate subsequent optimization-based refinement
Limitations & Future Work¶
- Object rotation is not predicted (due to symmetry issues), leaving pose estimation incomplete for asymmetric objects
- Performance depends on the quality of hand pose estimation; failures in hand reconstruction degrade object reconstruction
- Point cloud representations cannot be directly used in applications requiring mesh topology (e.g., physics simulation)
Related Work & Insights¶
- Hand-held object reconstruction: IHOI, AlignSDF, gSDF, D-SCO
- Implicit 3D representations: SDF, DDF, NeRF
- Point cloud generation: PoinTr, SnowflakeNet, Michelangelo
- Hand reconstruction: MANO, HaMeR, WiLoR
Rating¶
- Novelty: ⭐⭐⭐⭐ — The coarse-to-fine Transformer framework combined with hand geometry encoding is novel and practical
- Technical Depth: ⭐⭐⭐⭐ — Multi-coordinate-system hand encoding and joint decoding design are elegant
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, comprehensive ablations, and inference speed comparisons
- Value: ⭐⭐⭐⭐⭐ — 0.08-second inference enables real-time application potential