EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning¶

Conference: ICLR 2026 arXiv: 2601.19850 Code: Available Area: Multimodal VLM Keywords: Egocentric view, 3D hand reconstruction, in-context learning, vision-language model, MANO

TL;DR¶

This work introduces the in-context learning (ICL) paradigm to 3D hand reconstruction for the first time. Through VLM-guided template retrieval, a multimodal ICL tokenizer, and an MAE-driven reconstruction pipeline, EgoHandICL significantly outperforms state-of-the-art methods on the ARCTIC and EgoExo4D benchmarks.

Background & Motivation¶

Egocentric 3D hand reconstruction faces three core challenges: depth ambiguity, self-occlusion, and complex hand-object interactions. Existing methods address these by scaling training data or incorporating auxiliary cues, yet still underperform under severe occlusion and unseen scenarios.

Current limitations: - SOTA models such as WiLoR and HaMeR perform well in general settings but tend to miss hands, confuse left/right hands, and distort occluded regions in difficult cases such as crossed-hand occlusion or hands blending into the background. - Methods like WildHand rely on auxiliary supervision signals that require additional annotations and still cannot resolve severe occlusion.

Humans resolve visual ambiguity by drawing on prior experience and contextual reasoning—a concept that aligns naturally with ICL. ICL adapts to new problems by conditioning on a few relevant examples without updating model parameters. This paper is the first to introduce the ICL paradigm to 3D hand reconstruction.

Method¶

Overall Architecture¶

EgoHandICL consists of three core components:

Template Retrieval (Part A): A VLM-guided complementary retrieval strategy that selects contextually relevant example images.
ICL Tokenizer (Part B): Integrates image, structural, and textual modalities to construct unified ICL tokens.
MAE-Style Reconstruction (Part C): Trains a Transformer via masked autoencoding to perform context-driven hand reconstruction.

Key Designs¶

1. Template Retrieval Strategy

Two complementary strategies are used to retrieve templates relevant to the query image from a database:

Predefined Visual Templates: A VLM (Qwen2.5-VL-72B) classifies each image into one of four hand participation modes: - Left hand only, right hand only, both hands, no hands - Visually consistent examples of the same type are retrieved.

Adaptive Text Templates: Semantic descriptions generated by the VLM are used to retrieve templates based on textual similarity: - Descriptive prompts: describe occlusion and interaction details. - Reasoning prompts: provide guidance for handling occlusion and complex interactions (used under severe occlusion).

One template image is retrieved per query. The two strategies are complementary, jointly ensuring semantic alignment and visual consistency.

2. ICL Tokenizer

Four sets of ICL tokens are constructed for both the template and query images:

\(T_{\text{tpl}}^{\text{in}}\) (template input), \(T_{\text{tpl}}^{\text{tar}}\) (template target)
\(T_{\text{qry}}^{\text{in}}\) (query input), \(T_{\text{qry}}^{\text{tar}}\) (query target)

Three modalities are encoded: - Image tokens \(F_i\): Extracted by a pretrained ViT encoder (backbone shared with WiLoR) to capture appearance and spatial details. - Structural tokens \(F_m\): A MANO encoder maps coarse/ground-truth MANO parameters to tokens that preserve 3D hand joint and shape priors. - Text tokens \(F_t\): A Qwen-7B text encoder embeds VLM-generated semantic descriptions.

The three modality tokens are fused via cross-attention to produce unified ICL tokens.

A key design choice is to use a unified MANO parameterization for both inputs and outputs, ensuring structural consistency between query and template and bridging the modality gap between 2D visual inputs and 3D parametric outputs.

3. MAE-Style Masked Reconstruction

Core challenge: ground-truth annotations for both template and query are available during training, but the query target is unknown at inference.

Solution: - Training: Target tokens for both template and query (\(T_{\text{tpl}}^{\text{tar}}\) and \(T_{\text{qry}}^{\text{tar}}\)) are randomly partially masked; the optimal masking ratio is 70%. - Inference: Query target tokens are fully masked, and the Transformer decodes (reconstructs) the query's MANO parameters from the remaining ICL context.

This design simulates the incomplete supervision conditions at inference time, training the model to infer missing information from contextual examples.

Loss & Training¶

Three-level supervision: parameter-level + vertex-level + perceptual-level:

\[\mathcal{L} = \lambda_m \mathcal{L}_{mano} + \lambda_v \mathcal{L}_V + \lambda_{3D} \mathcal{L}_{3D}\]

MANO parameter loss: \(\mathcal{L}_{mano} = \|\Theta - \Theta^{gt}\|_2^2 + \|\beta - \beta^{gt}\|_2^2 + \|\Phi - \Phi^{gt}\|_2^2\)
Vertex loss: \(\mathcal{L}_V = \|V_{3D} - V_{3D}^{gt}\|_1\)
3D perceptual loss (novel contribution): \(\mathcal{L}_{3D} = \|\phi(\mathcal{P}) - \phi(\mathcal{P}^{gt})\|_2^2\), using Uni3D-ti as the 3D feature encoder \(\phi\) to reinforce semantic consistency under occlusion.

For datasets lacking MANO ground truth (e.g., EgoExo4D), 3D keypoint joint constraints are used as a substitute.

Loss weights: \(\lambda_m = 0.05\), \(\lambda_v = 5.0\), \(\lambda_{3D} = 0.01\). The model is trained for 100 epochs on a single RTX 4090.

Key Experimental Results¶

Main Results¶

ARCTIC dataset (hand mesh reconstruction, 118.2K train / 16.9K test):

Method	P-MPJPE↓	P-MPVPE↓	F@5↑	F@15↑	Two-hand P-MPVPE↓	MRRPE↓
HaMeR	9.9	9.6	0.046	0.911	9.9	10.1
WiLoR	5.5	5.5	0.524	0.994	5.7	9.8
WildHand	5.8	5.6	0.746	0.928	4.9	7.1
EgoHandICL	4.0	3.8	0.801	0.996	3.7	6.2

Compared to the second-best method: P-MPVPE improves by 31.1% in the general setting, 24.5% in the two-hand setting, and MRRPE decreases by 12%.

EgoExo4D dataset (joint estimation, 17.3K train / 4.1K test):

Method	MPJPE↓	P-MPJPE↓	F@10↑	F@15↑	Two-hand MRRPE↓
PCIE-EgoHandPose	25.5	8.5	0.544	0.910	130.9
WiLoR	31.1	12.5	0.528	0.905	378.0
EgoHandICL	21.1	7.7	0.789	0.935	110.9

Ablation Study¶

Backbone generality (ARCTIC dataset):

Configuration	P-MPVPE↓	Gain over backbone
EgoHandICL + HaMeR	8.1	+10.4%
EgoHandICL + WildHand	4.9	+12.5%
EgoHandICL + WiLoR	3.8	+30.9%

ICL consistently yields substantial improvements regardless of the coarse MANO backbone used.

Masking ratio: A 70% masking ratio achieves the best performance (P-MPVPE=3.8, F@5=0.801), consistent with MAE findings—higher masking encourages the model to exploit stronger contextual cues.

Loss function ablation:

Loss combination	P-MPVPE↓	F@5↑
\(\mathcal{L}_V\) only	4.7	0.6
+ \(\mathcal{L}_{mano}\)	4.3	0.6
+ \(\mathcal{L}_{3D}\)	3.9	0.7
+ \(\mathcal{L}_{mano}\) + \(\mathcal{L}_{3D}\)	3.8	0.8

Key Findings¶

ICL enables EgoHandICL to substantially outperform direct regression methods in occlusion and crossed-hand scenarios.
Contextual reasoning analysis confirms that the model genuinely leverages retrieved templates for inference rather than simple imitation.
The proposed full model achieves the best performance across all hand participation types, demonstrating the synergistic generalization advantage of ICL.
VLM reasoning prompts are more effective than descriptive prompts, indicating that semantic reasoning ability enhances retrieval quality.
EgoHandICL can be integrated into EgoVLM to improve hand-object interaction reasoning (avg. +3%).

Highlights & Insights¶

First successful application of ICL to 3D visual reconstruction: The modality gap between 2D images and 3D meshes is bridged through MANO parameterization of both inputs and outputs.
VLM as a retrieval engine: The semantic understanding capabilities of large models enable more robust context-relevant template selection compared to purely visual retrieval.
Elegant combination of MAE and ICL: Partial masking during training simulates information absence at inference, providing a generalizable paradigm for visual ICL.
Strong practical utility: EgoHandICL can serve as a plug-in to enhance existing hand reconstruction methods (10–31% improvement) and boost EgoVLM reasoning capability.

Limitations & Future Work¶

Only one template is retrieved per query; whether multi-template ICL yields further gains remains to be investigated.
Template retrieval requires a VLM (72B parameters) for preprocessing; data preparation requires 4× A100 GPUs, making deployment costly.
Validation is limited to laboratory (ARCTIC) and semi-controlled (EgoExo4D) settings; robustness in complex industrial-scale scenarios is yet to be examined.
The expressiveness of the MANO model itself constrains the modeling of extreme hand poses and deformations.
Temporal ICL reasoning over video sequences has not been explored.

HaMeR/WiLoR: Large-scale ViT-based image-to-MANO regression; serve as baseline backbones in this work.
Visual ICL (PIC/HiC): Explores ICL for point cloud recognition and human motion, but does not address the 2D→3D modality gap.
MAE: The masked autoencoding paradigm provides a solution to the training-inference asymmetry inherent in ICL.
Broader implication: The combination of ICL paradigm and VLM-guided retrieval is generalizable to other 3D reconstruction tasks involving occlusion or ambiguity (e.g., body pose estimation, object reconstruction).

Rating¶

Novelty: ★★★★★ — First introduction of ICL to 3D hand reconstruction; both problem formulation and framework design are original.
Technical Depth: ★★★★☆ — The multimodal tokenizer and MAE-based training strategy are elegantly designed.
Experimental Rigor: ★★★★★ — Multi-metric validation on two datasets with comprehensive ablation studies.
Practical Value: ★★★★☆ — Open-source code available; can serve as a plug-in to enhance existing methods.
Clarity: ★★★★☆ — Illustrations are clear and component logic is well-structured.