Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator¶

Conference: CVPR 2026 arXiv: 2603.14726 Code: Available Area: 3D Vision Keywords: whole-body pose estimation, hand pose, SMPL-X, feature modulation, modular framework

TL;DR¶

This paper proposes Hand4Whole++, a modular framework that injects features from a pretrained hand estimator into a frozen whole-body pose estimator via a lightweight CHAM module, enabling accurate wrist orientation prediction and transferring fine-grained finger joints and hand shape from a hand model via differentiable rigid alignment.

Background & Motivation¶

3D whole-body pose estimation faces a fundamental supervision gap:

Whole-body datasets (e.g., AGORA, ARCTIC): provide full-body annotations but with limited hand pose diversity
Hand datasets (e.g., InterHand2.6M): provide fine-grained finger annotations but lack full-body context

This leads to two problems: 1. Whole-body estimators (e.g., SMPLer-X) capture global structure but suffer from insufficient hand accuracy 2. Hand estimators (e.g., WiLoR, HaMeR) achieve high finger accuracy but lack global body awareness

Naïve combination (directly appending hand outputs onto the body) causes wrist orientation inconsistency with the upper-limb kinematic chain, resulting in physically implausible poses.

Core challenge: How to obtain fine-grained hand details while maintaining whole-body consistency?

Method¶

Overall Architecture¶

Hand4Whole++ consists of four components: 1. Pretrained whole-body pose estimator (SMPLer-X-L32) — frozen 2. Pretrained hand pose estimator (WiLoR) — frozen 3. CHAM (Conditional Hands Modulator) — the only trainable module 4. Finger joint and shape transfer module

Pipeline: input image → hand estimator extracts hand features → CHAM modulates whole-body feature flow → whole-body estimator predicts SMPL-X parameters (wrist orientation from whole-body, fingers from hand model).

Key Designs¶

1. CHAM: Conditional Hands Modulator¶

Architecture: - Extracts final-layer features for left and right hands from WiLoR's ViT backbone - Adds 2D positional encodings (preserving spatial location of hands in the full-body image) - Three-layer cross-attention Transformer encoder (activated only when both hands are detected, modeling bilateral hand relationships) - Two independent branches (left/right hand), each containing 24 \(1\times1\) convolutional layers (corresponding to SMPLer-X's 24 Transformer blocks) - All convolutional layers zero-initialized (ControlNet design, ensuring a neutral starting state)

Spatial alignment: Hand features are mapped back to the whole-body feature map space via inverse affine transformation, with zero-padding for non-hand regions; left and right branches are merged via element-wise maximum.

Key insight: CHAM improves not only wrist orientation but also the entire upper-limb kinematic chain (shoulder, elbow, wrist) through the whole-body feature flow, indirectly enhancing overall pose quality. The additional overhead is only ~10ms (~10% of total runtime).

2. Finger Joint and Hand Shape Transfer¶

Step	Operation	Source
Finger pose	Use MANO parameters \(\theta_{rh}, \theta_{lh}\)	Hand estimator
Hand shape	Use MANO parameters \(\beta_{rh}, \beta_{lh}\)	Hand estimator
Wrist orientation	Discard hand estimator prediction; use SMPL-X	Whole-body estimator (CHAM-modulated)
Alignment	Rigid alignment based on wrist and four MCP joints	Differentiable operation
Boundary smoothing	Laplacian smoothing at seams	Post-processing

Design rationale: The alignment step is fully differentiable, allowing gradients to back-propagate through CHAM for wrist orientation optimization. MANO's hand shape space is more expressive than SMPL-X's (which jointly encodes body, hands, and face in a shared latent space).

Loss & Training¶

Both pretrained estimators are frozen during training; only CHAM is optimized:

Pose loss: \(\ell_1\) distance (predicted vs. GT 3D joint rotations); hand datasets are converted to global wrist orientation via forward kinematics
Shape loss: \(\ell_1\) for whole-body datasets; \(\ell_2\) regularization for hand datasets
2D/3D keypoint loss: \(\ell_1\) loss, reference frame selected by dataset type (pelvis/right wrist/wrist-relative)
Body root pose regularization: When full-body annotations are absent in hand datasets, the SMPL-X root orientation is regularized to maintain an upright torso

Training data: InterHand2.6M, ReInterHand, ARCTIC, AGORA. 4 epochs, batch size 32, ~20 hours on a single RTX A6000.

Key Experimental Results¶

Main Results¶

Table 1: Comparison with baselines on whole-body/hand datasets (MPVPE/MRRPE, mm)

Method	AGORA Full/Hands	ARCTIC Full/Hands	EHF Full/Hands	IH26M MPVPE/MRRPE	ReIH MPVPE/MRRPE
Original whole-body model	85.61/52.31	56.06/31.48	63.26/46.21	38.64/119.56	58.86/101.82
Fine-tuned whole-body model	90.77/55.91	67.52/29.03	126.34/57.35	20.00/47.89	24.87/28.32
Hand-only model	-/99.11	-/46.79	-/46.28	11.17/94817	8.09/3094
Hand4Whole++	76.84/49.71	45.95/25.03	61.24/33.43	9.40/32.30	7.98/16.37

Table 4: Comparison with state-of-the-art whole-body methods

Method	AGORA Full/Hands	ARCTIC Full/Hands	EHF Full/Hands
Hand4Whole	185.18/74.55	151.47/47.79	76.84/39.82
OSX	178.28/76.37	111.42/50.70	70.82/53.73
SMPLer-X	85.61/52.31	56.06/31.48	63.26/46.21
Hand4Whole++	76.84/49.71	45.95/25.03	61.24/33.43

Ablation Study¶

Table 2: Ablation of whole-body + hand model combination strategies (AGORA, MPVPE)

Strategy	Full-body error	Hand error
Original whole-body model	84.76	52.31
Direct wrist orientation copy	90.70	100.59
CHAM modulation	76.88	50.56

Table 3: Ablation of finger joint and shape transfer (MPVPE/MRRPE)

Finger	Shape	IH26M	ReIH	HIC
✗	✗	14.69	18.13	21.68
✓	✗	12.26	15.24	19.61
✓	✓	9.40	7.98	17.72

Key Findings¶

Fine-tuning the whole-body model is counterproductive: overfits to hand datasets, causing EHF full-body error to increase from 63 to 126mm
Direct wrist orientation copying is catastrophic: hand error rises from 52 to 101mm, as the hand estimator has no awareness of the upper-limb kinematic chain
CHAM improves both hands and whole body: full-body error decreases from 84.76 to 76.88mm by optimizing the entire upper-limb kinematic chain
Shape transfer contributes substantially: MANO hand shape space (1.34mm point-to-point error) is significantly more expressive than SMPL-X (1.98mm)
Hand estimator MRRPE is extremely large (WiLoR: 94817mm), demonstrating that independently predicted hands have no whole-body consistency

Highlights & Insights¶

Freeze-and-modulate design philosophy: preserves pretrained model capabilities and bridges them with a lightweight module, avoiding catastrophic forgetting
ControlNet-style paradigm transferred to pose estimation: zero-initialized convolutions ensure a stable starting point
In-depth analysis of why naïve combination fails: clearly demonstrates failure modes and their underlying causes
Cross-attention activated only when both hands are detected: flexibly handles single-hand and two-hand scenarios

Limitations & Future Work¶

Hand datasets lack full-body annotations; non-hand joints receive only weak supervision and may misalign with the image
Dependence on two pretrained models increases runtime (~0.1s/frame total, with WiLoR accounting for ~50%)
No formal validation on egocentric scenarios (only preliminary observations)
The cross-attention design in CHAM assumes at most two hands; multi-person interaction scenarios are not covered

Relation to ControlNet: borrows the lightweight modulation design for controlling pretrained models, but applied to pose estimation rather than generation
Distinction from FrankMocap/Hand4Whole: the former directly concatenates hand outputs, the latter fuses features at the joint level; this work achieves deeper information injection via feature modulation
Distinction from HMR-Adapter: HMR-Adapter interpolates hand features from internal whole-body features (low quality), whereas this work injects features from an external hand model (higher information content)
Inspiration: the analogous "expert modulation" paradigm could be extended to other body parts (e.g., feet, facial expressions)

Rating¶

Novelty: ⭐⭐⭐⭐ — CHAM design is elegant, though the overall idea is a natural extension of ControlNet
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — validated on 6 datasets with comprehensive ablations, including a MANO vs. SMPL-X shape expressiveness comparison
Writing Quality: ⭐⭐⭐⭐⭐ — motivation is clear, comparative analysis is thorough, and failure cases are well articulated
Value: ⭐⭐⭐⭐ — highly practical, real-time at 10fps, modular design facilitates integration