Skip to content

Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

Conference: CVPR 2026 arXiv: 2603.14726 Code: Available Area: 3D Vision Keywords: whole-body pose estimation, hand pose, SMPL-X, feature modulation, modular framework

TL;DR

This paper proposes Hand4Whole++, a modular framework that injects features from a pretrained hand estimator into a frozen whole-body pose estimator via a lightweight CHAM module, enabling accurate wrist orientation prediction and transferring fine-grained finger joints and hand shape from a hand model via differentiable rigid alignment.

Background & Motivation

3D whole-body pose estimation faces a fundamental supervision gap:

  • Whole-body datasets (e.g., AGORA, ARCTIC): provide full-body annotations but with limited hand pose diversity
  • Hand datasets (e.g., InterHand2.6M): provide fine-grained finger annotations but lack full-body context

This leads to two problems: 1. Whole-body estimators (e.g., SMPLer-X) capture global structure but suffer from insufficient hand accuracy 2. Hand estimators (e.g., WiLoR, HaMeR) achieve high finger accuracy but lack global body awareness

Naïve combination (directly appending hand outputs onto the body) causes wrist orientation inconsistency with the upper-limb kinematic chain, resulting in physically implausible poses.

Core challenge: How to obtain fine-grained hand details while maintaining whole-body consistency?

Method

Overall Architecture

Hand4Whole++ consists of four components: 1. Pretrained whole-body pose estimator (SMPLer-X-L32) — frozen 2. Pretrained hand pose estimator (WiLoR) — frozen 3. CHAM (Conditional Hands Modulator) — the only trainable module 4. Finger joint and shape transfer module

Pipeline: input image → hand estimator extracts hand features → CHAM modulates whole-body feature flow → whole-body estimator predicts SMPL-X parameters (wrist orientation from whole-body, fingers from hand model).

Key Designs

1. CHAM: Conditional Hands Modulator

Architecture: - Extracts final-layer features for left and right hands from WiLoR's ViT backbone - Adds 2D positional encodings (preserving spatial location of hands in the full-body image) - Three-layer cross-attention Transformer encoder (activated only when both hands are detected, modeling bilateral hand relationships) - Two independent branches (left/right hand), each containing 24 \(1\times1\) convolutional layers (corresponding to SMPLer-X's 24 Transformer blocks) - All convolutional layers zero-initialized (ControlNet design, ensuring a neutral starting state)

Spatial alignment: Hand features are mapped back to the whole-body feature map space via inverse affine transformation, with zero-padding for non-hand regions; left and right branches are merged via element-wise maximum.

Key insight: CHAM improves not only wrist orientation but also the entire upper-limb kinematic chain (shoulder, elbow, wrist) through the whole-body feature flow, indirectly enhancing overall pose quality. The additional overhead is only ~10ms (~10% of total runtime).

2. Finger Joint and Hand Shape Transfer

Step Operation Source
Finger pose Use MANO parameters \(\theta_{rh}, \theta_{lh}\) Hand estimator
Hand shape Use MANO parameters \(\beta_{rh}, \beta_{lh}\) Hand estimator
Wrist orientation Discard hand estimator prediction; use SMPL-X Whole-body estimator (CHAM-modulated)
Alignment Rigid alignment based on wrist and four MCP joints Differentiable operation
Boundary smoothing Laplacian smoothing at seams Post-processing

Design rationale: The alignment step is fully differentiable, allowing gradients to back-propagate through CHAM for wrist orientation optimization. MANO's hand shape space is more expressive than SMPL-X's (which jointly encodes body, hands, and face in a shared latent space).

Loss & Training

Both pretrained estimators are frozen during training; only CHAM is optimized:

  • Pose loss: \(\ell_1\) distance (predicted vs. GT 3D joint rotations); hand datasets are converted to global wrist orientation via forward kinematics
  • Shape loss: \(\ell_1\) for whole-body datasets; \(\ell_2\) regularization for hand datasets
  • 2D/3D keypoint loss: \(\ell_1\) loss, reference frame selected by dataset type (pelvis/right wrist/wrist-relative)
  • Body root pose regularization: When full-body annotations are absent in hand datasets, the SMPL-X root orientation is regularized to maintain an upright torso

Training data: InterHand2.6M, ReInterHand, ARCTIC, AGORA. 4 epochs, batch size 32, ~20 hours on a single RTX A6000.

Key Experimental Results

Main Results

Table 1: Comparison with baselines on whole-body/hand datasets (MPVPE/MRRPE, mm)

Method AGORA Full/Hands ARCTIC Full/Hands EHF Full/Hands IH26M MPVPE/MRRPE ReIH MPVPE/MRRPE
Original whole-body model 85.61/52.31 56.06/31.48 63.26/46.21 38.64/119.56 58.86/101.82
Fine-tuned whole-body model 90.77/55.91 67.52/29.03 126.34/57.35 20.00/47.89 24.87/28.32
Hand-only model -/99.11 -/46.79 -/46.28 11.17/94817 8.09/3094
Hand4Whole++ 76.84/49.71 45.95/25.03 61.24/33.43 9.40/32.30 7.98/16.37

Table 4: Comparison with state-of-the-art whole-body methods

Method AGORA Full/Hands ARCTIC Full/Hands EHF Full/Hands
Hand4Whole 185.18/74.55 151.47/47.79 76.84/39.82
OSX 178.28/76.37 111.42/50.70 70.82/53.73
SMPLer-X 85.61/52.31 56.06/31.48 63.26/46.21
Hand4Whole++ 76.84/49.71 45.95/25.03 61.24/33.43

Ablation Study

Table 2: Ablation of whole-body + hand model combination strategies (AGORA, MPVPE)

Strategy Full-body error Hand error
Original whole-body model 84.76 52.31
Direct wrist orientation copy 90.70 100.59
CHAM modulation 76.88 50.56

Table 3: Ablation of finger joint and shape transfer (MPVPE/MRRPE)

Finger Shape IH26M ReIH HIC
14.69 18.13 21.68
12.26 15.24 19.61
9.40 7.98 17.72

Key Findings

  1. Fine-tuning the whole-body model is counterproductive: overfits to hand datasets, causing EHF full-body error to increase from 63 to 126mm
  2. Direct wrist orientation copying is catastrophic: hand error rises from 52 to 101mm, as the hand estimator has no awareness of the upper-limb kinematic chain
  3. CHAM improves both hands and whole body: full-body error decreases from 84.76 to 76.88mm by optimizing the entire upper-limb kinematic chain
  4. Shape transfer contributes substantially: MANO hand shape space (1.34mm point-to-point error) is significantly more expressive than SMPL-X (1.98mm)
  5. Hand estimator MRRPE is extremely large (WiLoR: 94817mm), demonstrating that independently predicted hands have no whole-body consistency

Highlights & Insights

  1. Freeze-and-modulate design philosophy: preserves pretrained model capabilities and bridges them with a lightweight module, avoiding catastrophic forgetting
  2. ControlNet-style paradigm transferred to pose estimation: zero-initialized convolutions ensure a stable starting point
  3. In-depth analysis of why naïve combination fails: clearly demonstrates failure modes and their underlying causes
  4. Cross-attention activated only when both hands are detected: flexibly handles single-hand and two-hand scenarios

Limitations & Future Work

  1. Hand datasets lack full-body annotations; non-hand joints receive only weak supervision and may misalign with the image
  2. Dependence on two pretrained models increases runtime (~0.1s/frame total, with WiLoR accounting for ~50%)
  3. No formal validation on egocentric scenarios (only preliminary observations)
  4. The cross-attention design in CHAM assumes at most two hands; multi-person interaction scenarios are not covered
  • Relation to ControlNet: borrows the lightweight modulation design for controlling pretrained models, but applied to pose estimation rather than generation
  • Distinction from FrankMocap/Hand4Whole: the former directly concatenates hand outputs, the latter fuses features at the joint level; this work achieves deeper information injection via feature modulation
  • Distinction from HMR-Adapter: HMR-Adapter interpolates hand features from internal whole-body features (low quality), whereas this work injects features from an external hand model (higher information content)
  • Inspiration: the analogous "expert modulation" paradigm could be extended to other body parts (e.g., feet, facial expressions)

Rating

  • Novelty: ⭐⭐⭐⭐ — CHAM design is elegant, though the overall idea is a natural extension of ControlNet
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — validated on 6 datasets with comprehensive ablations, including a MANO vs. SMPL-X shape expressiveness comparison
  • Writing Quality: ⭐⭐⭐⭐⭐ — motivation is clear, comparative analysis is thorough, and failure cases are well articulated
  • Value: ⭐⭐⭐⭐ — highly practical, real-time at 10fps, modular design facilitates integration