Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator¶
Conference: CVPR 2026 arXiv: 2603.14726 Code: Available Area: 3D Vision Keywords: whole-body pose estimation, hand pose, SMPL-X, feature modulation, modular framework
TL;DR¶
This paper proposes Hand4Whole++, a modular framework that injects features from a pretrained hand estimator into a frozen whole-body pose estimator via a lightweight CHAM module, enabling accurate wrist orientation prediction and transferring fine-grained finger joints and hand shape from a hand model via differentiable rigid alignment.
Background & Motivation¶
3D whole-body pose estimation faces a fundamental supervision gap:
- Whole-body datasets (e.g., AGORA, ARCTIC): provide full-body annotations but with limited hand pose diversity
- Hand datasets (e.g., InterHand2.6M): provide fine-grained finger annotations but lack full-body context
This leads to two problems: 1. Whole-body estimators (e.g., SMPLer-X) capture global structure but suffer from insufficient hand accuracy 2. Hand estimators (e.g., WiLoR, HaMeR) achieve high finger accuracy but lack global body awareness
Naïve combination (directly appending hand outputs onto the body) causes wrist orientation inconsistency with the upper-limb kinematic chain, resulting in physically implausible poses.
Core challenge: How to obtain fine-grained hand details while maintaining whole-body consistency?
Method¶
Overall Architecture¶
Hand4Whole++ consists of four components: 1. Pretrained whole-body pose estimator (SMPLer-X-L32) — frozen 2. Pretrained hand pose estimator (WiLoR) — frozen 3. CHAM (Conditional Hands Modulator) — the only trainable module 4. Finger joint and shape transfer module
Pipeline: input image → hand estimator extracts hand features → CHAM modulates whole-body feature flow → whole-body estimator predicts SMPL-X parameters (wrist orientation from whole-body, fingers from hand model).
Key Designs¶
1. CHAM: Conditional Hands Modulator¶
Architecture: - Extracts final-layer features for left and right hands from WiLoR's ViT backbone - Adds 2D positional encodings (preserving spatial location of hands in the full-body image) - Three-layer cross-attention Transformer encoder (activated only when both hands are detected, modeling bilateral hand relationships) - Two independent branches (left/right hand), each containing 24 \(1\times1\) convolutional layers (corresponding to SMPLer-X's 24 Transformer blocks) - All convolutional layers zero-initialized (ControlNet design, ensuring a neutral starting state)
Spatial alignment: Hand features are mapped back to the whole-body feature map space via inverse affine transformation, with zero-padding for non-hand regions; left and right branches are merged via element-wise maximum.
Key insight: CHAM improves not only wrist orientation but also the entire upper-limb kinematic chain (shoulder, elbow, wrist) through the whole-body feature flow, indirectly enhancing overall pose quality. The additional overhead is only ~10ms (~10% of total runtime).
2. Finger Joint and Hand Shape Transfer¶
| Step | Operation | Source |
|---|---|---|
| Finger pose | Use MANO parameters \(\theta_{rh}, \theta_{lh}\) | Hand estimator |
| Hand shape | Use MANO parameters \(\beta_{rh}, \beta_{lh}\) | Hand estimator |
| Wrist orientation | Discard hand estimator prediction; use SMPL-X | Whole-body estimator (CHAM-modulated) |
| Alignment | Rigid alignment based on wrist and four MCP joints | Differentiable operation |
| Boundary smoothing | Laplacian smoothing at seams | Post-processing |
Design rationale: The alignment step is fully differentiable, allowing gradients to back-propagate through CHAM for wrist orientation optimization. MANO's hand shape space is more expressive than SMPL-X's (which jointly encodes body, hands, and face in a shared latent space).
Loss & Training¶
Both pretrained estimators are frozen during training; only CHAM is optimized:
- Pose loss: \(\ell_1\) distance (predicted vs. GT 3D joint rotations); hand datasets are converted to global wrist orientation via forward kinematics
- Shape loss: \(\ell_1\) for whole-body datasets; \(\ell_2\) regularization for hand datasets
- 2D/3D keypoint loss: \(\ell_1\) loss, reference frame selected by dataset type (pelvis/right wrist/wrist-relative)
- Body root pose regularization: When full-body annotations are absent in hand datasets, the SMPL-X root orientation is regularized to maintain an upright torso
Training data: InterHand2.6M, ReInterHand, ARCTIC, AGORA. 4 epochs, batch size 32, ~20 hours on a single RTX A6000.
Key Experimental Results¶
Main Results¶
Table 1: Comparison with baselines on whole-body/hand datasets (MPVPE/MRRPE, mm)
| Method | AGORA Full/Hands | ARCTIC Full/Hands | EHF Full/Hands | IH26M MPVPE/MRRPE | ReIH MPVPE/MRRPE |
|---|---|---|---|---|---|
| Original whole-body model | 85.61/52.31 | 56.06/31.48 | 63.26/46.21 | 38.64/119.56 | 58.86/101.82 |
| Fine-tuned whole-body model | 90.77/55.91 | 67.52/29.03 | 126.34/57.35 | 20.00/47.89 | 24.87/28.32 |
| Hand-only model | -/99.11 | -/46.79 | -/46.28 | 11.17/94817 | 8.09/3094 |
| Hand4Whole++ | 76.84/49.71 | 45.95/25.03 | 61.24/33.43 | 9.40/32.30 | 7.98/16.37 |
Table 4: Comparison with state-of-the-art whole-body methods
| Method | AGORA Full/Hands | ARCTIC Full/Hands | EHF Full/Hands |
|---|---|---|---|
| Hand4Whole | 185.18/74.55 | 151.47/47.79 | 76.84/39.82 |
| OSX | 178.28/76.37 | 111.42/50.70 | 70.82/53.73 |
| SMPLer-X | 85.61/52.31 | 56.06/31.48 | 63.26/46.21 |
| Hand4Whole++ | 76.84/49.71 | 45.95/25.03 | 61.24/33.43 |
Ablation Study¶
Table 2: Ablation of whole-body + hand model combination strategies (AGORA, MPVPE)
| Strategy | Full-body error | Hand error |
|---|---|---|
| Original whole-body model | 84.76 | 52.31 |
| Direct wrist orientation copy | 90.70 | 100.59 |
| CHAM modulation | 76.88 | 50.56 |
Table 3: Ablation of finger joint and shape transfer (MPVPE/MRRPE)
| Finger | Shape | IH26M | ReIH | HIC |
|---|---|---|---|---|
| ✗ | ✗ | 14.69 | 18.13 | 21.68 |
| ✓ | ✗ | 12.26 | 15.24 | 19.61 |
| ✓ | ✓ | 9.40 | 7.98 | 17.72 |
Key Findings¶
- Fine-tuning the whole-body model is counterproductive: overfits to hand datasets, causing EHF full-body error to increase from 63 to 126mm
- Direct wrist orientation copying is catastrophic: hand error rises from 52 to 101mm, as the hand estimator has no awareness of the upper-limb kinematic chain
- CHAM improves both hands and whole body: full-body error decreases from 84.76 to 76.88mm by optimizing the entire upper-limb kinematic chain
- Shape transfer contributes substantially: MANO hand shape space (1.34mm point-to-point error) is significantly more expressive than SMPL-X (1.98mm)
- Hand estimator MRRPE is extremely large (WiLoR: 94817mm), demonstrating that independently predicted hands have no whole-body consistency
Highlights & Insights¶
- Freeze-and-modulate design philosophy: preserves pretrained model capabilities and bridges them with a lightweight module, avoiding catastrophic forgetting
- ControlNet-style paradigm transferred to pose estimation: zero-initialized convolutions ensure a stable starting point
- In-depth analysis of why naïve combination fails: clearly demonstrates failure modes and their underlying causes
- Cross-attention activated only when both hands are detected: flexibly handles single-hand and two-hand scenarios
Limitations & Future Work¶
- Hand datasets lack full-body annotations; non-hand joints receive only weak supervision and may misalign with the image
- Dependence on two pretrained models increases runtime (~0.1s/frame total, with WiLoR accounting for ~50%)
- No formal validation on egocentric scenarios (only preliminary observations)
- The cross-attention design in CHAM assumes at most two hands; multi-person interaction scenarios are not covered
Related Work & Insights¶
- Relation to ControlNet: borrows the lightweight modulation design for controlling pretrained models, but applied to pose estimation rather than generation
- Distinction from FrankMocap/Hand4Whole: the former directly concatenates hand outputs, the latter fuses features at the joint level; this work achieves deeper information injection via feature modulation
- Distinction from HMR-Adapter: HMR-Adapter interpolates hand features from internal whole-body features (low quality), whereas this work injects features from an external hand model (higher information content)
- Inspiration: the analogous "expert modulation" paradigm could be extended to other body parts (e.g., feet, facial expressions)
Rating¶
- Novelty: ⭐⭐⭐⭐ — CHAM design is elegant, though the overall idea is a natural extension of ControlNet
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — validated on 6 datasets with comprehensive ablations, including a MANO vs. SMPL-X shape expressiveness comparison
- Writing Quality: ⭐⭐⭐⭐⭐ — motivation is clear, comparative analysis is thorough, and failure cases are well articulated
- Value: ⭐⭐⭐⭐ — highly practical, real-time at 10fps, modular design facilitates integration