UniHM: Unified Dexterous Hand Manipulation with Vision Language Model¶

Conference: ICLR 2026 arXiv: 2603.00732 Code: GitHub Area: Multimodal VLM Keywords: dexterous hand manipulation, VLM, unified tokenizer, physics-guided dynamic refinement, cross-morphology generalization

TL;DR¶

This paper proposes UniHM, the first unified language-conditioned dexterous hand manipulation framework. It maps heterogeneous robotic hands into a shared discrete space via a morphology-agnostic VQ codebook, leverages a VLM for instruction-driven manipulation sequence generation, and ensures physical feasibility through physics-guided dynamic refinement.

Background & Motivation¶

Dexterous hand manipulation requires perceiving, grasping, and reconfiguring objects in complex environments. Generating diverse, long-horizon, and physically feasible manipulation sequences is critical for advancing humanoid robot applications.

Limitations of prior work: - Object-centric methods (UniDexGrasp, DexGraspNet, etc.): lack open-vocabulary instruction guidance and can only handle fixed sequences. - Language-guided grasping methods (SemGrasp, AffordDexGrasp, etc.): primarily generate static grasp poses, neglect temporal structure, and cannot produce smooth, continuous manipulation sequences. - Existing VLM-based manipulation methods (MotionGPT, HOIGPT, etc.): mainly target digital hands or low-DoF grippers, lacking cross-morphology generalization and physical feasibility guarantees.

Goal: Generate dynamic dexterous manipulation sequences directly from images and open-vocabulary instructions, supporting multiple hand morphologies without relying on teleoperation data.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) morphology-agnostic motion tokenization; (2) language-guided VLM manipulation sequence generation; (3) physics-aware decoding and dynamic refinement.

Key Designs¶

Unified Hand-Dexterous Tokenizer: A shared VQ-VAE codebook $\mathcal{Z} = \{\mathbf{e}_k\}_{k=1}^K$ maps five different hand morphologies (MANO, Shadow Hand, Allegro Hand, etc.) into a unified discrete space. Each morphology has a dedicated encoder $E_h$ and decoder $D_h$, with quantization $c = \arg\min_k \|E_h(\mathbf{x}^{(h)}) - \mathbf{e}_k\|_2^2$. New morphologies align their encoders via knowledge distillation: $\mathcal{L}_{\text{distill}} = \|E_{\text{new}}(\mathbf{x}_{\text{new}}) - E_{\text{ref}}(\mathbf{x}_{\text{ref}})\|_2^2$, circumventing the non-differentiability of the quantization step. Cross-morphology translation reduces to encode–quantize–decode: $\hat{\mathbf{x}}^{(j)} = D_j(\mathbf{e}_{Q(E_i(\mathbf{x}^{(i)}))})$.
VLM-driven Manipulation Generation: A decoupled architecture is adopted — a CLIPort perception module infers target trajectory $\mathcal{T}_{\text{tar}}$ from RGB-D input and instructions, while Point-SAM segments the target object point cloud $\mathcal{P}_{\text{obj}}$. Using Qwen3-0.6B as the backbone, the encoded initial hand pose, target trajectory, object point cloud, and text tokens are concatenated and fed into the VLM to autoregressively generate the manipulation token sequence. A progressive masking curriculum is applied, gradually increasing the masking ratio from full teacher forcing to fully autoregressive generation.
Physics-guided Dynamic Refinement: Frame-by-frame Gauss–Newton optimization with three energy terms:
- Contact energy $\mathcal{E}_{\text{contact}}$: based on signed point-to-plane distances from fingertips to the object surface, using an asymmetric smooth penalty.
- Generation prior $\mathcal{E}_{\text{gen}}$: penalizes deviation from VLM-generated configurations to preserve semantic intent.
- Temporal prior $\mathcal{E}_{\text{time}}$: regularizes first-order (velocity) and second-order (acceleration) temporal differences to ensure smooth and coherent motion.

Loss & Training¶

VQ-VAE training: reconstruction loss + codebook loss $\mathcal{L}_{\text{vq}} = \|\text{sg}[\mathbf{z}_e] - \mathbf{z}_q\|_2^2 + \beta\|\mathbf{z}_e - \text{sg}[\mathbf{z}_q]\|_2^2$

Physical optimization uses Levenberg–Marquardt damped Gauss–Newton iterations: $$(J_t^T J_t + \mathbf{W}_{\text{gen}} + \mathbf{W}_{\text{vel}} + \mathbf{W}_{\text{acc}} + \lambda I)\Delta q_t = -J_t^T r_{\text{contact}}(q_t) - \tilde{\mathbf{W}}$$

Data annotation: GPT-4o generates five open-vocabulary instructions per keyframe; Dex-Retargeting maps MANO poses to five robotic hand morphologies.

Key Experimental Results¶

Main Results¶

Method	DexYCB Seen MPJPE↓	FID↓	Diversity (GT=125.53)	DexYCB Unseen MPJPE↓	FID↓
TM2T	85.33	54.83	37.12	94.22	55.94
MDM	88.06	52.33	33.95	93.05	55.13
FlowMDM	82.75	48.05	61.25	86.13	51.33
MotionGPT3	74.80	43.35	72.51	77.93	46.14
UniHM	61.40	31.24	39.62	63.56	41.03

Real-World Success Rate	Grab	Pick&Place	Pull&Push	Open&Close
MDM+Retarget (Seen)	20%	10%	0%	5%
MotionGPT3+Retarget (Seen)	30%	15%	25%	25%
UniHM (Seen)	65%	50%	60%	55%
UniHM (Unseen)	60%	35%	55%	45%

Ablation Study¶

Configuration	DexYCB Seen MPJPE↓	FID↓	DexYCB Unseen MPJPE↓	FID↓	Note
w/o Depth Input	85.47	56.36	90.12	77.38	Severe degradation with RGB only
w/o Masked Training	73.41	44.87	74.63	43.09	Progressive masking is important
w/o Physical Refinement	65.78	33.57	65.39	45.06	Physical optimization improves feasibility
Full UniHM	61.40	31.24	63.56	41.03	All modules are indispensable

Key Findings¶

UniHM consistently outperforms state-of-the-art methods on DexYCB and OakInk, reducing MPJPE by 18% in both seen and unseen settings.
Real-world grasping success rates substantially exceed baselines (Grab: 65% vs. 30%), with good generalization to unseen objects.
Depth input is critical for 3D scene understanding; removing it increases MPJPE by approximately 40%.
Physics-guided refinement significantly reduces interpenetration and improves motion stability.
The unified codebook enables plug-and-play transfer across five hand morphologies.

Highlights & Insights¶

UniHM is the first fully unified language-conditioned dexterous hand manipulation framework, extending the paradigm from static pose generation to dynamic sequence manipulation.
The morphology-agnostic codebook design is elegant: knowledge distillation circumvents the non-differentiability of VQ, and new morphologies require only training new encoder–decoder pairs.
The model is trained solely on human video data, eliminating the need for costly teleoperation data collection.
Physics-guided optimization unifies the generation prior, temporal prior, and contact constraints within a single coherent framework.

Limitations & Future Work¶

The method relies on RGB-D input and lacks tactile and force feedback.
The contact and friction energy terms are relatively simplified.
Bimanual coordination and tool-use scenarios are not addressed.
The Qwen3-0.6B backbone is relatively small; larger models may yield further improvements.
CLIPort requires fine-tuning for new scenes; end-to-end unification of perception and generation is a promising future direction.

This work extends the VQ-VAE tokenization paradigm from human motion generation to multi-morphology manipulation; the codebook sharing strategy has broad applicability.
The progressive masking training curriculum is an effective strategy for mitigating exposure bias in autoregressive generation.
Physics-guided post-processing maintains a balance between generation flexibility and physical feasibility.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First unified language-conditioned dexterous hand manipulation framework with multiple novel designs.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on DexYCB, OakInk, and real-world settings with comprehensive ablations; quantitative evaluation of cross-morphology generalization is somewhat limited.
Writing Quality: ⭐⭐⭐⭐ Method descriptions are detailed and the physical optimization derivations are clearly presented.
Value: ⭐⭐⭐⭐⭐ Addresses core pain points in dexterous hand manipulation with substantial practical application potential.