Skip to content

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Conference: ICLR 2026
arXiv: 2603.00732
Code: GitHub
Area: Multi-modal VLM
Keywords: Dexterous hand manipulation, VLM, unified tokenizer, physical dynamics optimization, cross-morphology generalization

TL;DR

UniHM is proposed as the first unified language-conditioned dexterous hand manipulation framework. It maps heterogeneous robotic hands to a shared discrete space via a morphology-agnostic VQ codebook, combines a VLM for instruction-driven sequence generation, and ensures physical feasibility through physics-guided dynamic optimization.

Background & Motivation

Dexterous manipulation requires perceiving, grasping, and reconfiguring objects in complex environments. Generating diverse, long-horizon, and physically feasible manipulation sequences is key to advancing humanoid robot applications.

Limitations of Prior Work: - Object-centric methods (UniDexGrasp, DexGraspNet, etc.): Lack open-vocabulary instruction guidance and only handle fixed sequences. - Language-guided grasping methods (SemGrasp, AffordDexGrasp, etc.): Primarily generate static grasp poses, ignoring temporal structure and failing to produce smooth continuous manipulation sequences. - Existing VLM manipulation methods (MotionGPT, HOIGPT, etc.): Mainly target digital hands or low-DOF grippers, lacking cross-hand generalization and physical feasibility guarantees.

Goal: Directly generate dynamic dexterous manipulation sequences from images and open-vocabulary instructions, supporting multiple hand types without reliance on teleoperation data.

Method

Overall Architecture

UniHM decomposes the "image + open-vocabulary instruction → dynamic dexterous manipulation sequence" pipeline into three concatenated stages: first, a morphology-agnostic VQ-VAE compresses heterogeneous robotic hand poses into unified discrete tokens; second, a small-scale VLM autoregressively generates these tokens conditioned on perception cues; finally, physics-guided refinement applies frame-by-frame optimization to ensure physical feasibility. These stages are trained independently and connected end-to-end during inference, leveraging human video data while avoiding dependence on teleoperation data.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    IN["RGB-D Image<br/>+ Open-vocabulary Instruction"]
    IN --> PERC["Perception Front-end (Scaffolding)<br/>CLIPort for target trajectory<br/>Point-SAM for object segmentation"]
    IN --> TOK["Morphology-Agnostic Unified Tokenizer<br/>Shared VQ codebook<br/>Encodes heterogeneous poses into unified tokens"]
    PERC --> VLM
    TOK --> VLM["Perception-Decoupled VLM Manipulation Generation<br/>Qwen3-0.6B autoregressively outputs tokens<br/>Decoded via codebook into joint sequences"]
    VLM --> REFINE["Physics-Guided Dynamic Optimization<br/>Frame-by-frame Gauss-Newton refinement"]
    REFINE --> OUT["Physically Feasible<br/>Dexterous Manipulation Sequence"]

Key Designs

1. Morphology-Agnostic Unified Tokenizer: Mapping five hand types into a single codebook. Different robotic hands (5 types including MANO, Shadow, Allegro) vary in degrees of freedom and structure. UniHM assigns a pair of specialized encoders \(E_h\) and decoders \(D_h\) to each hand type while sharing a single VQ-VAE codebook \(\mathcal{Z} = \{\mathbf{e}_k\}_{k=1}^K\). Quantization maps the encoding result to the nearest codeword \(c = \arg\min_k \|E_h(\mathbf{x}^{(h)}) - \mathbf{e}_k\|_2^2\). This projects heterogeneous hands into the same discrete space, making cross-hand translation a "plug-and-play" process: \(\hat{\mathbf{x}}^{(j)} = D_j(\mathbf{e}_{Q(E_i(\mathbf{x}^{(i)}))})\). When adding new hand types, knowledge distillation is used to align the new encoder with a reference hand type \(\mathcal{L}_{\text{distill}} = \|E_{\text{new}}(\mathbf{x}_{\text{new}}) - E_{\text{ref}}(\mathbf{x}_{\text{ref}})\|_2^2\), bypassing non-differentiable quantization.

2. Perception-Decoupled VLM Manipulation Generation: Separating scene understanding from action generation. End-to-end generation from raw RGB-D is data-hungry and hard to converge. UniHM decouples perception: a CLIPort module infers the target trajectory \(\mathcal{T}_{\text{tar}}\) from RGB-D and instructions, and Point-SAM segments the target object point cloud \(\mathcal{P}_{\text{obj}}\). Using a Qwen3-0.6B base as the generator, the sequence of initial hand pose encodings, target trajectory, object point clouds, and text tokens are processed to autoregressively output manipulation tokens. A progressive masking curriculum is used during training to alleviate exposure bias, transitioning from teacher forcing to pure autoregression.

3. Physics-Guided Dynamic Optimization: Refinement for physical feasibility. Sequences generated by the VLM can exhibit physical flaws like penetration or jitter. UniHM performs frame-by-frame Gauss-Newton optimization with Levenberg-Marquardt damping, optimizing three energy terms: contact energy \(\mathcal{E}_{\text{contact}}\) uses a smooth penalty on point-to-plane distances to encourage proper contact without penetration; generation prior \(\mathcal{E}_{\text{gen}}\) penalizes deviation from the original VLM output; and temporal prior \(\mathcal{E}_{\text{time}}\) regularizes first-order (velocity) and second-order (acceleration) differences to suppress jitter. The joint angles \(\Delta q_t\) are updated by solving:

\[(J_t^T J_t + \mathbf{W}_{\text{gen}} + \mathbf{W}_{\text{vel}} + \mathbf{W}_{\text{acc}} + \lambda I)\Delta q_t = -J_t^T r_{\text{contact}}(q_t) - \tilde{\mathbf{W}}\]

where \(\mathbf{W}_*\) are weight matrices and \(\lambda I\) is the LM damping term. This post-processing step ensures physical feasibility—removing it increases MPJPE from 61.40 to 65.78 in ablation studies.

Loss & Training

The VQ-VAE is trained using reconstruction and codebook losses: \(\mathcal{L}_{\text{vq}} = \|\text{sg}[\mathbf{z}_e] - \mathbf{z}_q\|_2^2 + \beta\|\mathbf{z}_e - \text{sg}[\mathbf{z}_q]\|_2^2\), where \(\text{sg}[\cdot]\) is the stop-gradient operator and \(\beta\) is the commitment weight. Training data is automatically labeled through two steps: GPT-4o generates open-vocabulary instructions for keyframes, and Dex-Retargeting maps MANO poses to five robot hand types.

Key Experimental Results

Main Results

Method DexYCB Seen MPJPE↓ FID↓ Diversity(GT=125.53) DexYCB Unseen MPJPE↓ FID↓
TM2T 85.33 54.83 37.12 94.22 55.94
MDM 88.06 52.33 33.95 93.05 55.13
FlowMDM 82.75 48.05 61.25 86.13 51.33
MotionGPT3 74.80 43.35 72.51 77.93 46.14
Ours 61.40 31.24 39.62 63.56 41.03
Real-world Success Rate Grab Pick&Place Pull&Push Open&Close
MDM+Retarget (Seen) 20% 10% 0% 5%
MotionGPT3+Retarget (Seen) 30% 15% 25% 25%
Ours (Seen) 65% 50% 60% 55%
Ours (Unseen) 60% 35% 55% 45%

Ablation Study

Configuration DexYCB Seen MPJPE↓ FID↓ DexYCB Unseen MPJPE↓ FID↓ Description
w/o Depth Input 85.47 56.36 90.12 77.38 Significant degradation with RGB only
w/o Masked Training 73.41 44.87 74.63 43.09 Progressive masking is crucial
w/o Physical Refinement 65.78 33.57 65.39 45.06 Refinement improves feasibility
Full UniHM 61.40 31.24 63.56 41.03 All modules are indispensable

Key Findings

  • UniHM outperforms SOTA methods on DexYCB and OakInk, reducing MPJPE by 18% in both Seen and Unseen scenarios.
  • Real-world success rates significantly exceed baselines (Grab: 65% vs 30%), generalizing well to unseen objects.
  • Depth input is critical for 3D scene understanding; removing it increases MPJPE by approximately 40%.
  • Physical optimization effectively reduces penetration and enhances stability.
  • The unified codebook enables plug-and-play transfer across five hand types.

Highlights & Insights

  • First completely unified language-conditioned dexterous manipulation framework, extending from static poses to dynamic sequences.
  • Elegant morphology-agnostic codebook design: Knowledge distillation bypasses VQ non-differentiability, requiring only new encoders/decoders for new hand types.
  • Trained solely on human video data without expensive teleoperation data collection.
  • Physics-guided optimization unifies generative priors, temporal priors, and contact constraints in a single framework.

Limitations & Future Work

  • Relies on RGB-D input, lacking tactile and force feedback.
  • Contact and friction energy terms are simplified.
  • Does not cover bimanual collaboration or tool-use scenarios.
  • Qwen3-0.6B is a small base; larger models may yield further improvements.
  • CLIPort requires fine-tuning for new scenes; end-to-end unified perception and generation is a future direction.
  • Extending VQ-VAE tokenization from human motion generation to multi-hand manipulation demonstrates the broad applicability of codebook sharing strategies.
  • Progressive masking curriculum is an effective solution for exposure bias in autoregressive generation.
  • Physics-guided post-processing balances generative flexibility with physical feasibility.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First unified language-conditioned dexterous manipulation framework with several pioneering designs.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Extensive evaluation on DexYCB, OakInk, and real-world; complete ablation studies; however, quantitative evaluation of cross-hand generalization is limited.
  • Writing Quality: ⭐⭐⭐⭐ Detailed methodology and clear derivation of physical optimization formulas.
  • Value: ⭐⭐⭐⭐⭐ Addresses core challenges in dexterous hand manipulation with high potential for practical application.