DexVLG: Dexterous Vision-Language-Grasp Model at Scale¶

Conference: ICCV 2025 arXiv: 2507.02747 Code: None Area: Robotics / Dexterous Grasping Keywords: Dexterous grasping, vision-language model, Flow Matching, semantic part grasping, large-scale dataset

TL;DR¶

This paper presents DexVLG — the first large-scale vision-language-dexterous-grasp model. It introduces DexGraspNet 3.0, a dataset comprising 174K objects and 170M grasp poses with part-level semantic annotations. By combining a VLM encoder with a Flow Matching pose prediction head, DexVLG achieves over 76% zero-shot execution success in simulation and demonstrates semantically aligned dexterous grasping in the real world.

Background & Motivation¶

Background: Vision-Language-Action (VLA) models are advancing rapidly in robotics, but progress has been largely confined to simple parallel-jaw grippers due to the difficulty of data collection.
Limitations of Prior Work: Functional grasping with human-like dexterous hands — i.e., grasping specific object parts according to semantic instructions — remains severely understudied, lacking both large-scale training data and effective model architectures.
Key Challenge: Dexterous hands have high degrees of freedom (>20 DoF), resulting in an enormous grasp pose space that makes it difficult for conventional methods to cover semantically aligned grasps across diverse objects and parts.
Key Insight: A data-driven approach: first generate large-scale, high-quality, semantically aligned dexterous grasp data, then train a large model to learn language-guided grasp pose prediction.
Core Idea: Construct an ultra-large-scale part-level dexterous grasp dataset (DexGraspNet 3.0), leverage a VLM to interpret natural language instructions and RGBD inputs, and predict dexterous hand grasp poses via Flow Matching.

Method¶

Overall Architecture¶

DexVLG consists of two major components: (1) the DexGraspNet 3.0 data construction pipeline — filtering objects from Objaverse, semantic segmentation via SAMesh, part name and object size annotation via GPT-4o, and energy-optimization-based grasp pose synthesis; and (2) the DexVLG model — a VLM encoder that processes RGBD images and language instructions, followed by a Flow Matching pose prediction head that generates dexterous hand grasp pose parameters.

Key Designs¶

DexGraspNet 3.0 Dataset Construction:
- Function: Filters and annotates 174K objects from Objaverse's 800K+ collection, synthesizing grasp poses for every semantic part of each object, totaling 170M grasp poses.
- Core Pipeline:
  - GPT-4o six-view querying to filter low-quality or unsuitable objects
  - Trimesh mesh extraction → ManifoldPlus watertightening → CoACD convex decomposition
  - SAMesh semantic segmentation → GPT-4o Set-of-Marks annotation of part names
  - GPT-4o estimation of plausible object sizes with rescaling (diagonal 20–50 cm)
  - Part-geometry-based initial hand pose alignment → gradient optimization for grasp synthesis
- Design Motivation: Data scale is fundamental to generalization. Part-level annotations enable the model to understand semantic instructions such as "grasp the handle" or "grasp the cap."
LP-based Differentiable Force Closure (LP-DFC):
- Function: Improves upon the original DFC energy optimization objective to synthesize more natural grasp poses.
- Mechanism: At each timestep, the hand pose is fixed and linear programming is used to solve for optimal contact force magnitudes: \(\min_{\mathbf{f}} \|G(\mathbf{f} \odot c)\|_2\), s.t. \(\max_i(\mathbf{f})_i = 1\), \((\mathbf{f})_i \geq 0\) The DFC energy is then rescaled based on the net torque \(P\) and contact forces \(\mathbf{f}\).
- Design Motivation: The original DFC assumes equal contact forces, leading to artifacts such as finger tilting during thumb opposition. LP-DFC models variable contact force magnitudes, producing more natural poses that conform better to object geometry.
Part-Aligned Hand Pose Initialization:
- Function: Semantically aligns the initial dexterous hand pose according to the geometric properties of the target object part.
- Mechanism: Object parts are classified into four categories — lid-like, disk-like, L-shaped, and shaft-like — each with a dedicated palm position and orientation alignment strategy. Two grasp modes are defined: Wrap grasp (7 contact points: 5 fingertips + palm) and Pinch grasp (4 contact points: thumb + index + middle finger + palm).
- Design Motivation: Gradient-based optimization is highly sensitive to initialization (as noted in the DexGraspNet paper). Part-aligned initialization injects strong geometric priors, yielding more natural and semantically distinguishable optimized poses.
VLM + Flow Matching Pose Prediction:
- Function: Accepts RGBD images and language instructions to predict dexterous hand grasp pose parameters.
- Mechanism: A VLM encodes visual and language inputs; Flow Matching serves as the denoising module to generate grasp poses, replacing DDPM/DDIM diffusion.
- Design Motivation: The Flow Matching paradigm is more amenable to learning pose generation than conventional diffusion models, as validated by experiments showing significant improvements over DDPM and DDIM.

Loss & Training¶

Regularization energy for grasp synthesis: \(E_{reg} = \omega_{limit}E_{limit} + \omega_{pen}E_{pen} + \omega_{spen}E_{spen} + \omega_{dir}E_{dir}\)
- Components include joint limit energy, hand-object penetration energy, self-penetration energy, and contact direction alignment energy (cosine similarity).
2×5,000 initial poses are generated per object part (5,000 for Wrap and 5,000 for Pinch modes).
Simulated penetration checks further filter low-quality poses.
Tabletop scene camera setup: 80 cm from the table center, uniformly sampled at a 45° downward angle.

Key Experimental Results¶

Main Results (Simulation Benchmark)¶

Data	LVIS-Seen		Unseen		SamPart3D
	Suc↑	PGA↑	Suc↑	PGA↑	Suc↑	PGA↑
Wrap grasp	87.7	62.1	79.1	36.6	76.3	52.0
Pinch grasp	71.8	20.2	54.8	15.2	50.6	21.3

DexVLG maintains a 79.1% Wrap grasp success rate on unseen objects and 76.3% zero-shot generalization on SamPart3D.

Ablation Study¶

Denoising Paradigm Ablation:

Method	LVIS-Seen		Unseen		SamPart3D
	Suc↑	PGA↑	Suc↑	PGA↑	Suc↑	PGA↑
DDPM	51.9	7.8	34.1	10.9	40.7	5.5
DDIM	57.7	12.5	39.6	10.4	35.2	8.5
Flow Matching	75.3	39.1	54.0	18.3	53.4	27.0

Dataset Quality Evaluation:

Method	Scale↑	Penetration↓ (mm)	Self-Penetration↓ (mm)	Q1↑
DexGraspNet	1.32M	13.5	0.93	0.114
Multi-GraspLLM	120k	7.1	-	0.091
Ours-Wrap	103M	1.75	0.19	0.085
Ours-Pinch	67M	1.42	0.22	0.067

Key Findings¶

Flow Matching substantially outperforms DDPM/DDIM: success rate improvements of 23.4%/17.6% on LVIS-Seen, and PGA improvements of 31.3%/26.6%.
Wrap grasping consistently outperforms Pinch grasping across all datasets (15.9–24.3% higher success rate), indicating that whole-hand wrapping yields greater stability.
DexGraspNet 3.0 is two orders of magnitude larger than its predecessor (170M vs. 1.32M poses) while achieving substantially lower penetration depth (1.75 mm vs. 13.5 mm).
Part-aligned initialization produces more natural and semantically distinguishable grasp poses compared to random initialization.
The Q1 stability metric is slightly lower than that of pure power grasp datasets, as part-aligned grasping does not optimize for stability as its sole objective.

Highlights & Insights¶

A quintessential demonstration of the data-driven approach: 174K objects, 170M poses, and part-level annotations — the sheer scale of the dataset constitutes a core contribution in its own right.
The multi-faceted application of GPT-4o throughout the data pipeline (object filtering, part annotation, size estimation) illustrates the potential of LLMs in robotic data construction.
The LP-DFC improvement over DFC is modest but critical — transitioning from equal-force to variable-force modeling markedly improves pose quality in thumb-opposition scenarios.
The four-category part geometry classification (lid/disk/L-shaped/shaft), while not exhaustive, covers the majority of functionally relevant part shapes.
The complete pipeline — from dataset construction to model training to simulation evaluation — carries substantial engineering value.

Limitations & Future Work¶

Only the Appendix was available in the cached version; details of the main method and experiments are missing — the VLM encoder architecture and training strategy remain unknown.
Real-world experimental descriptions are insufficient, with only qualitative references to "successful part-aligned grasps" and no quantitative results.
Pinch grasp success rates are relatively low (50–72%), limiting applicability to fine manipulation scenarios.
Data synthesis is conducted entirely in simulation; the sim-to-real gap may hinder real-world deployment.
Evaluation is limited to the LEAP Hand; generalization to other dexterous hand designs (e.g., Shadow Hand, Allegro Hand) has not been verified.
Object size clipping to 20–50 cm restricts applicability to very small or very large objects.

The DexGraspNet series (1.0→2.0→3.0) demonstrates an iterative development trajectory for dexterous grasp datasets.
Part-geometry relational priors from SoFar and OmniSpatial are effectively leveraged.
The superiority of Flow Matching as an action generation paradigm is validated in dexterous manipulation, potentially establishing it as the standard choice for VLA systems.
The part-aligned initialization strategy can be generalized to other scenarios requiring semantically guided pose synthesis.

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale language-guided dexterous grasping model and dataset; however, the methodology primarily combines existing techniques.
Experimental Thoroughness: ⭐⭐⭐ Only appendix ablations are accessible in the cached version; the main experiments cannot be fully assessed. The simulation benchmark is well-designed, but real-world validation is insufficient.
Writing Quality: ⭐⭐⭐ The appendix is detailed and thorough, but the absence of the main body limits a complete evaluation.
Value: ⭐⭐⭐⭐⭐ The scale and quality of the dataset alone represent an extremely high contribution to the community and will drive future research in dexterous grasping.