GraspXL: Generating Grasping Motions for Diverse Objects at Scale¶

Conference: ECCV 2024
arXiv: 2403.19649
Code: Available
Area: Robotics
Keywords: Grasp motion generation, dexterous manipulation, reinforcement learning, large-scale generalization, multi-objective control

TL;DR¶

GraspXL is proposed, an RL-based grasp motion generation framework that generalizes to over 500,000 unseen objects after training on only 58 objects, while simultaneously supporting multi-objective motion control (grasp region, heading, wrist rotation, and hand position) and multiple dexterous hand platforms.

Background & Motivation¶

Human hands possess extraordinary dexterity—the ability to grasp objects of arbitrary shapes, target specific regions, approach from specific directions, and do so without object-specific training. Replicating this ability is a core challenge in both computer animation and robotic grasping.

Existing grasp motion synthesis methods face three key bottlenecks:

Dependence on expensive 3D hand-object interaction data: Data-driven methods (e.g., D-Grasp) require precise 3D annotated sequences as training data, which are extremely costly to collect, and the models are constrained by the training distribution, limiting generalization.

Limited generalization: Most methods can only handle objects seen during training (DexVIP evaluates 0 novel objects) or support only a small number of test objects (UniDexGrasp covers about 100), falling far short of real-world requirements.

Lack of multi-objective control: Existing methods typically support only a single objective (e.g., controlling only the heading direction) and cannot simultaneously satisfy multiple motion objectives such as grasp region, heading, wrist rotation, and position.

The core challenge lies in: how to enable a policy to generate stable grasping motions for arbitrary 3D shapes that satisfy multi-objective motions without requiring 3D hand-object interaction data? Key difficulties include: (1) highly diverse object shapes require a general shape-perception mechanism; (2) multiple objectives can conflict (satisfying an objective may cause contact that moves the object, undermining grasping stability); (3) joint configurations of different dexterous hands vary significantly.

Method¶

Overall Architecture¶

GraspXL operates within a reinforcement learning framework. Given an object and a hand model, the policy network \(\boldsymbol{\pi}\) takes the motion target \(\mathcal{T}\) and the physical state \(\mathbf{s}\) as inputs. After transformation by the feature extraction layer \(\Phi\), it outputs PD control targets as actions \(\mathbf{a}\), which are executed by a physics simulator (RaiSim). The hand model consists of \(L\) links, and the object is represented as a 3D point cloud.

Key Designs¶

Universal Feature Extraction \(\Phi(\mathbf{s}, \mathcal{T})\):
- Inputs include joint angles \(\mathbf{q}\), PD tracking errors \(\mathbf{d}\), hand/object velocities \(\mathbf{u}_h, \mathbf{u}_o\), contact vector \(\mathbf{c}\), and contact force \(\mathbf{f}\).
- Target differences: \(\tilde{\mathbf{v}}, \tilde{\mathbf{m}}, \tilde{\omega}\) represent the current differences from the target heading, position, and rotation, respectively.
- Link distance features \(\mathbf{l}^+ \in \mathbb{R}^{L \times 3}\): The vectors from each link to the closest point in the graspable region; \(\mathbf{l}^- \in \mathbb{R}^{L \times 3}\): The vectors to the closest point in the non-graspable region.
- Design Motivation: Link distance features are generic to object shapes (independent of specific point cloud encoders) and capture the spatial relationships with graspable/non-graspable regions, enabling the policy to generalize to arbitrary shapes.
Multi-Objective Reward Function:
- The total reward is \(r = r_{\text{goal}} + r_{\text{grasp}}\), decoupling target satisfaction and grasping stability.
- Target reward: \(r_{\text{goal}} = r_{\text{dis}} + r_\mathbf{v} + r_\omega + r_\mathbf{m}\)
  - \(r_{\text{dis}} = -\sum_i [w_d^+(i)\|\mathbf{h}_i - \mathbf{o}_i^+\|^2 - w_d^-(i)\|\mathbf{h}_i - \mathbf{o}_i^-\|^2]\): Encourages approaching the graspable region and staying away from the non-graspable region.
  - \(r_\mathbf{v} = -w_\mathbf{v}\|\mathbf{v} - \bar{\mathbf{v}}\|^2\): Heading alignment.
- Grasping reward: \(r_{\text{grasp}} = r_\mathbf{c} + r_\mathbf{f} + r_{\text{anatomy}} + r_{\text{reg}}\)
  - Contact reward \(r_\mathbf{c}\) and force reward \(r_\mathbf{f}\) (both distinguishing between graspable and non-graspable regions), anatomical constraints \(r_{\text{anatomy}}\) (for the MANO hand), and regularization \(r_{\text{reg}}\).
- Design Motivation: The reward function is entirely independent of specific hand structures, relying only on the geometric relationships between link positions and the object surface, allowing direct transfer to different dexterous hands.
Curriculum Learning:
- Stage 1: Train on static objects with higher weight on \(r_{\text{goal}}\) to learn precise finger movements to satisfy the targets.
- Stage 2: Fine-tune on movable objects with higher weight on \(r_{\text{grasp}}\) to learn stable grasping.
- Objective-driven Guidance: The difference between the target and current values is directly added as a bias to the wrist's 6DoF PD controller, accelerating exploration.
- Design Motivation: Directly learning both target alignment and grasping on movable objects leads to local optima—the contact forces generated while satisfying targets can cause the object to flip, resulting in failed grasps. The curriculum design decouples these two learning tasks.

Loss & Training¶

Standard RL training using the PPO algorithm in the RaiSim physics simulator.
The training set contains only 58 objects (26 from ShapeNet + 32 from PartNet).
During training, targets are randomly sampled: random heading \(\bar{\mathbf{v}}\), random rotation \(\bar{\omega} \in [0, 2\pi)\), ensuring the graspable region width \(\le 12\)cm.
Single Nvidia RTX 6000 GPU + 128 CPU cores.

Key Experimental Results¶

Main Results¶

Method	Success Rate↑	Mid. Error↓	Head. Error↓	Rot. Error↓	Contact Ratio↑
SynH2R-PD	26.5%	4.30cm	0.767rad	0.857rad	13.0%
SynH2R	82.3%	4.06cm	0.522rad	0.568rad	53.4%
GraspXL	95.0%	2.85cm	0.270rad	0.306rad	86.7%

ShapeNet test set (completely unseen objects):

Method	Success Rate↑	Mid. Error↓	Head. Error↓	Rot. Error↓
SynH2R	65.8%	4.49cm	0.642rad	0.688rad
GraspXL	81.0%	3.22cm	0.292rad	0.338rad

Large-scale Objaverse generalization (503k objects):

Object Size	Success Rate↑	Mid. Error↓	Head. Error↓	Rot. Error↓
Small	85.9%	3.20cm	0.311rad	0.362rad
Medium	84.5%	3.16cm	0.274rad	0.315rad
Large	79.0%	3.50cm	0.271rad	0.306rad
Average	82.2%	3.32cm	0.279rad	0.319rad

Ablation Study¶

Configuration	Suc. Rate↑	Mid. Err↓	Head. Err↓	Rot. Err↓	Description
w/o Guidance	90.0%	3.22	0.394	0.425	Without guidance, exploration efficiency decreases
w/o Distance	81.6%	2.90	0.419	0.475	Without distance features, shape perception degrades
w/o Curriculum	96.2%	4.12	0.381	0.462	Without curriculum, high success rate but poor target accuracy
Full Model	95.0%	2.85	0.270	0.306	Best overall performance

Generalization to different dexterous hands (PartNet):

Hand Model	Suc. Rate↑	Mid. Error↓	Head. Error↓	No. of Joints
MANO	95.0%	2.85cm	0.270rad	45
Allegro	95.3%	4.38cm	0.291rad	16
Shadow	94.0%	3.57cm	0.317rad	22
Faive	95.8%	2.85cm	0.228rad	30

Key Findings¶

Generalizes to over 500k unseen objects after training on only 58 objects, achieving an 82.2% success rate and demonstrating remarkable generalization capability.
SynH2R requires a week to generate reference poses for the ShapeNet test set, whereas GraspXL achieves real-time inference.
Although the success rate with the curriculum is slightly lower than without it, the target accuracy is significantly improved (Heading Error dropped from 0.381 to 0.270), highlighting the necessity of decoupled learning.
Link distance features are crucial for generalization—they directly extract shape intelligence from geometric distances without relying on point cloud encoders.
Highly effective on both reconstructed objects (with noisy meshes) and AI-generated objects, presenting no barriers to real-world deployment.

Highlights & Insights¶

Breakthrough in Scale: Generalizing from 58 training objects to over 500,000 test objects is an unprecedented scale in grasp motion synthesis.
Zero Data Dependency: It completely eliminates the need for 3D hand-object interaction data, learning solely through RL + physical simulation.
Exquisite Curriculum Design: Static-to-movable object state transitions paired with target-to-grasp reward weight switches effectively resolve multi-objective conflicts.
Generality: The same framework supports four different dexterous hands (MANO, Shadow, Allegro, Faive) without any architecture modification.

Limitations & Future Work¶

Only supports rigid bodies and does not handle deformable objects (e.g., cloth, ropes).
The success rate drops when objects are excessively large or heavy (79% for Large objects), requiring torque constraints to be considered.
Currently only generates approach-and-grasp motions, excluding subsequent manipulation (e.g., rotation, placement).
Link distance features are sensitive to occlusion and self-occlusion; future work could integrate point cloud encoders to enhance shape understanding.
Inference requires a physics simulator; direct deployment onto real robots necessitates sim-to-real transfer.

SynH2R (Christen et al., ICRA 2024): A two-stage approach that generates reference poses via optimization and follows them using RL. However, the optimization process is time-consuming and the control objectives are limited.
UniDexGrasp/UniDexGrasp++: Also employs RL for grasping but requires pre-generated reference poses and generalizes to only about 100 objects.
D-Grasp (Christen et al., CVPR 2022): Utilizes reference grasp poses to generate physically plausible grasping motions.
Application of Curriculum Learning in RL: The static-to-movable curriculum design in this work can be extended to other contact-rich RL tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — The multi-objective control framework and curriculum learning design are creative, and the large-scale generalization is highly impressive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely comprehensive, with evaluation spanning multiple datasets, multiple hand models, thorough ablations, and a test scale of over 500k objects.
Writing Quality: ⭐⭐⭐⭐ — Clear problem definition, consistent notation, and well-structured presentation.
Value: ⭐⭐⭐⭐⭐ — Demonstrates grasp motion generation for 500k+ objects for the first time. The code and dataset are open-source, offering direct value to the robotics and computer animation communities.