Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance¶

Conference: ECCV 2024
arXiv: 2407.13842
Code: https://airvlab.github.io/grasp-anything
Area: 3D Vision
Keywords: 6-DoF Grasp Detection, Language Guidance, Diffusion Models, Negative Prompt Guidance, Robotic Manipulation

TL;DR¶

This paper proposes a large-scale language-driven 6-DoF grasp dataset named Grasp-Anything-6D (containing 1M scenes and over 200M grasp poses) and LGrasp6D, a diffusion-based framework. The core novelty lies in the Negative Prompt Guidance (NPG) strategy, which directs the grasp poses away from non-target objects during inference.

Background & Motivation¶

Background: 6-DoF grasp detection is a fundamental task in robotic vision. Existing methods primarily focus on grasp stability, but neglect the grasping intentions expressed by humans via natural language.

Limitations of Prior Work: Existing language-driven grasping methods suffer from severe limitations: some only support single-object scenes, while others only detect 2D rectangle grasps, making them unable to handle fine-grained language instructions in cluttered 3D scenes.

Key Challenge: Language-driven grasping is inherently a fine-grained task (e.g., "grasp the blue cup" vs. "grasp the black cup"), yet prior works lack precise guidance mechanisms to differentiate between various objects.

Goal: To detect 6-DoF grasp poses for target objects in cluttered 3D point cloud scenes based on natural language instructions.

Key Insight: Drawing inspiration from negative prompts in image generation, this work learns the embedding representations of non-target objects in the scene during training. During inference, it "pushes" the grasp poses away from the non-target objects.

Core Idea: By learning negative prompt embeddings that encode "what not to grasp," the method combines positive guidance and negative repulsion during diffusion sampling to achieve precise grasping.

Method¶

Overall Architecture¶

LGrasp6D is an end-to-end diffusion framework:

Input: 3D point cloud scene \(\mathbf{S}\) and text prompt \(\mathbf{t}\) (e.g., "Grasp the red mug")
Forward Process: Incrementally adds noise to the target grasp pose \(\mathbf{g}_0\) using the \(\mathfrak{se}(3)\) Lie algebra representation.
Denoising Network: Simultaneously predicts noise and learns negative prompt embeddings.
Reverse Process: Generates grasp poses using the denoising step guided by the negative prompt.

Key Designs¶

Grasp-Anything-6D Dataset: Based on the 2D Grasp-Anything dataset, ZoeDepth is utilized to estimate depth maps, projecting 1.0 million 2D scenes into 3D point clouds. Grasp positions are mapped via bilinear interpolation, and 6-DoF rotations are derived from key rectangular grasp angles, followed by manual inspection to discard colliding or unstable grasps. The final scale reaches 1M point cloud scenes and over 200M 6-DoF grasp poses, with each grasp associated with a linguistic description. Its Design Motivation is to fill the gap in training data for language-driven 6-DoF grasping.
Denoising Network Architecture: The input grasp pose \(\mathbf{g}_t\) is encoded via an MLP, the point cloud \(\mathbf{S}\) is encoded into \(n_s\) scene tokens using PointNet++, and the text \(\mathbf{t}\) is encoded using a frozen CLIP ViT-B/32. The fused feature \(\mathbf{f}_{\text{uni}}\) of the time embedding, grasp embedding, and text embedding is used as the query, while the scene tokens serve as the key/value in multi-head cross-attention. Finally, an MLP outputs the predicted noise \(\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{g}_t, \mathbf{S}, \mathbf{t}, t)\). The noise prediction loss is:

\[\mathcal{L}_{\text{noise}} = \mathbb{E}_{\boldsymbol{\epsilon}, \mathbf{g}_0, \mathbf{S}, \mathbf{t}, t}\left[\|\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{g}_t, \mathbf{S}, \mathbf{t}, t) - \boldsymbol{\epsilon}\|^2\right]\]

Negative Prompt Guidance Learning: The network simultaneously outputs a negative prompt embedding \(\tilde{\mathbf{t}}\), which is obtained by subtracting the positive text embedding \(\mathbf{t}\) from the scene tokens, calculating their mean, and then passing it through an MLP. This embedding is trained to approximate the text embeddings of other objects in the scene (representing "what not to grasp"):

\[\mathcal{L}_{\text{negative}} = \min_{i=1}^{m} \|\tilde{\mathbf{t}} - \bar{\mathbf{t}}_i\|_2^2\]

where \(\{\bar{\mathbf{t}}_i\}_{i=1}^m\) is the set of text embeddings of non-target objects in the same scene. During inference, based on the conditional distribution decomposition in Proposition 1:

\[p(\mathbf{g}|\mathbf{S}, \mathbf{t}, \neg\tilde{\mathbf{t}}) \propto p(\mathbf{g}|\mathbf{S}) \frac{p(\mathbf{g}|\mathbf{t}, \mathbf{S})}{p(\mathbf{g}|\tilde{\mathbf{t}}, \mathbf{S})}\]

The combined denoising step is formulated as:

\[\tilde{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}} = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{g}_t, \mathbf{S}, \varnothing, t) + w\left(\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{g}_t, \mathbf{S}, \mathbf{t}, t) - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{g}_t, \mathbf{S}, \tilde{\mathbf{t}}, t)\right)\]

where \(w=0.2\) is the negative guidance strength, and \(\varnothing\) represents the unconditional prediction trained by randomly masking the text condition with a probability \(p_{\text{mask}}=0.1\).

Loss & Training¶

\[\mathcal{L} = 0.9\mathcal{L}_{\text{noise}} + 0.1\mathcal{L}_{\text{negative}}\]

Number of diffusion steps \(T=200\), with variance linearly increasing from \(\beta_1=10^{-4}\) to \(\beta_T=0.02\)
8× A100 GPUs, batch size 128, 200 epochs
Adam optimizer, lr=\(10^{-3}\), weight decay=\(10^{-4}\)
DDIM acceleration during inference can reduce the steps to 50 or even 10

Key Experimental Results¶

Main Results - Grasp-Anything-6D Dataset¶

Method	CR↑	EMD↓	CFR↑	IT(s)↓
6-DoF GraspNet	0.3802	0.8035	0.6900	0.4216
SE(3)-DF	0.4290	0.7565	0.7325	1.7233
3DAPNet	0.4777	0.7381	0.7213	3.4274
Ours w/o NPG	0.5459	0.6262	0.7336	1.4328
LGrasp6D (Ours)	0.6694	0.4013	0.7706	1.4832

Real-Robot Experiments¶

Method	Input	Single-Object Success Rate	Cluttered Scene Success Rate
GG-CNN + CLIP	RGB-D	0.10	0.07
CLIPORT	RGB-D	0.27	0.30
CLIP-Fusion	RGB-D	0.40	0.40
LGD	RGB-D	0.43	0.42
6-DoF GraspNet	Point Cloud	0.31	0.27
Ours w/o NPG	Point Cloud	0.38	0.36
LGrasp6D (Ours)	Point Cloud	0.43	0.42

Key Findings¶

Negative Prompt Guidance brings a substantial improvement: CR increases from 0.5459 to 0.6694 (+22.6%), and EMD decreases from 0.6262 to 0.4013 (-35.9%).
t-SNE visualization demonstrates that when NPG is utilized, the grasp pose clusters of different objects are clearly separated, whereas they are severely conflated without NPG.
Cross-dataset generalization (on Contact-GraspNet) shows consistent trends with minor performance degradation.
With 50-step DDIM acceleration, inference takes only 0.40s, which is faster than 6-DoF GraspNet while all evaluation metrics still outperform other baselines.
Trained on synthetic data but successfully generalizes to real-world tabletop, kitchen, and bathroom scenes.

Highlights & Insights¶

The concept transfer of Negative Prompt Guidance is ingenious: adapting the negative prompt idea ("avoid generating certain content") from image generation to robotic grasping.
Mathematical derivations are rigorous, with Proposition 1 providing a strict decomposition of conditional distributions, and the combined denoising formulation is clearly-structured.
The dataset scale is massive (1M scenes / 200M grasps), providing an essential training resource for this specific sub-field.
End-to-end design: mapping from natural language directly to 6-DoF grasps without requiring auxiliary segmentation or object detection modules.

Limitations & Future Work¶

There still exist failure cases involving grasping incorrect objects, collisions, and misaligned scoping (as shown in Figure 7 of the paper).
The dataset is derived from depth maps estimated via ZoeDepth, which introduces occasional inaccuracies.
Only object-level grasping instructions are supported, leaving part-level (e.g., "grasp the handle") and task-level (e.g., "grasp the knife to chop vegetables") scenarios unaddressed.
Applicable only to Robotiq 2F parallel-jaw grippers; generalization to multi-fingered dexterous hands requires extra investigation.
The inference speed (1.48s) remains relatively slow for real-time interaction, and although DDIM can accelerate the process, it incurs a slight performance trade-off.

Classifier-free Guidance: The concept of negative prompt originates from CFG, which linearly combines unconditional and conditional predictions.
3DAPNet: A representative prior work utilizing diffusion models for 6-DoF grasping, but lacks support for language.
Grasp-Anything: The 2D predecessor of this work's dataset, offering 2D language-grasp alignments.
Insight: The negative prompt mechanism can be extended to other conditional generation tasks (e.g., obstacle avoidance in language-driven manipulation planning).

Rating¶

Novelty: ⭐⭐⭐⭐ — The first effective application of negative prompt guidance in grasp detection.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Highly comprehensive, featuring baseline comparisons, cross-dataset verification, DDIM acceleration, t-SNE analysis, and physical robot deployment.
Writing Quality: ⭐⭐⭐⭐ — Clearly structured with theoretical derivations, extensive benchmarks, and real-world robot validation.
Value: ⭐⭐⭐⭐ — Offers a large-scale dataset alongside an effective method, substantially advancing the domain of language-driven manipulation.