An Economic Framework for 6-DoF Grasp Detection¶

Conference: ECCV 2024
arXiv: 2407.08366
Code: https://github.com/iSEE-Laboratory/EconomicGrasp
Area: Robotics
Keywords: 6-DoF Grasp Detection, Economic Supervision, Ambiguity Problem, Focus Representation, Resource-Efficient

TL;DR¶

This paper proposes the EconomicGrasp framework. By identifying that the ambiguity problem in dense supervision is the root cause of the conflict between performance and resource consumption, it designs an economic supervision paradigm (retaining all view perspectives but cropping rotation angles and depths) and a focus representation module (an interactive grasp head with composite scoring). It outperforms the SOTA by approximately 3 AP on GraspNet-1Billion with only 1/4 of the training time and 1/8 of the memory cost.

Background & Motivation¶

Background: 6-DoF grasp detection is a fundamental capability in robotic manipulation, aiming to generate feasible grasp poses \(\mathbf{G} = [\mathbf{c}, v, a, d, w, s]\) (center point, view direction, rotation angle, depth, width, and score) from input point clouds. Recently, with the emergence of the large-scale real-world dataset GraspNet-1Billion, methods using dense supervision (14,400 labels per point, i.e., 300 views \(\times\) 12 angles \(\times\) 4 depths) have achieved state-of-the-art (SOTA) performance, exemplified by GSNet.

Limitations of Prior Work: Although dense supervision yields excellent performance, it incurs massive resource overheads: (1) Long training time: data processing time is about 9 times that of model training and loss calculation combined; (2) High memory consumption: each batch requires loading nearly 100 million labels, consuming up to 34.6 GB of memory; (3) Low GPU utilization: only around 20%; (4) High storage cost: the label set size reaches 55 GB. In contrast, while early sparse supervision methods are resource-friendly, their performance lags significantly.

Key Challenge: Dense supervision achieves high performance but at the cost of massive resource consumption, whereas sparse supervision is resource-friendly but suffers from limited performance. What fundamentally causes this performance gap? Is it the gap in network module design, or is it an inherent problem of the supervision paradigms themselves?

Goal: 1) Identify the root cause of the performance gap between dense and sparse supervision; 2) Design an "economic" supervision paradigm that is both resource-friendly and performance-preserving; 3) Further improve grasp prediction accuracy under economic supervision.

Key Insight: The authors first progressively "modernize" a simple sparse supervision baseline by adding advanced modules, only to find that network module design is not the primary cause of the performance gap. Through variance analysis, they reveal that the "ambiguity problem" in sparse supervision is the true culprit: multiple good grasps with completely different poses can exist at the same point. If only one is randomly retained, the label directions of neighboring points can contradict each other, confounding network training.

Core Idea: By retaining all grasp views (eliminating ambiguity) but cropping rotation angles and depths (reducing redundancy), the framework finds an "economic" trade-off between dense and sparse supervision, while designing a focus representation module to accurately predict specific grasps.

Method¶

Overall Architecture¶

The pipeline of EconomicGrasp is: Input Point Cloud \(\rightarrow\) 3D UNet backbone for feature extraction \(\rightarrow\) Graspness prediction to identify graspable points \(\rightarrow\) Best view selection \(\rightarrow\) Cylindrical region feature grouping \(\rightarrow\) Interactive Grasp Head for predicting angle, depth, width, and score \(\rightarrow\) Composite Score Estimation to output the final score. The core contributions lie in the intermediate supervision paradigm design and the backend focus representation module.

Key Designs¶

Economic Supervision Paradigm:
- Function: Maintains training effectiveness while substantially reducing label quantity, compressing the label set from 55 GB to 1.6 GB.
- Mechanism: Variance analysis reveals that the grasp view is at the core of the ambiguity problem. Once the view direction is determined, the standard deviations of rotation angle and depth drop nearly to zero (from 3.18/0.81 down to 0.22/0.05). Based on this insight, a three-step label cropping strategy is designed: (1) Grasp Pose Cropping: Retain all 300 view directions but keep only the best grasp for each view (reducing labels per point from 14,400 to 300); (2) Scene-level Label Construction: Pre-merge the labels of all objects in each scene into scene-level labels, avoiding online construction overhead during training; (3) Point Cropping: Remove points without any valid grasp labels (friction coefficient \(< 0.8\) or possessing collision), further halving the label volume. Additionally, a selective match loss is introduced to handle cases where some points cannot find matched labels under economic supervision.
- Design Motivation: The key insight of keeping all views is that view is the most ambiguous parameter. If views are incomplete, adjacent points in the same region might be assigned grasp labels with completely different directions, preventing the network from learning a consistent pattern. In contrast, once the view is fixed, the angle and depth exhibit minimal variation, making it sufficient to keep just one optimal value.
Interactive Grasp Head:
- Function: Learns more discriminative feature representations for specified grasps under economic supervision.
- Mechanism: It consists of two layers of interactive attention: global and local. Global Interactive Attention performs self-attention among the point features within the cylindrical region to compress regional features into a unified representation, focusing on specific good grasps rather than the whole region. Local Interactive Attention performs attention across the features of the four grasp parameters (angle, depth, width, score) to model parameter dependencies—e.g., depth is sometimes naturally determined once the angle is fixed, and vice versa. The four parameters use independent prediction heads but share information via attention mechanisms.
- Design Motivation: Under economic supervision, there is only one optimal grasp per view, requiring the network to precisely learn this specific grasp. Traditional methods predict scores for all angle-depth combinations, leading to scattered features; this method utilizes interactive attention to focus features on the single target, enhancing prediction accuracy. Learning the dependencies between parameters is another unique advantage of this design.
Composite Score Estimation:
- Function: Predicts grasp quality scores more accurately.
- Mechanism: It transforms the continuous scoring problem into a classification problem. The score is computed from the friction coefficient as \(s = 1.1 - \mu\), which has 6 discrete values (0, 0.2, 0.4, 0.6, 0.8, 1.0) and is predicted via a 6-class classifier to output a probability distribution. During inference, instead of simply taking the score corresponding to the maximum probability, a composite score is computed as \(s = [0, 0.2, 0.4, 0.6, 0.8, 1.0] \cdot \mathbf{s}_c^T\), which is the probability-weighted sum of each score interval. This fully exploits the distribution information learned by the classifier.
- Design Motivation: Direct regression of scores has limited accuracy, whereas the classification + weighted combination approach captures the distribution characteristics while outputting continuous values. Ablation studies show that not using composite scoring leads to a drop of about 10 AP. Moreover, this design is tailored for economic supervision—introducing classification under dense supervision would expand the angle-depth combinations by 6 times, which is highly uneconomical.

Loss & Training¶

The total loss includes: (1) smooth L1 loss for view prediction; (2) cross-entropy classification loss for rotation angle and depth; (3) smooth L1 regression loss for width; (4) cross-entropy classification loss for scores; (5) auxiliary loss for graspness identification. Training uses the Adam optimizer with an initial learning rate of 1e-3, cosine decay, batch size of 4, and trains for 10 epochs. The backbone is a 14-layer 3D UNet (implemented with Minkowski Engine) with a feature dimension of 512.

Key Experimental Results¶

Main Results¶

Dataset	Metric	EconomicGrasp	GSNet (SOTA)	Gain
Kinect-Seen	AP	62.59	61.19	+1.40
Kinect-Similar	AP	51.73	47.39	+4.34
Kinect-Novel	AP	19.54	19.01	+0.53
RealSense-Seen	AP	68.21	65.70	+2.51
RealSense-Similar	AP	61.19	53.75	+7.44
RealSense-Novel	AP	25.48	23.98	+1.50

Resource Cost Comparison¶

Method	Training Time (h)	Memory (GB)	Storage (GB)	Mean mAP
GSNet (dense)	37.8	35.4	55	42.53/47.81
EconomicGrasp	8.3	4.2	1.6	44.62/51.63
Ratio	1/4.5	1/8.4	1/34	+2.1/+3.8

Ablation Study¶

Configuration	Seen	Similar	Novel	Mean
Vanilla (Sparse Supervision)	43.59	34.09	13.36	30.34
+ Economic Supervision	60.07	48.16	18.70	42.31
+ Interactive Head	63.08	50.61	18.74	44.14
+ Composite Score	59.81	48.45	19.01	42.42
Full Model	62.59	51.73	19.54	44.62

Key Findings¶

Economic Supervision is the largest contributor: Moving from Vanilla to +Economic Supervision increases the mean AP from 30.34 to 42.31 (+12 AP gain), validating that the ambiguity problem is the core bottleneck.
Interactive Grasp Head provides consistent gains: It improves the mean AP by +1.8, demonstrating the effectiveness of focused feature learning for predicting specific grasps.
Composite Scoring must collaborate with the Interactive Head: Using composite scoring alone yields a slight performance drop (42.42 vs 42.31), but combining it with the Interactive Head achieves the best result of 44.62, showing that accurate scoring depends on precise feature representations.
Real-world robot experiments: In grasp experiments with 6 scenes and 36 objects, EconomicGrasp achieves a success rate of 92.3%, outperforming GSNet's 87.8%; failures drop from 109/7680 to 36/7680.

Highlights & Insights¶

Brilliant discovery and analysis of the ambiguity problem: Revealing the root cause of the performance gap between dense and sparse supervision through simple variance statistics is highly commendable. This "diagnosis before treatment" research methodology is exemplary. The finding that angle and depth are almost fixed once the view is determined is highly insightful.
Clever design of economic supervision balancing information and redundancy: Rather than simply reducing labels, this approach selectively retains key information based on ambiguity analysis (retaining all views while cropping others). This data-understanding-driven method design is highly convincing.
Substantial improvement in resource efficiency: Achieves 1/4 of training time, 1/8 of memory footprint, and better performance, rendering this improvement highly valuable for practical deployment. Moreover, the method can be plug-and-played into other frameworks.

Limitations & Future Work¶

The economic supervision paradigm relies on the label structure of GraspNet-1Billion (300 views \(\times\) 12 angles \(\times\) 4 depths). Cropping strategies need to be redesigned for datasets with other annotation formats.
As mentioned in the conclusion, constructing economic supervision from scratch (rather than simplifying from dense annotations) is a meaningful future direction.
The current method targets two-finger parallel grippers; adaptation to complex end-effectors like dexterous hands requires further research.
Is the 6-level discretization of the composite score optimal? Whether finer-grained discretization can bring further improvements remains to be explored.
The real-world experimental scenes are relatively small (6 scenes); larger-scale and more diverse physical testing would better validate generalization.

vs GSNet: GSNet is the current SOTA dense-supervision method utilizing all 14,400 labels per point. EconomicGrasp outperforms it with only 300 labels per point while substantially reducing resource consumption, showing that "more labels does not necessarily mean better results."
vs S4G/PointNetGPD: These early sparse supervision methods predict only one grasp per point, and their performance is constrained by the ambiguity problem. EconomicGrasp fundamentally resolves ambiguity by retaining all views.
vs TransGrasp/GraNet: Although some sparse-supervision methods utilize advanced architectures, they do not resolve the ambiguity problem, leading to a persistent performance gap.
The idea of economic supervision can inspire other dense annotation tasks (such as dense prediction, 3D object detection, etc.)—designing more efficient supervision strategies by analyzing label redundancy and ambiguity.

Rating¶

Novelty: ⭐⭐⭐⭐ The discovery of the ambiguity problem and the design of the economic supervision paradigm are highly original, though the technological components are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Very comprehensive, covering two camera datasets, detailed ablation studies, resource cost analysis, real-world robot experiments, and failure case analysis.
Writing Quality: ⭐⭐⭐⭐ The logical flow from observation to analysis to solution is clear, figures are well-designed, and the step-by-step reasoning is highly convincing.
Value: ⭐⭐⭐⭐⭐ Substantially reduces resource consumption while improving performance, offering direct engineering value to the field of robotic grasping.