RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping¶

Conference: ICCV 2025 arXiv: 2507.23734 Code: GitHub Area: Image Segmentation Keywords: affordance segmentation, robotic grasping, reasoning instructions, large-scale benchmark, vision-language model

TL;DR¶

This paper introduces RAGNet, the first large-scale reasoning-based affordance segmentation benchmark (273k images, 180 categories, 26k reasoning instructions), and proposes the AffordanceNet framework, which integrates VLM-pretrained affordance prediction with grasp pose generation, demonstrating strong open-world generalization and reasoning capabilities.

Background & Motivation¶

General-purpose robotic grasping systems must accurately perceive object affordance regions across diverse open-world scenarios in response to human instructions. Prior work suffers from two key bottlenecks:

Insufficient data scale and diversity: Existing affordance datasets are typically confined to specific domains (robotics, egocentric, or in-the-wild), with limited categories (UMD: 17, AGD20k: 50) and homogeneous image sources, resulting in poor open-world generalization.

Lack of reasoning-based instructions: Existing methods rely on fixed-template language prompts (e.g., "Please segment the affordance region of XX"), which cannot handle human-like high-level instructions such as "I need something that can cut bread," lacking complex reasoning capabilities.

Key insight: Instructions received by robots in real-world scenarios often do not directly specify the target object category but instead describe functional requirements. This demands that affordance prediction models possess a complete capability chain: from functional description reasoning to object identification to region segmentation.

Method¶

Overall Architecture¶

The paper makes two primary contributions: 1. RAGNet Benchmark: Large-scale multi-source data collection + multi-tool affordance annotation + three-tier reasoning instruction construction. 2. AffordanceNet Model: AffordanceVLM (affordance region prediction) + Pose Generator (grasp pose generation).

Key Design 1: Multi-source Data Collection and Annotation¶

Data sources spanning four domains: - Wild: HANDAL (17 categories of hardware/kitchen tools, real-world diverse scenes) - Robot: Open-X (124 categories, robotic manipulation scenes), GraspNet (32 categories) - Ego-centric: EgoObjects (74 indoor categories, first-person perspective) - Simulation: RLBench (10 categories, simulated environments)

In total: 273k images across 180 categories, far exceeding prior datasets.

Five-tier affordance annotation pipeline: - ❶ Raw masks: Directly using existing fine-grained annotations (e.g., HANDAL) - ❷ SAM2: For objects without handles, GT bounding boxes guide SAM2 to generate masks - ❸ Florence2 + SAM2: Language instructions drive Florence2 to produce polygon proposals, refined by SAM2 - ❹ VLPart + SAM2: VLPart's part-level recognition (e.g., knife handle, cup handle) combined with SAM2 for affordance region segmentation - ❺ Manual annotation (+ SAM2): Fallback when the above tools fail

Annotation strategies adapt to object characteristics: soda cans require whole-object annotation (grasping the entire object), while woks require only handle annotation.

Key Design 2: Three-tier Reasoning Instruction System¶

Template-based instructions: Fixed templates such as "Please segment the affordance map of \<category> in this image," applicable to all data.
Easy reasoning instructions: Instructions containing the object name, e.g., "Can you find a mug for tea."
Hard reasoning instructions: Instructions describing only functional requirements without mentioning the object name, e.g., "I need something to drink coffee."

GPT-4 is used to generate reasoning instructions, producing 26k in total (HANDAL: 8.5k hard; EgoObjects: 12.7k easy + 4.7k hard).

Key Design 3: AffordanceNet Model¶

AffordanceVLM: VLM-based affordance region prediction module - Input: RGB image + human instruction (template-based or reasoning-based) - Output: affordance segmentation mask - Pretrained on RAGNet's large-scale data to learn mappings from diverse instructions to affordance regions

Pose Generator: Grasp pose generation module - Input: 2D affordance mask + depth image - Output: 3D gripper pose - Converts VLM's 2D predictions into executable robot actions

Validation Benchmark Design¶

Four validation splits are constructed: - Zero-shot category recognition: Evaluates generalization to unseen categories - Cross-domain affordance prediction: Evaluates prediction on unseen image domains - Reasoning instruction validation: Tests reasoning capability using hard instructions without category names - Real-robot closed-loop grasping: Fully cross-domain physical manipulation evaluation

Key Experimental Results¶

Dataset Scale Comparison¶

Dataset	Images	Categories	Wild	Robot	Ego	Sim	Reasoning
UMD (2015)	10k	17	-	✓	-	-	-
AGD20k (2020)	20k	50	✓	-	-	-	-
HANDAL (2023)	200k	17	✓	-	-	-	-
ManipVQA (2024)	84k	-	-	✓	✓	-	✓
RAGNet	273k	180	✓	✓	✓	✓	✓

RAGNet is the first large-scale affordance benchmark to simultaneously cover all four domains and support reasoning instructions.

Annotation Tool Combinations (by Data Source)¶

Source	Domain	Annotation Tools	Reasoning Instructions	Categories
HANDAL	Wild	❶	8.5k (hard)	17
Open-X	Robot	❸❹❺	-	124
GraspNet	Robot	❶❺	-	32
EgoObjects	Ego	❷❹❺	17.4k	74
RLBench	Sim	❺	-	10

Key Findings¶

Pretraining on large-scale multi-source data significantly improves open-world generalization of affordance prediction.
Hard reasoning instructions (without category names), while increasing task difficulty, effectively enhance the model's functional reasoning capability.
Cross-domain validation demonstrates that the model can generalize from wild data to robot scenarios and vice versa.
Real-robot closed-loop experiments validate the feasibility of the complete pipeline from VLM affordance prediction to grasp execution.
The adaptive five-tier annotation pipeline substantially reduces manual annotation cost while maintaining annotation quality.

Highlights & Insights¶

Benchmark contribution outweighs methodological contribution: The RAGNet dataset itself (273k images, 180 categories, 26k reasoning instructions, four-domain coverage) constitutes critical missing infrastructure for the field and will drive subsequent research.
Tiered reasoning instruction design: The three-tier system (template-based → easy reasoning → hard reasoning) precisely corresponds to different requirement levels ranging from academic research to real-world deployment.
Engineering value of the annotation pipeline: The five-tier adaptive annotation strategy (raw masks → SAM2 → Florence2+SAM2 → VLPart+SAM2 → manual) constitutes a reusable large-scale annotation methodology.
Closed-loop validation approach: The complete chain from VLM affordance prediction to depth-based pose generation to real-robot execution bridges the full perception–planning–execution pipeline.
Clear data motivation: By systematically comparing existing datasets across three dimensions—domain coverage, category breadth, and reasoning capability—the paper constructs a convincing argument for the proposed dataset.

Limitations & Future Work¶

The AffordanceNet model architecture is relatively straightforward, primarily consisting of VLM fine-tuning concatenated with pose generation, offering limited methodological novelty.
Reasoning instructions are generated solely by GPT-4, without validation of the diversity and complexity of real user instructions.
Annotations for some data sources rely on automated tools (SAM2/Florence2/VLPart), and annotation quality has not been systematically evaluated quantitatively.
Simulation data (RLBench) covers only 10 categories, which is modest compared to other data sources.
Real-robot experiments are limited in scene and object variety, and do not comprehensively cover grasping across all 180 categories.

Affordance datasets: UMD pioneered affordance segmentation with 10k RGB-D images; AGD20k extended coverage to 36 affordance categories from an exocentric perspective; HANDAL provides fine-grained handle annotations; AED and 3DOI introduce more diverse scenes; ManipVQA first incorporated reasoning but remained limited in data scale.
Affordance algorithms: Progress from supervised learning to transfer learning and self-supervised learning has gradually addressed cross-domain generalization; recent VLM-based methods (AffordanceLLM, ManipVQA) introduce language reasoning but are constrained by data scale.
Robotic grasping: From classical geometric methods to learning-based approaches, affordance prediction plays an increasingly important role in grasp planning; this paper is the first to unify the complete pipeline from VLM affordance prediction to real-world grasping within a single framework.

Rating¶

Novelty: ⭐⭐⭐⭐ — The large-scale reasoning-based affordance benchmark fills a critical gap in the field; the hard reasoning instruction evaluation setting (without category names) is creative.
Technical Depth: ⭐⭐⭐ — The model component (AffordanceNet) is relatively straightforward; the primary contribution lies in data and benchmark construction.
Experimental Thoroughness: ⭐⭐⭐⭐ — Four validation protocols (zero-shot, cross-domain, reasoning, real-robot) provide comprehensive coverage; some experimental details require consulting supplementary materials.
Writing Quality: ⭐⭐⭐⭐ — The description of data construction is systematic and complete; comparison tables are clear.
Recommendation: ⭐⭐⭐⭐ — The dataset resource holds significant value for the affordance and robotic grasping communities.

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Novelty: TBD
Experimental Thoroughness: TBD
Writing Quality: TBD
Value: TBD

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Design 1: Multi-source Data Collection and Annotation¶

Key Design 2: Three-tier Reasoning Instruction System¶

Key Design 3: AffordanceNet Model¶

Validation Benchmark Design¶

Key Experimental Results¶

Dataset Scale Comparison¶

Annotation Tool Combinations (by Data Source)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶