Skip to content

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Conference: ICCV 2025 arXiv: 2507.23734 Code: GitHub Area: Image Segmentation Keywords: affordance segmentation, robotic grasping, reasoning instructions, large-scale benchmark, vision-language model

TL;DR

This paper introduces RAGNet, the first large-scale reasoning-based affordance segmentation benchmark (273k images, 180 categories, 26k reasoning instructions), and proposes the AffordanceNet framework, which integrates VLM-pretrained affordance prediction with grasp pose generation, demonstrating strong open-world generalization and reasoning capabilities.

Background & Motivation

General-purpose robotic grasping systems must accurately perceive object affordance regions across diverse open-world scenarios in response to human instructions. Prior work suffers from two key bottlenecks:

Insufficient data scale and diversity: Existing affordance datasets are typically confined to specific domains (robotics, egocentric, or in-the-wild), with limited categories (UMD: 17, AGD20k: 50) and homogeneous image sources, resulting in poor open-world generalization.

Lack of reasoning-based instructions: Existing methods rely on fixed-template language prompts (e.g., "Please segment the affordance region of XX"), which cannot handle human-like high-level instructions such as "I need something that can cut bread," lacking complex reasoning capabilities.

Key insight: Instructions received by robots in real-world scenarios often do not directly specify the target object category but instead describe functional requirements. This demands that affordance prediction models possess a complete capability chain: from functional description reasoning to object identification to region segmentation.

Method

Overall Architecture

The paper makes two primary contributions: 1. RAGNet Benchmark: Large-scale multi-source data collection + multi-tool affordance annotation + three-tier reasoning instruction construction. 2. AffordanceNet Model: AffordanceVLM (affordance region prediction) + Pose Generator (grasp pose generation).

Key Design 1: Multi-source Data Collection and Annotation

Data sources spanning four domains: - Wild: HANDAL (17 categories of hardware/kitchen tools, real-world diverse scenes) - Robot: Open-X (124 categories, robotic manipulation scenes), GraspNet (32 categories) - Ego-centric: EgoObjects (74 indoor categories, first-person perspective) - Simulation: RLBench (10 categories, simulated environments)

In total: 273k images across 180 categories, far exceeding prior datasets.

Five-tier affordance annotation pipeline: - ❶ Raw masks: Directly using existing fine-grained annotations (e.g., HANDAL) - ❷ SAM2: For objects without handles, GT bounding boxes guide SAM2 to generate masks - ❸ Florence2 + SAM2: Language instructions drive Florence2 to produce polygon proposals, refined by SAM2 - ❹ VLPart + SAM2: VLPart's part-level recognition (e.g., knife handle, cup handle) combined with SAM2 for affordance region segmentation - ❺ Manual annotation (+ SAM2): Fallback when the above tools fail

Annotation strategies adapt to object characteristics: soda cans require whole-object annotation (grasping the entire object), while woks require only handle annotation.

Key Design 2: Three-tier Reasoning Instruction System

  1. Template-based instructions: Fixed templates such as "Please segment the affordance map of \<category> in this image," applicable to all data.
  2. Easy reasoning instructions: Instructions containing the object name, e.g., "Can you find a mug for tea."
  3. Hard reasoning instructions: Instructions describing only functional requirements without mentioning the object name, e.g., "I need something to drink coffee."

GPT-4 is used to generate reasoning instructions, producing 26k in total (HANDAL: 8.5k hard; EgoObjects: 12.7k easy + 4.7k hard).

Key Design 3: AffordanceNet Model

AffordanceVLM: VLM-based affordance region prediction module - Input: RGB image + human instruction (template-based or reasoning-based) - Output: affordance segmentation mask - Pretrained on RAGNet's large-scale data to learn mappings from diverse instructions to affordance regions

Pose Generator: Grasp pose generation module - Input: 2D affordance mask + depth image - Output: 3D gripper pose - Converts VLM's 2D predictions into executable robot actions

Validation Benchmark Design

Four validation splits are constructed: - Zero-shot category recognition: Evaluates generalization to unseen categories - Cross-domain affordance prediction: Evaluates prediction on unseen image domains - Reasoning instruction validation: Tests reasoning capability using hard instructions without category names - Real-robot closed-loop grasping: Fully cross-domain physical manipulation evaluation

Key Experimental Results

Dataset Scale Comparison

Dataset Images Categories Wild Robot Ego Sim Reasoning
UMD (2015) 10k 17 - - - -
AGD20k (2020) 20k 50 - - - -
HANDAL (2023) 200k 17 - - - -
ManipVQA (2024) 84k - - -
RAGNet 273k 180

RAGNet is the first large-scale affordance benchmark to simultaneously cover all four domains and support reasoning instructions.

Annotation Tool Combinations (by Data Source)

Source Domain Annotation Tools Reasoning Instructions Categories
HANDAL Wild 8.5k (hard) 17
Open-X Robot ❸❹❺ - 124
GraspNet Robot ❶❺ - 32
EgoObjects Ego ❷❹❺ 17.4k 74
RLBench Sim - 10

Key Findings

  • Pretraining on large-scale multi-source data significantly improves open-world generalization of affordance prediction.
  • Hard reasoning instructions (without category names), while increasing task difficulty, effectively enhance the model's functional reasoning capability.
  • Cross-domain validation demonstrates that the model can generalize from wild data to robot scenarios and vice versa.
  • Real-robot closed-loop experiments validate the feasibility of the complete pipeline from VLM affordance prediction to grasp execution.
  • The adaptive five-tier annotation pipeline substantially reduces manual annotation cost while maintaining annotation quality.

Highlights & Insights

  1. Benchmark contribution outweighs methodological contribution: The RAGNet dataset itself (273k images, 180 categories, 26k reasoning instructions, four-domain coverage) constitutes critical missing infrastructure for the field and will drive subsequent research.
  2. Tiered reasoning instruction design: The three-tier system (template-based → easy reasoning → hard reasoning) precisely corresponds to different requirement levels ranging from academic research to real-world deployment.
  3. Engineering value of the annotation pipeline: The five-tier adaptive annotation strategy (raw masks → SAM2 → Florence2+SAM2 → VLPart+SAM2 → manual) constitutes a reusable large-scale annotation methodology.
  4. Closed-loop validation approach: The complete chain from VLM affordance prediction to depth-based pose generation to real-robot execution bridges the full perception–planning–execution pipeline.
  5. Clear data motivation: By systematically comparing existing datasets across three dimensions—domain coverage, category breadth, and reasoning capability—the paper constructs a convincing argument for the proposed dataset.

Limitations & Future Work

  • The AffordanceNet model architecture is relatively straightforward, primarily consisting of VLM fine-tuning concatenated with pose generation, offering limited methodological novelty.
  • Reasoning instructions are generated solely by GPT-4, without validation of the diversity and complexity of real user instructions.
  • Annotations for some data sources rely on automated tools (SAM2/Florence2/VLPart), and annotation quality has not been systematically evaluated quantitatively.
  • Simulation data (RLBench) covers only 10 categories, which is modest compared to other data sources.
  • Real-robot experiments are limited in scene and object variety, and do not comprehensively cover grasping across all 180 categories.
  • Affordance datasets: UMD pioneered affordance segmentation with 10k RGB-D images; AGD20k extended coverage to 36 affordance categories from an exocentric perspective; HANDAL provides fine-grained handle annotations; AED and 3DOI introduce more diverse scenes; ManipVQA first incorporated reasoning but remained limited in data scale.
  • Affordance algorithms: Progress from supervised learning to transfer learning and self-supervised learning has gradually addressed cross-domain generalization; recent VLM-based methods (AffordanceLLM, ManipVQA) introduce language reasoning but are constrained by data scale.
  • Robotic grasping: From classical geometric methods to learning-based approaches, affordance prediction plays an increasingly important role in grasp planning; this paper is the first to unify the complete pipeline from VLM affordance prediction to real-world grasping within a single framework.

Rating

  • Novelty: ⭐⭐⭐⭐ — The large-scale reasoning-based affordance benchmark fills a critical gap in the field; the hard reasoning instruction evaluation setting (without category names) is creative.
  • Technical Depth: ⭐⭐⭐ — The model component (AffordanceNet) is relatively straightforward; the primary contribution lies in data and benchmark construction.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Four validation protocols (zero-shot, cross-domain, reasoning, real-robot) provide comprehensive coverage; some experimental details require consulting supplementary materials.
  • Writing Quality: ⭐⭐⭐⭐ — The description of data construction is systematic and complete; comparison tables are clear.
  • Recommendation: ⭐⭐⭐⭐ — The dataset resource holds significant value for the affordance and robotic grasping communities.

Highlights & Insights

Limitations & Future Work

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD