Skip to content

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

Conference: CVPR2026
arXiv: 2603.14880
Code: lif314/RealVLG-R1
Area: Semantic Segmentation
Keywords: visual-language grounding, robotic grasping, reinforcement learning fine-tuning, multi-granularity annotation, zero-shot generalization, large-scale vision-language models

TL;DR

This paper proposes the RealVLG framework, comprising the RealVLG-11B large-scale real-world multi-granularity annotated dataset and the RealVLG-R1 unified model fine-tuned via reinforcement learning. It is the first work to unify visual-language grounding (VLG) and robotic grasping under a single paradigm, enabling end-to-end prediction of bounding boxes, segmentation masks, grasp poses, and contact points from natural language instructions, while demonstrating zero-shot generalization capability.

Background & Motivation

Disconnect between VLG and grasping: Existing VLG research focuses on coarse-grained object-level localization (bounding boxes / segmentation masks), while traditional robotic grasping methods rely on geometric cues without language semantic guidance, creating a significant gap between the two.

Insufficient quality of synthetic data: Datasets such as Grasp-Anything use diffusion models to generate low-resolution synthetic scenes; grasp annotations are automatically generated by RAGT-3/3 with limited quality, and language descriptions cover only scene- or object-category-level information.

Lack of fine-grained language descriptions: Language annotations in existing grasp datasets are coarse, lacking detailed descriptions of target object attributes and spatial relationships, and thus cannot support language-driven fine-grained manipulation.

SFT struggles with multi-solution problems: Grasp poses inherently admit multiple feasible solutions, yet supervised fine-tuning forces the model to fit a single label, leading to "averaged" predictions that are physically infeasible.

Insufficient scale of real-world datasets: Existing real-world grasp datasets have inconsistent annotations and lack multi-modal aligned annotations such as segmentation, detection, and language descriptions.

Absence of zero-shot capability: Grasping methods trained in closed environments have poor scalability and cannot be directly deployed in unseen real-world scenarios.

Method

Overall Architecture

RealVLG consists of two components: the dataset (RealVLG-11B) and the model (RealVLG-R1):

  • RealVLG-11B Dataset: Integrates real-world grasp datasets including Cornell, VMRD, OCID-Grasp, GraspNet, and GraspClutter6D, with unified extensions covering bounding boxes, segmentation masks, rectangular grasp poses, contact points, and natural language descriptions. It covers approximately 165,000 images, 800+ object instances, 1.3 million annotations, and approximately 11 billion grasp examples.
  • RealVLG-R1 Model: Built on Qwen2.5-VL as the backbone, employing a reinforcement learning with verifiable rewards (RLVR) strategy to drive model learning via verifiable reward signals, with unified prediction across four output types.

Data Annotation Pipeline (Key Design 1)

  1. Language annotation: Render object 3D models from 8 viewpoints → GPT-4o generates Meta Descriptions → combined with images to generate Language Instructions for each target, covering category, color, shape, and spatial relationships.
  2. Localization verification: Qwen-VL-Max performs grounding on image + language to output bounding boxes → SAM2 generates segmentation masks.
  3. Grasp pose unification: Convert 6-DoF grasp poses into a unified rectangular grasp representation, and compute contact points based on segmentation masks.
  4. Human review: Manual cross-validation of consistency across four modalities—Meta Description, Language Instruction, Bbox, and Segmentation Mask—with iterative correction for any failing instances.

Reinforcement Learning Fine-Tuning (Key Design 2)

  • Adopts the RLVR paradigm, replacing fixed-label supervision with a verifiable reward function \(R(q,o)\).
  • Uses the GRPO algorithm for token-level importance-weighted policy optimization.
  • Further employs the GSPO method, introducing length-normalized importance weights at the sequence level: \(s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{old}}(y_i|x)}\right)^{1/|y_i|}\), reducing variance for long sequences.

Task-Specific Reward Functions (Loss Design)

  • Bbox reward: Binary reward based on IoU threshold: \(R_{Bbox} = \mathbf{1}(\text{IoU}(B_p, B_{gt}) \geq \tau)\)
  • Segmentation reward: Combines IoU-based coarse localization with S-measure fine-grained mask quality: \(R_{Seg} = \mathbf{1}(\text{IoU}) + S_\alpha(M_p, M_{gt})\)
  • Grasp reward: Negative sum of Huber losses computed separately over five components \((x, y, \cos\theta, \sin\theta, w)\)
  • Contact point reward: Binary rectangular alignment IoU reward plus L2 distance penalty for two contact points
  • Format reward: All tasks uniformly require the <think>...</think><answer>...</answer> format

Key Experimental Results

Data Quality Evaluation

Dataset MTLD ↑ CLIP Score ↑ \(R_s\) \(R_g\) \(R_c\)
Grasp-Anything 27.45 0.54 0.38 0.69
Grasp-Anything++ 15.14 0.52 0.31 0.62
RealVLG-11B 36.49 0.65 0.99 0.69 0.87

RealVLG-11B comprehensively outperforms synthetic datasets in language diversity (MTLD), vision-language alignment (CLIP Score), and spatial consistency.

Main Results on the RealVLG Benchmark

Model Seen Bbox (gIoU) Seen Grasp (mIoU/gAcc) Novel Bbox (gIoU) Novel Grasp (mIoU/gAcc)
Qwen-VL-Max 92.3 16.0/16.7 88.4 8.1/5.4
Qwen2.5VL-3B + SFT 56.4 3.4/1.7 57.2 4.4/1.5
RealVLG-R1-3B (GRPO) 87.2 34.7/40.3 78.5 16.3/17.1
RealVLG-R1-7B (GSPO) 89.0 33.6/32.8 88.5 16.5/18.3

Ablation Study & Key Findings

  1. SFT vs. RL fine-tuning: SFT improves gIoU by only ~5% over the base model, whereas GRPO/GSPO achieves improvements exceeding 30%, demonstrating the significant advantage of reinforcement learning on multi-solution grasping tasks.
  2. GRPO vs. GSPO: GRPO achieves higher grasp accuracy on smaller models (3B: mIoU 34.7 vs. 29.2), while GSPO offers better stability on larger models and achieves a 100% valid response (Rv) rate.
  3. Zero-shot generalization: In Novel (unseen object) settings, RealVLG-R1-7B (GSPO) still achieves Bbox gIoU of 88.5% and grasp mIoU/gAcc of 16.5/18.3%, demonstrating non-trivial generalization capability.
  4. Valid output rate: The closed-source Qwen-VL-Max achieves only 60–70% Rv, while all RealVLG-R1 configurations reach 96–100%, indicating that RL fine-tuning significantly improves structured output consistency.
  5. Only 10% training data used: RealVLG-R1 and SFT are trained on only 10% of the training set for 10 epochs, demonstrating excellent data efficiency.

Highlights & Insights

  • First framework unifying VLG and grasping: Integrates semantic localization and physical interaction reasoning into a single model, marking the first end-to-end robotic perception model based on LVLMs.
  • High-quality data annotation pipeline: Four-layer quality assurance via GPT-4o automatic generation, Qwen-VL-Max verification, SAM2 segmentation, and human review.
  • 11-billion-scale real-world grasp dataset: The largest real-world perception dataset simultaneously encompassing semantic and visual information.
  • Reinforcement learning addresses multi-solution problems: Elegantly resolves the core challenge of multiple feasible grasp poses by replacing fixed labels with verifiable rewards.
  • Zero-shot deployment capability: Enables perception and manipulation in unseen real-world environments without task-specific fine-tuning.

Limitations & Future Work

  • Currently supports only 2D rectangular grasp poses; extension to 3D space and 6-DoF grasping has not been explored.
  • Grasp accuracy in Novel settings (mIoU ~16%) still has substantial room for improvement, with a notable gap relative to detection performance.
  • Segmentation relies entirely on SAM2 as a frozen module; the model does not directly generate masks.
  • Experiments do not report closed-loop manipulation success rates on real robots.
  • The dataset primarily covers tabletop scenarios; generalization to complex industrial and outdoor environments has not been validated.
  • Inference requires sampling \(G\) groups of responses to estimate advantages, which may limit inference efficiency.
  • Visual-language grounding: Methods such as GLIP, Shikra, and GroundingDINO focus on Bbox/Seg localization without addressing grasp reasoning.
  • Grasp datasets: Cornell and GraspNet-1Billion provide real-world annotations but lack the language modality; Grasp-Anything includes language but relies on low-quality synthetic data.
  • Language-driven grasping: Existing methods largely depend on pre-segmented inputs, suffer from multi-stage error accumulation, and generalize poorly to open-world scenarios.
  • RL fine-tuning for LLMs: DeepSeek-R1 introduced the RLVR paradigm for reasoning tasks; this paper extends it to visual grounding and robotic grasping.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to unify VLG and grasping, transferring the RLVR paradigm from NLP reasoning to embodied perception.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive data quality evaluation, benchmark, and multi-baseline comparisons; real-robot closed-loop experiments are absent.
  • Writing Quality: ⭐⭐⭐⭐ — Clear paper structure, detailed dataset construction pipeline, and complete mathematical derivations.
  • Value: ⭐⭐⭐⭐ — The dataset and benchmark offer long-term community value; the unified framework merits follow-up research.