RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation¶
Conference: CVPR2026
arXiv: 2603.14880
Code: lif314/RealVLG-R1
Area: Semantic Segmentation
Keywords: Visual-Language Grounding, Robotic Grasping, Reinforcement Learning Fine-tuning, Multi-granular Annotation, Zero-shot Generalization, Large-scale Vision-Language Models
TL;DR¶
Ours proposes the RealVLG framework, comprising the 11B-level real-world multi-granular annotated dataset RealVLG-11B and the Reinforcement Learning (RL) fine-tuned unified model RealVLG-R1. This work unifies Visual-Language Grounding (VLG) and robotic grasping into the same paradigm for the first time, achieving end-to-end prediction from natural language instructions to bounding boxes, segmentation masks, grasp poses, and contact points, while demonstrating zero-shot generalization capabilities.
Background & Motivation¶
VLG and Grasping Disjoint: Existing visual-language grounding research focuses on coarse-grained object-level localization (bounding box / segmentation mask), while traditional robotic grasping methods rely on geometric cues and lack linguistic semantic guidance, resulting in a significant gap between the two.
Insufficient Synthetic Data Quality: Datasets like Grasp-Anything use diffusion models to generate low-resolution synthetic scenes, and grasp annotations are automatically generated by RAGT-3/3 with limited quality. Language descriptions only cover scene or object category levels.
Lack of Fine-grained Language Descriptions: Language annotations in existing grasping datasets are coarse, lacking precise descriptions of target object attributes and spatial relations, which fails to support language-driven fine-grained operations.
SFT Difficulty with Multi-solution Problems: Grasping poses inherently possess multiple feasible solutions. However, Supervised Fine-Tuning (SFT) forces fitting to a single label, leading to "averaged" predictions that are physically infeasible.
Insufficient Real-world Dataset Scale: Annotations in existing real-world grasping datasets are inconsistent and lack multi-modal aligned annotations for segmentation, detection, and language.
Lack of Zero-shot Capability: Grasping methods trained in closed environments have poor scalability and cannot be directly deployed in unseen real-world scenes.
Method¶
Overall Architecture¶
RealVLG aims to unify Visual-Language Grounding (VLG) and robotic grasping into a single model, enabling natural language instructions to map directly to bounding boxes, segmentation masks, grasp poses, and contact points. It consists of two components: the RealVLG-11B dataset—integrating real grasping datasets such as Cornell, VMRD, OCID-Grasp, GraspNet, and GraspClutter6D, unified via an annotation pipeline to include bboxes, segmentation masks, rectangular grasp poses, contact points, and natural language descriptions (covering ~165,000 images, 800+ object instances, 1.3 million annotations, and ~11 billion grasp examples); and the RealVLG-R1 model—using Qwen2.5-VL as the backbone and fine-tuned via Reinforcement Learning from Verifiable Rewards (RLVR) to unifiedly predict the four categories of output. The following diagram illustrates the "Dataset Construction" and "Model Training" pipelines.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
A["Collection of Real Grasping Datasets<br/>Cornell / VMRD / OCID-Grasp / GraspNet / GraspClutter6D"]
subgraph DATA["Multi-granular Data Annotation Pipeline"]
direction TB
B["Language Annotation<br/>8-view Rendering + GPT-4o Meta Description & Language Instruction Generation"]
C["Localization Verification<br/>Qwen-VL-Max outputs bbox, SAM2 outputs Segmentation Mask"]
D["Grasp Pose Unification<br/>6-DoF → Rectangular Pose → Contact Point Calculation"]
E["Human Review<br/>Four-modal Cross-verification, Iterative Correction if failed"]
B --> C --> D --> E
end
A --> DATA
DATA --> F["RealVLG-11B Dataset<br/>bbox / Mask / Grasp Pose / Contact Point / Language Description"]
subgraph MODEL["Reinforcement Learning Fine-tuning (RLVR)"]
direction TB
G["Qwen2.5-VL Backbone<br/>Sampling G groups of candidate outputs"]
H["Verifiable Reward R(q,o)<br/>bbox / Segmentation / Grasp / Contact Point"]
I["Policy Optimization: GRPO (token-level) or GSPO (sequence-level)<br/>Group Relative Advantage Update"]
G --> H --> I
end
F --> MODEL
MODEL --> J["Unified Prediction<br/>bbox + Segmentation Mask + Grasp Pose + Contact Point"]
Key Designs¶
1. Multi-granular Data Annotation Pipeline: Producing High-Quality Real-World Annotations via Four-Modal Cross-Verification
Existing grasp data are either low-quality synthetics or limited to scene/class-level language descriptions. This work establishes quality via a four-step pipeline: (1) Language Annotation—rendering 3D models from 8 views for GPT-4o to generate Meta Descriptions, then combining these with images to generate Language Instructions containing category, color, shape, and spatial relations; (2) Localization Verification—Qwen-VL-Max performs grounding on image + language to output bboxes, and SAM2 generates masks; (3) Grasp Pose Unification—converting 6-DoF poses into a unified rectangular representation and calculating contact points based on masks; (4) Human Review—cross-verifying four-modal consistency and iteratively correcting failures.
2. Reinforcement Learning Fine-tuning (RLVR): Solving Multi-solution Grasping via Verifiable Rewards
Grasping poses inherently have multiple feasible solutions; SFT forced onto a single label results in "averaging" that is physically infeasible. Ours adopts the RLVR paradigm, replacing fixed labels with a verifiable reward function \(R(q,o)\): sampling \(G\) candidate outputs from the old policy, calculating group relative advantage \(\hat{A}_i\), and updating the policy. Two interchangeable algorithms are provided: GRPO for token-level importance weighting, and GSPO for sequence-level importance weights \(s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{old}}(y_i|x)}\right)^{1/|y_i|}\) to normalize variance. In experiments, RL fine-tuning achieved a Gain of over 30% in grasping compared to SFT, as the rewards allow "multiple correct solutions" without forcing convergence to a mean.
Loss & Training¶
Rewards are designed per task, requiring a <think>...</think><answer>...</answer> format:
- Bbox Reward: Binary reward based on IoU threshold \(R_{Bbox} = \mathbf{1}(\text{IoU}(B_p, B_{gt}) \geq \tau)\)
- Segmentation Reward: IoU coarse localization + S-measure fine-grained mask quality \(R_{Seg} = \mathbf{1}(\text{IoU}) + S_\alpha(M_p, M_{gt})\)
- Grasp Reward: Negative sum of Huber losses for \((x, y, \cos\theta, \sin\theta, w)\) components
- Contact Point Reward: Rectangular alignment IoU binary reward + L2 distance penalty for the two contact points
Key Experimental Results¶
Data Quality Assessment¶
| Dataset | MTLD ↑ | CLIP Score ↑ | \(R_s\) ↑ | \(R_g\) ↑ | \(R_c\) ↑ |
|---|---|---|---|---|---|
| Grasp-Anything | 27.45 | 0.54 | – | 0.38 | 0.69 |
| Grasp-Anything++ | 15.14 | 0.52 | – | 0.31 | 0.62 |
| RealVLG-11B | 36.49 | 0.65 | 0.99 | 0.69 | 0.87 |
RealVLG-11B surpasses synthetic datasets across language diversity (MTLD), visual-language alignment (CLIP Score), and spatial consistency.
Main Results on RealVLG Benchmark¶
| Model | Seen Bbox (gIoU) | Seen Grasp (mIoU/gAcc) | Novel Bbox (gIoU) | Novel Grasp (mIoU/gAcc) |
|---|---|---|---|---|
| Qwen-VL-Max | 92.3 | 16.0/16.7 | 88.4 | 8.1/5.4 |
| Qwen2.5VL-3B + SFT | 56.4 | 3.4/1.7 | 57.2 | 4.4/1.5 |
| RealVLG-R1-3B (GRPO) | 87.2 | 34.7/40.3 | 78.5 | 16.3/17.1 |
| RealVLG-R1-7B (GSPO) | 89.0 | 33.6/32.8 | 88.5 | 16.5/18.3 |
Ablation Study & Key Findings¶
- SFT vs RL Fine-tuning: SFT shows only an ~5% gIoU Gain over the base model, whereas GRPO/GSPO achieve over 30%, proving the significant advantage of RL for multi-solution grasping tasks.
- GRPO vs GSPO: GRPO achieves higher grasping accuracy on smaller models (3B: mIoU 34.7 vs 29.2), while GSPO offers better stability on larger models with a 100% Rv rate.
- Zero-shot Generalization: In Novel (unseen object) scenarios, RealVLG-R1-7B (GSPO) maintains a Bbox gIoU of 88.5% and a grasp mIoU/gAcc of 16.5/18.3%, demonstrating non-trivial generalization.
- Output Validity Rate: The closed-source Qwen-VL-Max has an Rv of only 60-70%, whereas all RealVLG-R1 configurations reach 96-100%, indicating that RL fine-tuning significantly improves structured output consistency.
- Only 10% Training Data: RealVLG-R1 and SFT trained for 10 epochs using only 10% of the training set, reflecting high data efficiency.
Highlights & Insights¶
- First Unified VLG + Grasping Framework: Unifies semantic localization and physical interaction reasoning into one model; the first end-to-end robotic perception model based on LVLM.
- High-Quality Data Annotation Pipeline: Fourfold assurance via GPT-4o generation, Qwen-VL-Max verification, SAM2 segmentation, and human review.
- 11B-level Real-world Grasping Dataset: The largest real-world perception dataset containing both semantic and visual information.
- RL for Multi-solution Problems: Elegantly solves the core challenge of multi-feasible grasp poses by replacing fixed labels with verifiable rewards.
- Zero-shot Deployment Capability: Executes perception and manipulation in unseen real-world environments without scene-specific fine-tuning.
Limitations & Future Work¶
- Currently supports only 2D rectangular grasp poses; not extended to 3D space and 6-DoF grasping.
- Grasping accuracy in Novel scenarios (mIoU ~16%) still has room for improvement compared to detection performance.
- Segmentation depends entirely on SAM2 as a frozen module; the model itself does not directly generate masks.
- Closed-loop operation success rates on real robots were not reported.
- Dataset primarily covers tabletop scenes; generalization to complex industrial and outdoor environments is unverified.
- Inference requires sampling G groups of responses for advantage estimation, which may limit efficiency.
Related Work & Insights¶
- Visual-Language Grounding: GLIP, Shikra, GroundingDINO focus on Bbox/Seg localization without grasp reasoning.
- Grasping Datasets: Cornell and GraspNet-1Billion provide real-world annotations but lack language; Grasp-Anything includes language but consists of low-quality synthetic data.
- Language-driven Grasping: Existing methods often rely on pre-segmented inputs, leading to severe multi-stage error accumulation and poor open-world generalization.
- RL Fine-tuning LLMs: DeepSeek-R1 proposed the RLVR paradigm for reasoning tasks; Ours extends this to visual localization and robotic grasping.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to unify VLG and grasping, migrating the RLVR paradigm from NLP reasoning to embodied perception.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive data quality assessment, Benchmark, and multi-baseline comparisons, though lacking real-robot closed-loop experiments.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, detailed dataset construction process, and complete formula derivations.
- Value: ⭐⭐⭐⭐ — The dataset and Benchmark hold long-term value for the community; the unified framework approach is worth following.