RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation¶
Conference: CVPR2026
arXiv: 2603.14880
Code: lif314/RealVLG-R1
Area: Semantic Segmentation
Keywords: visual-language grounding, robotic grasping, reinforcement learning fine-tuning, multi-granularity annotation, zero-shot generalization, large-scale vision-language models
TL;DR¶
This paper proposes the RealVLG framework, comprising the RealVLG-11B large-scale real-world multi-granularity annotated dataset and the RealVLG-R1 unified model fine-tuned via reinforcement learning. It is the first work to unify visual-language grounding (VLG) and robotic grasping under a single paradigm, enabling end-to-end prediction of bounding boxes, segmentation masks, grasp poses, and contact points from natural language instructions, while demonstrating zero-shot generalization capability.
Background & Motivation¶
Disconnect between VLG and grasping: Existing VLG research focuses on coarse-grained object-level localization (bounding boxes / segmentation masks), while traditional robotic grasping methods rely on geometric cues without language semantic guidance, creating a significant gap between the two.
Insufficient quality of synthetic data: Datasets such as Grasp-Anything use diffusion models to generate low-resolution synthetic scenes; grasp annotations are automatically generated by RAGT-3/3 with limited quality, and language descriptions cover only scene- or object-category-level information.
Lack of fine-grained language descriptions: Language annotations in existing grasp datasets are coarse, lacking detailed descriptions of target object attributes and spatial relationships, and thus cannot support language-driven fine-grained manipulation.
SFT struggles with multi-solution problems: Grasp poses inherently admit multiple feasible solutions, yet supervised fine-tuning forces the model to fit a single label, leading to "averaged" predictions that are physically infeasible.
Insufficient scale of real-world datasets: Existing real-world grasp datasets have inconsistent annotations and lack multi-modal aligned annotations such as segmentation, detection, and language descriptions.
Absence of zero-shot capability: Grasping methods trained in closed environments have poor scalability and cannot be directly deployed in unseen real-world scenarios.
Method¶
Overall Architecture¶
RealVLG consists of two components: the dataset (RealVLG-11B) and the model (RealVLG-R1):
- RealVLG-11B Dataset: Integrates real-world grasp datasets including Cornell, VMRD, OCID-Grasp, GraspNet, and GraspClutter6D, with unified extensions covering bounding boxes, segmentation masks, rectangular grasp poses, contact points, and natural language descriptions. It covers approximately 165,000 images, 800+ object instances, 1.3 million annotations, and approximately 11 billion grasp examples.
- RealVLG-R1 Model: Built on Qwen2.5-VL as the backbone, employing a reinforcement learning with verifiable rewards (RLVR) strategy to drive model learning via verifiable reward signals, with unified prediction across four output types.
Data Annotation Pipeline (Key Design 1)¶
- Language annotation: Render object 3D models from 8 viewpoints → GPT-4o generates Meta Descriptions → combined with images to generate Language Instructions for each target, covering category, color, shape, and spatial relationships.
- Localization verification: Qwen-VL-Max performs grounding on image + language to output bounding boxes → SAM2 generates segmentation masks.
- Grasp pose unification: Convert 6-DoF grasp poses into a unified rectangular grasp representation, and compute contact points based on segmentation masks.
- Human review: Manual cross-validation of consistency across four modalities—Meta Description, Language Instruction, Bbox, and Segmentation Mask—with iterative correction for any failing instances.
Reinforcement Learning Fine-Tuning (Key Design 2)¶
- Adopts the RLVR paradigm, replacing fixed-label supervision with a verifiable reward function \(R(q,o)\).
- Uses the GRPO algorithm for token-level importance-weighted policy optimization.
- Further employs the GSPO method, introducing length-normalized importance weights at the sequence level: \(s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{old}}(y_i|x)}\right)^{1/|y_i|}\), reducing variance for long sequences.
Task-Specific Reward Functions (Loss Design)¶
- Bbox reward: Binary reward based on IoU threshold: \(R_{Bbox} = \mathbf{1}(\text{IoU}(B_p, B_{gt}) \geq \tau)\)
- Segmentation reward: Combines IoU-based coarse localization with S-measure fine-grained mask quality: \(R_{Seg} = \mathbf{1}(\text{IoU}) + S_\alpha(M_p, M_{gt})\)
- Grasp reward: Negative sum of Huber losses computed separately over five components \((x, y, \cos\theta, \sin\theta, w)\)
- Contact point reward: Binary rectangular alignment IoU reward plus L2 distance penalty for two contact points
- Format reward: All tasks uniformly require the
<think>...</think><answer>...</answer>format
Key Experimental Results¶
Data Quality Evaluation¶
| Dataset | MTLD ↑ | CLIP Score ↑ | \(R_s\) ↑ | \(R_g\) ↑ | \(R_c\) ↑ |
|---|---|---|---|---|---|
| Grasp-Anything | 27.45 | 0.54 | – | 0.38 | 0.69 |
| Grasp-Anything++ | 15.14 | 0.52 | – | 0.31 | 0.62 |
| RealVLG-11B | 36.49 | 0.65 | 0.99 | 0.69 | 0.87 |
RealVLG-11B comprehensively outperforms synthetic datasets in language diversity (MTLD), vision-language alignment (CLIP Score), and spatial consistency.
Main Results on the RealVLG Benchmark¶
| Model | Seen Bbox (gIoU) | Seen Grasp (mIoU/gAcc) | Novel Bbox (gIoU) | Novel Grasp (mIoU/gAcc) |
|---|---|---|---|---|
| Qwen-VL-Max | 92.3 | 16.0/16.7 | 88.4 | 8.1/5.4 |
| Qwen2.5VL-3B + SFT | 56.4 | 3.4/1.7 | 57.2 | 4.4/1.5 |
| RealVLG-R1-3B (GRPO) | 87.2 | 34.7/40.3 | 78.5 | 16.3/17.1 |
| RealVLG-R1-7B (GSPO) | 89.0 | 33.6/32.8 | 88.5 | 16.5/18.3 |
Ablation Study & Key Findings¶
- SFT vs. RL fine-tuning: SFT improves gIoU by only ~5% over the base model, whereas GRPO/GSPO achieves improvements exceeding 30%, demonstrating the significant advantage of reinforcement learning on multi-solution grasping tasks.
- GRPO vs. GSPO: GRPO achieves higher grasp accuracy on smaller models (3B: mIoU 34.7 vs. 29.2), while GSPO offers better stability on larger models and achieves a 100% valid response (Rv) rate.
- Zero-shot generalization: In Novel (unseen object) settings, RealVLG-R1-7B (GSPO) still achieves Bbox gIoU of 88.5% and grasp mIoU/gAcc of 16.5/18.3%, demonstrating non-trivial generalization capability.
- Valid output rate: The closed-source Qwen-VL-Max achieves only 60–70% Rv, while all RealVLG-R1 configurations reach 96–100%, indicating that RL fine-tuning significantly improves structured output consistency.
- Only 10% training data used: RealVLG-R1 and SFT are trained on only 10% of the training set for 10 epochs, demonstrating excellent data efficiency.
Highlights & Insights¶
- First framework unifying VLG and grasping: Integrates semantic localization and physical interaction reasoning into a single model, marking the first end-to-end robotic perception model based on LVLMs.
- High-quality data annotation pipeline: Four-layer quality assurance via GPT-4o automatic generation, Qwen-VL-Max verification, SAM2 segmentation, and human review.
- 11-billion-scale real-world grasp dataset: The largest real-world perception dataset simultaneously encompassing semantic and visual information.
- Reinforcement learning addresses multi-solution problems: Elegantly resolves the core challenge of multiple feasible grasp poses by replacing fixed labels with verifiable rewards.
- Zero-shot deployment capability: Enables perception and manipulation in unseen real-world environments without task-specific fine-tuning.
Limitations & Future Work¶
- Currently supports only 2D rectangular grasp poses; extension to 3D space and 6-DoF grasping has not been explored.
- Grasp accuracy in Novel settings (mIoU ~16%) still has substantial room for improvement, with a notable gap relative to detection performance.
- Segmentation relies entirely on SAM2 as a frozen module; the model does not directly generate masks.
- Experiments do not report closed-loop manipulation success rates on real robots.
- The dataset primarily covers tabletop scenarios; generalization to complex industrial and outdoor environments has not been validated.
- Inference requires sampling \(G\) groups of responses to estimate advantages, which may limit inference efficiency.
Related Work & Insights¶
- Visual-language grounding: Methods such as GLIP, Shikra, and GroundingDINO focus on Bbox/Seg localization without addressing grasp reasoning.
- Grasp datasets: Cornell and GraspNet-1Billion provide real-world annotations but lack the language modality; Grasp-Anything includes language but relies on low-quality synthetic data.
- Language-driven grasping: Existing methods largely depend on pre-segmented inputs, suffer from multi-stage error accumulation, and generalize poorly to open-world scenarios.
- RL fine-tuning for LLMs: DeepSeek-R1 introduced the RLVR paradigm for reasoning tasks; this paper extends it to visual grounding and robotic grasping.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to unify VLG and grasping, transferring the RLVR paradigm from NLP reasoning to embodied perception.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive data quality evaluation, benchmark, and multi-baseline comparisons; real-robot closed-loop experiments are absent.
- Writing Quality: ⭐⭐⭐⭐ — Clear paper structure, detailed dataset construction pipeline, and complete mathematical derivations.
- Value: ⭐⭐⭐⭐ — The dataset and benchmark offer long-term community value; the unified framework merits follow-up research.