CVPR2026 Segmentation visual-language grounding robotic grasping reinforcement learning fine-tuning multi-granularity annotation zero-shot generalization large-scale vision-language models

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation¶

Conference: CVPR2026
arXiv: 2603.14880
Code: lif314/RealVLG-R1
Area: Semantic Segmentation
Keywords: visual-language grounding, robotic grasping, reinforcement learning fine-tuning, multi-granularity annotation, zero-shot generalization, large-scale vision-language models

TL;DR¶

This paper proposes the RealVLG framework, comprising the RealVLG-11B large-scale real-world multi-granularity annotated dataset and the RealVLG-R1 unified model fine-tuned via reinforcement learning. It is the first work to unify visual-language grounding (VLG) and robotic grasping under a single paradigm, enabling end-to-end prediction of bounding boxes, segmentation masks, grasp poses, and contact points from natural language instructions, while demonstrating zero-shot generalization capability.

Background & Motivation¶

Disconnect between VLG and grasping: Existing VLG research focuses on coarse-grained object-level localization (bounding boxes / segmentation masks), while traditional robotic grasping methods rely on geometric cues without language semantic guidance, creating a significant gap between the two.

Insufficient quality of synthetic data: Datasets such as Grasp-Anything use diffusion models to generate low-resolution synthetic scenes; grasp annotations are automatically generated by RAGT-3/3 with limited quality, and language descriptions cover only scene- or object-category-level information.

Lack of fine-grained language descriptions: Language annotations in existing grasp datasets are coarse, lacking detailed descriptions of target object attributes and spatial relationships, and thus cannot support language-driven fine-grained manipulation.

SFT struggles with multi-solution problems: Grasp poses inherently admit multiple feasible solutions, yet supervised fine-tuning forces the model to fit a single label, leading to "averaged" predictions that are physically infeasible.

Insufficient scale of real-world datasets: Existing real-world grasp datasets have inconsistent annotations and lack multi-modal aligned annotations such as segmentation, detection, and language descriptions.

Absence of zero-shot capability: Grasping methods trained in closed environments have poor scalability and cannot be directly deployed in unseen real-world scenarios.

Method¶

Overall Architecture¶

RealVLG consists of two components: the dataset (RealVLG-11B) and the model (RealVLG-R1):

RealVLG-11B Dataset: Integrates real-world grasp datasets including Cornell, VMRD, OCID-Grasp, GraspNet, and GraspClutter6D, with unified extensions covering bounding boxes, segmentation masks, rectangular grasp poses, contact points, and natural language descriptions. It covers approximately 165,000 images, 800+ object instances, 1.3 million annotations, and approximately 11 billion grasp examples.
RealVLG-R1 Model: Built on Qwen2.5-VL as the backbone, employing a reinforcement learning with verifiable rewards (RLVR) strategy to drive model learning via verifiable reward signals, with unified prediction across four output types.

Data Annotation Pipeline (Key Design 1)¶

Language annotation: Render object 3D models from 8 viewpoints → GPT-4o generates Meta Descriptions → combined with images to generate Language Instructions for each target, covering category, color, shape, and spatial relationships.
Localization verification: Qwen-VL-Max performs grounding on image + language to output bounding boxes → SAM2 generates segmentation masks.
Grasp pose unification: Convert 6-DoF grasp poses into a unified rectangular grasp representation, and compute contact points based on segmentation masks.
Human review: Manual cross-validation of consistency across four modalities—Meta Description, Language Instruction, Bbox, and Segmentation Mask—with iterative correction for any failing instances.

Reinforcement Learning Fine-Tuning (Key Design 2)¶

Adopts the RLVR paradigm, replacing fixed-label supervision with a verifiable reward function \(R(q,o)\).
Uses the GRPO algorithm for token-level importance-weighted policy optimization.
Further employs the GSPO method, introducing length-normalized importance weights at the sequence level: \(s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{old}}(y_i|x)}\right)^{1/|y_i|}\), reducing variance for long sequences.

Task-Specific Reward Functions (Loss Design)¶

Bbox reward: Binary reward based on IoU threshold: \(R_{Bbox} = \mathbf{1}(\text{IoU}(B_p, B_{gt}) \geq \tau)\)
Segmentation reward: Combines IoU-based coarse localization with S-measure fine-grained mask quality: \(R_{Seg} = \mathbf{1}(\text{IoU}) + S_\alpha(M_p, M_{gt})\)
Grasp reward: Negative sum of Huber losses computed separately over five components \((x, y, \cos\theta, \sin\theta, w)\)
Contact point reward: Binary rectangular alignment IoU reward plus L2 distance penalty for two contact points
Format reward: All tasks uniformly require the <think>...</think><answer>...</answer> format

Key Experimental Results¶

Data Quality Evaluation¶

Dataset	MTLD ↑	CLIP Score ↑	\(R_s\) ↑	\(R_g\) ↑	\(R_c\) ↑
Grasp-Anything	27.45	0.54	–	0.38	0.69
Grasp-Anything++	15.14	0.52	–	0.31	0.62
RealVLG-11B	36.49	0.65	0.99	0.69	0.87

RealVLG-11B comprehensively outperforms synthetic datasets in language diversity (MTLD), vision-language alignment (CLIP Score), and spatial consistency.

Main Results on the RealVLG Benchmark¶

Model	Seen Bbox (gIoU)	Seen Grasp (mIoU/gAcc)	Novel Bbox (gIoU)	Novel Grasp (mIoU/gAcc)
Qwen-VL-Max	92.3	16.0/16.7	88.4	8.1/5.4
Qwen2.5VL-3B + SFT	56.4	3.4/1.7	57.2	4.4/1.5
RealVLG-R1-3B (GRPO)	87.2	34.7/40.3	78.5	16.3/17.1
RealVLG-R1-7B (GSPO)	89.0	33.6/32.8	88.5	16.5/18.3

Ablation Study & Key Findings¶

SFT vs. RL fine-tuning: SFT improves gIoU by only ~5% over the base model, whereas GRPO/GSPO achieves improvements exceeding 30%, demonstrating the significant advantage of reinforcement learning on multi-solution grasping tasks.
GRPO vs. GSPO: GRPO achieves higher grasp accuracy on smaller models (3B: mIoU 34.7 vs. 29.2), while GSPO offers better stability on larger models and achieves a 100% valid response (Rv) rate.
Zero-shot generalization: In Novel (unseen object) settings, RealVLG-R1-7B (GSPO) still achieves Bbox gIoU of 88.5% and grasp mIoU/gAcc of 16.5/18.3%, demonstrating non-trivial generalization capability.
Valid output rate: The closed-source Qwen-VL-Max achieves only 60–70% Rv, while all RealVLG-R1 configurations reach 96–100%, indicating that RL fine-tuning significantly improves structured output consistency.
Only 10% training data used: RealVLG-R1 and SFT are trained on only 10% of the training set for 10 epochs, demonstrating excellent data efficiency.

Highlights & Insights¶

First framework unifying VLG and grasping: Integrates semantic localization and physical interaction reasoning into a single model, marking the first end-to-end robotic perception model based on LVLMs.
High-quality data annotation pipeline: Four-layer quality assurance via GPT-4o automatic generation, Qwen-VL-Max verification, SAM2 segmentation, and human review.
11-billion-scale real-world grasp dataset: The largest real-world perception dataset simultaneously encompassing semantic and visual information.
Reinforcement learning addresses multi-solution problems: Elegantly resolves the core challenge of multiple feasible grasp poses by replacing fixed labels with verifiable rewards.
Zero-shot deployment capability: Enables perception and manipulation in unseen real-world environments without task-specific fine-tuning.

Limitations & Future Work¶

Currently supports only 2D rectangular grasp poses; extension to 3D space and 6-DoF grasping has not been explored.
Grasp accuracy in Novel settings (mIoU ~16%) still has substantial room for improvement, with a notable gap relative to detection performance.
Segmentation relies entirely on SAM2 as a frozen module; the model does not directly generate masks.
Experiments do not report closed-loop manipulation success rates on real robots.
The dataset primarily covers tabletop scenarios; generalization to complex industrial and outdoor environments has not been validated.
Inference requires sampling \(G\) groups of responses to estimate advantages, which may limit inference efficiency.

Visual-language grounding: Methods such as GLIP, Shikra, and GroundingDINO focus on Bbox/Seg localization without addressing grasp reasoning.
Grasp datasets: Cornell and GraspNet-1Billion provide real-world annotations but lack the language modality; Grasp-Anything includes language but relies on low-quality synthetic data.
Language-driven grasping: Existing methods largely depend on pre-segmented inputs, suffer from multi-stage error accumulation, and generalize poorly to open-world scenarios.
RL fine-tuning for LLMs: DeepSeek-R1 introduced the RLVR paradigm for reasoning tasks; this paper extends it to visual grounding and robotic grasping.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to unify VLG and grasping, transferring the RLVR paradigm from NLP reasoning to embodied perception.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive data quality evaluation, benchmark, and multi-baseline comparisons; real-robot closed-loop experiments are absent.
Writing Quality: ⭐⭐⭐⭐ — Clear paper structure, detailed dataset construction pipeline, and complete mathematical derivations.
Value: ⭐⭐⭐⭐ — The dataset and benchmark offer long-term community value; the unified framework merits follow-up research.