Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics¶
Conference: NeurIPS 2025
arXiv: 2506.00070
Code: GitHub
Area: Reinforcement Learning
Keywords: Embodied Reasoning, Reinforcement Learning, Robot Control, GRPO, Large Vision-Language Models
TL;DR¶
Robot-R1 proposes training large vision-language models (LVLMs) via reinforcement learning (GRPO) for embodied reasoning. By casting next keystate prediction as multiple-choice questions and optimizing reasoning paths with RL, a 7B-parameter model surpasses GPT-4o on low-level control reasoning tasks.
Background & Motivation¶
LVLMs have demonstrated considerable potential in robotics, capable of jointly processing visual inputs and natural language instructions for high-level robot control. However, current LVLMs still face challenges in fine-grained embodied reasoning:
Limitations of SFT datasets: Existing embodied reasoning datasets are largely constructed heuristically; their language descriptions fail to capture the precise quantitative information required for low-level robot control, and they are not optimized for actual robot action prediction.
Catastrophic forgetting: SFT approaches suffer severe performance degradation on out-of-distribution input-output formats, forgetting previously acquired general conversational capabilities.
Insufficient generalization: SFT-trained models show limited improvement in spatial and motion reasoning—capabilities critical for robot control—and struggle to generalize to novel tasks.
Inspired by DeepSeek-R1, the authors argue that reinforcement learning can more effectively elicit and reinforce reasoning paths, offering superior generalization and sample efficiency compared to SFT. The core insight is that if a model autonomously explores how to reason about next keystate prediction through RL, this reasoning capability naturally transfers to other embodied reasoning tasks.
Method¶
Overall Architecture¶
Robot-R1 follows a three-stage pipeline: (1) extracting metadata and keyframes from expert demonstrations to construct training data; (2) reformulating continuous state prediction as multiple-choice question answering (MCQA); and (3) training the LVLM with GRPO to generate and optimize reasoning chains. The framework also includes a purpose-built evaluation benchmark, Robot-R1 Bench.
Key Designs¶
-
Metadata-conditioned data generation: Since LVLMs cannot directly infer low-level states from images, Robot-R1 introduces environment metadata \(M\) comprising three components: (a) fixed environmental reference points (e.g., table center); (b) a 3D coordinate system definition with positive axis directions; and (c) end-effector dimensions for scale estimation. This metadata is provided as conditional input in the question prompt, helping the model build a quantitative understanding of the scene.
-
Three MCQA task designs:
- Waypoint prediction QA (primary task): Given current observation \(o_t\), state \(s_t\), and metadata \(M\), predict the next keyframe state \(s_{k^*}\). Each question contains the correct answer and 3 distractors randomly sampled from the valid state space.
- Current state prediction QA (auxiliary task): Identify the current state \(s_t\) from visual observations, enhancing the model's understanding of its own state.
- Motion prediction QA (auxiliary task): Predict the motion direction from the current state to the next keyframe (e.g., "move upward," "slightly backward"). Motion labels are extracted from 3D Cartesian displacements via rule-based heuristics.
The core motivation for discretizing continuous state prediction into multiple-choice questions is that the continuous action space is too large for effective RL exploration; discretization narrows the action space, making learning more efficient.
- GRPO for reasoning path optimization:
During training, the model outputs its reasoning in
<think>...</think>tags and the final answer in<answer>...</answer>tags. The GRPO algorithm generates \(G\) distinct responses per query, using within-group relative advantage \(A_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}\) to update the policy, with a KL divergence penalty to prevent excessive policy drift.
Loss & Training¶
The reward signal consists of two components:
- Format reward \(r_f\): Encourages the model to adhere to the prescribed output structure (<think>/<answer> tags).
- Answer correctness reward \(r_a\): A rule-based binary reward granted when the model's answer exactly matches the correct multiple-choice option.
Total reward \(R = r_f + r_a\). Training uses batch size 128, 5 epochs, learning rate \(1.0 \times 10^{-6}\), and 5 sampled responses per prompt.
Robot-R1 Bench¶
A purpose-designed open-ended QA benchmark evaluating four embodied reasoning capabilities: planning, high-level action reasoning, motion reasoning, and spatial reasoning. It comprises 215 questions manually created by experienced researchers, with GPT-4o serving as judge on a 0–3 scoring scale.
Key Experimental Results¶
Main Results: Robot-R1 Bench Low-Level Control Reasoning¶
| Model | Motion(In) | Motion(Out) | Motion(Avg) | Spatial(In) | Spatial(Out) | Spatial(Avg) |
|---|---|---|---|---|---|---|
| GPT-4o | 0.92 | 0.52 | 0.72 | 1.70 | 1.07 | 1.43 |
| Gemini-2.0-Flash | 0.52 | 0.40 | 0.46 | 1.76 | 1.14 | 1.49 |
| Qwen2.5-VL-7B-Ins | 0.64 | 0.52 | 0.58 | 1.62 | 1.11 | 1.40 |
| Direct SFT | 0.12 | 0 | 0.06 | 0.08 | 0.04 | 0.06 |
| CoT SFT | 0.84 | 0.56 | 0.70 | 0.46 | 0.07 | 0.29 |
| Robot-R1 (Ours) | 0.96 | 0.56 | 0.76 | 1.76 | 1.18 | 1.51 |
Robot-R1 achieves the highest scores across all low-level control reasoning metrics, surpassing all commercial models.
EmbodiedBench Manipulation Benchmark¶
| Model | Base | Common | Complex | Spatial | Visual | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Ins | 6.3 | 6.3 | 6.3 | 14.6 | 11.1 | 8.92 |
| Direct SFT | 0 | 0 | 0 | 0 | 0 | 0 |
| CoT SFT | 0 | 0 | 0 | 0 | 0 | 0 |
| Robot-R1 (Ours) | 12.5 | 8.3 | 6.3 | 14.6 | 16.7 | 11.68 |
Robot-R1 improves approximately 31% over the baseline, while SFT-based methods entirely fail to complete any task.
Ablation Study¶
| RL Algorithm | Planning | High-Level Action | Motion | Spatial |
|---|---|---|---|---|
| GRPO | 1.44 | 1.30 | 0.76 | 1.51 |
| RLOO | 1.56 | 1.54 | 0.68 | 1.52 |
| REINFORCE++ | 1.40 | 1.08 | 0.62 | 1.38 |
GRPO and RLOO perform comparably; REINFORCE++ underperforms due to higher variance introduced by batch-level reward normalization.
Key Findings¶
- Training exclusively on low-level control tasks yields significant gains in high-level action reasoning—a transfer that SFT cannot achieve.
- Robot-R1 improves approximately 40% on quantitative and 60% on qualitative metrics on the SpatialRGPT benchmark.
- RL algorithms with variance-reduction mechanisms (GRPO/RLOO) are better suited for Robot-R1 training.
- Experiments across three random seeds confirm the robustness of the training results.
Highlights & Insights¶
- Elegant continuous-to-discrete reformulation: Casting continuous state prediction as multiple-choice questions directly addresses the core challenge of RL exploration in large action spaces.
- Natural transfer of reasoning capability: Training only on low-level state prediction yields improvements in high-level reasoning—a compelling demonstration of emergent transfer.
- Empirical evidence for RL over SFT: SFT (including CoT-SFT) performs poorly or collapses entirely when transferred to out-of-distribution tasks, whereas RL-trained models exhibit substantially stronger generalization.
Limitations & Future Work¶
- Training data is drawn from only 5 tabletop manipulation tasks in RLBench, limiting scene diversity.
- Planning capability shows a slight decline, as the training objective focuses on next-keyframe prediction rather than long-horizon planning.
- Only 7B-scale models are evaluated; the effectiveness of larger models remains unknown.
- The impact of the number of answer choices and distractor generation strategies on training performance is insufficiently explored.
Related Work & Insights¶
- DeepSeek-R1: The paradigm for RL-based reasoning model training; Robot-R1 extends this approach to the embodied domain.
- ARM (James et al.): The source of the keyframe extraction methodology.
- EmbodiedBench, SpatialRGPT: External evaluation benchmarks used to validate generalization.
- Insight: RL-trained embodied reasoning capability may be a critical pathway toward general-purpose robot intelligence.
Rating¶
- Novelty: ⭐⭐⭐⭐ Transferring the R1-style RL reasoning training paradigm to robotic embodied reasoning is novel and meaningful.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmark evaluations + ablations + seed robustness + human agreement validation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-articulated motivation.
- Value: ⭐⭐⭐⭐ Introduces an effective RL-based reasoning training paradigm for robotics with strong practical implications.