RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation¶
Conference: ICLR 2026 arXiv: 2602.09973 Code: GitHub Area: Robot Learning / Datasets Keywords: Intermediate Representation, VLA, Manipulation Dataset, Embodied VQA, Plan-then-Execute
TL;DR¶
This paper presents RoboInter, a unified manipulation suite for intermediate representations, comprising: RoboInter-Tool (a semi-automatic annotation GUI), RoboInter-Data (230K episodes × 571 scenes with dense per-frame annotations across 10+ intermediate representation types), RoboInter-VQA (a 29-category embodied VQA benchmark), and RoboInter-VLA (a plan-then-execute framework supporting both modular and end-to-end variants), providing a complete infrastructure for enhancing VLA generalization through intermediate representations.
Background & Motivation¶
Background: Vision-Language-Action (VLA) systems integrate large-scale pretrained VLMs with robotic manipulation, yet existing manipulation datasets suffer from high annotation cost, embodiment specificity, and insufficient coverage. The plan-then-execute paradigm—generating high-level plans before translating them into low-level actions—has been validated as an effective approach to improving generalization, but critically relies on supervision signals from intermediate representations (subtasks, traces, grounding, etc.).
Limitations of Prior Work:
- Existing datasets provide almost no dense intermediate representation annotations, limiting the development of plan-then-execute methods.
- Prior annotation efforts are either small in scale (ShareRobot: only 51K), limited in annotation types (LLARVA: trace only), or rely on automatic annotation with uncontrolled quality (ECoT).
- No systematic benchmark exists for evaluating VLMs' spatial and temporal reasoning in embodied settings.
- Comparisons between modular and end-to-end VLA lack a unified framework and data support.
Key Challenge: While the potential of the plan-then-execute paradigm has been demonstrated, large-scale, high-quality, and diverse intermediate representation annotations are lacking to fully unlock it.
Goal: Build a complete intermediate representation ecosystem—from annotation tooling (RoboInter-Tool) to data (RoboInter-Data) to benchmarks (RoboInter-VQA) to model frameworks (RoboInter-VLA)—addressing the three bottlenecks of data, evaluation, and methodology in a unified suite.
Method¶
Overall Architecture¶
The RoboInter suite consists of four major components:
- RoboInter-Tool: A lightweight PyQt5-based GUI supporting semi-automatic per-frame annotation, integrated with SAM2 for object segmentation and tracking.
- RoboInter-Data: Dense per-frame annotation data covering 230K+ episodes, 571 scenes, and 10+ intermediate representation types.
- RoboInter-VQA: An embodied VQA benchmark and dataset with 9 spatial and 20 temporal question categories.
- RoboInter-VLA: A plan-then-execute framework with a VLM Planner and Executor, supporting three architectural variants.
Key Design 1: Multi-Level Intermediate Representation Annotation¶
Annotations span three levels:
- Task level: Tasks decomposed into subtasks → 15 predefined primitive skills (Pick, Place, Push, Pull, etc.) → segment-level and video-level language annotations.
- Object level: Object grounding annotations via SAM2 + human review → 61M frames of object grounding, 190K affordance boxes, and placement proposals.
- Execution level: End-effector 2D trajectories (70M frames of trace annotations), contact points, 6D grasp poses, and gripper bounding boxes.
All annotations are temporally aligned with actions, states, and both third-person and wrist-view observations.
Key Design 2: F-CoT — Flexible Chain-of-Thought Bridging Planning and Execution¶
The paper introduces Flexible Chain-of-Thought (F-CoT), composed of combinations of intermediate representations:
- Serves as VQA supervision signals for Planner training.
- Serves as action-aligned guidance signals for the Executor.
- Supports text form (Te-Modular) and visual prompt form (Im-Modular).
- Users may flexibly select subsets (e.g., subtask + trace, affordance + skill) according to task requirements.
Key Design 3: Three Plan-then-Execute VLA Variants¶
| Variant | Architecture | Intermediate Representation Usage |
|---|---|---|
| IC-E2E | VLM as feature extractor for Executor | Implicit (via pretrained weights) |
| EC-E2E | VLM jointly generates intermediate representations and actions | Explicit (joint optimization of CoT + action) |
| Modular | Planner and Executor decoupled | Explicit (Planner output → Executor conditioning) |
The Executor adopts a Qwen2.5-VL backbone + DiT action head + information aggregator (compressing all token hidden states into a controllable-length conditioning feature).
Key Experimental Results¶
Main Results: VLM Capability Evaluation on Third-Party Benchmarks¶
| Model | Where2Place ↑ | RoboRefIt ↑ | RoboVQA ↑ | RefCOCOg ↑ |
|---|---|---|---|---|
| QwenVL2.5-7B | 18.9% | 75.8% | 38.4 | 87.2% |
| RoboBrain-2.0-7B | 63.6% | 8.8% | 31.6 | 62.9% |
| RoboInter-Qwen-7B | 65.8% | 85.6% | 74.4 | 88.4% |
| RoboInter-LLaVAOV-7B | 66.3% | 89.3% | 74.5 | 87.3% |
RoboInter-VLM substantially outperforms baselines across all embodied benchmarks while maintaining stable general-purpose capabilities (TextVQA 83.0, MME 2281).
Open-Loop Executor Evaluation¶
| Method | OLS@0.1 | OLS@0.05 | OLS@0.01 | mOLS |
|---|---|---|---|---|
| Vanilla | 0.6793 | 0.3608 | 0.0189 | 0.3086 |
| IC-E2E | 0.6984 | 0.3810 | 0.0204 | 0.3218 |
| EC-E2E | 0.7049 | 0.3930 | 0.0314 | 0.3340 |
| Te-Modular | 0.7124 | 0.4133 | 0.0584 | 0.3543 |
| Oracle+Executor | 0.7511 | 0.4640 | 0.0587 | 0.3861 |
Te-Modular (textual F-CoT + modular architecture) achieves the best results among learned methods, demonstrating that decoupling planning and execution allows each component to specialize more effectively.
Ablation Study: Contribution of Intermediate Representation Types¶
| Intermediate Representation Combination | mOLS |
|---|---|
| Vanilla (no intermediate representation) | 0.3086 |
| + Subtask | 0.3146 |
| + Subtask + Primitive Skill | 0.3159 |
| + ... + Object Box | 0.3289 |
| + ... + Gripper Box | 0.3391 |
| + ... + Affordance | 0.3435 |
| + ... + Trace | 0.3861 |
Coarse-grained representations (Subtask, Skill) yield marginal gains, spatially precise signals (Object Box, Affordance) contribute significantly, and Trace provides the largest gain (dense temporal information directly constraining action generation).
Real-World Closed-Loop Evaluation¶
| Model | ID Avg. Success | OOD Avg. Success | ID→OOD Drop |
|---|---|---|---|
| OpenVLA | 45.0% | 23.3% | 21.7% |
| π₀ | 63.3% | 45.0% | 18.3% |
| Vanilla | 65.0% | 38.3% | 26.7% |
| IC-E2E | 77.3% | 58.3% | 19.0% |
| EC-E2E | 68.3% | 60.0% | 8.3% |
EC-E2E achieves the best OOD performance with the smallest degradation (only 8.3%), demonstrating that explicit intermediate representation reasoning provides stronger generalization robustness.
Highlights & Insights¶
Strengths¶
- Exceptional systematicity: From annotation tooling to data to benchmarks to model frameworks, the suite provides an all-in-one infrastructure for intermediate representation research.
- Substantial scale: 230K episodes + 571 scenes + 61M frames of grounding annotations, far exceeding prior work.
- Comprehensive experimentation: Full coverage of open-loop, closed-loop, cross-platform, SimplerEnv, and ablation evaluations.
- Deep insights: Systematic validation of the effects of intermediate representation granularity and architectural design choices (modular vs. end-to-end) on performance.
Limitations & Future Work¶
- VQA data is generated from templates and reassembled annotations, limiting diversity to the design space of annotation templates.
- Real-world experiments cover only 4 tasks; generalization to more complex, long-horizon tasks remains unexplored.
- The modular Planner incurs high inference latency (~2.4s), requiring engineering optimizations such as asynchronous inference for practical deployment.
Rating¶
⭐⭐⭐⭐⭐
RoboInter represents a milestone in robotic intermediate representation research. It not only provides the largest and most diverse multi-type intermediate representation dataset to date, but also establishes a complete experimental platform for the plan-then-execute paradigm through systematic VQA benchmark and VLA framework design. The ablation study clearly reveals a value hierarchy of intermediate representations—Trace > Affordance > Object Box > Subtask—an insight of significant guidance value for future embodied AI research.