RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation¶

Conference: ICLR 2026 arXiv: 2602.09973 Code: GitHub Area: Robot Learning / Datasets Keywords: Intermediate Representation, VLA, Manipulation Dataset, Embodied VQA, Plan-then-Execute

TL;DR¶

This paper presents RoboInter, a unified manipulation suite for intermediate representations, comprising: RoboInter-Tool (a semi-automatic annotation GUI), RoboInter-Data (230K episodes × 571 scenes with dense per-frame annotations across 10+ intermediate representation types), RoboInter-VQA (a 29-category embodied VQA benchmark), and RoboInter-VLA (a plan-then-execute framework supporting both modular and end-to-end variants), providing a complete infrastructure for enhancing VLA generalization through intermediate representations.

Background & Motivation¶

Background: Vision-Language-Action (VLA) systems integrate large-scale pretrained VLMs with robotic manipulation, yet existing manipulation datasets suffer from high annotation cost, embodiment specificity, and insufficient coverage. The plan-then-execute paradigm—generating high-level plans before translating them into low-level actions—has been validated as an effective approach to improving generalization, but critically relies on supervision signals from intermediate representations (subtasks, traces, grounding, etc.).

Limitations of Prior Work:

Existing datasets provide almost no dense intermediate representation annotations, limiting the development of plan-then-execute methods.
Prior annotation efforts are either small in scale (ShareRobot: only 51K), limited in annotation types (LLARVA: trace only), or rely on automatic annotation with uncontrolled quality (ECoT).
No systematic benchmark exists for evaluating VLMs' spatial and temporal reasoning in embodied settings.
Comparisons between modular and end-to-end VLA lack a unified framework and data support.

Key Challenge: While the potential of the plan-then-execute paradigm has been demonstrated, large-scale, high-quality, and diverse intermediate representation annotations are lacking to fully unlock it.

Goal: Build a complete intermediate representation ecosystem—from annotation tooling (RoboInter-Tool) to data (RoboInter-Data) to benchmarks (RoboInter-VQA) to model frameworks (RoboInter-VLA)—addressing the three bottlenecks of data, evaluation, and methodology in a unified suite.

Method¶

Overall Architecture¶

The RoboInter suite consists of four major components:

RoboInter-Tool: A lightweight PyQt5-based GUI supporting semi-automatic per-frame annotation, integrated with SAM2 for object segmentation and tracking.
RoboInter-Data: Dense per-frame annotation data covering 230K+ episodes, 571 scenes, and 10+ intermediate representation types.
RoboInter-VQA: An embodied VQA benchmark and dataset with 9 spatial and 20 temporal question categories.
RoboInter-VLA: A plan-then-execute framework with a VLM Planner and Executor, supporting three architectural variants.

Key Design 1: Multi-Level Intermediate Representation Annotation¶

Annotations span three levels:

Task level: Tasks decomposed into subtasks → 15 predefined primitive skills (Pick, Place, Push, Pull, etc.) → segment-level and video-level language annotations.
Object level: Object grounding annotations via SAM2 + human review → 61M frames of object grounding, 190K affordance boxes, and placement proposals.
Execution level: End-effector 2D trajectories (70M frames of trace annotations), contact points, 6D grasp poses, and gripper bounding boxes.

All annotations are temporally aligned with actions, states, and both third-person and wrist-view observations.

Key Design 2: F-CoT — Flexible Chain-of-Thought Bridging Planning and Execution¶

The paper introduces Flexible Chain-of-Thought (F-CoT), composed of combinations of intermediate representations:

Serves as VQA supervision signals for Planner training.
Serves as action-aligned guidance signals for the Executor.
Supports text form (Te-Modular) and visual prompt form (Im-Modular).
Users may flexibly select subsets (e.g., subtask + trace, affordance + skill) according to task requirements.

Key Design 3: Three Plan-then-Execute VLA Variants¶

Variant	Architecture	Intermediate Representation Usage
IC-E2E	VLM as feature extractor for Executor	Implicit (via pretrained weights)
EC-E2E	VLM jointly generates intermediate representations and actions	Explicit (joint optimization of CoT + action)
Modular	Planner and Executor decoupled	Explicit (Planner output → Executor conditioning)

The Executor adopts a Qwen2.5-VL backbone + DiT action head + information aggregator (compressing all token hidden states into a controllable-length conditioning feature).

Key Experimental Results¶

Main Results: VLM Capability Evaluation on Third-Party Benchmarks¶

Model	Where2Place ↑	RoboRefIt ↑	RoboVQA ↑	RefCOCOg ↑
QwenVL2.5-7B	18.9%	75.8%	38.4	87.2%
RoboBrain-2.0-7B	63.6%	8.8%	31.6	62.9%
RoboInter-Qwen-7B	65.8%	85.6%	74.4	88.4%
RoboInter-LLaVAOV-7B	66.3%	89.3%	74.5	87.3%

RoboInter-VLM substantially outperforms baselines across all embodied benchmarks while maintaining stable general-purpose capabilities (TextVQA 83.0, MME 2281).

Open-Loop Executor Evaluation¶

Method	OLS@0.1	OLS@0.05	OLS@0.01	mOLS
Vanilla	0.6793	0.3608	0.0189	0.3086
IC-E2E	0.6984	0.3810	0.0204	0.3218
EC-E2E	0.7049	0.3930	0.0314	0.3340
Te-Modular	0.7124	0.4133	0.0584	0.3543
Oracle+Executor	0.7511	0.4640	0.0587	0.3861

Te-Modular (textual F-CoT + modular architecture) achieves the best results among learned methods, demonstrating that decoupling planning and execution allows each component to specialize more effectively.

Ablation Study: Contribution of Intermediate Representation Types¶

Intermediate Representation Combination	mOLS
Vanilla (no intermediate representation)	0.3086
+ Subtask	0.3146
+ Subtask + Primitive Skill	0.3159
+ ... + Object Box	0.3289
+ ... + Gripper Box	0.3391
+ ... + Affordance	0.3435
+ ... + Trace	0.3861

Coarse-grained representations (Subtask, Skill) yield marginal gains, spatially precise signals (Object Box, Affordance) contribute significantly, and Trace provides the largest gain (dense temporal information directly constraining action generation).

Real-World Closed-Loop Evaluation¶

Model	ID Avg. Success	OOD Avg. Success	ID→OOD Drop
OpenVLA	45.0%	23.3%	21.7%
π₀	63.3%	45.0%	18.3%
Vanilla	65.0%	38.3%	26.7%
IC-E2E	77.3%	58.3%	19.0%
EC-E2E	68.3%	60.0%	8.3%

EC-E2E achieves the best OOD performance with the smallest degradation (only 8.3%), demonstrating that explicit intermediate representation reasoning provides stronger generalization robustness.

Highlights & Insights¶

Strengths¶

Exceptional systematicity: From annotation tooling to data to benchmarks to model frameworks, the suite provides an all-in-one infrastructure for intermediate representation research.
Substantial scale: 230K episodes + 571 scenes + 61M frames of grounding annotations, far exceeding prior work.
Comprehensive experimentation: Full coverage of open-loop, closed-loop, cross-platform, SimplerEnv, and ablation evaluations.
Deep insights: Systematic validation of the effects of intermediate representation granularity and architectural design choices (modular vs. end-to-end) on performance.

Limitations & Future Work¶

VQA data is generated from templates and reassembled annotations, limiting diversity to the design space of annotation templates.
Real-world experiments cover only 4 tasks; generalization to more complex, long-horizon tasks remains unexplored.
The modular Planner incurs high inference latency (~2.4s), requiring engineering optimizations such as asynchronous inference for practical deployment.

Rating¶

⭐⭐⭐⭐⭐

RoboInter represents a milestone in robotic intermediate representation research. It not only provides the largest and most diverse multi-type intermediate representation dataset to date, but also establishes a complete experimental platform for the plan-then-execute paradigm through systematic VQA benchmark and VLA framework design. The ablation study clearly reveals a value hierarchy of intermediate representations—Trace > Affordance > Object Box > Subtask—an insight of significant guidance value for future embodied AI research.