CVPR 2026 Reinforcement Learning autonomous data collection closed-loop robotic manipulation automatic environment reset in-context imitation learning VLM task planning

RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset¶

Conference: CVPR 2026 arXiv: 2603.11811 Code: None Area: Reinforcement Learning Keywords: autonomous data collection, closed-loop robotic manipulation, automatic environment reset, in-context imitation learning, VLM task planning

TL;DR¶

This paper presents RADAR — a fully autonomous closed-loop robotic manipulation data generation engine comprising four modules: VLM-based semantic planning, GNN policy execution, VQA-based success evaluation, and FSM-orchestrated LIFO causal reverse environment reset. Requiring only 2–5 human demonstrations, the system continuously generates high-fidelity manipulation data, achieving 90% success rate on complex long-horizon tasks in simulation.

Background & Motivation¶

The scaling of end-to-end embodied intelligence models (e.g., \(\pi_0\), RDT-1B) is severely constrained by the high cost of acquiring large-scale physical interaction data. Existing approaches face a fundamental dilemma: simulation-based methods (e.g., RoboGen, MimicGen) are scalable but suffer from the sim-to-real gap, while teleoperation methods yield high-quality data but are costly and non-scalable. Recent autonomous data collection approaches (e.g., SOAR) have attempted to use VLMs for task proposal and success detection, but exhibit three critical shortcomings: (1) visual prompting relies on brittle 2D pixel-level heuristics lacking 3D kinematic constraints; (2) execution policies are passive and cannot autonomously orchestrate tasks or verify outcomes; and (3) most critically, they lack the ability to automatically reset the environment, requiring repeated human intervention to restore the scene and breaking the closed loop.

Core Problem¶

How to construct a truly human-out-of-the-loop data collection pipeline — enabling the robot to autonomously plan tasks, execute manipulation, evaluate outcomes, and automatically restore the environment state after task completion — so as to achieve continuous and uninterrupted data generation.

Method¶

Overall Architecture¶

RADAR elegantly distributes cognitive load via a "cerebrum–cerebellum" collaborative paradigm: a VLM serves as the "cerebrum" responsible for high-level semantic reasoning (task planning and success evaluation), while a GNN policy acts as the "cerebellum" for sub-millimeter physical control. The system is grounded in an Affordance Library constructed from 2–5 human demonstrations as prior knowledge, and operates in a closed loop through four modules: (1) scene-conditioned task generation → (2) in-context imitation learning execution → (3) VQA-based automatic success evaluation → (4) FSM-orchestrated causal reverse environment reset.

Key Designs¶

Scene-Conditioned Task Generation: This proceeds in three steps. First, the VLM performs Semantic Object Grounding, extracting a structured object representation (name + geometric attributes such as "elliptical") from the current scene image as hard constraints for subsequent planning. Second, hierarchical task planning adapts to three modes based on scene complexity: simple scenes directly perform Affordance matching (e.g., mapping "fold towel" to a "close box" demonstration); complex scenes apply Selective Attention to actively mask distractors (e.g., ignoring a strawberry and a Rubik's cube to focus on a lemon); long-horizon tasks perform skill-chain orchestration, simultaneously generating a forward execution sequence and a LIFO-constrained reverse reset sequence. Finally, the most compatible 3D demonstration is retrieved from the Affordance Library as execution prior via dual-dimensional matching (action similarity + geometric/functional similarity).
In-Context Imitation Learning Execution (ICIL): Built upon the Instant Policy framework, imitation learning is formulated as a graph-based diffusion generation problem. A heterogeneous graph is constructed comprising contextual demonstrations, current point cloud observations, and future actions; executable continuous trajectories are generated via iterative denoising through a graph transformer's reverse diffusion process. This enables zero-shot generalization to novel objects from a single visual demonstration without fine-tuning. A key component is VLM-driven semantic object masking to filter distractors from the point cloud — ablation experiments show that removing this masking causes success rates to collapse from 80–100% to 0–10%.
Three-Stage VQA Automatic Success Evaluation: To address the unreliability of direct VLM evaluation of imperative commands, a three-stage pipeline is designed. (a) Semantic task-to-VQA conversion: an LLM transforms an action command (e.g., "place the yellow ball on the blue plate") into a state query (e.g., "is the yellow ball on the cloth or on the table?"); (b) VLM visual assessment: the post-execution image and VQA query are fed into a VLM (e.g., GPT-4V) to obtain a textual evaluation; (c) Robust Boolean decoding: a parsing LLM distills the verbose VLM response into a strict binary signal (True/False) to drive the downstream state machine. This three-stage design strictly decouples VLM visual reasoning from deterministic logic.
FSM-Orchestrated Autonomous Environment Reset: The key innovation is the simultaneous generation of both a forward plan and a LIFO causal reverse reset plan during the task planning phase. The FSM explicitly decouples execution states (A: planning, B: forward execution, C: reverse execution) from data routing actions (D: dual storage, E: single storage), supporting three loop types: (a) Continuous success loop (B→C→B) — both forward and reverse succeed, triggering immediate re-execution of the same task and dual storage of both trajectories; (b) Asymmetric recovery loop (B→C→A) — forward succeeds but reverse fails, treating the unrestored scene as a new initial state for re-planning, saving only the valid forward trajectory; (c) Forward abort (B→A) — forward failure triggers discard and re-planning. This design allows the system to continue operating even when reset fails.

Loss & Training¶

The ICIL policy employs the standard denoising training objective of graph diffusion models.
The overall pipeline requires no end-to-end training — the VLM (GPT-4V/CogVLM) and GNN policy (Instant Policy) are both used as pretrained models.
Experiments adopt 1-shot demonstrations as context (additional demonstrations yield diminishing returns).
Skill retrieval uses a VLM rather than CLIP, as CLIP embeddings are biased toward nouns and fail to distinguish fine-grained action semantics.

Key Experimental Results¶

Dataset	Metric	Ours	ReKep	MOKA
RLBench - Large Container (Cup)	Success Rate	0.80	0.20	0.20
RLBench - Push Block	Success Rate	1.00	0.40	0.40
RLBench - Stack Block	Success Rate	0.80	0.40	0.10
RLBench - Close Box	Success Rate	1.00	0.40	0.30
RLBench - Put Laptop & Cup into Tray	Success Rate	0.80	0.10	0.00
RLBench - Push & Stack Blocks	Success Rate	0.40	0.00	0.00
RLBench - Close then Open Box	Success Rate	0.90	0.20	0.10

Ablation Study¶

Semantic point cloud masking is critical: removing the VLM-driven selective attention mask causes Large Container (Cup) to drop from 0.80→0.10 and Push Block from 1.00→0.00 — distractor objects lead to catastrophic failure of the execution policy.
VLM-based skill retrieval outperforms CLIP-based retrieval — CLIP lacks sufficient discriminative power for action semantics.
Long-horizon tasks are nearly fatal for baseline methods (ReKep and MOKA drop to 0–10%), while RADAR maintains 40–90%.

Highlights & Insights¶

The "cerebrum–cerebellum" collaborative system design is highly elegant — the VLM handles semantic reasoning while the GNN manages physical precision, with clear separation of concerns.
Simultaneously generating forward and LIFO reverse plans is the core insight — environment reset is elegantly formulated as a reverse task planning problem.
The asymmetric recovery mechanism in the FSM is pragmatic — reset failure does not block the pipeline, as the unrestored scene becomes a new starting point.
The three-stage VQA evaluation is substantially more robust than single-stage VLM judgment — visual reasoning and Boolean logic are strictly decoupled.
Generalization to new tasks requires only 2–5 human demonstrations with 1-shot in-context learning, yielding exceptional data efficiency.
Real-world deployment on deformable object manipulation (towel folding, tube insertion) validates practical feasibility.

Limitations & Future Work¶

The cumulative failure rate of environment reset is a fundamental bottleneck — \(p_{total} \approx p_{forward} \times p_{reverse}\), leading to high compound error rates in complex scenes.
The current FSM is proof-of-concept level; robust reset in highly unstructured environments remains an open problem.
Real-world evaluation is limited to qualitative validation (towel folding, grasping), lacking large-scale quantitative experiments.
Reliance on commercial VLMs such as GPT-4V introduces cost and latency concerns that may challenge large-scale deployment.
The effectiveness of generated data for training downstream policies is not evaluated — final validation of data quality is absent.
Simulation experiments use ground-truth environment reset (to isolate forward capability), obscuring the quantitative evaluation of the complete closed loop.

SOAR: Also employs VLMs for autonomous data collection, but uses the SuSIE image-editing diffusion model to generate visual subgoals — prone to geometric hallucinations (e.g., floating objects) and lacks environment reset capability. RADAR replaces pixel-level generation with 3D demonstration priors, entirely avoiding hallucination issues.
MOKA: Uses 2D mark-based visual prompting for grasp reasoning, but 2D pixel space lacks kinematic constraints. RADAR provides 3D priors through the Affordance Library, yielding greater reliability in tasks requiring precise contact (e.g., tight-fit insertion).
Instant Policy: RADAR directly adopts its graph-diffusion ICIL architecture for low-level execution. The distinction is that Instant Policy is a passive execution engine, whereas RADAR embeds it within a complete cognitive closed loop.

Further Implications¶

The idea of simultaneously planning forward actions and reverse resets generalizes to industrial automation — any production line task requiring cyclic execution faces the environment reset problem.
The "cerebrum–cerebellum" division of labor offers a reference architecture for general-purpose robotic systems — VLMs should handle planning and verification rather than directly outputting control signals, with precise control delegated to specialized policies.
The three-stage VQA evaluation pattern (command → query → assessment → decoding) is applicable to other scenarios requiring reliable VLM-based judgment.

Rating¶

Novelty: ⭐⭐⭐⭐ LIFO causal reverse reset and the FSM asymmetric recovery mechanism are core innovations; the overall system design is elegant.
Experimental Thoroughness: ⭐⭐⭐ Simulation experiments are thorough, but real-world evaluation is qualitative only, and closed-loop evaluation of downstream policy training on generated data is absent.
Writing Quality: ⭐⭐⭐⭐ System description is clear and the FSM state transition diagram is intuitive, though some phrasing leans toward marketing rhetoric.
Value: ⭐⭐⭐⭐ Identifies a critical bottleneck in autonomous data collection (environment reset) and proposes a viable solution; high directional value.