Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: Project Page (Code status to be confirmed)
Area: Robotics / Embodied AI
Keywords: VLA, Visual Sketch, Long-horizon Manipulation, Human-in-the-loop, See-Think-Sketch-Act
TL;DR¶
This paper proposes Action-Sketcher, which enables VLA models to operate in a "See-Think-Sketch-Act" loop. It first draws spatial intent as a Visual Sketch (composed of points, boxes, and arrows) as a human-readable and editable intermediate representation before generating actions. It significantly outperforms strong baselines like π0.5 and OpenVLA-OFT on long-horizon, cluttered, and referentially ambiguous real-world manipulation tasks. Furthermore, sketches allow for direct human intervention to further improve success rates.
Background & Motivation¶
Background: Current mainstream Vision-Language-Action (VLA) models directly map multi-view observations and language instructions to actions. End-to-end policies (OpenVLA, Octo, Diffusion Policy) perform well on short-horizon tasks. Hierarchical VLAs attempt to improve long-horizon behavior using a "planner + controller" setup. Recently, a "think-before-act" paradigm has emerged (e.g., EO-1, OneTwoVLA, ThinkAct), inserting explicit reasoning before action execution.
Limitations of Prior Work: These methods share a common issue where "intent is hidden." ① End-to-end policies compress planning intent into latent representations, making task decomposition difficult and actions lack causal explanation. ② Reasoning in hierarchical VLAs is often instantaneous and local, lacking continuous modeling of global intent (changing human goals, accumulating errors, historical states). ③ Even in "think-before-act" models, if the intermediate evidence is only text, spatial references (where to contact, how to approach, relationships between objects) remain implicit—unverifiable by humans and lacking low-entropy geometric guidance for the controller.
Key Challenge: Long-horizon manipulation is difficult for two reasons. Spatially, natural language instructions are inherently ambiguous (which cup is "the cup" when multiple are present?) or under-determined (what exact pose is "left of the cup"?); text-only cues cannot translate linguistic relations into executable constraints. Temporally, human-in-the-loop collaboration is weak, and interpretable planning outputs are rarely exposed, allowing small errors to propagate and accumulate into failures.
Goal: To create an intermediate interface that makes "intent both visible and actionable," resolving spatial reference disambiguation (where/how to act) and temporal error correction (early detection and recovery).
Key Insight: The authors advocate for externalizing spatial intent onto the "language → control" interface—not leaving it in text or latent vectors, but drawing it directly on the robot's current view. A sketch that humans can understand, approve, and modify is essentially a "verifiable contract" between high-level reasoning and low-level control.
Core Idea: Use a Visual Sketch consisting of points/boxes/arrows instead of pure text intermediate representations to anchor spatial intent, and adaptively interlace reasoning and action using a token-gated "See-Think-Sketch-Act" loop.
Method¶
Overall Architecture¶
Action-Sketcher models long-horizon manipulation as a sequence modeling problem over a hybrid output space (discrete tokens + continuous actions). It learns a policy \(\pi_\theta\) where the agent autonomously decides whether to "reason" or "act" at each step. The input context is a sequence of tokens: multi-view images (left wrist, right wrist, base camera), task instructions, history of completed sub-tasks, the current sub-task, and the visual sketch image. Using π0 as a backbone, the model auto-regressively generates text within a single model (reasoning chains, sub-task plans, structural descriptions of sketches) and predicts continuous action chunks using flow-matching.
The system runs a dual-mode event-driven loop gated by special tokens:
- Reasoning Mode: When the model decides thinking is needed (completing a sub-task, encountering an error, or receiving human intervention), it generates
<BOR>(begin-of-reasoning). It then performs temporal reasoning (inferring the next sub-task from instructions and history) and spatial reasoning (analyzing object layouts and relationships to produce text-based points/boxes/arrows), ending with<EOR>. The text sketch is then rendered onto the current reference view as a visual sketch image and updated into the context. - Action Mode: When the model deems further reasoning unnecessary (e.g., scene consistency, normal sub-task execution), it generates
<BOA>(begin-of-action), triggering the action expert to generate action chunks via flow-matching.
Initially, history, current sub-tasks, and sketch images are empty, forcing the model to start in Reasoning Mode to fill these fields before fluently switching between modes.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Multi-view Observation<br/>+ Task Instruction"] --> B["Visual Sketch<br/>Explicit spatial intent via points/boxes/arrows"]
B --> C["See-Think-Sketch-Act Loop<br/>Token-gated dual mode"]
C -->|"<BOR> Reasoning needed"| D["Reasoning Mode<br/>Temporal → Spatial → Sketching"]
C -->|"<BOA> Normal execution"| E["Action Mode<br/>Flow-matching action chunks"]
D --> B
E --> F["Action chunks → Execution"]
F -->|"Sub-task boundary/Error/Human intervention"| C
G["Multi-stage Curriculum Training<br/>Spatiotemporal pre-train → Language-Sketch alignment → Sketch-Action adaptation"] -.Training.-> C
Key Designs¶
1. Visual Sketch: A Verifiable Contract via Points, Boxes, and Arrows
The design addresses the failure of "text-only intermediate representations to explain space." A visual sketch at step \(t\) is a tuple of sparse geometric primitives \(S_t = (B_t, P_t, A_t)\), all defined on the robot’s ego-view image plane:
- Boxes \(B_t\): Object-level affordance cues. Each box \(b_i=(x_{1,i},y_{1,i},x_{2,i},y_{2,i})\) uses pixel coordinates to identify manipulable regions. It disambiguates object references in cluttered scenes—"pick the one closest to the cup" can be locked to a target (e.g., an apple) with a box, stripping away appearance details while retaining scale and position.
- Points \(P_t\): Keypoints \(p_i=(x_i,y_i)\) specify precise interaction/reference locations, representing part-level affordances, motion landmarks, or geometric references. For "pouring tea," this includes the teapot spout \(p_{spout}\), cup center \(p_{cup}\), and stable handle contact \(p_{handle}\).
- Arrows \(A_t\): Dynamic links connecting static keypoints to actual actions. The authors factorize complex SE(3) operations into 2D plane projections of translation trajectories and rotation cues, \(A_t = A^{trans}_t \cup A^{rot}_t\). Translation arrows \(a^{trans}_i=(p^{start}_i, p^{end}_i)\) are ordered sequences anchored on keypoints. Rotation arrows \(a^{rot}_i=(p_i, \text{axis}\in\{x,y,z\}, \text{dir}\in\{\circlearrowright,\circlearrowleft\})\) specify rotation around canonical axes (e.g., tilting the teapot around the ego-view x-axis).
Mechanism: Sketches are continuous, human-editable, and generated one sub-task at a time. By not encoding the entire trajectory at once, they express richer and more precise action primitives (contact points, rotation arrows) than coarse trajectories, providing low-entropy geometric guidance to the controller and an entry point for human verification.
2. See-Think-Sketch-Act Loop: Token-Gated Adaptive Switching
To balance "deliberate reasoning vs. real-time execution," the model uses <BOR>/<BOA> gated tokens instead of a fixed reasoning frequency. The model itself decides to switch modes based on observations, predicted risks, or user feedback. <BOR> triggers full spatiotemporal reasoning and a sketch refresh when global re-planning is needed; <BOA> allows the action expert to output action chunks via flow-matching during routine execution without redundant reasoning.
Mechanism: This event-driven design ensures "slow thinking" occurs only when necessary (sub-task boundaries, scene changes, errors, human intervention), maintaining low-latency action prediction otherwise. Rendering the sketch back into the view and feeding it into the context closes the reasoning-action loop: if a human or the model detects an incorrect sketch, it can be intercepted and corrected before execution.
3. Multi-stage Curriculum Training + Mode Balanced Sampling + Sketch Perturbation Enhancement
Training a single model to master spatiotemporal reasoning, precise language-to-sketch binding, robust sketch-to-action mapping, and mode switching is difficult. The authors use a three-stage curriculum:
- Stage 1: Basic Spatiotemporal Learning: Uses 3.4M spatial samples (visual grounding for boxes, pointing for keypoints, scene understanding, VQA) and 870k temporal sequences (EgoPlan, ShareRobot, AgiBot-World). GPT-4o is used to label 20% with text reasoning.
- Stage 2: Reasoning-to-Sketch Enhancement: On 21k samples (including 2.6k real-world 2–16 sub-task episodes), the model is trained to complete the "temporal reasoning → spatial reasoning → text sketch generation" pipeline.
- Stage 3: Sketch-to-Action + Mode Adaptation: Jointly trains the action policy and mode switching. It teaches the model when to output
<BOR>or<BOA>while training the action expert on labeled action data.
Two key reinforcements: Sketch Perturbation Enhancement simulates reasoning errors (randomly perturbing boxes while keeping IoU≥0.8, resampling points within radius \(c\)) to make the action expert robust. Mode Balanced Sampling prevents the model from biasing toward the more frequent <BOA> steps by sampling reasoning \(D_R\) and action \(D_A\) sets equally:
Key Experimental Results¶
Main Results¶
On the LIBERO benchmark, Action-Sketcher's average success rate is comparable to the strongest baselines, but it leads significantly in the Long subset:
| Dataset / Subset | Metric | Ours | π0.5 | OpenVLA-OFT |
|---|---|---|---|---|
| LIBERO-Long | Success % | 96.0 | 92.4 | 94.5 |
| LIBERO-Object | Success % | 99.6 | 98.2 | 98.4 |
| LIBERO-Avg | Success % | 96.9 | 96.8 | 97.1 |
The gap widens on harder long-horizon/spatial tasks (RoboTwin 2.0 simulation + real-world dual-arm platform):
| Task | π0 | π0.5 | OpenVLA-OFT | Ours |
|---|---|---|---|---|
| Stack Blocks (Sim) | 4.0 | 7.0 | 12.4 | 34.5 |
| Place A2B Left (Sim) | 12.0 | 11.0 | 21.0 | 43.0 |
| Tidy Table (Real) | 23.0 | 31.2 | 36.0 | 52.0 |
| Pick & Place (Real) | 30.0 | 34.5 | 52.5 | 67.0 |
| Pour Tea (Real) | 16.0 | 20.0 | 15.0 | 27.6 |
Human-in-the-Loop (RQ2)¶
Failure analysis shows 66% of failures originate in Reasoning Mode, with the vast majority (61% of total failures) occurring in spatial reasoning—i.e., sketch generation itself. However, because the sketch is an explicit interface, it is a natural entry point for human intervention. Allowing humans to pause and slightly modify sketches nearly doubles the success rate on the hardest real-world tasks:
| Real-world Task | Original % | + Human-in-the-loop % | Gain |
|---|---|---|---|
| Tidy Table | 52.0 | 75.0 | +23.0 |
| Pour Tea | 27.6 | 44.0 | +16.4 |
| Pick & Place | 67.0 | 85.5 | +18.5 |
Ablation Study¶
| Configuration | Success % (Stack Blocks) | Progress % (Tidy Table) | Note |
|---|---|---|---|
| Action-Sketcher (Full) | 34.5 | 52.0 | Full model |
| w/o Spatial Reasoning | 13.8 | 23.9 | Drops to ~1/3 |
| w/o Visual Sketch | 9.8 | 15.0 | Lowest performance |
| w/o Boxes | 31.2 | 49.0 | Disambiguation capability drops |
| w/o Keypoints | 26.6 | 43.6 | Largest drop—coordinate grounding is key |
| w/o Arrows | 29.9 | 48.2 | Dynamic guidance impaired |
| w/o Stage 1 | 29.2 | 39.7 | Skip spatiotemporal pre-training |
| w/o Stage 2 | 18.1 | 21.9 | No reasoning finetuning, inconsistent sketches |
| w/o Stage 3 | 0.0 | 0.0 | Complete failure without adaptation |
Key Findings¶
- Visual sketches are essential: Removing them causes success rates to plummet from 34.5% to 9.8%, proving they are a fundamental bridge for grounding language into executable actions, not just a visualization aid.
- Keypoints are the most critical primitive: Removing them results in the largest drop (to 26.6%) as they provide precise coordinate grounding.
- Stage 3 is indispensable: Removing it leads to 0% success, identifying the sketch-to-action adaptation phase as the bottleneck for deployment.
- Errors concentrate in sketch generation: 61% of failures stem from inaccurate sketches during spatial reasoning. This disadvantage is mitigated by the explicit interface, allowing human corrections to push performance near 100%.
Highlights & Insights¶
- The "verifiable contract" concept is clever: Moving intermediate representations from latent vectors to visible points/boxes/arrows provides three benefits: disambiguation, supervisability (easy to label), and debuggability.
- Factorization of SE(3) into 2D translation + rotation arrows: Projecting 6-DOF motion onto 2D planes allows the sketch to remain in the image space while retaining complex semantics.
- Token-gated adaptive switching: Using
<BOR>/<BOA>allows the model to learn when to think slowly versus act fast, treating reasoning frequency as a learnable behavior rather than a hyperparameter. - One sub-task at a time: Generating local sketches instead of full trajectories allows for more precise primitives (contact points, rotation), proving "local over global" can be beneficial.
Limitations & Future Work¶
- Sketch generation (spatial reasoning) is the primary bottleneck. Current autonomous grounding accuracy is insufficient, often requiring human-in-the-loop fallback.
- Absolute success rates on difficult long-horizon tasks (e.g., Pour Tea at 27.6%) are still low, indicating a gap before practical utility.
- Sketch perturbation uses heuristic values (IoU≥0.8); the distribution of real reasoning errors may not be fully covered.
- Ego-view 2D projection handles occlusion poorly and has limited expressive power for 3D depth-wise fine movements.
- Future Work: Implementing active assistance (model requests human help when sketch confidence is low) or multi-view sketch fusion.
Related Work & Insights¶
- vs. Text-based think-before-act: Action-Sketcher externalizes spatial intent, which text-based methods (EO-1, ThinkAct) keep implicit or compressed.
- vs. Visual prompts/trajectories (RT-Trajectory, RT-Sketch): These often either use static prompts or compress trajectories into non-editable forms; Action-Sketcher generates dynamic, editable, sub-task-specific primitives.
- vs. Hierarchical VLA: Traditional planners lack continuous global intent modeling; this model maintains a verifiable intent loop through the See-Think-Sketch-Act process.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ An intuitive and persuasive interface for externalizing spatial intent.
- Experimental Thoroughness: ⭐⭐⭐⭐ Strong ablation and human-in-the-loop studies, though absolute real-world success rates remain modest.
- Writing Quality: ⭐⭐⭐⭐ Logical flow with clear definitions of primitives.
- Value: ⭐⭐⭐⭐⭐ High utility for long-horizon tasks and human-robot collaboration.