Skip to content

Spatially Guided Training for Vision-Language-Action Model

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=eKhOrQWAVJ
Code: https://internrobotics.github.io/internvla-m1.github.io
Area: Robotics / VLA Training
Keywords: Vision-Language-Action models, spatial grounding, robotic manipulation, spatial prompting, dual-system strategy

TL;DR

ST4VLA significantly mitigates the issues of "seeing but not moving" or "forgetting how to see after learning to move" in VLA training by first teaching the VLM spatial priors such as points, boxes, and trajectories, and then injecting these priors as implicit planning conditions into a DiT action expert via spatial prompts during the action post-training phase. It achieves stronger generalization in SimplerEnv, LIBERO, large-scale simulated pick-and-place, and real-world long-horizon robotic tasks.

Background & Motivation

Background: Current general-purpose robotic policies roughly follow two paths. One is hierarchical robotic systems: they first use VLMs, detectors, segmenters, or 3D scene graphs for task decomposition and spatial localization, then pass intermediate results to low-level controllers. The other is data-driven VLA: images, language, and robot trajectories are trained end-to-end within the same model to predict actions directly from instructions.

Limitations of Prior Work: Hierarchical systems benefit from clear spatial structures (e.g., knowing which object to grasp and where to place it), but they often rely on manual rules, hand-written planners, or fixed task templates, making them costly to scale to complex tabletop scenes and long-horizon tasks. End-to-end VLAs are easier to scale but tend to "wash out" useful spatial reasoning capabilities inherent in pre-trained VLMs. Supervision from action data primarily comes from low-level control trajectories while text instructions are sparse, causing the model to sacrifice target localization, affordance understanding, and trajectory reasoning to fit action patterns.

Key Challenge: Robot control requires both continuous actions and highly transferable, discrete spatial priors. VLM pre-training learns vast vision-language knowledge, but standard VLA fine-tuning exposes this knowledge directly to action loss, leading to spatial grounding degradation. Simply mixing grounding and action data for joint training causes gradient conflicts between the two objectives, resulting in unstable perception and action.

Goal: The goal is to address the VLA training paradigm rather than a single controller architecture: how to preserve the VLM's spatial capabilities while learning robot actions, how to optimize spatial grounding and action policy objectives in the same direction, and how to make these spatial priors truly serve real-world manipulation and long-horizon tasks.

Key Insight: In robotic tasks, "where to act" and "how to act" should not be completely coupled. Spatial information like points, boxes, trajectories, and object relationships are general knowledge across tasks and embodiments. In contrast, joint increments, end-effector trajectories, and gripper states are embodiment-specific control knowledge. Decoupling these and connecting them via lightweight conditions during action training is more reasonable than forcing a single model to carry all objectives simultaneously.

Core Idea: Replace standard VLA fine-tuning with "spatial grounding pre-training + spatial prompt-guided action post-training." This allows a VLM Planner to continuously generate transferable spatial implicit plans, which a DiT Actor then converts into specific robot actions.

Method

Overall Architecture

ST4VLA is a dual-system VLA framework: System 2 is a slower but more reliable VLM Planner responsible for extracting semantic and spatial priors from images and instructions; System 1 is an action expert DiT Actor responsible for converting these priors and robot observations into continuous control actions. Training is divided into two steps: first, strengthening the VLM's spatial grounding capabilities; second, activating these capabilities via spatial prompts during action post-training and feeding VLM latent spatial embeddings to the action expert through a querying transformer.

The critical aspect of this framework is that spatial information is not forced into explicit boxes or points for a rule-based controller but enters the action expert as latent planning tokens. This preserves the end-to-end trainability of VLA while ensuring the action model "sees" target objects, spatial relationships, and potential trajectories during control signal generation.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Image Observation + Task Instruction"] --> B["Spatial Grounding Pre-training<br/>Point/Box/Trajectory Priors"]
    B --> C["Dual-system VLA<br/>Planner and Actor Division of Labor"]
    C --> D["Spatial Prompting Post-training<br/>Latent Spatial Tokens"]
    D --> E["DiT Actor Generates<br/>Continuous Action Chunks"]
    E --> F["Simulated and Real Robot Execution"]

Key Designs

1. Spatial Grounding Pre-training: Embedding robotic spatial common sense into the VLM

Standard VLM vision-language pre-training possesses semantic knowledge but lacks the specific spatial signals required for robotics: target locations, empty slots, graspable regions, and approximate end-effector trajectories. The first stage of ST4VLA unifies web-scale multimodal grounding data and robot-related data into a QA format, allowing a Qwen2.5-VL style Planner to learn box, point, and trajectory outputs during supervised fine-tuning. Data sources include RefCOCO, LLaVA-OneVision, RoboRefIt, A0, MolmoAct, and an ST4VLA manipulation dataset constructed by the authors.

The value of this step is decoupling "spatial priors" from specific robot embodiments. For example, bounding box QA teaches the model to find objects based on language, point QA teaches it to point at targets or slots, and trajectory QA teaches it to describe manipulation trends via 2D trajectories. These capabilities determine which object or path the action expert should focus on. Experiments show that while general grounding data provides some improvement, adding robotic grounding data significantly boosts metrics on Where2Place, RoboRefit, A0 ManiSkill, and SimplerEnv.

2. Dual-system VLA: Planner manages "where and what," Actor manages "how"

Ours does not directly convert the VLM into a monolithic action model but adopts a dual-system structure. The VLM Planner (System 2) reads images and language to provide latent representations of semantics, targets, and spatial relationships. The action expert (System 1) uses a compact diffusion transformer and DINOv2 visual encoder to predict embodiment-specific actions. They are connected by a querying transformer (only 8.7 MB), which maps variable-length VLM tokens into a fixed number of learnable query tokens as conditions for the action expert.

This design avoids two extremes: explicit planning system regression into brittle task decomposition, and end-to-end action loss overriding VLM spatial representations. The querying transformer acts as a lightweight interface, extracting spatial latents needed by the action expert. Gradient decay is applied at this interface (e.g., scaling gradients from the Actor back to the VLM by \(0.5\)), allowing joint optimization without causing the VLM to forget multimodal knowledge.

3. Spatial Prompting Post-training: Activating spatial priors via prompts instead of coordinate outputs

The core of the second stage is not simple co-training but appending spatial prompts to action data. For instance, an instruction like "store all toys into the toy box" is expanded to "Identify all relevant toys and their spatial relationships to the container." A default unified prompt is "Figure out how to execute it, then locate the key object needed." This does not require the model to output explicit coordinates but induces the Planner to focus on target objects and spatial relations within its latent representations.

Ablations show that while Random Padding achieves only 58.5% average success rate (ruling out token length benefits), Box/Point/Trace prompting achieve 76.6%, 74.9%, and 73.9% respectively. However, the Unified Prompting reaches 77.9%, performed best. This suggests that spatial semantics must be activated but not necessarily compressed into fixed formats, as rigid constraints might limit the Planner's implicit reasoning.

4. Gradient Alignment Diagnosis: Proving objective alignment via PSS

The paper analyzes whether spatial grounding and action learning are consistent at the optimization level using Projection-Space Similarity (PSS). Given a spatial grounding batch and an action batch with gradient matrices \(G_{spat}\) and \(G_{act}\), projection matrices are constructed as \(P_{spat}=G_{spat}G_{spat}^{+}\) and \(P_{act}=G_{act}G_{act}^{+}\). Similarity is defined as \(PSS(G_{spat},G_{act})=\frac{tr(P_{spat}P_{act})}{min(r_{spat},r_{act})}\). Higher values indicate less conflict when updating shared parameters.

Vanilla co-training shows a PSS of only 0.25, reflecting inconsistent optimization. ST4VLA increases PSS to 0.42. During training, Vanilla VLA perception performance drops to near-random, whereas ST4VLA retains ~70% of original RefCOCO-g capability while reaching success milestones faster on WidowX.

Loss & Training

The first stage uses standard SFT-style next-token prediction to train the VLM Planner on unified QA formats covering VQA, box QA, point QA, and trajectory QA. No robotic action heads are trained here.

The second stage uses both action data and multimodal spatial grounding data. The action expert predicts continuous action chunks (chunk size 16 in SimplerEnv). The VLM receives primary images, instructions, and auxiliary spatial prompts. The action expert predicts actions conditioned on the Planner's latent tokens. Multimodal data follows the QA format loss, while action data uses robotic action loss. The total optimization is a weighted sum of both.

Loss weighting is crucial. Ablations on grounding vs. action loss weights reveal that \(1:1\) or \(1:5\) biases the model too much toward grounding, while \(1:15\) or \(1:20\) weakens spatial capabilities. The optimal ratio is approximately \(1:10\), yielding success rates of 80.7/76.0 on Google Robot VM/VA and 71.7 on WidowX.

Key Experimental Results

Main Results

Experiments cover simulated benchmarks (SimplerEnv, LIBERO), large-scale Isaac-Sim pick-and-place, real Franka pick-and-place, and real long-horizon tasks. ST4VLA shows significant improvements over vanilla VLA and strong baselines in SimplerEnv.

Benchmark / Track Metric Ours (ST4VLA) Strong Baseline Gain
SimplerEnv Google Robot Visual Matching Avg SR 84.6 SpatialVLA 75.1 +9.5
SimplerEnv Google Robot Variant Aggregation Avg SR 75.9 SpatialVLA 70.7 +5.2
SimplerEnv WidowX Visual Matching Avg SR 73.2 GR00T N1.5 61.9 +11.3
LIBERO Average Avg SR 95.9 π0.5-KI 94.3 +1.6
Real Pick-and-Place Avg SR 65 GR00T N1.5 48 +17

In SimplerEnv, ST4VLA achieved 97.3, 98.0, 65.3, and 77.8 in Pick Coke Can, Move Near, Open/Close Drawer, and Open Top Drawer & Place Apple respectively.

Training Strategy MME RefCOCO-g [email protected] Where2Place point-Acc Google Robot VM/VA WidowX VM
Vanilla VLA - - - 66.1 / 63.5 54.7
Vanilla co-train 1106 47.1 21.4 70.2 / 66.5 61.1
+ Spatially Guided 1374 68.1 25.5 78.8 / 70.0 67.4
+ Spatially Pretrained 1411 71.2 25.5 84.6 / 75.9 73.2

Ablation Study

Configuration Key Metrics Note
No Additional Pretraining Google VM/VA 66.1/63.5, WidowX 54.9 Relying only on base Qwen2.5-VL
+ General Grounding Data Google VM/VA 72.6/70.3, WidowX 65.2 Improves target recognition
+ Robotic Grounding Data Google VM/VA 84.3/75.9, WidowX 73.1 Largest gain from robot-specific spatial data
Unified Spatial Prompt Avg 77.9 Most stable prompting strategy

Key Findings

  • Spatial pre-training is the key factor determining the performance ceiling of VLA. Increasing spatial data from 0M to 3.0M improved metrics from 61.4 to 77.9.
  • Simple co-training does not solve the perception-action conflict. Vanilla co-training has low PSS and oscillatory curves; ST4VLA increases PSS to 0.42, aligning the objectives.
  • Real-world generalization is significantly improved. Ours achieved a 65% success rate in real pick-and-place, outperforming GR00T N1.5 (48%) and π0 (31%).
  • In long-horizon tasks (sandwich making, sorting), ST4VLA leads, performing better under physical interference and task replanning scenarios.

Highlights & Insights

  • The ingenuity of ST4VLA lies in treating spatial grounding as a training objective for the VLM Planner and a latent condition for the Actor, rather than an external module. This retains end-to-end scalability while avoiding reliance on fixed detectors.
  • The paper treats the collapse of spatial capabilities during VLA fine-tuning as an observable phenomenon (validated by PSS and RefCOCO-g curves) rather than just reporting success rates.
  • Unified spatial prompting results suggest that robot policies do not necessarily need explicit chain-of-thought outputs; activating the correct latent space via language can provide sufficient information to the controller.

Limitations & Future Work

  • The training pipeline is heavy, requiring >3M spatial entries and large-scale demonstrations, posing a high reproduction barrier.
  • Success is demonstrated mainly in tabletop scenarios. More complex contact-rich manipulation and tool use require further validation.
  • Failure cases indicate that while the model locates targets, it can still fail in grasp pose or transition phases due to control precision limits.
  • Future work could incorporate depth, proprioception, or 3D affordance maps into the Planner-Actor interface to handle uncertainty.
  • vs Monolithic VLAs (OpenVLA, RT-2): ST4VLA emphasizes establishing spatial grounding before action learning, which yields higher stability in unseen contexts.
  • vs Diffusion/Flow Policies (π0): While π0 focuses on action modeling, ST4VLA proves that VLM spatial reasoning as an independent training objective provides additional gains.
  • Insight: Intermediate representations do not have to be explicit text or symbolic structures. "Trainable latent spatial plans" that constrain the action expert without limiting VLM reasoning are a promising direction.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Combines grounding pre-training, spatial prompts, and dual-system VLA comprehensively.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive coverage across simulation and real-world benchmarks with solid ablation chains.
  • Writing Quality: ⭐⭐⭐⭐☆ Clear main line and figures, though some details are dispersed in the appendix.
  • Value: ⭐⭐⭐⭐⭐ Practical recipe for VLA training, particularly regarding the preservation of spatial priors.