Skip to content

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Conference: ICLR 2026
arXiv: 2602.10109
Code: https://internrobotics.github.io/internvla-m1.github.io
Area: Autonomous Driving
Keywords: Vision-Language-Action, Spatially Guided Training, Dual-System Architecture, Diffusion Policy, Robotic Manipulation

TL;DR

This paper proposes ST4VLA, a two-stage spatially guided training framework (spatial grounding pre-training + spatially guided action post-training) that explicitly injects VLM spatial priors into VLA policy learning. On SimplerEnv, it improves the Google Robot success rate from 66.1% to 84.6% and WidowX from 54.7% to 73.2%, achieving state-of-the-art performance.

Background & Motivation

The Core Gap from VLM to VLA: Large-scale vision-language models excel at multimodal understanding, but direct transfer to embodied control requires mapping textual instructions to low-level motor actions — a correspondence that is extremely scarce in standard VLM training data.

Importance of Spatial Priors: Spatial priors such as object recognition, affordance localization, visual trajectory reasoning, and relative position perception are transferable general knowledge for robotic manipulation, yet existing methods fail to exploit them effectively.

Limitations of Hierarchical Systems: Traditional hierarchical approaches (e.g., augmenting foundation models with SAM/DINO) rely on rule-based task decomposition and hand-crafted planning heuristics, making them difficult to scale automatically to complex and diverse tasks.

Overfitting in End-to-End VLAs: Data-driven VLAs (e.g., OpenVLA, π₀) fine-tune directly from pre-trained VLMs for control, but tend to overfit low-level motion patterns without adequately leveraging spatial priors.

Gradient Conflicts in Naive Co-Training: The authors find that directly combining spatial data with action data in vanilla co-training causes inconsistent gradient directions between spatial perception and action learning objectives (PSS of only 0.25), leading to unstable oscillations.

Collapse of Spatial Priors after Fine-Tuning: After directly fine-tuning a VLM into a VLA, RefCOCO-g performance drops sharply to near-random levels within 20k steps, demonstrating that action training severely degrades existing spatial representations.

Method

Overall Architecture: Dual-System VLA

ST4VLA builds a dual-system architecture on top of Qwen2.5-VL:

  • System 2 (VLM Planner): A multimodal encoder that captures spatial and semantic priors, responsible for high-level planning and spatial reasoning (a "slow but reliable" reasoner).
  • System 1 (Action Expert): A lightweight Diffusion Transformer (DiT) combined with a DINOv2 visual encoder, responsible for fast, embodiment-specific motion control.

The two systems are connected via a Querying Transformer (only 8.7 MB): a \(k\)-layer cross-attention module that maps variable-length latent spatial grounding embeddings produced by the VLM Planner into a fixed number of learnable query tokens, which are then passed to the Action Expert.

Key Design 1: Spatial Prompting

During the action post-training stage, spatial prompts are appended to the standard task instruction to explicitly activate the spatial awareness acquired by the VLM during spatial pre-training. For example: - Original instruction: "store all toys into the toy box" - Spatial prompt extension: "Identify all relevant toys and their spatial relationships to the container." - Generic prompt: "Figure out how to execute it, then locate the key object needed"

The latent grounding embeddings generated by the VLM provide explicit spatial cues to guide action generation.

Key Design 2: Gradient Decay

A gradient decay factor (e.g., 0.5) is introduced in the Querying Transformer to attenuate gradients back-propagated from the Action Expert to the VLM, preventing action learning objectives from corrupting the VLM's semantic reasoning capabilities while still enabling effective joint optimization.

Two-Stage Training Pipeline

Stage 1: Spatial Grounding Pre-training - Unifies large-scale internet vision-language grounding data (RefCOCO, LLaVA-OneVision) with robot-specific datasets (RoboRefIt, A0, ST4VLA Data). - Training targets include bounding box detection, affordance recognition, and trajectory prediction. - All data is standardized into a QA format and trained under a standard SFT framework. - Goal: establish general yet robot-task-relevant spatial representations.

Stage 2: Spatially Guided Action Post-training - Joint training on action data and spatial grounding data. - The VLM is updated via next-token prediction on image-prompt pairs. - Spatial prompts are applied to action data to enhance semantic-motor alignment. - The Action Expert learns a diffusion policy via DiT. - Dual supervision: spatial supervision for the VLM Planner and action supervision for the Action Expert.

Loss & Training

  • VLM Planner: Standard autoregressive next-token prediction loss (including spatial grounding objectives).
  • Action Expert: Diffusion denoising loss (DiT).
  • Joint optimization with gradient decay to protect VLM parameters.

Key Experimental Results

Main Results: SimplerEnv Benchmark

Model Google Robot VM Google Robot VA WidowX VM
RT-2-X 46.3 54.4 -
OpenVLA 34.3 39.3 4.2
CogACT 74.8 61.3 51.3
SpatialVLA 75.1 70.7 42.7
π₀ 58.8 54.8 27.1
GR00T N1.5 35.2 44.5 61.9
Vanilla VLA 66.1 63.5 54.7
ST4VLA 84.6 75.9 73.2

ST4VLA achieves state-of-the-art results across all three settings: - Google Robot VM surpasses the next best (CogACT) by +9.8% - Google Robot VA surpasses the next best (SpatialVLA) by +5.2% - WidowX VM surpasses the next best (GR00T N1.5) by +11.3%

Ablation Study: Training Strategy Comparison

Training Strategy Google Robot VM/VA WidowX VM RefCOCO-g IoU@0.5
Vanilla VLA 66.1/63.5 54.7 Collapsed (near random)
Vanilla Co-train 70.2/66.5 61.1 47.1
+Spatially Guided 78.8/70.0 67.4 68.1
+Spatially Pretrained (ST4VLA) 84.6/75.9 73.2 71.2
  • The combination of spatial pre-training and spatially guided co-training yields substantial cumulative gains.
  • Gradient PSS improves from 0.25 (vanilla co-train) to 0.42 (spatially guided).

Real-World Pick-and-Place Generalization

Model In-dist. Novel Instances Similar Distractors Novel Background Unseen Location Unseen Orientation Attribute Instruction Spatial Instruction Average
π₀ 45 32 25 27 18 32 37 31 31
GR00T N1.5 78 46 40 47 20 40 59 53 48
ST4VLA 92 62 49 63 52 72 73 61 65

ST4VLA consistently leads across all 8 generalization dimensions, with an average success rate of 65% vs. GR00T's 48% vs. π₀'s 31%.

Long-Horizon Manipulation

On long-horizon tasks including table tidying, drawer organization, and sandwich assembly, ST4VLA outperforms GR00T N1.5 and π₀ under in-distribution, physical perturbation, and task re-planning settings, demonstrating the task decomposition and dynamic adaptation capabilities of the dual-system framework.

Key Findings

  1. Spatial pre-training is the single most impactful improvement (VM +13.4%), with spatial prompting providing additional gains on top of it.
  2. Gradient alignment analysis via PSS quantifies the consistency between spatial and action optimization objectives for the first time.
  3. Generalization to unseen object locations and orientations is particularly pronounced (location: 52% vs. 20%/18%), indicating that spatial priors effectively provide a foundation for generalization.

Highlights & Insights

  1. First Systematic Analysis of Spatial Prior Collapse: By tracking RefCOCO-g performance and introducing the PSS gradient alignment metric, the paper clearly illustrates the mechanism by which naive VLA fine-tuning destroys spatial priors, providing a quantitative explanation.
  2. Simplicity and Effectiveness of Spatial Prompting: Appending simple textual prompts to instructions significantly improves spatial-action alignment without requiring complex intermediate representation generation or additional modules.
  3. Elegant Dual-System and Dual-Supervision Design: The System 1/2 analogy is intuitive; the Querying Transformer is extremely lightweight at only 8.7 MB, and the gradient decay mechanism is simple yet critical.
  4. Comprehensive and Challenging Evaluation: The evaluation spans simulation benchmarks, large-scale 200-task pick-and-place, 8-dimensional real-world generalization, and long-horizon manipulation, making the assessment highly convincing.
  5. Scalability of the Unified Framework: Spatial grounding pre-training draws from diverse data sources (web-scale + robot-specific), and the training pipeline is reusable across different embodiment types.

Limitations & Future Work

  1. VLM Backbone Fixed at Qwen2.5-VL-3B: The 3B parameter scale is relatively small; it remains unverified whether larger models (7B/72B) could yield further improvements or break through a performance ceiling.
  2. Manual Design of Spatial Prompts: Although effective, the spatial prompt templates are fixed; automated prompt generation or adaptive prompting strategies have not been explored.
  3. Limited Experimental Platforms: Validation is primarily conducted on the Franka single-arm robot, without covering more complex embodiments such as dexterous hands, dual-arm setups, or mobile manipulation.
  4. Inference Efficiency: The inference latency and real-time feasibility of the dual-system architecture (VLM + DiT + DINOv2) are not analyzed, leaving open the question of suitability for practical deployment.
  5. Data Fairness Relative to π₀: π₀ and GR00T utilize larger-scale pre-training action data, making direct comparisons not entirely fair (although the authors attempt alignment in the 200-task experiment).
  • Hierarchical Robotic Systems: SayCan, Code as Policies, and similar approaches rely on rule-based task decomposition and lack flexibility.
  • End-to-End VLAs: RT-2, OpenVLA, π₀, and CogACT learn actions directly but neglect spatial priors.
  • Spatially Aware VLAs: SpatialVLA introduces spatial representations but lacks spatially guided prompting; Magma employs spatial pre-training but lacks spatially guided post-training.
  • Explicit Reasoning VLAs: ECOT (textual planning), CoT-VLA (visual chain-of-thought), OneTwoVLA (alternating thinking and execution).
  • Positioning of This Work: ST4VLA unifies spatial pre-training and spatially guided post-training, achieving end-to-end optimization through gradient alignment without relying on explicit intermediate representations.

Rating

  • Novelty: ⭐⭐⭐⭐ (The combination of spatial prompting and gradient alignment is novel; the PSS analysis provides theoretical insight.)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive evaluation across simulation, real-world, large-scale, long-horizon, and multi-dimensional generalization settings.)
  • Writing Quality: ⭐⭐⭐⭐ (Well-structured; the logic from motivation to method to experiments flows clearly.)
  • Value: ⭐⭐⭐⭐ (The spatially guided training paradigm offers important reference value for the VLA community, though applicability to broader embodiment types requires further validation.)