Skip to content

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Conference: CVPR2026
arXiv: 2509.24948
Code: github.com/amap-cvlab/world-env
Area: Multimodal VLM
Keywords: VLA, World Model, Reinforcement Learning, Post-Training, Robot Manipulation, Few-Shot

TL;DR

This paper proposes the World-Env framework, which leverages a physically consistent world model as a virtual environment in place of real-world interaction to perform RL post-training on VLA models. With only 5 demonstrations per task, the framework achieves significant improvements in manipulation success rates.

Background & Motivation

Core Problem

VLA (Vision-Language-Action) models learn policies from large-scale demonstration data via imitation learning, but face two key bottlenecks:

Data Scarcity: Collecting high-quality human demonstrations is extremely costly, and performance degrades sharply under few-shot conditions.

Constrained RL Post-Training: Although RL can compensate for limited demonstrations through interactive exploration, real-world environments are non-resettable—particularly in high-risk settings such as industrial automation, where state changes induced by interaction can be costly or irreversible.

Limitations of Prior Work

Approach Advantages Limitations
Real-world RL Authentic dynamics Non-resettable, high cost, safety risks
Simulator RL No physical risk High development cost, large sim-to-real gap, difficult to adapt to novel objects
Pure SFT Simple and direct Requires large demonstrations, poor generalization

Key Insight

Video world models possess action-conditioned future prediction capabilities and persistent scene representations, enabling them to generate visually plausible future frame sequences. This effectively constitutes a low-cost, risk-free virtual simulator that is simultaneously more flexible than conventional simulators, requiring no manual modeling of new objects.

Additional Problem

Existing VLA models lack a task-completion detection mechanism and continue executing redundant actions after task success (e.g., continuing to push an object already placed in position), leading to degraded success rates.

Method

Overall Architecture

World-Env consists of three core modules:

  1. Physically-Consistent World Simulator: A diffusion-based world simulator that predicts future visual observations conditioned on actions.
  2. VLM-Guided Instant Reflector: A VLM-based instant feedback module that provides continuous reward signals and determines task completion.
  3. RL Post-Training Pipeline: A policy optimization pipeline based on RLOO + PPO.

Workflow: The VLA policy \(\pi_\theta\) predicts action \(\mathbf{a}_t\) given the current observation \(\mathbf{o}_t\), proprioceptive state \(\mathbf{s}_t\) (6D end-effector pose + 1D gripper state), and language instruction \(\mathbf{g}\). The next state \(\mathbf{s}_{t+1}\) is computed via forward kinematics, upon which the world simulator predicts the next observation frame \(\mathbf{o}_{t+1}\), forming a closed-loop rollout. The Instant Reflector evaluates the trajectory and decides whether to terminate.

Key Design 1: Geometry-Aware Feature Injection

The world simulator is built on a U-Net denoising diffusion network. The core innovation is Geometry-Aware Feature Injection:

  • Predicted actions are converted to proprioceptive states via forward kinematics and projected onto the image plane to generate an action map (foreground encodes pose; background is all-black to maximize contrast).
  • Historical observations sampled from a memory bank are concatenated with the action map and injected into the U-Net as pixel-level conditions.
  • Complementary features are extracted from two pretrained encoders and injected into the denoising process via multi-resolution cross-attention:
    • VGGT: Preserves fine-grained geometric structure and spatial layout of reference images.
    • CLIP: Captures high-level semantic and contextual information.
  • This dual-path injection strategy simultaneously ensures local geometric fidelity and global semantic consistency.

Key Design 2: Training Data Augmentation Strategy

Training the world model solely on expert demonstrations limits generalization to unseen state-action sequences. The solution:

  • A post-SFT OpenVLA-OFT policy is deployed in the LIBERO simulator for autonomous exploration.
  • A scale head is trained to predict the log-scale parameter \(\boldsymbol{\beta}_t\) of a Laplace distribution, with the VLA output \(\boldsymbol{\mu}_t\) as the location parameter: \(\mathbf{a}_t \sim \text{Laplace}(\boldsymbol{\mu}_t, \boldsymbol{\beta}_t)\).
  • Diverse trajectories encompassing both successes and failures are collected via action perturbation, then mixed with original expert trajectories for training.

Key Design 3: VLM-Guided Instant Reflector

  • A frozen visual encoder \(\mathcal{E}_{\text{vision}}\) extracts patch embeddings from video frames.
  • A frozen LLM \(\mathcal{E}_{\text{LLM}}\) performs cross-modal reasoning.
  • A lightweight reward head \(\mathcal{R}_\theta\) outputs a continuous reward: \(R(\mathbf{o}_{1:t}, \mathbf{g}) = \sigma(\mathcal{R}_\theta(h_t)) \in [0,1]\).
  • A termination signal is triggered when \(R > \eta = 0.5\), preventing redundant actions.

Loss & Training

Reward Head Training: BCE loss is used, with per-frame binary success labels \(y_t \in \{0,1\}\) as supervision:

\[\mathcal{L} = \text{BCE}(R(\mathbf{o}_{1:t}, \mathbf{g}), y_t)\]

RL Optimization: The LOOP (Leave-One-Out PPO) objective is adopted: - \(N=8\) rollouts are generated per initial state. - RLOO baseline: \(b_n = \frac{1}{N-1}\sum_{j \neq n} R_j\); advantage: \(A_n = R_n - b_n\). - Importance sampling ratios are based on the Laplace action distribution. - PPO clipped objective: \(\mathcal{L}_{\text{PPO}} = -\min(r_{t,n} A_n, \text{clip}(r_{t,n}, 1-\epsilon, 1+\epsilon) A_n)\), with \(\epsilon = 0.1\).

Training Details: 8×H20 GPUs, ~48h. The VLM backbone is fine-tuned with LoRA rank=32 (lr=1e-4); the action/scale heads are trained with full parameters (lr=1e-5); batch size=4.

Key Experimental Results

Main Results: LIBERO Benchmark (Only 5 Demonstrations per Task)

Method LIBERO-Goal LIBERO-Object LIBERO-Spatial LIBERO-Long Avg.
π₀ 67.6 68.4 80.2 28.2 61.1
π₀+FAST 59.2 76.8 59.2 24.8 55.0
OpenVLA 73.2 55.0 82.4 32.2 60.7
UniVLA 82.0 76.2 84.4 56.4 74.75
OpenVLA-OFT 84.0 74.2 84.2 57.0 74.85
Ours 86.4 86.6 87.6 57.8 79.6

Key Finding: Under the extreme low-data regime of only 5 demonstrations per task, World-Env outperforms the strongest SFT baseline (OpenVLA-OFT) by +4.75pp on average, with a gain of up to +12.4pp on the Object subset.

Ablation Study

Extra Data Reward Head Goal Object Spatial Long
68.4 75.2 73.2 42.2
79.8 81.8 78.4 44.6
68.8 76.4 74.4 43.8
86.4 86.6 87.6 57.8
  • Extra Data contributes the most: Augmenting world model training with exploration data is the primary driver of performance gains (+6.3pp on average).
  • Synergistic effect is significant: Adding the Reward Head alone yields negligible improvement, but when combined with Extra Data, it provides an additional +13.2pp on the Long subset.

Termination Mechanism Comparison (Without Ground-Truth Termination Signals)

Under fair conditions without ground-truth termination signals, Ours achieves an average of 74.9% vs. OpenVLA-OFT's 63.05% (+11.85pp), validating the necessity of the dynamic termination mechanism.

Real-World Experiments

Task OpenVLA-OFT Ours
Clean table 20% 30%
Put green toy 30% 50%
Put red toy 30% 40%
Put orange toy 20% 50%

The proposed method consistently outperforms the baseline in real-world settings, validating its sim-to-real transfer capability.

Key Findings

  1. Only 20 RL training steps suffice to surpass the SFT baseline on multi-goal tasks.
  2. Performance is comparable to the simulator-based RL method RIPT-VLA (79.6 vs. 79.15), yet World-Env can be directly deployed in real environments.
  3. Baseline methods lacking a termination mechanism continue executing redundant actions after task completion, resulting in an average success rate drop of ~10pp.

Highlights & Insights

  1. Paradigm Innovation: This work is the first to propose replacing physical environments and conventional simulators with a world model for VLA RL post-training, opening a third pathway—safer than real environments and more flexible than traditional simulators.
  2. Dual-Path Geometry + Semantics Injection: Combining VGGT's geometry-aware features with CLIP's semantic features ensures physical consistency in generated frames, which is critical to making the world model a reliable RL environment.
  3. Continuous vs. Binary Reward: The VLM-guided instant reflector outputs a continuous reward in \([0,1]\), avoiding the issue of advantage collapsing to zero in all-success or all-failure rollouts, substantially improving training efficiency.
  4. Dynamic Termination Mechanism: This design addresses the "post-success failure" problem that has been overlooked in the VLA field; experiments demonstrate a contribution of ~10pp to the overall success rate.
  5. Extreme Data Efficiency: The framework is effective with only 5 demonstrations per task, and surpasses SFT after just 20 RL training steps.

Limitations & Future Work

  1. World Model Dependency: Both the world simulator and the instant reflector require diverse training data; exploration data currently still relies on simulator collection, meaning the framework has not fully decoupled from simulators.
  2. Training Efficiency: Policy optimization is relatively slow, with the computational overhead of trajectory generation by the simulator being the primary bottleneck (48h on 8×H20).
  3. World Model Fidelity Ceiling: A gap persists between diffusion-generated visual observations and real scenes, and errors may accumulate over long horizons.
  4. Low Real-World Success Rates: Even the best result of Ours in real-world settings reaches only 50%, indicating substantial room for improvement in transferring from the world model to real environments.
  5. Limited Task Complexity: LIBERO is a relatively simple tabletop manipulation benchmark; effectiveness on more complex tasks (e.g., dexterous hands, bimanual coordination) has yet to be validated.
  • RIPT-VLA: RL post-training based on a real simulator with comparable performance, but not deployable to real environments; World-Env's use of a world model as a replacement for the simulator is a more general solution.
  • OpenVLA-OFT: A VLA with continuous action representation, used as the backbone for RL post-training in this work, validating the two-stage SFT → RL training paradigm.
  • Genie 3 / V-JEPA 2: General-purpose world models; stronger future world models will directly improve the performance of this framework.
  • DiWA: A concurrent work that uses world models for diffusion policy adaptation, but differs from this paper in not explicitly constructing an RL interactive environment.

Insights: The idea of using world models as RL environments can be generalized to other VLA application domains such as autonomous driving and navigation; the continuous reward + dynamic termination design is also transferable to RL training of LLM agents.

Rating

  • Novelty: ⭐⭐⭐⭐ — The idea of replacing physical environments with a world model for VLA RL post-training is novel; the geometry-aware injection and dynamic termination designs are noteworthy.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage across LIBERO's four subsets, ablation studies, and real-world experiments; however, the real-world evaluation covers only 4 simple tasks.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-articulated motivation, and rich figures and tables.
  • Value: ⭐⭐⭐⭐ — Proposes a practical new paradigm for VLA post-training with open-sourced code; real-world performance requires further validation.