Skip to content

Generative Visual Code Mobile World Models

Conference: ICML 2026
arXiv: 2602.01576
Code: Available (Paper provides Project Page, Code, gWorld 8B/32B weights, and MWMBench benchmark)
Area: LLM Agent / Multimodal VLM / Mobile GUI World Model
Keywords: Mobile GUI World Models, Renderable Code Generation, VLM Post-training, Cross-modal Re-labeling, Look-ahead Reasoning

TL;DR

The authors reframe the "Mobile GUI World Model" as a new paradigm of "VLM-generated renderable web code." They propose an automated data synthesis pipeline that rewrites policy trajectories into training samples of (image state, action) \(\rightarrow\) (reasoning chain, next-state code). The resulting gWorld-8B/32B achieves state-of-the-art performance across six in/out-of-distribution benchmarks, improving average instruction accuracy of baseline models by 27–46 percentage points while reducing rendering failure rates to \(<1\%\).

Background & Motivation

Background: Mobile GUI agents have become a prominent research direction. A mainstream approach for enhancement is the introduction of "World Models (WM)": given the current GUI state \(S_t\) and action \(A_t\), the model predicts the next state \(S_{t+1}\) to augment policies during training or perform rollout value estimation during inference. Existing WMs generally fall into two categories: (1) Text-based WMs—compressing states into text descriptions before prediction, which loses critical visual information like icons, layouts, fonts, and colors; (2) Visual-based WMs—directly generating the next GUI screenshot. For example, VIMO utilizes a five-stage pipeline involving "OCR \(\rightarrow\) box mask \(\rightarrow\) GPT-4o filtering \(\rightarrow\) self-trained diffusion model inpainting \(\rightarrow\) dual GPT-4o text backfilling."

Limitations of Prior Work: Text-based WMs sacrifice visual fidelity and cannot interface with mainstream VLM policies. Pure pixel-based visual WMs struggle in "text-intensive + discrete layout" GUI scenarios—text generated by diffusion or autoregressive pixel models is often illegible and layouts are distorted. These models rely on slow, complex, closed-source external pipelines requiring multiple GPT-4o calls. Furthermore, VIMO released data without weights, making replication and deployment difficult.

Key Challenge: GUI states require both pixel-level fidelity (for direct grounding by screenshots) and symbolic precision (accurate text, buttons, and lists). This dual requirement places the "direct pixel prediction" paradigm in a dilemma: image models excel at vision but fail at text generation, while text models generate accurate text but lose visual structure. Additionally, the authors observe significant visual redundancy in GUI transitions (e.g., most pixels remain unchanged during typing). Pixel models tend to learn degenerate solutions that "approximately copy \(S_t\)," showing high similarity metrics without actually modeling action semantics.

Goal: Build a single self-contained model for visual mobile GUI world modeling that simultaneously achieves: (a) pixel-level correctness for text and layout; (b) end-to-end operation without multi-model pipelines; (c) large-scale synthetic training data; (d) retention of native coordinates for actions to interface directly with real mobile execution.

Key Insight: Modern VLMs have encountered vast amounts of structured web code during pre-training and are naturally proficient at generating readable text. By representing the "next state" as renderable web code (HTML/CSS) and using a browser to render it back to pixels: the VLM's language prior ensures high-quality text and semantics, the web code prior ensures structural layout, and the renderer "translates" symbolic output back to pixels. Thus, a single VLM simultaneously handles state change understanding and structured state output, eliminating external dependencies.

Core Idea: Rewrite the world model from \(p_\theta(S_{t+1}^{\text{image}} \mid S_t, A_t)\) to \(p_\theta(R_t, S_{t+1}^{\text{code}} \mid S_t^{\text{image}}, A_t)\)—the VLM directly generates a "reasoning first, then code" description of the next state, with a browser responsible for rendering.

Method

Overall Architecture

gWorld is a standard VLM (based on Qwen3-VL 8B / 32B) trained via supervised fine-tuning (SFT) to learn the above mapping. The system consists of three components: (1) Training Data Synthesis Pipeline: Automatically converts existing mobile agent offline policy trajectories into WM training samples; (2) gWorld Training: SFT on 260,000 synthetic samples to enable the VLM to "see a GUI screenshot + a coordinate action \(\rightarrow\) output reasoning chain \(R_t\) and next-state web code \(S_{t+1}^{\text{code}}\)"; (3) Inference & Evaluation: Renders generated code into images using a browser, compares them with ground-truth images, and employs three frontier VLMs for IAcc. scoring and DINO embedding similarity. The entire pipeline is engineeringly minimalist: "one VLM + one browser renderer."

Key Designs

  1. Replacing Pixels with Renderable Web Code for Next-State Representation:

    • Function: Switches the generation target from image \(S_{t+1}^{\text{image}}\) to HTML/CSS code \(S_{t+1}^{\text{code}}\), which is then reconstructed into pixels by a deterministic renderer.
    • Mechanism: Formalizes state prediction as \(p_\theta(S_{t+1}^{\text{code}} \mid S_t^{\text{image}}, A_t)\). Using the VLM's language prior, text is verbatim correct; using web code priors, layouts and component structures benefit from strong inductive bias.
    • Design Motivation: Pixel-direct models suffer from illegible text and distorted layouts in GUIs and often collapse into "copy-input" shortcut solutions. (See §4.3: Emu3.5 34B's output similarity is highly correlated with input-target similarity at \(\rho=0.92\), indicating identity mapping, whereas gWorld 32B is at \(\rho \approx 0.4\)). The code space explicitly highlights structural changes, forcing the VLM to "understand the action" to modify nodes correctly.
  2. Cross-modal State Re-labeling: Converting Policy Trajectories to WM Training Data:

    • Function: Automatically rewrites existing mobile agent trajectories \(\{(S_t^{\text{image}}, A_t)\}_{t=1}^{T}\) into WM samples \(\{(S_t^{\text{image}}, A_t, S_{t+1}^{\text{code}})\}_{t=1}^{T-1}\) without manual code labeling.
    • Mechanism: Two steps. Step 1 reuses trajectories from AitW, GUIO, AC, and AMEX, swapping the "action at step \(t\)" supervision for the "state at step \(t+1\)" as the target. Step 2 uses a frontier model \(\pi^*\) (Gemini 3 Flash) for image-to-code re-labeling: \(S_t^{\text{code}} \leftarrow \pi^*(S_t^{\text{image}}, P^{\text{img-to-code}})\), converting ground-truth screenshots into renderable code.
    • Design Motivation: Code-based WM lacks datasets, and manual labeling is prohibitively expensive. Mobile agent trajectories, however, are abundant (over 3.7 million transitions across four sets). This step transfers "policy data" to "world model data" without loss. Scaling analysis (Fig. 5) suggests performance has not yet saturated.
  3. Look-ahead Reasoning Chain Synthesis: Splitting Modeling into "Describe Change, Then Code":

    • Function: Inserts a natural language reasoning chain \(R_t\) before \(S_{t+1}^{\text{code}}\) in SFT labels.
    • Mechanism: During synthesis, the labeling model \(\pi^*\) is allowed to "peek" at the ground-truth next state \(S_{t+1}^{\text{image}}\) to explain "what changed from \(S_t\) to \(S_{t+1}\) under action \(A_t\)," i.e., \(R_t \leftarrow \pi^*(S_t^{\text{image}}, A_t, S_{t+1}^{\text{image}}, P^{\text{look-ahead}})\). This "future-informed" reasoning aligns perfectly with the transition, providing a "high-quality teacher answer."
    • Design Motivation: Generating code in one shot is difficult as it requires understanding action effects, DOM structure design, and syntax simultaneously. CoT/Reasoning traces are proven to boost performance. Look-ahead ensures the reasoning is correct—ablation (Fig. 6) shows look-ahead \(R_t\) consistently outperforms non-look-ahead \(R_t^*\).

Loss & Training

Standard SFT cross-entropy is used on sequences containing \(R_t\) and \(S_{t+1}^{\text{code}}\). Base models are Qwen3-VL 8B/32B. Total data size is 260K. Evaluation uses a consensus of three frontier VLMs (GPT-5 Mini, Claude 4.5 Haiku, Gemini 3 Flash) to eliminate bias, with a rule-based filter to penalize unrenderable code.

Key Experimental Results

Main Results

Evaluated on 4 in-distribution (AitW, GUIO, AC, AMEX) and 2 out-of-distribution (AndroidWorld, KApps) benchmarks against 8 baselines, including image edit models and large-scale VLMs.

Model Parameters Avg. IAcc.↑ Avg. Render Fail↓ Avg. Similarity↑
Qwen-Image-Edit 20B 13.4 65.2
Emu3.5 34B 25.8 70.5
Llama 4 402B-A17B 55.7 9.2 62.4
Qwen3-VL 32B 52.5 11.0 63.3
Qwen3-VL 235B-A22B 51.5 29.5 67.6
GLM-4.6V 106B 67.4 2.5 69.6
gWorld 8B 74.9 1.4 70.3
gWorld 32B 79.6 0.6 71.4

gWorld-8B outperforms Llama 4 402B (50.25× larger) and GLM-4.6V 106B (13.25× larger) in IAcc. Compared to its base Qwen3-VL models, gains are +45.7 and +27.1 points respectively, with rendering failure compressed to \(<1\%\).

Ablation Study

Configuration Key Metric Description
Naive \(S_{t+1}^{\text{code}}\) Synthesis (\(\pi^*\) direct pred) 97% Render, 94.6% IAcc No ground-truth pixel access
Ours: Cross-modal Re-labeling (GT) 100% Render, 100% IAcc +5.4% IAcc.
No look-ahead \(R_t^*\) Lower across all 5 benchmarks Reasoning without \(S_{t+1}\) info
Ours: look-ahead \(R_t\) Higher across all 5 benchmarks Teacher peeks at \(S_{t+1}\)
gWorld 8B (37K \(\rightarrow\) 240K data) Power law growth, \(R^2 \geq 0.94\) Data scaling far from saturated

Key Findings

  • Image generation models are "pseudo-strong" in GUI tasks: Emu3.5 34B outputs essentially copy the input (\(\rho=0.92\)), leading to decent similarity but poor IAcc (25.8%). gWorld actually modifies the state (\(\rho \approx 0.4\)), showing that traditional visual similarity is deceptive for GUI WMs.
  • Both synthesis steps are essential: Cross-modal re-labeling ensures renderable, semantically correct labels; look-ahead reasoning decomposes the problem. Removing either results in a performance drop.
  • Downstream performance gains: Integrating gWorld 8B into an M3A agent for \(K=3\) rollout value estimation improved success rate by +7.6 points over backbone-only and +22.4 points over using a vanilla Qwen3-VL 8B as the WM.

Highlights & Insights

  • Paradigm Transformation: Redefining "Visual World Modeling" as "Structured Code Generation + Deterministic Rendering" leverages VLM language capabilities to bypass the physical weaknesses of image models.
  • Teacher Peeking = High-Quality Supervision: Providing the "answer" to the labeling model to generate reasoning is significantly better than zero-shot reasoning. The student model learns the reasoning process without "cheating" during inference.
  • Data Leverage: Converting "policy trajectories" into "world modeling data" allows the use of millions of existing samples, providing a cheap scaling path for the field.

Limitations & Future Work

  • Code Space Ceiling: Information loss occurs for photo-realistic GUI states (e.g., camera previews), where gWorld 32B dropped 4.7 points. Future work may explore hybrid representations (code for structure, pixels for photos).
  • Frontier Teacher Dependence: The pipeline relies on Gemini 3 Flash. The impact of error propagation or potential for open-source teacher replacement remains to be explored.
  • Latency: Generating long reasoning and HTML sequences results in higher token counts and latency compared to direct pixel diffusion, posing challenges for on-device deployment.
  • vs. VIMO: VIMO uses a complex 5-stage pipeline with diffusion and closed-source weights; gWorld uses a single VLM + renderer with open weights and higher IAcc.
  • vs. Pixel Gen WMs: Pure pixel models struggle with symbolic density; this work empirically demonstrates the limitations of the pixel-only paradigm for GUIs.
  • vs. Image-to-Web-Code: While that field focuses on front-end automation, gWorld proves these capabilities can be "folded" into world modeling to predict state changes.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Reframing visual WM as renderable code is a clean, paradigm-level innovation for the GUI domain.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 benchmarks, 8 frontier baselines, human evaluation, and scaling law analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure; the similarity analysis in Fig. 4 is particularly insightful.
  • Value: ⭐⭐⭐⭐⭐ Open-source weights and benchmarks, combined with a reproducible path for GUI agents, offer high community value.