Skip to content

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Conference: ACL 2025
arXiv: 2412.19723
Code: OS-Genesis Homepage
Area: GUI Agent / Data Synthesis
Keywords: GUI Agent, Trajectory Synthesis, Reverse Task Synthesis, VLM, Reward Model

TL;DR

This work proposes OS-Genesis, an interaction-driven GUI Agent trajectory synthesis pipeline. By allowing the agent to explore and interact with the environment first, followed by deriving tasks in reverse (Reverse Task Synthesis), and combined with a Trajectory Reward Model (TRM) for quality filtering, it generates high-quality, diverse training trajectories, nearly doubling the performance on AndroidWorld.

Background & Motivation

Key Challenge: VLM-based GUI agents require high-quality trajectory data for training. However, existing data collection methods face severe bottlenecks: manual annotation is expensive and inefficient, while task-driven synthesis from pre-defined tasks is limited in data diversity and quality.

Limitations of Prior Work: (1) Manual collection requires annotators to log complete trajectories and manually pre-define high-level tasks, which is high-cost and limited in scale; (2) task-driven model synthesis heavily relies on pre-defined high-level tasks, restricting scalability and diversity; (3) intermediate step errors or task goal mismatches lead to incomplete or semantically incoherent synthetic trajectories.

Core Idea: Simulating how humans learn GUI interaction—exploring application capabilities first (interaction-driven) and then reversely synthesizing meaningful tasks from the executed operations. This paradigm naturally bridges the gap between abstract instructions and the dynamic characteristics of GUIs.

Method

Overall Architecture

OS-Genesis consists of three core phases: (1) Interaction-Driven Capability Discovery—traversing UI elements in an online environment without human intervention; (2) Reverse Task Synthesis—deriving low-level instructions from collected interaction triplets, then synthesizing them into high-level tasks; (3) Trajectory Reward Model (TRM)—conducting graded quality assessment and weighted sampling of synthetic trajectories for training.

Key Designs

  1. Interaction-Driven Capability Discovery: Realizes rule-based UI element traversal (CLICK, TYPE, SCROLL) in Android emulators and Chrome browsers. GPT-4o is leveraged to generate contextually appropriate inputs only during input box interactions. A large number of triplets \(\langle s_{pre}, a, s_{post} \rangle\) (screenshots before and after interaction + executed action) are collected.

  2. Reverse Task Synthesis: A two-level generation process. (a) Low-level: GPT-4o is employed to derive atomic operation descriptions \(\tau_{low}\) (e.g., "click the dropdown menu to display options") from each triplet; (b) High-level: Low-level tasks are mapped to broader user intentions \(\tau_{high}\) (e.g., "regularize app settings"). These high-level instructions are then used to drive GPT-4o execution in the environment to collect complete trajectories.

  3. Trajectory Reward Model (TRM): Unlike traditional binary labeler functions (keep/discard), TRM outputs a graded reward (1-5) for each trajectory, evaluating both completion and coherence. During training, trajectories are sampled with probability \(P(g_i) = R_i / \sum_{k=1}^N R_k\), permitting incomplete but partially valuable trajectories to contribute to training.

Loss & Training

Two complementary SFT objectives: - Planning Training: \(\mathcal{L}_1 = -\sum \log(p_\theta(\ell | s, h_i, c) \cdot p_\theta(a | s, h_i, c, \ell))\), predicting both low-level instructions and actions. - Action Training: \(\mathcal{L}_2 = -\sum \log p_\theta(a | s, c, \ell)\), predicting the action given the low-level instruction.

Key Experimental Results

Main Results (AndroidWorld Success Rate)

Base Model Zero-Shot Task-Driven Self-Instruct OS-Genesis
GPT-4o (M3A) 23.70
InternVL2-4B 0.00 4.02 7.14 15.18
InternVL2-8B 2.23 4.46 5.36 16.96
Qwen2-VL-7B 0.89 6.25 9.82 17.41

WebArena Success Rate

Base Model Zero-Shot Task-Driven Self-Instruct OS-Genesis
InternVL2-4B 0.00 4.98 5.81 7.88
InternVL2-8B 0.00 4.56 7.05 9.96
Qwen2-VL-7B 7.47 7.05 5.39 10.79

Ablation Study

Analysis Dimension Findings
Data Diversity OS-Genesis achieves the highest cosine distance in both instruction and trajectory dimensions, outperforming human data in trajectory diversity.
TRM vs Labeler Graded TRM sampling outperforms binary labeler filtering, particularly for high-level planning tasks.
Data Scale Performance scales with the volume of data, initiating saturation around 1K trajectories.
Comparison with Human Data Instruction quality from OS-Genesis is even superior to human-written designs (as predefined tasks might mismatch dynamic environments).

Key Findings

  • OS-Genesis boosts the success rate of Qwen2-VL-7B on AndroidWorld from 9.82% to 17.41%, nearly doubling execution performance while using only 1K trajectories (compared to 1.5K for Self-Instruct).
  • Under OOD evaluations on AndroidControl (where only 20 out of 833 apps were included in synthetic training data), OS-Genesis demonstrates strong generalization capabilities.
  • While human-written instructions exhibit high diversity, their corresponding trajectory diversity is low, as humans tend to reuse familiar operational paths. OS-Genesis achieves high diversity across both dimensions.

Highlights & Insights

  • Shifts the paradigm of GUI trajectory construction from "task-driven" to "interaction-driven", significantly enhancing data diversity and quality.
  • The intuition of reverse task synthesis is clear and elegant—exploring first and reverse-engineering tasks after, which naturally aligns with the dynamic nature of GUI environments.
  • Graded evaluation in TRM avoids data waste caused by simply discarding incomplete trajectories.

Limitations & Future Work

  • The capability discovery phase relies on GPT-4o for text generation in input fields and reverse task synthesis, introducing high execution costs.
  • Absolute performance on WebArena still lags significantly behind GPT-4o Zero-Shot (16.25%), indicating that there is room for improvement in composite Web tasks using synthesized data.
  • Primarily supports CLICK, TYPE, and SCROLL actions, failing to cover more complex interactive operations such as dragging and gestures.
  • Data scale increases suffer from saturation, bounded by the intrinsic limitations of the base VLM and the quality of trajectories executed by GPT-4o.
  • GUI Agent Data: AndroidControl (Li et al., 2024) provides human-annotated mobile trajectories; AgentTrek (Lai et al., 2024) utilizes predefined tasks to drive trajectory synthesis.
  • GUI Agent Systems: M3A (Rawles et al., 2024) is a GPT-4o-based Android agent; CogAgent (Hong et al., 2024) is fine-tuned based on a VLM.
  • Reverse Task Synthesis Conceptualization: Unlike Self-Instruct (Wang et al., 2023), which generates tasks directly from an LLM, OS-Genesis reverses synthesized tasks from real-world environmental interactions, aligning closer with actual GUI capabilities.

Rating

Dimension Score (1-10)
Novelty 8
Utility 8
Experimental Thoroughness 9
Writing Quality 8
Overall Rating 8