Skip to content

UIPro: Unleashing Superior Interaction Capability for GUI Agents

Conference: ICCV 2025 arXiv: 2509.17328 Code: GitHub Area: LLM Agent Keywords: GUI agent, unified action space, GUI grounding, vision-language model, multi-platform interaction

TL;DR

UIPro is proposed to achieve state-of-the-art GUI interaction performance across mobile, web, and desktop platforms by constructing 20.6M GUI understanding samples for pre-training and introducing a unified action space to integrate heterogeneous GUI agent task data.

Background & Motivation

Building autonomous GUI agents capable of operating graphical interfaces as humans do is a long-standing vision in AI. The core capabilities for GUI interaction include: (1) visual understanding and grounding of GUI elements; and (2) planning and executing action sequences that fulfill user goals.

Existing approaches face two critical bottlenecks:

Insufficient data scale: Existing GUI interaction datasets typically lack sufficient scale and scenario diversity. The advantages of large-scale training cannot emerge at small scales (emergent capabilities), yet large-scale datasets such as CogAgent (247M) and ScreenAI (421M) are not publicly available.

Training pipeline deficiencies: Different GUI trajectory datasets adopt heterogeneous action spaces (e.g., AITW defines swipe as DUAL_POINT(start, end), while AndroidControl uses scroll(direction)), and naively mixing them during training leads to action conflicts and performance degradation.

Mechanism: (1) Constructing the largest open-source GUI understanding dataset (20.6M samples) to establish a strong grounding foundation for agents; (2) designing a unified action space to integrate heterogeneous data sources and unlock the potential of multi-source data.

Method

Overall Architecture

UIPro adopts a two-stage training paradigm: - Stage 1: GUI Understanding Pre-training — Training on 20.6M multi-platform, multi-task GUI understanding samples to acquire strong grounding capabilities. - Stage 2: GUI Agent Task Fine-tuning — Fine-tuning on agent trajectory data unified under the proposed action space to develop action prediction capabilities.

Two backbone models are used: UIPro-SLiME (3B, trained from scratch) and UIPro-Qwen2VL (7B, fine-tuned from Qwen2-VL).

Key Designs

  1. Large-Scale GUI Understanding Data Construction: GUI data is collected and cleaned from multiple sources (Common Crawl web pages, Android emulators, RICO, MobileViews, etc.), generating \(\langle\)screenshot, referring expression, coordinate\(\rangle\) triplets across 13 task types:

    • Element description (elemgnd/elemref): Describes visual appearance, element type, and location.
    • User intent (intentgnd): Describes how users interact with an element, e.g., "tap the password input field."
    • Contextual functionality (funcgnd/funcref): Describes interaction affordances, e.g., "this element enables the user to share content."
    • Text grounding (textgnd/OCR), icon classification (icongnd/iconref), widget listing, GUI question answering, and GUI summarization.
    • The final 20.6M samples are associated with 2.5M unique screenshots; 67% are newly annotated and 33% are cleaned from open-source data.
  2. Unified Action Space Design: To resolve conflicts arising from heterogeneous action definitions, an action superset is designed:

    • Swipe is unified as swipe(start, direction, distance), compatible with AITW's DUAL_POINT and AndroidControl's scroll(direction).
    • Separate unified action spaces are defined for mobile, web, and desktop platforms (the mobile space includes 12 actions such as tap, long_press, drag, input_text, swipe, and navigate).
    • All outputs are formatted as JSON, e.g., {"action_type": "click", "target": (x, y)}.
    • Action definitions are excluded from prompts (experiments show that omitting them improves training efficiency).
  3. Systematic Denoising Pipeline: Because raw GUI data suffers from severe noise (95.9% of home pages contain accessibility errors, and one data source has a noise rate of 29%), a seven-step denoising procedure is designed:

    • Detection of blank elements (color standard deviation \(< 5\)).
    • OCR-based detection of invisible elements.
    • Removal of invalid, oversized, or undersized bounding boxes.
    • Removal of duplicate boxes and mismatched elements.

Loss & Training

  • Pre-training stage: Coordinates are normalized to \([0, 1000]\); UIPro-SLiME trains with a frozen ViT and full parameter fusion for 1 epoch; UIPro-Qwen2VL is fine-tuned via LoRA on a 4.4M subset.
  • Agent fine-tuning stage: Trained for 6 epochs until performance saturates; prompts include task descriptions and action histories; ground-truth actions are formatted as JSON objects.
  • Mobile: 6 data sources are mixed (380K samples); Web: 3 data sources are mixed (145K samples).

Key Experimental Results

Main Results (Tables)

AITW Mobile Benchmark (Step SR%):

Method Scale General Install GoogleApps Single WebShop Overall
GPT-4V-OmniParser - 48.3 57.8 51.6 77.4 52.9 57.7
SeeClick 10B 54.0 66.4 54.9 63.5 57.6 59.3
OS-ATLAS 7B 57.9 63.4 55.5 79.1 59.7 63.1
UIPro-Qwen2VL 7B 64.4 74.6 67.9 79.4 67.6 70.4
UIPro-SLiME 3B 67.0 71.4 65.4 73.2 62.9 68.0

Mind2Web Web Benchmark (Step SR%):

Method Scale Cross-Task Cross-Website Cross-Domain
OmniParser (GPT-4V) - 39.4 36.5 42.0
OS-ATLAS 7B 36.7 35.7 37.2
UIPro-Qwen2VL 7B 48.4 43.6 45.5

Ablation Study (Tables)

Effect of GUI Understanding Pre-training Data Scale:

Pre-training Data Size Avg. Grounding Accuracy AITW Step SR AndroidControl Step SR
0 ~30% ~52% ~40%
5.9M ~55% ~63% ~55%
20.6M ~60% ~68% ~61%

Effect of Unified Action Space: Mixing data sources without unifying the action space leads to significant performance degradation across all benchmarks, primarily due to a sharp drop in action type accuracy and inconsistent swipe direction prediction.

Key Findings

  • UIPro-SLiME (3B) outperforms CogAgent (18B) and GPT-4V-OmniParser.
  • Grounding accuracy is positively correlated with downstream agent task performance, confirming that grounding is the foundation of agent capability.
  • The unified action space improves not only the accuracy of shared actions but also that of platform-specific actions (e.g., Wait), indicating cross-task knowledge transfer and a regularization effect from data diversity.
  • Both GUI understanding data and agent task data exhibit clear scaling laws.
  • Performance gains from denoising are consistently significant across all 6 grounding benchmarks.

Highlights & Insights

  • The largest open-source GUI understanding dataset (20.6M samples) covering 13 task types.
  • The unified action space design is elegant and effective — a superset accommodates different definitions, and different platforms share similar interaction principles.
  • The systematic denoising pipeline reveals the severity of GUI data quality issues (29% noise rate in one data source).
  • The introduction of functional grounding tasks (funcgnd) is a notable contribution — enabling the model to understand what an element can do rather than merely what it is.

Limitations & Future Work

  • Desktop training data is far sparser than mobile and web data, limiting UIPro's performance on Windows/macOS.
  • Evaluation is currently limited to offline settings; on-device real-time interaction evaluation remains to be explored.
  • The action space unification is still manually designed; future work could explore automatic learning of cross-platform action alignment.
  • Benchmarks such as AITW do not account for alternative valid solutions, potentially leading to underestimated evaluation scores.
  • Compared to SeeClick (5.3M data), UIPro uses 4× more data and incorporates functional annotations.
  • Compared to OS-ATLAS (13.6M data), UIPro uses 50% more data and achieves superior performance on most benchmarks.
  • The unified action space concept is generalizable to other multi-source mixed training scenarios (e.g., action space unification in robotic manipulation).

Rating

  • Novelty: ⭐⭐⭐⭐ (systematic contributions through unified action space + large-scale data engineering)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 agent benchmarks + 6 grounding benchmarks + comprehensive ablations + transfer experiments)
  • Writing Quality: ⭐⭐⭐⭐ (clear structure, complete details)
  • Value: ⭐⭐⭐⭐⭐ (significant contributions to both data resources and methodology for the GUI agent community)