UIPro: Unleashing Superior Interaction Capability for GUI Agents¶

Conference: ICCV 2025 arXiv: 2509.17328 Code: GitHub Area: LLM Agent Keywords: GUI agent, unified action space, GUI grounding, vision-language model, multi-platform interaction

TL;DR¶

UIPro is proposed to achieve state-of-the-art GUI interaction performance across mobile, web, and desktop platforms by constructing 20.6M GUI understanding samples for pre-training and introducing a unified action space to integrate heterogeneous GUI agent task data.

Background & Motivation¶

Building autonomous GUI agents capable of operating graphical interfaces as humans do is a long-standing vision in AI. The core capabilities for GUI interaction include: (1) visual understanding and grounding of GUI elements; and (2) planning and executing action sequences that fulfill user goals.

Existing approaches face two critical bottlenecks:

Insufficient data scale: Existing GUI interaction datasets typically lack sufficient scale and scenario diversity. The advantages of large-scale training cannot emerge at small scales (emergent capabilities), yet large-scale datasets such as CogAgent (247M) and ScreenAI (421M) are not publicly available.

Training pipeline deficiencies: Different GUI trajectory datasets adopt heterogeneous action spaces (e.g., AITW defines swipe as DUAL_POINT(start, end), while AndroidControl uses scroll(direction)), and naively mixing them during training leads to action conflicts and performance degradation.

Mechanism: (1) Constructing the largest open-source GUI understanding dataset (20.6M samples) to establish a strong grounding foundation for agents; (2) designing a unified action space to integrate heterogeneous data sources and unlock the potential of multi-source data.

Method¶

Overall Architecture¶

UIPro adopts a two-stage training paradigm: - Stage 1: GUI Understanding Pre-training — Training on 20.6M multi-platform, multi-task GUI understanding samples to acquire strong grounding capabilities. - Stage 2: GUI Agent Task Fine-tuning — Fine-tuning on agent trajectory data unified under the proposed action space to develop action prediction capabilities.

Two backbone models are used: UIPro-SLiME (3B, trained from scratch) and UIPro-Qwen2VL (7B, fine-tuned from Qwen2-VL).

Key Designs¶

Large-Scale GUI Understanding Data Construction: GUI data is collected and cleaned from multiple sources (Common Crawl web pages, Android emulators, RICO, MobileViews, etc.), generating \(\langle\)screenshot, referring expression, coordinate\(\rangle\) triplets across 13 task types:
- Element description (elemgnd/elemref): Describes visual appearance, element type, and location.
- User intent (intentgnd): Describes how users interact with an element, e.g., "tap the password input field."
- Contextual functionality (funcgnd/funcref): Describes interaction affordances, e.g., "this element enables the user to share content."
- Text grounding (textgnd/OCR), icon classification (icongnd/iconref), widget listing, GUI question answering, and GUI summarization.
- The final 20.6M samples are associated with 2.5M unique screenshots; 67% are newly annotated and 33% are cleaned from open-source data.
Unified Action Space Design: To resolve conflicts arising from heterogeneous action definitions, an action superset is designed:
- Swipe is unified as swipe(start, direction, distance), compatible with AITW's DUAL_POINT and AndroidControl's scroll(direction).
- Separate unified action spaces are defined for mobile, web, and desktop platforms (the mobile space includes 12 actions such as tap, long_press, drag, input_text, swipe, and navigate).
- All outputs are formatted as JSON, e.g., {"action_type": "click", "target": (x, y)}.
- Action definitions are excluded from prompts (experiments show that omitting them improves training efficiency).
Systematic Denoising Pipeline: Because raw GUI data suffers from severe noise (95.9% of home pages contain accessibility errors, and one data source has a noise rate of 29%), a seven-step denoising procedure is designed:
- Detection of blank elements (color standard deviation \(< 5\)).
- OCR-based detection of invisible elements.
- Removal of invalid, oversized, or undersized bounding boxes.
- Removal of duplicate boxes and mismatched elements.

Loss & Training¶

Pre-training stage: Coordinates are normalized to \([0, 1000]\); UIPro-SLiME trains with a frozen ViT and full parameter fusion for 1 epoch; UIPro-Qwen2VL is fine-tuned via LoRA on a 4.4M subset.
Agent fine-tuning stage: Trained for 6 epochs until performance saturates; prompts include task descriptions and action histories; ground-truth actions are formatted as JSON objects.
Mobile: 6 data sources are mixed (380K samples); Web: 3 data sources are mixed (145K samples).

Key Experimental Results¶

Main Results (Tables)¶

AITW Mobile Benchmark (Step SR%):

Method	Scale	General	Install	GoogleApps	Single	WebShop	Overall
GPT-4V-OmniParser	-	48.3	57.8	51.6	77.4	52.9	57.7
SeeClick	10B	54.0	66.4	54.9	63.5	57.6	59.3
OS-ATLAS	7B	57.9	63.4	55.5	79.1	59.7	63.1
UIPro-Qwen2VL	7B	64.4	74.6	67.9	79.4	67.6	70.4
UIPro-SLiME	3B	67.0	71.4	65.4	73.2	62.9	68.0

Mind2Web Web Benchmark (Step SR%):

Method	Scale	Cross-Task	Cross-Website	Cross-Domain
OmniParser (GPT-4V)	-	39.4	36.5	42.0
OS-ATLAS	7B	36.7	35.7	37.2
UIPro-Qwen2VL	7B	48.4	43.6	45.5

Ablation Study (Tables)¶

Effect of GUI Understanding Pre-training Data Scale:

Pre-training Data Size	Avg. Grounding Accuracy	AITW Step SR	AndroidControl Step SR
0	~30%	~52%	~40%
5.9M	~55%	~63%	~55%
20.6M	~60%	~68%	~61%

Effect of Unified Action Space: Mixing data sources without unifying the action space leads to significant performance degradation across all benchmarks, primarily due to a sharp drop in action type accuracy and inconsistent swipe direction prediction.

Key Findings¶

UIPro-SLiME (3B) outperforms CogAgent (18B) and GPT-4V-OmniParser.
Grounding accuracy is positively correlated with downstream agent task performance, confirming that grounding is the foundation of agent capability.
The unified action space improves not only the accuracy of shared actions but also that of platform-specific actions (e.g., Wait), indicating cross-task knowledge transfer and a regularization effect from data diversity.
Both GUI understanding data and agent task data exhibit clear scaling laws.
Performance gains from denoising are consistently significant across all 6 grounding benchmarks.

Highlights & Insights¶

The largest open-source GUI understanding dataset (20.6M samples) covering 13 task types.
The unified action space design is elegant and effective — a superset accommodates different definitions, and different platforms share similar interaction principles.
The systematic denoising pipeline reveals the severity of GUI data quality issues (29% noise rate in one data source).
The introduction of functional grounding tasks (funcgnd) is a notable contribution — enabling the model to understand what an element can do rather than merely what it is.

Limitations & Future Work¶

Desktop training data is far sparser than mobile and web data, limiting UIPro's performance on Windows/macOS.
Evaluation is currently limited to offline settings; on-device real-time interaction evaluation remains to be explored.
The action space unification is still manually designed; future work could explore automatic learning of cross-platform action alignment.
Benchmarks such as AITW do not account for alternative valid solutions, potentially leading to underestimated evaluation scores.

Compared to SeeClick (5.3M data), UIPro uses 4× more data and incorporates functional annotations.
Compared to OS-ATLAS (13.6M data), UIPro uses 50% more data and achieves superior performance on most benchmarks.
The unified action space concept is generalizable to other multi-source mixed training scenarios (e.g., action space unification in robotic manipulation).

Rating¶

Novelty: ⭐⭐⭐⭐ (systematic contributions through unified action space + large-scale data engineering)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 agent benchmarks + 6 grounding benchmarks + comprehensive ablations + transfer experiments)
Writing Quality: ⭐⭐⭐⭐ (clear structure, complete details)
Value: ⭐⭐⭐⭐⭐ (significant contributions to both data resources and methodology for the GUI agent community)