AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs¶

Conference: ACL 2025
arXiv: 2502.01977
Code: Project Page
Area: GUI Understanding / VLM
Keywords: GUI grounding, functionality annotation, VLM, automatic pipeline, UI understanding

TL;DR¶

This work proposes the AutoGUI automatic annotation pipeline. By simulating interactions to compare UI state changes, inferring element functionality with LLMs, and performing dual-LLM verification and filtering, it constructs a high-quality UI functionality dataset of 704K annotations. The annotation accuracy of 96.7% is comparable to humans, significantly enhancing VLM UI grounding capabilities and demonstrating clear data scaling effects.

Background & Motivation¶

Background: VLMs hold great potential for UI understanding, but they are bottlenecked by data shortage—UI datasets are significantly smaller in scale compared to natural image datasets. Limitations of Prior Work: Existing annotations mostly consist of element alt-text or simple user intentions, lacking contextualized functional semantic descriptions (e.g., two identical magnifying glass icons could represent "search" and "zoom" respectively). Key Challenge: High-quality, large-scale contextualized UI element functionality annotations are needed, but human annotation is prohibitively expensive. Goal: Design a fully automatic annotation pipeline to generate high-quality functional descriptions of UI elements at scale without human intervention. Key Insight: Infer functionality by looking at "what happens to the UI after an element is clicked"—similar to how humans explore an unfamiliar interface. Core Idea: Compare UI state changes before and after interactions using LLMs to automatically infer element functionality, combined with a dual-LLM quality control mechanism to ensure accuracy.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) Automatically crawl Web/Android UI interaction trajectories; (2) Perform LLM functionality inference, LLM-aided rejection, and dual-LLM verification; (3) Convert data into grounding/referring tasks to fine-tune VLMs.

Key Designs¶

Based on UI State Change Functionality Inference:
- Function: Use differences in UI AXTree before and after interactions to infer element functionality.
- Mechanism: $paper_notes/docs/ACL2025/multilingual_mt/cosmmic_commentsensitive_multimodal_multilingual_indian_corpus.md = ext{LLM}(p_{ ext{anno}}, s_t, s_{t+1})$, using difflib to generate line-level differences (additions, deletions, shifts, and attribute updates) in the AXTree. The LLM analyzes these changes and summarizes the functionality via Chain-of-Thought.
- Design Motivation: Instead of relying solely on the visual appearance of an element, this method prioritizes "what happens after clicking." For instance, if clicking a magnifying glass icon opens a search bar, its function is search; if it displays a zoom slider, its function is zoom.
LLM-aided Rejection (Invalid Sample Filtering):
- Function: The LLM evaluates whether the state changes produced by the interaction are sufficient to infer functionality.
- Mechanism: $ ext{score} = ext{LLM}(p_{ ext{reject}}, e, s_t, s_{t+1})$, scoring based on three criteria (clarity of change, relevance, and predictability) and discarding the bottom 30%.
- Design Motivation: Not all interactions produce meaningful state changes, such as pages that fail to load properly or redirects that require logging in.
Dual-LLM Verification (Annotation Quality Control):
- Function: Use two different LLMs to cross-validate whether the functionality annotation is correct.
- Mechanism: Llama-3-70B and Mistral-7B separately score the annotations, retaining only those that receive full marks from both.
- Design Motivation: A single LLM may exhibit systematic biases (such as incorrectly describing dropdown menus). Cross-validation with two different LLMs allows them to correct each other.

Loss & Training¶

VLMs are fine-tuned using LoRA, training for 1 epoch on 8×A100. Qwen-VL and Qwen2-VL are fine-tuned with LoRA, while SliME only freezes the visual encoder. Coordinates are normalized to [0, 999].

Key Experimental Results¶

Main Results¶

Fine-tuned UI grounding accuracy (%):

Model	FuncPred	ScreenSpot	ScreenSpot-v2	MoTIF	VWB EG
Qwen-VL (Baseline)	3.0	5.2	5.6	7.8	1.7
Qwen-VL + AutoGUI	48.7(+45.7)	41.2(+36.0)	40.2(+34.6)	44.0(+36.2)	42.1(+40.4)
Qwen2-VL (Baseline)	38.7	66.4	66.9	71.1	55.9
Qwen2-VL + AutoGUI	65.0(+26.3)	80.0(+13.6)	83.2(+16.3)	72.3(+1.2)	90.3(+34.4)

Ablation Study¶

Comparison of annotation quality vs. human annotators:

Annotator	Rejector	Verifier	Accuracy
Llama-3-70B	None	None	64.5%
Llama-3-70B	Rule+LLM	Dual-LLM	96.7%
Human Annotator	-	-	95.5%

Data scaling effect: From 25k $\rightarrow$ 125k $\rightarrow$ 702k, the grounding accuracy of the three VLMs continues to rise.

Key Findings¶

Functionality annotations significantly outperform HTML/metadata annotations: Scoring up to 4x higher than HTML annotations on FuncPred.
Clear data scaling effect: Increasing the volume of data leads to continuous performance improvements.
Assisting GUI agent tasks: Qwen2-VL+AutoGUI improves over Gemini's native grounding by 9.73% on AITW.
Annotation accuracy of 96.7% is comparable to trained human annotators (95.5%).

Highlights & Insights¶

The design idea of "inferring functionality via interaction differences" is highly clever—seamlessly mimicking the way humans explore unfamiliar interfaces.
Dual-LLM quality control ensures that the fully automatic pipeline achieves quality comparable to human annotation.
Data scaling effects indicate that the pipeline is highly scalable.
The finding that functionality annotations > HTML annotations demonstrates the immense value of contextualized semantic descriptions.

Limitations & Future Work¶

Only Web and Android platforms are covered, leaving iOS and desktop applications unaddressed.
The current pipeline cannot annotate elements that modify remote internet content (e.g., posting or purchasing).
Lack of goal-oriented interaction trajectories (only random interactions are used).
Some low-traffic websites experience rendering distortion under mobile resolutions.

vs SeeClick (Cheng et al. 2024): SeeClick uses only static pages and HTML annotations—Ours leverages interaction differences and LLM inference.
vs Widget Captioning (Li et al. 2020): Relies on 163K human annotations—Ours is fully automatic, scaling to 704K with comparable quality.
vs UGround (Gou et al. 2025): Provides only brief functional descriptions—Ours provides contextualized and detailed functionality annotations.
Insight: The data bottleneck in UI understanding can be thoroughly resolved through automatic interaction and LLM inference.

Rating¶

Novelty: ⭐⭐⭐⭐ The first fully automatic UI functionality annotation pipeline; the concept of inferring functionality via interaction differences is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐ 704K dataset + multiple VLMs + scalability + agent applications.
Writing Quality: ⭐⭐⭐⭐ The description of the pipeline is clear and well-structured.
Value: ⭐⭐⭐⭐⭐ Resolves the data bottleneck in UI understanding, offering foundational infrastructure value for GUI agents.