AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs¶
Conference: ACL 2025
arXiv: 2502.01977
Code: Project Page
Area: GUI Understanding / VLM
Keywords: GUI grounding, functionality annotation, VLM, automatic pipeline, UI understanding
TL;DR¶
This work proposes the AutoGUI automatic annotation pipeline. By simulating interactions to compare UI state changes, inferring element functionality with LLMs, and performing dual-LLM verification and filtering, it constructs a high-quality UI functionality dataset of 704K annotations. The annotation accuracy of 96.7% is comparable to humans, significantly enhancing VLM UI grounding capabilities and demonstrating clear data scaling effects.
Background & Motivation¶
Background: VLMs hold great potential for UI understanding, but they are bottlenecked by data shortage—UI datasets are significantly smaller in scale compared to natural image datasets. Limitations of Prior Work: Existing annotations mostly consist of element alt-text or simple user intentions, lacking contextualized functional semantic descriptions (e.g., two identical magnifying glass icons could represent "search" and "zoom" respectively). Key Challenge: High-quality, large-scale contextualized UI element functionality annotations are needed, but human annotation is prohibitively expensive. Goal: Design a fully automatic annotation pipeline to generate high-quality functional descriptions of UI elements at scale without human intervention. Key Insight: Infer functionality by looking at "what happens to the UI after an element is clicked"—similar to how humans explore an unfamiliar interface. Core Idea: Compare UI state changes before and after interactions using LLMs to automatically infer element functionality, combined with a dual-LLM quality control mechanism to ensure accuracy.
Method¶
Overall Architecture¶
A three-stage pipeline: (1) Automatically crawl Web/Android UI interaction trajectories; (2) Perform LLM functionality inference, LLM-aided rejection, and dual-LLM verification; (3) Convert data into grounding/referring tasks to fine-tune VLMs.
Key Designs¶
-
Based on UI State Change Functionality Inference:
- Function: Use differences in UI AXTree before and after interactions to infer element functionality.
- Mechanism: \(paper_notes/docs/ACL2025/multilingual_mt/cosmmic_commentsensitive_multimodal_multilingual_indian_corpus.md = ext{LLM}(p_{ ext{anno}}, s_t, s_{t+1})\), using
difflibto generate line-level differences (additions, deletions, shifts, and attribute updates) in the AXTree. The LLM analyzes these changes and summarizes the functionality via Chain-of-Thought. - Design Motivation: Instead of relying solely on the visual appearance of an element, this method prioritizes "what happens after clicking." For instance, if clicking a magnifying glass icon opens a search bar, its function is search; if it displays a zoom slider, its function is zoom.
-
LLM-aided Rejection (Invalid Sample Filtering):
- Function: The LLM evaluates whether the state changes produced by the interaction are sufficient to infer functionality.
- Mechanism: $ ext{score} = ext{LLM}(p_{ ext{reject}}, e, s_t, s_{t+1})$, scoring based on three criteria (clarity of change, relevance, and predictability) and discarding the bottom 30%.
- Design Motivation: Not all interactions produce meaningful state changes, such as pages that fail to load properly or redirects that require logging in.
-
Dual-LLM Verification (Annotation Quality Control):
- Function: Use two different LLMs to cross-validate whether the functionality annotation is correct.
- Mechanism: Llama-3-70B and Mistral-7B separately score the annotations, retaining only those that receive full marks from both.
- Design Motivation: A single LLM may exhibit systematic biases (such as incorrectly describing dropdown menus). Cross-validation with two different LLMs allows them to correct each other.
Loss & Training¶
VLMs are fine-tuned using LoRA, training for 1 epoch on 8×A100. Qwen-VL and Qwen2-VL are fine-tuned with LoRA, while SliME only freezes the visual encoder. Coordinates are normalized to [0, 999].
Key Experimental Results¶
Main Results¶
Fine-tuned UI grounding accuracy (%):
| Model | FuncPred | ScreenSpot | ScreenSpot-v2 | MoTIF | VWB EG |
|---|---|---|---|---|---|
| Qwen-VL (Baseline) | 3.0 | 5.2 | 5.6 | 7.8 | 1.7 |
| Qwen-VL + AutoGUI | 48.7(+45.7) | 41.2(+36.0) | 40.2(+34.6) | 44.0(+36.2) | 42.1(+40.4) |
| Qwen2-VL (Baseline) | 38.7 | 66.4 | 66.9 | 71.1 | 55.9 |
| Qwen2-VL + AutoGUI | 65.0(+26.3) | 80.0(+13.6) | 83.2(+16.3) | 72.3(+1.2) | 90.3(+34.4) |
Ablation Study¶
Comparison of annotation quality vs. human annotators:
| Annotator | Rejector | Verifier | Accuracy |
|---|---|---|---|
| Llama-3-70B | None | None | 64.5% |
| Llama-3-70B | Rule+LLM | Dual-LLM | 96.7% |
| Human Annotator | - | - | 95.5% |
Data scaling effect: From 25k \(\rightarrow\) 125k \(\rightarrow\) 702k, the grounding accuracy of the three VLMs continues to rise.
Key Findings¶
- Functionality annotations significantly outperform HTML/metadata annotations: Scoring up to 4x higher than HTML annotations on FuncPred.
- Clear data scaling effect: Increasing the volume of data leads to continuous performance improvements.
- Assisting GUI agent tasks: Qwen2-VL+AutoGUI improves over Gemini's native grounding by 9.73% on AITW.
- Annotation accuracy of 96.7% is comparable to trained human annotators (95.5%).
Highlights & Insights¶
- The design idea of "inferring functionality via interaction differences" is highly clever—seamlessly mimicking the way humans explore unfamiliar interfaces.
- Dual-LLM quality control ensures that the fully automatic pipeline achieves quality comparable to human annotation.
- Data scaling effects indicate that the pipeline is highly scalable.
- The finding that functionality annotations > HTML annotations demonstrates the immense value of contextualized semantic descriptions.
Limitations & Future Work¶
- Only Web and Android platforms are covered, leaving iOS and desktop applications unaddressed.
- The current pipeline cannot annotate elements that modify remote internet content (e.g., posting or purchasing).
- Lack of goal-oriented interaction trajectories (only random interactions are used).
- Some low-traffic websites experience rendering distortion under mobile resolutions.
Related Work & Insights¶
- vs SeeClick (Cheng et al. 2024): SeeClick uses only static pages and HTML annotations—Ours leverages interaction differences and LLM inference.
- vs Widget Captioning (Li et al. 2020): Relies on 163K human annotations—Ours is fully automatic, scaling to 704K with comparable quality.
- vs UGround (Gou et al. 2025): Provides only brief functional descriptions—Ours provides contextualized and detailed functionality annotations.
- Insight: The data bottleneck in UI understanding can be thoroughly resolved through automatic interaction and LLM inference.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first fully automatic UI functionality annotation pipeline; the concept of inferring functionality via interaction differences is highly novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ 704K dataset + multiple VLMs + scalability + agent applications.
- Writing Quality: ⭐⭐⭐⭐ The description of the pipeline is clear and well-structured.
- Value: ⭐⭐⭐⭐⭐ Resolves the data bottleneck in UI understanding, offering foundational infrastructure value for GUI agents.