InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization¶

Conference: AAAI 2026 arXiv: 2508.05731 Code: github.com/InfiXAI/InfiGUI-G1 Area: Reinforcement Learning Keywords: GUI Grounding, Multimodal Large Language Models, Adaptive Exploration, Policy Optimization, Multi-Answer Generation

TL;DR¶

To address the exploration bottleneck in semantic alignment for GUI grounding, this paper proposes the Adaptive Exploration Policy Optimization (AEPO) framework. AEPO enforces broad exploration via a multi-answer generation strategy, dynamically guides learning through an adaptive exploration reward function, and ensures exploration quality via a collinearity penalty mechanism, significantly improving multimodal large language model performance on complex GUI grounding tasks.

Background & Motivation¶

GUI Grounding is a core perceptual task for autonomous GUI agents, requiring precise mapping of natural language instructions to specific interactive elements on the screen. This task can be decomposed along two orthogonal dimensions:

Spatial Alignment: Accurately localizing element coordinates — the precision of "pointing"

Semantic Alignment: Identifying the correct interactive element — pointing at the right target

Limitations of Prior Work:

SFT methods: Data-intensive and difficult to generalize to unseen UI layouts
RLVR methods (e.g., GRPO): Effectively improve spatial alignment by optimizing coordinate generation, but suffer from an exploration bottleneck

The "confidence trap" problem constitutes the core motivation of this paper. As a concrete example: given the instruction "search for objects using the camera," both a "Camera" button and a "Google Lens" icon appear on screen. The model may repeatedly select the "Camera" button with high confidence (a semantic error). Standard RLVR continuously samples from this high-confidence but incorrect choice, rarely encountering the correct "Google Lens" by chance, and thus fails to obtain the learning signal necessary to correct the semantic misunderstanding.

This reveals a fundamental issue: the single-answer paradigm of standard RL leads to low sampling efficiency and an inability to escape the policy's "confidence trap".

Method¶

Overall Architecture¶

The AEPO framework comprises three synergistic components: 1. Multi-Answer Generation: Forces the model to generate \(N\) candidate points in a single forward pass 2. Adaptive Exploration Reward (AER): A nonlinear reward signal derived from the efficiency first-principle \(\eta = U/C\) 3. Collinearity Penalty: Prevents degenerate linear-scan strategies

Key Designs¶

1. Problem Formulation and Multi-Answer Generation¶

GUI grounding is formalized as a policy optimization problem: - Context \(c = (\mathcal{S}, \mathcal{I})\): screenshot + natural language instruction - Action \(a\): coordinate point \(p=(x,y)\) - Policy \(\pi_\theta(a|c)\): action probability distribution given context - Objective: \(\theta^* = \arg\max_\theta \mathbb{E}_{c\sim\mathcal{D}, a\sim\pi_\theta(\cdot|c)}[R(a, B)]\)

The key innovation of multi-answer generation is prompting the model to produce \(N\) candidate points \(\mathcal{A} = \{p_1, p_2, \ldots, p_N\}\) within a single forward pass. This compels the model to explore beyond its highest-confidence single prediction, substantially increasing the probability of sampling a correct action from the tail of the policy distribution — particularly critical for semantically challenging instances.

2. Adaptive Exploration Reward (AER)¶

AER is derived from the efficiency first-principle \(\eta = U/C\):

Utility \(U\): - Exploration success (any candidate falls within the ground truth): \(U = +1\) - Exploration failure: \(U = -1\)

Cost \(C\): Modeled as the geometric mean of two cost components: - Proposal cost \(C_p = N\) (cost of generating \(N\) candidates) - Verification cost \(C_v\): the rank \(k\) of the first correct point on success, or \(N\) on failure - \(C = \sqrt{C_p \cdot C_v}\) (geometric mean captures diminishing marginal returns)

The resulting accuracy reward:

\[R_{\text{accuracy}}(\mathcal{A}, B) = \begin{cases} 1/\sqrt{N \cdot k} & \text{if } \exists\, p_i \in B \\ -1/N & \text{otherwise} \end{cases}\]

Dynamic behavior: - On failure: The penalty is only \(-1/N\), decreasing as \(N\) grows, encouraging broader exploration - On success: The reward \(1/\sqrt{Nk}\) is larger when \(k\) is small, rewarding efficient accurate prediction - This asymmetric design promotes exploration on failure and convergence on success

3. Collinearity Penalty Mechanism¶

If the \(N\) generated candidate points are approximately collinear (detected by checking whether the area of any triangle formed by three points is near zero), the accuracy reward is overridden with a large negative value \(R_{\text{accuracy}} = -1\). This prevents the model from adopting a simple but inefficient linear-scan strategy, incentivizing genuinely diverse semantic exploration in geometric space.

Loss & Training¶

The total reward signal combines format and accuracy rewards:

\[R_{\text{total}} = R_{\text{format}} + R_{\text{accuracy}}\]

The format reward \(R_{\text{format}}\) is +1 when the output format is correct and 0 otherwise, serving as a prerequisite for subsequent reward evaluation. The total reward is used to compute the advantage estimate \(\hat{A}\), which directly guides policy parameter updates. Training employs standard policy gradient algorithms such as GRPO, PPO, or RLOO.

Key Experimental Results¶

Main Results¶

MMBench-GUI Benchmark (Top-1 Accuracy %):

Model	Windows	MacOS	Linux	iOS	Android	Web	Avg
UI-TARS-1.5-7B	68.3/39.0	69.0/44.5	64.4/37.8	88.5/69.4	90.5/69.3	81.0/56.5	64.3
Naive RLVR-7B	79.3/58.1	82.3/62.7	64.4/44.9	94.9/89.1	95.5/84.2	92.9/79.5	79.3
InfiGUI-G1-7B	82.7/61.8	83.8/63.9	72.3/52.0	94.9/89.4	95.2/85.6	93.5/76.3	80.8
G1-7B + Exploration Success	87.1/69.1	87.2/76.3	78.5/58.2	98.1/92.4	98.0/91.8	97.1/85.7	86.4

ScreenSpot-Pro Benchmark (Top-1 Accuracy %):

Model	CAD	Dev.	Creative	Scientific	Office	OS	Avg
Naive RLVR-7B	53.8/17.2	71.4/15.9	60.6/11.9	76.4/26.4	74.6/34.0	54.2/20.2	47.6
InfiGUI-G1-3B	50.8/25.0	64.9/20.0	51.5/16.8	68.8/32.7	70.6/32.1	49.5/15.7	45.2

Ablation Study¶

The contribution of each component is assessed via comparison against Naive RLVR:

Configuration	MMBench Avg	Notes
Naive RLVR-3B	70.9	Baseline single-answer RL
InfiGUI-G1-3B (Top-1)	73.4	Combined gain from multi-answer + AER + collinearity penalty
G1-3B Exploration Success Rate	81.6	Upper bound utilizing all multi-answer candidates
Naive RLVR-7B	79.3	Larger model baseline
InfiGUI-G1-7B (Top-1)	80.8	Surpasses RLVR evaluating only the first answer
G1-7B Exploration Success Rate	86.4	Upper bound with all multi-answer candidates

Key insights: - For the 3B model, Top-1 already exceeds RLVR by 2.5%; exploration upper bound improves by 10.7% - For the 7B model, the largest gains over Naive RLVR appear in the Advanced category (semantically challenging items) - Average \(N\) for exploration success is 1.6–2.0, indicating the model has learned efficient exploration

Key Findings¶

Semantic alignment is the critical bottleneck: Gains from InfiGUI-G1 are substantially larger on Advanced categories than Basic categories, validating that AEPO effectively addresses the semantic alignment problem
Exploration efficiency: InfiGUI-G1-7B achieves an 86.4% exploration success rate with an average of only 1.6 candidates, demonstrating that the model learns not only to explore but to explore efficiently
Cross-platform generalization: Consistent improvements are observed across all platforms — Windows, MacOS, Linux, iOS, Android, and Web
3B vs. 7B scaling effect: The 7B model outperforms the 3B model in both absolute performance and the incremental gains attributable to AEPO
Training stability: Standard deviations across 5 runs are only σ = 0.11–0.41, indicating highly stable training

Highlights & Insights¶

Precise problem definition: Decomposing GUI grounding into spatial and semantic alignment, and precisely diagnosing the exploration bottleneck in semantic alignment, yields a clear and well-motivated problem formulation
Theoretical grounding of AER: Deriving the reward function from the efficiency first-principle \(\eta=U/C\) rather than through ad-hoc design confers theoretical elegance
Generality of the multi-answer paradigm: The framework of multi-answer generation combined with adaptive rewards is not limited to GUI grounding; it may also prove effective for other visual grounding tasks requiring spatial exploration, such as referring expression comprehension
Elegance of the collinearity penalty: A simple geometric constraint prevents degenerate strategies with negligible computational overhead yet notable empirical effect

Limitations & Future Work¶

Multi-answer generation increases inference-time computational cost (though average \(N\) remains small), requiring a performance–efficiency trade-off in practical deployment
The collinearity penalty relies on a simple triangle-area heuristic, which may need refinement in higher-dimensional or more complex spatial exploration scenarios
The current framework assumes ground truth bounding boxes; adaptation may be required for non-rectangular or pixel-level targets
Whether AEPO retains its advantages in agent-level (multi-step decision-making) tasks remains unexplored
The utility function in AER considers only binary outcomes (success/failure), leaving progressive metrics such as IoU unexploited

Shares the RLVR foundation with GUI RL methods such as UI-R1 and GUI-R1, but overcomes the limitations of the single-answer paradigm
The multi-answer generation idea bears resemblance to Best-of-N sampling, but AER provides a finer-grained learning signal
The adaptive reward design balancing exploration and exploitation is transferable to other RL settings, such as multi-target exploration in robotic manipulation

Rating¶

Novelty: ⭐⭐⭐⭐ — The multi-answer + adaptive reward framework is innovative, though the novelty of individual components is moderate
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-benchmark, cross-platform evaluation with reported standard deviations and exploration success rate analysis
Writing Quality: ⭐⭐⭐⭐ — Clear structure; motivation figures are highly intuitive
Value: ⭐⭐⭐⭐ — Establishes a new state of the art in the rapidly growing field of GUI agents