Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents¶

Conference: AAAI 2026
arXiv: 2511.10705
Code: Not released
Area: Agent / LLM / GUI Automation
Keywords: GUI Agent, Planning-Grounding Co-Evolution, GRPO, Self-Iterative Training, Reward Ensemble

TL;DR¶

This paper proposes Co-EPG, a framework that decouples a GUI Agent into separate Planning and Grounding models, establishes a positive feedback loop via GRPO co-training and a Confidence-based Dynamic Reward Ensemble Mechanism (C-DREM), enabling both models to co-evolve through self-iteration. Using only benchmark datasets (no external data), Co-EPG achieves state-of-the-art results on Multimodal-Mind2Web (58.4%) and AndroidControl (83.1%).

Background & Motivation¶

GUI task automation is an important frontier in AI. Current GUI Agents require two core capabilities: Planning—deciding what action to take given the current screen state; and Grounding—locating the precise position of the target element on the interface.

Existing approaches fall into two paradigms: 1. End-to-end single model: A single VLM is trained to perform both planning and grounding simultaneously. However, such models generalize poorly across diverse GUI environments, with limitations in both perception and interaction. 2. Modular multi-model: Planning and grounding are decoupled into independent models or realized through multi-agent collaboration. However, existing collaborative architectures suffer from two fundamental problems: - Insufficient synergy: Each model is optimized independently, failing to leverage the mutual reinforcement between planning and grounding. - Data inefficiency: Heavy reliance on large-scale synthetic data generation, with underutilization of existing data.

Therefore, a new paradigm is needed to enable Planning and Grounding models to mutually reinforce each other and continuously improve.

Core Problem¶

How can co-evolution of two models be achieved within a decoupled Planning-Grounding architecture? Specifically: 1. How can the quality of plans generated by the planning model be evaluated and guided using the grounding model? 2. How can improvements in the planning model in turn produce higher-quality training data to enhance the grounding model? 3. How can continuous improvement be achieved through self-iteration without relying on external data?

The key challenge lies in the fact that plan quality cannot be directly evaluated (unlike coordinates, which can be compared against ground truth); it must be measured indirectly through the execution outcomes of the grounding model.

Method¶

Overall Architecture¶

Co-EPG is a self-iterative training framework whose core is establishing a positive feedback loop between Planning and Grounding:

Input: GUI screenshot + task description + action history
Output: Complete action (coordinates + action type + action value)

The overall pipeline alternates between two phases: 1. Iterative Training: Both models are initialized with SFT, then the planning model is optimized via GRPO driven by C-DREM, after which the higher-quality data produced by the updated planning model is used to further fine-tune the grounding model via SFT. 2. Data Enhancement: The updated models are used to expand and improve the quality and diversity of training data.

Through multiple rounds of iteration (3 rounds in the paper), both models continuously improve in a spiraling fashion.

Key Designs¶

P-G Dual-Model Architecture: Decision-making is decoupled into two steps:
- Planning model \(\pi\): Takes the current observation \(o_t\), task description \(Q\), and history \(h_t\) as input; outputs a textual plan \(p_t\), action type \(a_t^{type}\), and action value \(a_t^{value}\).
- Grounding model \(\phi\): Takes the screenshot \(o_t^{vision}\) and plan \(p_t\) as input; outputs target element coordinates \(a_t^{coor}\).

The plan serves as the interaction medium between the two models—it is simultaneously the output of the planning model and the input of the grounding model, forming the key bridge for co-evolution.

C-DREM (Confidence-based Dynamic Reward Ensemble Mechanism): Addresses the problem of bias from a single reward model in GRPO training. The core idea is:
- Multiple grounding models (including two large open-source VLMs, Qwen2.5-VL-72B and 32B, as well as the iteratively trained grounding model \(\phi_k\)) jointly evaluate plan quality.
- Each model's weight is determined by two components: a static prior \(\sigma_j\) (with higher weight assigned to self-trained models) and a dynamic confidence \(c_j\) (computed as the mean log-likelihood of predicted coordinate tokens).
- Weights are normalized via softmax: \(w_j = \frac{\exp(\sigma_j \cdot c_j)}{\sum_n \exp(\sigma_n \cdot c_n)}\)
- Plan reward \(r_{plan}\): 1 if the grounding model's predicted coordinates fall within the target bounding box, 0 otherwise; final reward is the weighted sum.
Self-Augmented Data Evolution:
- Initialization (\(k=0\)): An open-source VLM pool (Planner \(\Pi\) and Verifier \(\Phi\)) is used to generate plans and verify them; successful instances form \(D_0\).
- Subsequent iterations (\(k \geq 1\)): The newly trained \(\pi_k'\) is added to the Planner pool to increase diversity, and \(\phi_k\) is added to the Verifier pool to improve verification reliability.
- Only the latest two generations of models are retained to balance efficiency.

Loss & Training¶

Reward Design (three components): - Plan reward \(r_{plan}\): Whether the grounding model's predicted coordinates fall within the target bounding box, integrated via C-DREM weighting. - Action type reward \(r_{type}\): Exact match with ground truth; 1 if correct, 0 otherwise. - Action value reward \(r_{value}\): 1 if the F1 score between the predicted value and ground truth exceeds 0.5, 0 otherwise.

Final Reward: If \(r_{type}=0\) or \(r_{value}=0\), then \(r_i=0\); otherwise \(r_i = r_{plan}\). This means action type and value serve as gating conditions, with the plan reward as the core optimization signal.

GRPO Training: \(G=7\) rollouts are generated per group; advantage is computed via within-group normalization: \(A_i = \frac{r_i - \text{mean}(\{r_1,...,r_G\})}{\text{std}(\{r_1,...,r_G\})}\)

Training Details: - Backbone: Qwen2.5-VL-3B/7B-Instruct - SFT: batch size 96, lr \(1 \times 10^{-6}\), 3 epochs, 8 GPUs - GRPO: batch size 294, lr \(5 \times 10^{-7}\), temperature 0.9, KL coefficient \(\beta=0.01\), clip \(\epsilon=0.2\), 7 GPUs, ~48 hours - C-DREM static prior weight ratio: Qwen2.5-VL-72B : 32B : \(\phi_k\) = 1:1:2

Key Experimental Results¶

Multimodal-Mind2Web (Web Tasks)¶

Method	Model Scale	Cross-Task Step SR	Cross-Website Step SR	Cross-Domain Step SR	Avg Step SR
GPT-4 + Choice	-	40.2	32.4	36.8	36.5
GPT-4V + OmniParser	-	39.4	36.5	42.0	39.3
Explorer-7B	7B	53.2	56.7	53.0	54.3
AgentTrek-7B	7B	55.7	51.4	52.6	53.2
AGUVIS-7B	7B	60.4	54.6	56.6	57.2
Co-EPG-Web-3B	3B	53.1	51.1	50.0	51.4
Co-EPG-Web-7B	7B	61.9	58.1	55.3	58.4

AndroidControl (Mobile Tasks)¶

Method	High Step Acc	Low Step Acc	Avg Step Acc
UI-TARS-2B	68.9	89.3	79.1
UI-TARS-7B	72.5	90.8	81.7
InfiGUI-R1-3B	73.2	90.0	81.6
Co-EPG-Mob-3B	73.4	90.2	81.8
Co-EPG-Mob-7B	74.2	92.0	83.1

OmniACT (Desktop + Web Cross-Platform)¶

Method	Avg Acc
GPT-4o + UGround-V1-7B	34.0
Co-EPG-Des-7B-M2	53.2 (+19.2%)

Ablation Study¶

P-G Decoupled vs. End-to-End:

Architecture	Avg Step SR
End-to-End	50.1
P-G Dual-Model	53.5 (+3.4%)

C-DREM Ablation:

Variant	Avg Step SR
w/o C-DREM (single grounding model as reward)	56.50
w/o Confidence & Prior Weights (uniform weighting)	57.01 (+0.51)
w/o Confidence Weights (static prior only)	57.67 (+1.17)
Full C-DREM	58.41 (+1.91)

Iterative Improvement (Co-EPG-Web-7B Avg Step SR):

Stage	w/o GRPO	w/ GRPO
Iteration 1	52.6	53.5
Iteration 2	54.5	55.0
Iteration 3	56.5	58.4

Each round of iteration yields consistent gains; GRPO consistently outperforms pure SFT, demonstrating that iterative data improvement and GRPO co-training are dual driving forces.

Data Efficiency: Co-EPG-Web-7B uses only 6,862 annotated steps (2.42% of AGUVIS's 283,500) while surpassing AGUVIS-7B's performance.

Highlights & Insights¶

The positive feedback loop design is highly elegant: The plan naturally bridges Planning and Grounding, and the grounding model's execution outcomes serve as a natural reward signal for the planning model, eliminating the need to manually engineer complex reward functions.
C-DREM addresses the single-point failure problem of reward models in RL training: Ensembling multiple grounding models with confidence weighting yields more stable training and faster convergence than a single reward model. Confidence is computed via token log-likelihood, incurring minimal computational overhead.
Exceptional data efficiency: Co-EPG surpasses AGUVIS, which relies on large-scale synthetic data, using only 2.42% of its data, validating that deeply exploiting data value is more effective than simply accumulating more data.
Cross-platform generalization: The approach is effective across Web (Mind2Web), Mobile (AndroidControl), and Desktop (OmniACT) environments.

Limitations & Future Work¶

Step-level evaluation only: The paper evaluates only single-step action accuracy and does not assess end-to-end task success rate at the trajectory level, which is required in real-world applications where multiple consecutive steps must all be correct.
Grounding model still relies on large models for reward: C-DREM depends on Qwen2.5-VL-72B and 32B as auxiliary reward sources, resulting in substantial computational cost for GRPO training (7 GPUs × 48 hours).
Diminishing returns across iterations: The improvement margin gradually decreases across the three iterations (Iter1→2 vs. Iter2→3); the paper does not investigate whether further iterations would saturate.
No direct measure of plan semantic quality: Plan quality is entirely assessed indirectly through grounding outcomes; a good plan executed by a weak grounding model may receive a negative reward, potentially introducing noise.
Validated on offline data only: No online environment interaction experiments are conducted, leaving the performance of continuous execution in real GUI environments unknown.
Strong dependency on Qwen2.5-VL: Both the backbone and auxiliary reward models are from the Qwen series; performance on other VLM families remains unexplored.

vs. AGUVIS (ICLR 2025): AGUVIS employs a two-stage end-to-end single-model approach (grounding pre-training followed by planning fine-tuning), requiring large amounts of synthetic data (283K steps). Co-EPG surpasses it using only 2.42% of that data through decoupling and co-evolution. The essential distinction is that Co-EPG enables mutual improvement between two models, rather than sequential unidirectional training.
vs. WebRL / WebEvolver: These methods also apply RL for self-evolution but adopt end-to-end single-model designs that do not exploit the decoupled Planning-Grounding structure. Co-EPG's contribution lies in combining the decoupled architecture with self-evolving training.
vs. Agent-SAMA (AAAI 2026): Agent-SAMA employs a state-aware FSM for GUI navigation decisions without training optimization, leaning more toward engineering design. Co-EPG focuses on innovation in the training paradigm.

Transferable Insights: - The co-evolution idea is generalizable: Beyond GUI Agents, any task decomposable into "policy + execution" (e.g., high-level planning + low-level control in robotic manipulation) can benefit from this positive feedback loop design. - C-DREM has general applicability: Confidence-weighted reward ensembling from multiple models can be extended to other RL-from-AI-feedback scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ Combining Planning-Grounding decoupling with GRPO co-training is novel, and the confidence-weighted ensemble in C-DREM is also a clever design; however, no individual component is entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across Web/Mobile/Desktop benchmarks with ablations covering major components, but lacks trajectory-level evaluation and online interaction experiments.
Writing Quality: ⭐⭐⭐⭐ Overall logic is clear, architectural diagrams are intuitive, and mathematical derivations are complete; however, the layout of Section 4 relative to Table 1 is somewhat disorganized.
Value: ⭐⭐⭐⭐ High data efficiency is a significant advantage, but the GRPO training pipeline requires multiple large models for reward evaluation, creating a relatively high deployment barrier.