CVPR2026 Multimodal VLM GUI automation long-horizon tasks multi-agent framework reinforcement learning state tracking task scheduling

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation¶

Conference: CVPR2026 arXiv: 2511.22235 Code: hehehahi4/CES Area: Multimodal VLM Keywords: GUI automation, long-horizon tasks, multi-agent framework, reinforcement learning, state tracking, task scheduling

TL;DR¶

This paper proposes CES (Coordinator-Executor-State Tracker), a multi-agent framework coupled with a staged execution-feedback reinforcement learning algorithm. By decoupling high-level task planning from low-level execution, and through dedicated training of the Coordinator and State Tracker, CES significantly improves GUI agent planning and state management capabilities on long-horizon tasks.

Background & Motivation¶

Conflicting capabilities in single-agent systems: Existing end-to-end GUI agents attempt to couple heterogeneous capabilities—task planning, multi-step reasoning, GUI element grounding, and precise action execution—within a single model. With limited parameters, simultaneously mastering both high-level and low-level abilities is difficult, and catastrophic capability collapse tends to occur as task complexity increases.
Lack of task state awareness: In long-horizon tasks, agents primarily rely on screenshots to infer progress. However, screenshots are insufficient and unreliable state representations—recurring home screens, out-of-distribution interfaces, and similar situations make progress estimation difficult.
Limitations of the SFT paradigm: Supervised fine-tuning relies heavily on large-scale, high-quality trajectory annotations, which are costly to acquire and generalize poorly.
Insufficiency of single-step RL: Although existing RL methods achieve reasonable performance on simple tasks, they still train a single policy network and do not address the capability coupling problem.
Lack of optimization in multi-agent frameworks: Existing multi-agent approaches mostly assign roles via general-purpose VLMs and prompt engineering, without deep specialization for each role.
Temporal verification experiments: The paper designs screenshot temporal ordering experiments and finds that accuracy drops sharply as the step gap increases, empirically demonstrating that screenshots cannot adequately represent task state.

Method¶

Overall Architecture: CES Collaborative Loop¶

Drawing an analogy to operating system design, three specialized agents form a cyclic collaboration:

Coordinator (CPU / planning core): Integrates the user's high-level instruction \(q\), the state summary \(m^{t-1}\) provided by the State Tracker, and the current screenshot \(s^t\) to decompose and generate a clear atomic instruction \(l^t = \pi_c(q, m^{t-1}, s^t)\).
Executor (I/O device / execution terminal): A frozen pretrained GUI model that executes action \(u^t = (th^t, a^t) = \pi_e(l^t, s^t)\) based solely on the atomic instruction \(l^t\) and current screenshot \(s^t\), without needing to understand long-term intent.
State Tracker (dynamic memory): A language model that does not directly perceive the GUI environment. It generates a new high-semantic natural-language state summary \(m^t = \pi_s(q, m^{t-1}, u^t)\) by interpreting the Executor's output \(u^t\), the user intent \(q\), and the previous state \(m^{t-1}\).

Staged Execution-Feedback Reinforcement Learning¶

Warm-up SFT: The Coordinator and State Tracker are first fine-tuned with supervised learning on existing trajectory data, enabling them to learn basic role responsibilities and output formats.

Execution-feedback reward function: Rather than directly evaluating the quality of intermediate outputs, outputs are passed to the Executor for execution, and a rule-based reward function provides objective scoring:

\[R = \alpha_1 R_{format} + \alpha_2 R_{executor}, \quad R_{executor} = \gamma_1 R_{type} + \gamma_2 R_{param}\]

where \(R_{format}\) rewards format correctness, \(R_{type}\) rewards correct action type, and \(R_{param}\) rewards correct action parameters.

Stage 1 — Training the Coordinator: The Executor is frozen, and ground-truth states are used as \(m^{t-1}\) input. The Coordinator's planning policy is optimized using the GRPO algorithm with execution-feedback rewards.

Stage 2 — Training the State Tracker: The trained Coordinator and Executor are frozen. State summaries generated by the State Tracker pass through the full CES loop, and the resulting execution-feedback reward is backpropagated to optimize the State Tracker, enabling it to learn to produce state information most valuable to the Coordinator.

Training Details¶

Coordinator backbone: Qwen2.5-VL-7B; State Tracker backbone: Qwen3-4B
SFT stage: LLaMA Factory, 1 epoch, lr=5e-5
RL stage: Verl framework; Coordinator 10 epochs lr=1e-6; State Tracker 5 epochs
Reward coefficients: \(\alpha_1=0.1, \alpha_2=0.9\); \(\gamma_1=0.2, \gamma_2=0.8\)
Training Executor: GUI-R1-7B (frozen); Hardware: 8×80G GPU

Key Experimental Results¶

Main Results: Long-Horizon Task Performance (Table 1)¶

Evaluated on three benchmarks—AITZ (avg. 7.5 steps), AMEX (avg. 12.8 steps), and GUI-Odyssey (avg. 15.3 steps):

Model	Method	AITZ SR	AMEX SR	GUI-Odyssey SR
Qwen2.5-VL-7B	Zero Shot	18.11	35.10	34.37
GUI-R1-7B	RL	30.59	43.69	38.79
GUI-Owl-7B	RL	32.70	40.48	35.82
+ GPT-5	Multi-Agent	40.55	35.80	42.47
+ CES (Ours)	Multi-Agent	43.05	48.48	53.69

CES improves average Type accuracy by 10.38% over the GUI-R1-7B baseline, and raises GUI-Odyssey SR from 38.79% to 53.69% (+14.9%).

Generalization (Table 2)¶

CES functions as a plug-and-play module and yields significant improvements across Executors of different scales:

Executor	Setting	AMEX SR	GUI-Odyssey SR
UI-R1-3B	Baseline → CES	35.81 → 43.38 (+7.57)	32.49 → 38.04 (+5.55)
GUI-Owl-7B	Baseline → CES	40.48 → 47.24 (+6.76)	35.82 → 46.65 (+10.83)
GUI-Owl-32B	Baseline → CES	43.16 → 52.05 (+8.89)	39.60 → 56.75 (+17.15)

Ablation Study (Table 3)¶

Configuration	AMEX SR	GUI-Odyssey SR
Full CES	48.48	53.69
w/o Coordinator	33.27 (−15.21)	39.15 (−14.54)
w/o State Tracker	42.08 (−6.40)	42.52 (−11.17)
w/o RL (SFT only)	36.54 (−11.94)	42.89 (−10.80)

Removing any component or the RL stage leads to significant performance degradation, validating the necessity of each component and training strategy.

Highlights & Insights¶

OS-inspired design philosophy: GUI automation is analogized to the CPU-I/O-Memory architecture of an operating system, elegantly decoupling planning, execution, and state management.
State Tracker innovation: A pure language model is used for dynamic context compression and state summarization, shifting state understanding from a high-dimensional visual space to a low-dimensional semantic space, nearly eliminating State Loss errors (14% → 2%).
Execution-feedback reward: This design cleverly addresses the difficulty of directly evaluating abstract tasks (planning/state tracking) by using downstream execution outcomes to guide upstream optimization.
Plug-and-play generalizability: The lightweight combination of a 7B Coordinator and 4B State Tracker substantially benefits diverse Executors; notably, the 7B+4B combination achieves performance comparable to a 32B single model.
Thorough empirical validation: Preliminary temporal ordering experiments, three long-horizon benchmarks, multi-scale Executor generalization, detailed ablations, and failure case analysis are all provided.

Limitations & Future Work¶

Executor remains a bottleneck: Failure case analysis shows that the performance bottleneck has shifted to the Executor's perceptual limitations (Perception Error and Generalization Failure), which CES cannot address.
Staged rather than joint training: The Coordinator and State Tracker are trained separately; the possibility of joint training or co-evolution has not been explored (noted in the paper's Future Work).
Domain applicability: Validation is limited to mobile GUI scenarios and has not been extended to other GUI environments such as web or desktop.
Dependency on state annotation quality: Stage 1 relies on ground-truth state annotations; the feasibility of obtaining such annotations in real-world settings remains to be verified.
Computational overhead: Three models are executed serially (7B + frozen Executor + 4B), and inference latency may be an obstacle to practical deployment.

vs GUI-R1: Both apply RL to train GUI agents, but GUI-R1 trains a single end-to-end model, whereas CES decouples high-level and low-level components and specifically optimizes high-level scheduling, raising GUI-Odyssey SR from 38.79% to 53.69%.
vs SWIRL: Both are multi-stage workflow methods. SWIRL achieves SR=51.65% on GUI-Odyssey; CES reaches 53.69% and leads on other benchmarks as well.
vs Mobile-Agent-v3 / MobiAgent: Both are multi-agent frameworks, but these methods assign roles via prompt engineering without dedicated optimization. CES trains each role deeply via execution-feedback RL.
vs GPT-5 Multi-Agent: Using GPT-5 as Coordinator and State Tracker produces unstable results (with some metrics declining), whereas CES's dedicated trained models consistently and significantly outperform the prompt-based approach.

Rating¶

Novelty: ⭐⭐⭐⭐ — The OS-inspired three-role decoupled design combined with the execution-feedback RL training paradigm demonstrates good originality.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three benchmarks, multi-scale generalization, detailed ablations, failure case analysis, and preliminary experiments are all included.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear, motivation is well articulated, and the preliminary experiment design is elegant.
Value: ⭐⭐⭐⭐ — The plug-and-play high-level scheduling module has practical value for the GUI agent community, and the staged training paradigm is broadly transferable.