History-Aware Reasoning for GUI Agents¶

Conference: AAAI 2026 arXiv: 2511.09127 Code: https://github.com/BigTaige/HAR-GUI Area: Agent / GUI Automation Keywords: GUI Agent, Short-term Memory, Reinforcement Learning, Reflective Learning, History-Aware Reasoning

TL;DR¶

This paper proposes the HAR framework, which transforms the reasoning paradigm of GUI agents from "history-unaware" to "history-aware" by constructing reflective learning scenarios, synthesizing error-correction guidelines, and designing a hybrid RL reward function incorporating a Memory-Augmented Reward (MAR). A 3B model trained under this framework surpasses larger models on multiple benchmarks including AITW, Mind2Web, and GUI-Odyssey.

Background & Motivation¶

State of the Field¶

Background: Existing GUI agents (e.g., UI-R1, GUI-R1, InfiGUI-R1) employ System-2 CoT combined with RL to enhance reasoning, yet they share a critical overlooked limitation: their reasoning is history-unaware—multi-step sequential interactions are reduced to isolated single-screen comprehension, disregarding crucial cues embedded in historical interaction context. For example, in an 11-step long-horizon task, the agent at step 8 reasons entirely without considering what transpired in the preceding 7 steps. This stems from the intrinsic CoT patterns of pretrained MLLMs; existing RL training with reasoning-format instructions optimizes only action prediction without altering the underlying reasoning paradigm.

Solving the Problem¶

Goal: The paper aims to equip GUI agents with stable short-term memory in long-horizon sequential reasoning—specifically, to explicitly integrate and analyze historical interaction information within System-2 CoT. The central challenge is that the "history-unaware" reasoning pattern is deeply entrenched in the pretraining-phase CoT, and standard RL training cannot alter it (it only narrows the gap between pass@k and pass@1).

Method¶

Overall Architecture¶

HAR comprises two key training stages: (1) GUI scenario warm-up (SFT for domain knowledge injection); and (2) learning from failures (reflective RL for short-term memory enhancement).

Key Designs¶

GUI Scenario Warm-up (SFT):
- Collection of GUI understanding data (captioning, QA, grounding, etc.)
- Synthesis of Action-to-Summary (Act2Sum) data: a teacher model generates goal-oriented semantic summaries for each action to enhance action semantic understanding.
- System-2 CoT distillation: Qwen2.5-VL-72B synthesizes System-2 reasoning chains for each sample; correctly predicted samples are filtered and used for training.
Reflective Learning Scenario Construction:
- The warmed-up model performs inference; erroneous samples are collected as \(\mathbb{D}_{his}\).
- The teacher model generates up to 3 error-correction guidelines \(\mathbb{G}\) per erroneous sample (analyzing the cause of the error and providing hints without revealing the answer).
- A reflective format instruction is constructed: the erroneous prediction, erroneous CoT, and correction guidelines are provided to the model, which is required to first articulate the error (statement) and then re-reason.
Hybrid RL Reward Function:
- Format Reward \(r^{format}\): whether the output conforms to the reflective format.
- Action Reward \(r^{action}\): for coordinate-based actions (CLICK), a multi-scale Euclidean distance reward is used (normalized coordinate distance + absolute coordinate distance); a correct prediction yields an additional precision bonus (\(r=1+F_{abs}\)), while an incorrect prediction receives partial reward based on absolute distance.
- Memory-Augmented Reward (MAR) \(r^{memory}\): Qwen3-235B is used to judge whether the CoT contains analysis of historical interactions. This is the key innovation—explicitly rewarding the model for "considering prior actions during reasoning."
- Combined reward: \(r = r^{format} \times (r^{action} + \gamma \times r^{memory})\), with \(\gamma=0.2\).
- Design rationale: compared to forcing attention to history via instructions (GRPO*), MAR uses RL signals to allow the model to autonomously learn when historical context should be consulted.
Round-2 RL + Task Mixing: After Round-1 RL training in the reflective scenario, Round-2 RL switches to reasoning-format instructions (aligned with inference-time usage), while grounding tasks are mixed in (TMTS) to prevent degradation of grounding capability.

Loss & Training¶

GRPO algorithm for RL optimization.
Base model: Qwen2.5-VL-3B-Instruct, LoRA rank=64, alpha=128.
SFT: 1 epoch, lr=5e-6; RL: 2 epochs, lr=2e-6.

Key Experimental Results¶

Benchmark	Metric	HAR-GUI-3B	InfiGUI-R1-3B	GUI-R1-3B	UI-R1-3B	Qwen2.5-VL-7B
AITW	SSR (avg)	70.2	67.7	65.6	59.9	-
Mind2Web	SSR (Cross-Task)	42.2	37.2	38.8	36.8	-
GUI-Odyssey	SSR (avg)	62.31	50.62	48.35	46.71	58.39
ScreenSpot	Avg	83.3	-	-	-	79.8
ScreenSpot-V2	Avg	86.2	-	-	-	-

OOD evaluation (Chinese Alipay mini-program): HAR-GUI-3B achieves a step success rate of 76.5%, substantially outperforming GUI-R1-3B (69.99%), with only a ~10% gap relative to Qwen2.5-VL-72B (86.91%).

Ablation Study¶

RL with reasoning-format instructions only (GRPO): reasoning remains history-unaware, with limited performance gains.
Forcing historical attention via instructions (GRPO*): leads to performance degradation, indicating that rigid constraints are counterproductive—the model should autonomously acquire this behavior.
HAR's reflective scenario + error-correction guidelines + MAR: the model autonomously develops history-aware reasoning patterns.
Training RL exclusively on sequential reasoning data degrades grounding capability; TMTS effectively mitigates this.
In post-training, HAR-GUI as the initialization checkpoint consistently outperforms GRPO and the base Qwen2.5-VL.

Highlights & Insights¶

Precise problem identification: The paper discovers and systematically analyzes the "history-unaware" reasoning deficiency in existing GUI agents—a problem present even in 72B models.
Memory-Augmented Reward: Directly rewards whether the CoT incorporates historical information, using RL signals to guide the shift in reasoning paradigm rather than imposing manual constraints.
Reflective learning paradigm: Constructing "error + correction guideline" reflective scenarios injects external domain reasoning knowledge more effectively than pure RL exploration.
Multi-scale coordinate reward: Dual-scale reward combining normalized and absolute coordinates enables finer-grained optimization of CLICK actions.
3B surpassing 7B+: HAR-GUI-3B outperforms Qwen2.5-VL-7B on GUI-Odyssey (62.31 vs. 58.39).

Limitations & Future Work¶

Synthesis of correction guidelines and CoT relies on a 72B teacher model; distillation quality is bounded by the teacher's capability.
MAR uses a model to judge whether the CoT contains historical information, which may introduce false positives or negatives.
OOD generalization is evaluated only in CLICK-only scenarios; complex actions such as TYPE are not thoroughly validated.
The training pipeline is relatively complex (SFT + Round-1 RL + Round-2 RL + post-training).

vs. UI-R1/GUI-R1/InfiGUI-R1: These methods apply RL with reasoning-format instructions to enhance reasoning but optimize only action prediction without altering the reasoning paradigm; HAR fundamentally changes the reasoning pattern through reflective scenarios.
vs. UI-TARS: UI-TARS introduces System-2 reasoning but does not specifically address short-term memory; HAR's 3B model outperforms UI-TARS-2B on ScreenSpot.
vs. traditional agent frameworks (ReAct/Reflexion): Traditional methods rely on hand-crafted prompts for reflection; HAR internalizes reflective capability through training.

The "short-term memory deficiency" may be a common weakness of all CoT-based agents—CoT tends to reason independently from the current state while neglecting historical context.
The design rationale behind MAR (using RL rewards to incentivize specific properties of the reasoning process) is generalizable to other scenarios requiring particular reasoning patterns.
The reflective learning scenario (providing errors + correction guidelines for RL) constitutes an effective knowledge injection paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ Precisely identifies and addresses the history-unaware problem in GUI agents; the MAR design is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers three categories of benchmarks (sequential reasoning / grounding / understanding), OOD evaluation, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, analysis is convincing, and case studies are rich.
Value: ⭐⭐⭐⭐ Proposes an effective solution to the short-term memory problem in GUI agents; code is open-sourced.

Additional Notes¶

The methodology and experimental design of this work offer reference value for related research.
Future work may validate the generalizability and scalability of the approach across broader scenarios and larger scales.
Integration with recent related work (e.g., intersections with RL/MCTS/multimodal methods) presents potential research opportunities.
Deployment feasibility and computational efficiency should be assessed against practical application requirements.
The selection of datasets and evaluation metrics may affect the generalizability of conclusions; cross-validation on additional benchmarks is recommended.