Agent-SAMA: State-Aware Mobile Assistant¶
Conference: AAAI 2026 arXiv: 2505.23596v3 Code: Zenodo Area: Other Keywords: GUI Agent, Finite State Machine, Multi-Agent Collaboration, Error Recovery, Mobile Task Automation
TL;DR¶
This paper proposes Agent-SAMA, which for the first time introduces a finite state machine (FSM) into mobile GUI agents, modeling UI screens as states and user actions as transitions. Four specialized agents collaborate to achieve state-aware task planning, execution verification, and error recovery, improving success rate by up to 12% and recovery rate by 13.8% on cross-app benchmarks.
Background & Motivation¶
Mobile GUI agents leverage MLLMs to interpret UI screenshots and execute actions such as tapping and swiping, with prior work including AppAgent and the Mobile-Agent series. However, existing agents are fundamentally reactive—they determine the next action solely based on the current screen, lacking a structured representation of app navigation flow. This resembles a tourist navigating street by street, aware of visited locations but without a global understanding of the overall route. This leads to three critical limitations: (1) inability to understand execution context (i.e., the current stage within a task); (2) inability to detect whether action outcomes match expectations; and (3) absence of structured support for error recovery, making agents prone to repetitive failure loops.
Core Problem¶
How to provide GUI agents with a structured representation of app navigation that enables them to track execution progress, anticipate action outcomes, and precisely roll back to stable states upon failure? This is particularly critical for long-horizon cross-app tasks, where action chains are lengthy, error probability is high, and reactive agents are fundamentally insufficient.
Method¶
Overall Architecture¶
Agent-SAMA is a multi-agent framework comprising four phases: Planning → Execution → Verification & Recovery → Knowledge Retention. The core innovation lies in modeling app interactions using an FSM \(\mathcal{M} = (S, A, T, s_0, G)\): UI screens are represented as states \(S\), user actions as \(A\), and screen-to-screen transitions as the transition function \(T\). The FSM is constructed incrementally in real time during execution.
Key Designs¶
- Planner Agent + LLM-as-Judge Planning: High-level tasks are decomposed into a sub-task sequence \(\pi = [(g_1, r_1), ..., (g_k, r_k)]\), where each sub-task is accompanied by a rationale. The key innovation is generating five candidate plans and then applying LLM-as-judges to evaluate them across dimensions such as goal relevance, execution efficiency, and robustness, selecting the optimal plan to avoid suboptimal single-path planning.
- State Agent + Real-Time FSM Construction: The core module of the execution phase. A Screen Parser extracts UI element coordinates and descriptions; the State Agent maps each screen to an FSM node containing three elements: current state description \(d_i\), predicted next state \(d_{i+1}\), and pre/post-conditions. To mitigate state explosion, a State Beacon mechanism is introduced—each state is assigned a concise semantic label (e.g., "Homepage of Walmart"), and newly encountered states are first matched against existing beacons, with matches reusing existing nodes. In cross-app tasks, each app maintains an independent FSM.
- Reflection Agent for Error Recovery: Structured verification and recovery are enabled via the FSM. The FSM-predicted transition (including post-conditions) is compared against the actual screen, yielding one of three verdicts: Success / NoChange / Fail. Upon failure, the FSM is used to identify a previously validated stable state \(s_j\), and a recovery plan is generated to roll back and retry. If recovery fails consecutively (\(n=2\) times), the process escalates to the Planner for re-planning, preventing infinite recovery loops.
- Mentor Agent for Knowledge Retention: Upon task completion, reusable knowledge \(K\) (action sequences, guiding cues, and constructed FSMs) is extracted and stored in long-term memory. At the start of new tasks, relevant knowledge is retrieved as context (e.g., a shopping FSM for Walmart can be transferred to Amazon), improving planning efficiency and robustness.
Loss & Training¶
No training is required; the entire framework is driven by prompt engineering over an MLLM (GPT-4o) with temperature set to 0 to reduce variability. The Screen Parser employs DBNet for OCR, GroundingDINO for icon localization, and Qwen-VL-Plus for icon description generation.
Key Experimental Results¶
| Dataset | Metric | Agent-SAMA | Mobile-Agent-E+Evo | Gain |
|---|---|---|---|---|
| Mobile-Eval-E | Success Rate | 84.0% | 72.0% | +12.0% |
| Mobile-Eval-E | Recovery Success | 71.88% | 67.34% | +4.53% |
| Mobile-Eval-E | Action Accuracy | 83.24% | 76.65% | +6.59% |
| Mobile-Eval-E | Satisfaction Score | 86.15% | 78.97% | +7.18% |
| SPA-Bench | Success Rate | 80.0% | 75.0% | +5.0% |
| SPA-Bench | Recovery Success | 66.67% | 52.86% | +13.81% |
| AndroidWorld | Success Rate | 63.7% | 53.4% | +10.3% |
Ablation Study¶
- Planning module has the greatest impact: Removing the Planner causes SR to drop from 84% to 52% on Mobile-Eval-E, and from 80% to 45% on SPA-Bench.
- Multi-plan selection is effective: Single-path planning vs. 5-candidate + Judge selection yields an SR difference of approximately 12%.
- Pre/Post-conditions: Removing them reduces SR from 84% to 72% on Mobile-Eval-E, with a notable impact on verification and recovery quality.
- Mentor knowledge retention: Removing it reduces SR from 84% to 68% on Mobile-Eval-E, demonstrating the importance of cross-task knowledge transfer.
- All four components are complementary and mutually reinforcing; none is dispensable.
Highlights & Insights¶
- First application of FSM to GUI agents: Modeling mobile app interactions as a state machine is a natural and effective abstraction, providing agents with structured memory and reasoning support.
- State Beacon deduplication mechanism: Concise and practical, it effectively mitigates the state explosion problem.
- Elegant error recovery design: The FSM provides stable anchor points, and a hierarchical recovery strategy (local rollback first, then global re-planning) demonstrably reduces execution errors (32 vs. 49 on Mobile-Eval-E).
- Agent-SAMA remains competitive under weaker MLLMs: When using Claude 3.5, it still outperforms the GPT-4o-based baseline, indicating that the framework's gains are independent of the underlying model.
- Model-agnostic method: The FSM layer can serve as a lightweight memory module pluggable into any existing GUI agent.
Limitations & Future Work¶
- State Beacon relies on LLM text matching: This may introduce false matches or missed matches; replacing it with visual-semantic embeddings for more efficient matching is worth exploring.
- Flat FSM may suffer state explosion in ultra-long-horizon tasks: Tasks spanning 5+ apps and 20+ steps lack hierarchical abstraction.
- Only three benchmarks are evaluated: Scenarios involving dynamic content (e.g., ad pop-ups) and external interruptions are not assessed.
- Performance varies across runs: Differences exist between the results reported in the paper and reproduced baseline results; uncertainty remains despite averaging over 5 runs.
- The effectiveness of cross-app FSM transfer in knowledge retention has not been quantified in isolation.
Related Work & Insights¶
- vs. Mobile-Agent-E+Evo: Both employ a multi-agent architecture and long-term memory, but Agent-SAMA adds FSM-based structured representation, yielding more precise recovery (recovery rate 4.5%–13.8% higher) and fewer execution errors.
- vs. GUI-Xplore: Both use graph-based modeling of app navigation, but GUI-Xplore constructs a static graph offline from videos for inference-time reasoning; Agent-SAMA builds the FSM online in real time for execution decision-making and recovery—an "online and practical" counterpart.
- vs. AgentS2/V-Droid: These are strong baselines on AndroidWorld (54.3%/59.5%); Agent-SAMA surpasses both at 63.7%, while placing greater emphasis on long-horizon cross-app tasks.
- FSM as a general-purpose agent memory layer: This paper demonstrates that an FSM can function as a "lightweight, model-agnostic memory layer"—a concept generalizable to web agents, desktop agents, and even embodied agents.
- LLM-as-Judge for plan selection: Generating multiple candidate plans and scoring them with a judge is more robust than single-pass generation; this design pattern is transferable to other agent settings such as code generation and research automation.
- Formalization via pre/post-conditions: Rooted in Design by Contract from classical software engineering, this approach regains practical value in the context of LLM-based agents.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The idea of modeling app interactions as FSMs is natural; the genuine contribution lies in the engineering instantiation and multi-agent collaboration design.)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Three benchmarks, ablation studies, and multi-MLLM comparisons are provided, though the effectiveness of cross-app knowledge transfer lacks quantitative analysis.)
- Writing Quality: ⭐⭐⭐⭐ (Structure is clear, figures are intuitive, and the appendix includes complete prompts and case studies; some tables are slightly disorganized.)
- Value: ⭐⭐⭐⭐ (The FSM-as-agent-memory-layer idea is broadly applicable; code is open-sourced and can serve as a foundation for future research.)