MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=C3F0G9nXhl
Code: https://github.com/THUDM/MobileRL
Area: LLM Agent / Mobile GUI Agent / Agentic Reinforcement Learning
Keywords: GUI Agent, Agentic RL, GRPO, Difficulty Adaptation, Experience Replay, Curriculum Filtering, Reward Reshaping

TL;DR¶

MobileRL introduces an online agentic RL framework for mobile GUI agents using "two-stage reasoning SFT warm-up + Adaptive GRPO (AdaGRPO)". By combining positive sample replay, failure curriculum filtering, and shortest path rewards, the framework stabilizes multi-step training under sparse rewards. It achieves state-of-the-art (SOTA) results, with a 9B model reaching a 80.2% success rate on AndroidWorld and 53.6% on AndroidLab.

Background & Motivation¶

Background: Vision-Language Models (VLMs) enable GUI agents to operate web and mobile interfaces zero-shot. However, the dominant scaling approach remains Supervised Fine-Tuning (SFT) or offline imitation learning on static expert demonstrations. These methods suffer from narrow behavioral coverage and a lack of self-recovery capabilities—once the model enters a state not present in the demonstrations, it often fails to recover.

Limitations of Prior Work: Reinforcement Learning with Verifiable Rewards (RLVR) appears to be a superior alternative. However, applying "Agentic RL" (requiring multi-step planning and reasoning) to interactive simulators like mobile phones faces three major hurdles:

Complex Instruction Following under Sparse Positive Signals: Running rollouts in mobile simulators is slow and expensive. Base models struggle to consistently output valid GUI action commands, making successful trajectories extremely rare and early exploration highly inefficient.
Heavy-Tailed and Unstable Task Difficulty Spectrum: Some tasks succeed with minimal sampling, while others remain unsolvable for certain models. Naive uniform sampling wastes compute and fails to utilize rare but highly informative successful trajectories.
Sampling Bottleneck of Large-Scale Mobile Environments: Managing hundreds of concurrent mobile instances is resource-intensive and difficult to replicate. Low sampling throughput further limits the scale and efficiency of RL.

Key Challenge: Rewards in mobile agent RL are sparse and binary (only awarded upon task completion), while task difficulty follows a heavy-tailed distribution and sampling costs are high. Standard GRPO's uniform sampling and uniform reward broadcasting are both unstable and inefficient in this setting.

Goal: To build a framework that efficiently, stably, and reproducibly scales agentic RL in mobile interactive environments, transforming open-source VLMs into SOTA mobile GUI agents.

Core Idea: The authors propose a two-stage SFT warm-up (Reasoning-Free + Reasoning) to provide a strong initial policy for RL. This is followed by Adaptive GRPO (AdaGRPO), which utilizes a triplet of strategies—replaying successful trajectories based on difficulty, filtering unsolvable tasks, and reshaping rewards based on completion length—to address sparse rewards, heavy-tailed difficulty, and expensive sampling simultaneously.

Method¶

Overall Architecture¶

MobileRL divides training into two phases: Reasoning Warm-up (Reasoning-Free SFT + Reasoning SFT, initializing the policy to output actions with intermediate reasoning) and Online Agentic RL (using AdaGRPO for closed-loop interaction, trajectory sampling, and updates). Tasks are modeled as finite-horizon MDPs: the state includes screenshots, parsed UI hierarchy XML, and historical think/action text; actions are atomic operations like Tap, Swipe, Type, Launch, Back, or Finish. Rewards are binary (1 for success, 0 otherwise). The RL process is built on the Verl framework, utilizing hundreds of Dockerized Android Virtual Devices (AVDs) to support reproducible sampling across 1000+ environments.

flowchart LR
    A[Expert Demonstrations] --> B[Reasoning-Free SFT<br/>Action Foundation]
    B --> C[Reasoning SFT<br/>Iterative Reasoning Completion]
    C --> D[Warm-up Policy πθ]
    D --> E[Closed-loop Interaction<br/>Sample G Trajectories]
    E --> F[SPA Reward Reshaping<br/>Compute Relative Advantage]
    F --> G[AdaPR Buffer<br/>Store High-Advantage Successes]
    F --> H[FCF<br/>Filter All-Failure Tasks]
    G --> I[Mixed Sampling Policy Update]
    H --> I
    I --> D

Key Designs¶

1. Two-Stage Reasoning Warm-up: Learn to act, then learn to think. Starting online RL directly from a base model is inefficient. First, Reasoning-Free SFT is conducted using expert demonstrations and the AndroidControl dataset to establish action capabilities. Since these datasets often lack reasoning, the model acts as a "black box." Therefore, Reasoning SFT is layered: an Instruct model generates "reasoning-action" candidates \((c_k, a_k)\) for each task \(x\), and only those where the action matches the expert answer \((x, c_k, a^*)\) are kept in dataset \(D_R\). After training an initial reasoning policy \(\pi_0^R\), it is iteratively refined. This ensures a reliable action foundation and transparent intermediate reasoning, reducing expensive on-policy trial-and-error during RL.

2. Shortest Path Adjustment (SPA): Preventing long trajectories from gaming the reward. Mobile environments typically provide binary rewards \(r \in \{0, 1\}\) at the terminal state. Standard practice broadcasts this to every step \(R(s_t, a_t) = r\). However, longer successful trajectories contribute more gradient terms, effectively rewarding verbosity. SPA scales success rewards based on trajectory length:

\[R_{\text{SPA}}(s_t,a_t) = r(\tau_i)\left(1 - \alpha\frac{T_i - T_{\min}}{T_i}\right),\quad T_{\min}=\min_{\tau_j\in\mathcal{T}_{\text{succ}}}|\tau_j|,\ \alpha\in(0,1]\]

where \(T_i\) is the length of trajectory \(\tau_i\), \(T_{\min}\) is the shortest successful trajectory length for the current task instance, and \(\alpha\) controls the penalty magnitude. This only applies to successful trajectories; failed attempts still receive 0. Unlike token-level penalties in text RLVR, SPA guides the policy toward "shorter successful paths" without sacrificing the success rate.

3. Adaptive Positive Replay (AdaPR): Utilizing rare and valuable successes. Under sparse rewards, successful trajectories for difficult tasks are rare but contain high information value. AdaPR maintains a replay buffer \(B\). In each iteration \(t\), the top-\(\kappa\) high-advantage successful trajectories from the current sampling \(\mathcal{T}_t\) are stored. During updates, each mini-batch is drawn from a mixed distribution:

\[q(\tau) = \gamma\, p_B(\tau) + (1-\gamma)\, p_{\text{on}}(\tau)\]

where \(p_{\text{on}}\) is the on-policy distribution and \(p_B\) is the buffer distribution. To maintain exploration, \(\gamma M\) high-advantage samples are taken from \(B\), while the rest remain on-policy. This reinforces rare success signals and stabilizes policy updates.

4. Failure Curriculum Filtering (FCF): Reserving compute for solvable tasks. Given the heavy-tailed difficulty distribution, some tasks consistently yield zero rewards. FCF uses online statistics to dynamically down-weight these: if a task yields all-zero rewards for two consecutive epochs, it enters a three-epoch cooling period where its sampling probability decays as \(w_{\text{task}} = \exp(-f)\) (where \(f\) is the count of failed epochs). If it remains unsolved after cooling, it is permanently removed from the training set. This resource-aware curriculum specifically prunes unsolvable tasks in expensive mobile environments while retaining recoverable failure signals. Crucially, FCF only affects the training sampling distribution; evaluation is always performed on the full test set.

Key Experimental Results¶

Main Results¶

Comparison on AndroidWorld (116 tasks/20 apps) and AndroidLab (138 tasks/9 apps) across closed-source and open-source models (success rate %):

Model	#Params	AndroidWorld	AndroidLab
GPT-4o-2024-11-20	-	34.5	31.2
Claude-Sonnet-4-thinking	-	41.0	40.6
UI-Tars-1.5	-	64.2	38.3
AutoGLM-2024-10	-	–	36.2
Qwen2.5-VL-7B-Instruct	7B	27.6	10.1
GLM-4.1V-9B-Thinking	9B	41.7	24.6
V-Droid	8B	59.5	38.3
UI-Genie-Agent	72B	-	41.2
MobileRL w/ Qwen2.5-VL-7B	7B	72.0	42.5
MobileRL w/ GLM-4.1V-9B	9B	80.2	53.6

MobileRL-9B improves the previous SOTA from 64.2% / 41.2% to 80.2% / 53.6%. MobileRL-7B also outperforms the 72B UI-Genie-Agent by approximately 16% on AndroidWorld.

Ablation Study¶

Incremental gains by phase (success rate %, subscripts denote relative gain):

Model	AndroidWorld	AndroidLab
Qwen2.5-VL-7B-Instruct	27.6	10.1
+ Reasoning-Free SFT	50.2 (+22.6)	36.9 (+26.8)
+ Reasoning SFT	56.8 (+6.6)	38.7 (+1.8)
+ AdaGRPO (MobileRL-7B)	72.0 (+15.2)	42.5 (+3.8)
GLM-4.1V-9B-Base	7.7	10.1
+ Reasoning-Free SFT	48.1 (+40.4)	42.7 (+32.6)
+ Reasoning SFT	66.2 (+18.1)	45.0 (+2.3)
+ AdaGRPO (MobileRL-9B)	80.2 (+14.0)	53.6 (+8.6)

AdaGRPO component ablation (AndroidWorld test set, averaged over 3 runs):

Variant	AndroidWorld
MobileRL (Full)	71.1
w/o AdaPR	63.6
w/o SPA	69.1
w/o AdaPR & SPA	58.5
w/o FCF	64.8
w/o AdaGRPO (Reasoning SFT only)	56.8

Key Findings¶

Every component contributes: Removing AdaPR drops performance by 7.5 points, FCF by 6.3 points, and SPA by 2 points. Removing both AdaPR and SPA drops it to 58.5. Replay and curriculum filtering are the primary drivers of training stability.
SFT warm-up is critical: Two-stage SFT improved GLM-4.1V from 7.7% to 66.2%. AdaGRPO then contributed an additional +14%. This suggests RL requires a sufficiently strong initial policy.
Stable training curves: The reward curve for the full MobileRL remains consistently higher and more stable than the ablation variants.

Highlights & Insights¶

Explicitly modeling "Difficulty" in RL: AdaPR reuses rare successes, FCF prunes dead-end tasks, and SPA favors efficiency. All three target the specific pain points of mobile GUI agents—heavy-tailed difficulty and expensive sampling—rather than blindly applying text-based RLVR.
Engineering as a Contribution: Leveraging Verl to orchestrate thousands of Dockerized AVDs for concurrent sampling solves the long-standing reproducibility and throughput issues of mobile simulators.
Scaling Efficiency: A 7B model outperforming 72B competitors suggests that algorithm and data design are more critical than parameter count for GUI agents.

Limitations & Future Work¶

Heavily dependent on SFT initialization: The pipeline requires "Reasoning-Free SFT → Reasoning SFT → RL," where reasoning SFT relies on bootstrapping from an Instruct model. The feasibility of starting RL directly from a base model remains unexplored.
Terminal binary rewards: SPA only scales successful trajectories; it does not replace the need for intermediate reward signals in extremely long or sub-task-heavy missions.
Platform Limitation: The framework and environment are currently tied to Android, leaving cross-platform generalization (iOS, Desktop, Web) for future verification.
Aggressive FCF pruning: The permanent removal of tasks might exclude tasks that appear "unsolvable" early on but could be tackled after the policy improves later in training.

GRPO and RLVR: This work builds upon GRPO by extending it from single-turn text reasoning to multi-step GUI agents, providing a reusable difficulty-adaptive modification for sparse multi-step settings.
Experience Replay and Curriculum Learning: While AdaPR and FCF borrow from classical RL, their adaptation to the "expensive sampling + sparse reward" constraints of LLM agents demonstrates the necessity of tailored designs when porting RL components to agentic tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ While AdaPR/FCF/SPA have conceptual roots, their integration into a difficulty-adaptive framework tailored for mobile GUI RL is a distinctive and practical innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Dual benchmarks, dual backbones, step-by-step ablations, and stable training curves provide a solid evidence chain.
Writing Quality: ⭐⭐⭐⭐ Clear mapping between challenges and methods. Motivated design choices and well-coordinated formulas/tables.
Value: ⭐⭐⭐⭐⭐ Open-sourcing a framework that refreshes SOTA on multiple benchmarks and enables large-scale mobile RL is highly valuable to the community.