MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=PS8iu4PxKz
Code: MobileIPL-dataset (HuggingFace)
Area: Mobile GUI Agent / Preference Learning
Keywords: Mobile GUI Agent, CoaT, Iterative Preference Learning, Thinking-level DPO, Rule-based Reward, Instruction Evolution

TL;DR¶

To address the lack of CoaT (Chain of Action-Planning Thoughts) reasoning trajectories and the difficulty of step-level annotation for mobile GUI agents, MobileIPL uses MCTS-style iterative sampling to construct a CoaT-tree. It scores leaf nodes based on rule-based rewards and backpropagates values to intermediate thinking steps, constructing "Thinking-level" DPO pairs (T-DPO) to optimize the reasoning process. This approach outperforms large-scale continuously pre-trained models such as OS-ATLAS and UI-TARS on three mobile GUI benchmarks.

Background & Motivation¶

Background: VLM-driven mobile GUI agents need to generate intermediate thoughts between user instructions and current interfaces. The Chain of Action-Planning Thoughts (CoaT, similar to System-2 slow thinking) paradigm proposed by AITZ has been proven to significantly improve reasoning performance in GUI tasks by organizing "Description-Thought-Action-Grounding" into a structured chain of thought.

Limitations of Prior Work: (1) High-quality and diverse CoaT trajectories are scarce; direct SFT on fixed CoaT trajectories easily leads to overfitting, trapping the model in rigid reasoning patterns. (2) While self-training can mitigate data scarcity, using only final answer correctness as a reward ignores the quality of intermediate reasoning steps, leading to reward hacking. (3) Search methods like ReST-MCTS* rely on process reward models (PRM) to score intermediate steps, but PRMs require large-scale manual step-level annotation. In the mobile GUI domain, environments depend on real devices or simulators, making the cost of step-level annotation far higher than in text, code, or math tasks.

Key Challenge: Optimizing the quality of "intermediate thinking steps" (rather than just final action correctness) without paying the high cost of manual step-level PRM annotation.

Goal: Achieve fine-grained preference optimization for the entire thinking process of mobile GUI agents without PRMs or manual step-level labels, while mitigating overfitting during the warm-up SFT stage.

Core Idea: Replace PRMs with rule-based rewards. Each action is expanded into a CoaT-tree where leaf nodes (complete actions) are scored against ground truth via rules. Scores are backpropagated to intermediate steps to automatically construct thinking-level DPO pairs. Simultaneously, GPT-4o three-stage instruction evolution is used to inject diverse Q&A during SFT to prevent overfitting and enhance UI understanding.

Method¶

Overall Architecture¶

MobileIPL consists of three stages: first, warm-up SFT using mixed data augmented by "instruction evolution" to obtain a seed policy; second, iterative sampling for each action to build a CoaT-tree, scoring leaf nodes with rule rewards and backpropagating values to intermediate steps; finally, filtering contrastive pairs from the tree for Thinking-level DPO self-training, using the updated agent as a new base for further iterations until performance plateaus.

flowchart LR
    A[General VLM] --> B[Instruction Evolution<br/>GPT-4o 3-stage Q&A]
    B --> C[Warm-up SFT<br/>Mixed T+Q for Seed Policy πS0]
    C --> D[CoaT-tree Iterative Sampling<br/>K continuations per step]
    D --> E[Leaf Rule Reward v·st·<br/>+ Backpropagation]
    E --> F[Contrastive Data Filtering<br/>α/β/γ Classification for DPO Pairs]
    F --> G[Thinking-level DPO Training]
    G -->|Update agent as new base| D

Key Designs¶

1. Multi-turn Thinking Modeling: Decomposing one action into four dialogue turns. The authors formalize each action $\hat{a}_i$ in CoaT reasoning as a multi-turn dialogue $\hat{a}_i=[s_1,s_2,s_3,s_4]$, corresponding to Description, Thought, Action decision, and Grounding respectively, where $s_1=\text{Description}(P_1,u_i)$, $s_2=\text{Thought}(P_2,u_i,I,\hat{a}_{0:i-1},s_1)$, $s_3=\text{Action}(\cdot)$, and $s_4=\text{Grounding}(\cdot)$. The motivation is practical: when a model decodes a full reasoning segment at once, the image modality $u$ occupies most input tokens, overshadowing text instructions $I$ and action history, which shifts attention away from text details. Multi-turn dialogue forces each step to focus on the current reasoning sub-task, balancing multimodal token ratios and ensuring format-compliant outputs.

2. CoaT-tree Iterative Sampling + Rule Rewards: Using rules instead of PRMs to score intermediate steps. Sampling stepwise following the CoaT paradigm, $K$ continuations $\hat{s}_t=\{\hat{s}_t^{(k)}\mid \hat{s}_{0:t-1}\}_{k=1}^K$ are sampled for each step $t$, forming a tree. Leaf nodes (complete actions) are compared with ground truth $a^*$ for rule rewards: 1 for perfect correctness; $v_{type}+\text{score}_{match}$ if the action type matches; 0 otherwise. $\text{score}_{match}$ uses normalized distance $d(x,y)$ for CLICK and text F1 for INPUT to provide smooth rewards for "near misses." Crucially, leaf values are backpropagated to intermediate steps: $v(s_{t-1})=c\cdot\frac{1}{K}\sum_{k=1}^{K}v(s_t^{(k)})$, where $c$ is a discount factor. This provides each intermediate thinking step with a continuous value without needing PRMs or manual labels.

3. Contrastive Data Filtering: Categorizing trees into α/β/γ to select samples. Based on the value distribution of leaf nodes, trees are classified into three types: $\alpha$ are "perfect trees" where all leaf values are 1 (stable correct outputs, not used for contrastive pairs); $\beta$ are "potentially correct trees" with both correct and incorrect leaves, serving as the main source for DPO pairs; $\gamma$ are "to-be-refined trees" where no leaf value is 1. In $\beta$, samples with value 1 and diverse action types are used as positives, and pairs are constructed as $\beta_{pairs}=\langle \hat{s}_t^{(k)}\uparrow,\hat{s}_t^{(k')}\downarrow\mid v(\hat{s}_t^{(k)})-v(\hat{s}_t^{(k')})>1/K\rangle$ (requiring a value gap $>1/K$ to avoid noise). In $\gamma$, ground truth actions $a^*$ are used as positives against sampled negatives, $\gamma_{pairs}=\langle a^*\uparrow,\hat{s}_t^{(k)}\downarrow\rangle$.

4. Thinking-level DPO (T-DPO) + Iterative Self-training. Given the same prefix $s_{1:t-1}$, positive $s_t^+$ and negative $s_t^-$ continuations for the same thinking step are compared: $$L_{\text{T-DPO}}=-\mathbb{E}\big[\log\sigma(\beta\log\frac{\pi_\theta(s_t^+\mid s_{1:t-1})}{\pi_{ref}(s_t^+\mid s_{1:t-1})}-\beta\log\frac{\pi_\theta(s_t^-\mid s_{1:t-1})}{\pi_{ref}(s_t^-\mid s_{1:t-1})})\big]$$ This is optimized alongside an SFT loss. Unlike TreePO/SPO which segments long sequences into many short fragments, MobileIPL uses the fixed CoaT-tree to model thinking, with values derived directly from rule rewards (avoiding unstable PRMs), making sampling and training more efficient.

5. Three-stage Instruction Evolution: Mitigating SFT overfitting and enhancing UI understanding. Since CoaT patterns become rigid after warm-up SFT, the authors use GPT-4o to generate three levels of Q&A based on real screenshots: Level I General GUI Q&A (grounding/referring/description), Level II Component functions and nested relationships (avoiding misidentifying textviews as buttons), and Level III Advanced FAQ (page structure, navigation expectations). These are manually filtered and mixed with original trajectories $T$ for SFT, preventing static instruction overfitting and improving UI layout understanding via visually grounded Q&A. This expanded the sampling space from 4% to 31% and the correct answer ratio from 72.7% to 77.9%.

Key Experimental Results¶

Main Results¶

Performance on AITZ, AMEX, and AndroidControl benchmarks using Qwen2-VL-7B as the backbone, with Step.Acc as the primary metric:

Benchmark	Key Comparison	MobileIPL	Strongest Baseline
AITZ (Total)	vs Falcon-UI-7B (3M GUI pre-training)	69.15	69.10
AITZ (Total)	vs Seed Model / Qwen2-VL-7B	69.15	55.40 / 60.36
AMEX (Overall)	vs SphAgent-7B (SOTA)	74.29	70.71 (+3.58)
AMEX (Overall)	vs OS-Atlas / UI-Tars	74.29	70.33 / 70.33 (+3.96)
AndroidControl (Step.Acc)	vs UI-Tars-7B / OS-Atlas	72.7	72.5 / 71.2
AndroidControl (Grounding)	vs Qwen2-VL-7B (SFT)	77.0	68.5 (+8.5)

On the AndroidControl OOD subsets (IDD / app-unseen / task-unseen), MobileIPL achieved 73.6 / 70.0 / 72.2, significantly outperforming OS-Atlas (71.2 / 60.7 / 66.2) and Qwen2-VL-GRPO (70.2 / 68.1 / 69.7).

Ablation Study¶

Ablation on AITZ Round 1 (MobileIPL-R1 = 65.4):

Setting	Total	Δ
MobileIPL-R1 (Full)	65.4	—
− IPL (SFT only)	60.4	−5.0
− Instruction Evo	62.9	−2.5
− IPL Negatives	61.4	−4.0
− IPL + Naive DPO (Full Trajectory)	60.3	−5.1
1/2 Training Data (R1)	64.8	−0.6
4/5 Training Data (R2)	60.6	−4.8

Key Findings¶

IPL is core: Removing IPL causes a 5.0% drop; negative samples contribute 4.0%, suggesting they teach "how to reason" rather than simple memorization.
Thinking-level DPO outperforms Naive DPO: Using naive DPO on full trajectories dropped performance to 60.3%, lower than SFT with positive CoaT-tree samples (61.4%), validating the effectiveness of step-wise optimization.
Efficient at low-resource: With half the data, the first round of IPL (64.8) already surpassed the best results of original CoaT-SFT and naive DPO.
Best Efficiency-Performance Trade-off: MobileIPL (69.15, ~27 rollouts) outperforms SPO-Chain (68.03, ~54 rollouts) and GRPO (66.29, 8 rollouts).
K=3 is the sweet spot: Increasing $K$ from 3 to 4 increases tree nodes from 27 to 64 but yields less than 1% improvement.

Highlights & Insights¶

Rule-based Rewards + Backpropagation as PRM Alternative: In mobile GUI scenarios where step-level labeling is extremely expensive, this bypasses the need for unstable PRMs by using smooth signals from CLICK distance and INPUT F1.
α/β/γ Classification is a Clean Abstraction: It clearly defines which data to ignore (perfect), which to mine for contrast (mixed), and which to use for grounding (all wrong), using a value gap threshold to filter noise.
Multi-turn Dialogue Addresses VLM Pain Points: It prevents image tokens from overwhelming text instructions, a critical observation for GUI agents compared to text-only CoT.
Small Models Beating Large Pre-trained Models: By optimizing the thinking process, a 7B backbone with limited data can outperform models with millions of GUI pre-training samples like OS-Atlas/UI-TARS.

Limitations & Future Work¶

Weak PRESS Actions: MobileIPL's PRESS accuracy on AITZ is significantly lower than some baselines (23.5 in R1), attributed to sample distribution issues.
Handcrafted Reward Rules: Smooth rewards for CLICK/INPUT rely on manual $v_{type}/v_{format}$ and distance normalization; other actions (SCROLL/STOP) only use binary 0/1 rewards.
Reliance on GPT-4o for Instruction Evolution: Generating three-stage Q&A still requires closed-source models and manual filtering, creating a barrier for scaling and reproduction.
Empirical Iteration Termination: It lacks a clear convergence criterion, with the number of rounds $R$ requiring manual tuning.

CoaT / AITZ: The base modeling paradigm of Thought-Action-Grounding.
ReST-MCTS* / Xie et al.: Represent the MCTS + PRM approach; this work explicitly avoids their PRM labeling costs.
TreePO / TreeRL / SPO: Segment long sequences for preference optimization; MobileIPL is more efficient by using fixed CoaT-trees and rule-based values.
Insight: In embodied/GUI scenarios with expensive labeling, "Rule Rewards + Tree Backpropagation" is a viable low-cost alternative to PRMs.

Rating¶

Novelty: ⭐⭐⭐⭐ —— The combination of rule-based backpropagation and thinking-level DPO is a practical innovation for GUI agents.
Experimental Thoroughness: ⭐⭐⭐⭐ —— Comprehensive benchmarks, OOD testing, efficiency analysis, and ablation studies.
Writing Quality: ⭐⭐⭐⭐ —— Clear motivation, well-described algorithms, and good use of diagrams.
Value: ⭐⭐⭐⭐ —— Demonstrates that optimizing thinking handles resource constraints better than massive pre-training.