M²-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining¶

Conference: ICLR 2026 arXiv: 2602.05429 Code: Coming soon Area: LLM Agent Keywords: GUI Agent, MCTS, Data Mining, Multi-Agent Collaboration, Mobile Interaction

TL;DR¶

This paper proposes M²-Miner, the first MCTS-based automatic data mining framework for mobile GUI agents. Through the collaboration of three agents—InferAgent, OrchestraAgent, and JudgeAgent—it achieves a 64× improvement in mining efficiency, enriches intent diversity via an intent recycling strategy, and trains a GUI agent that achieves state-of-the-art performance on multiple benchmarks.

Background & Motivation¶

Background: GUI agents automate software interactions by interpreting user intents and executing action sequences on graphical interfaces, representing a prominent research direction in both academia and industry. High-quality intent–trajectory training data is the core dependency of current GUI agents.

Limitations of Prior Work: - High cost: Manual annotation (e.g., AITW, AndroidControl) requires hours per sample, with costs as high as $0.36/image. - Low quality: Both manually annotated and automatically mined data frequently contain redundant steps, ambiguous intent descriptions, and biased action paths. - Low diversity: Existing datasets adopt an intent-to-flat-trajectory structure, recording only a single successful path per intent, resulting in monotonic intent types.

Key Challenge: Manual annotation is quality-controllable but not scalable, while existing automated mining methods (e.g., AgentQ based on vanilla MCTS, OS-Genesis based on rule-driven exploration) suffer from low efficiency, are limited to web environments, or produce only single trajectories.

Goal: How to automatically mine high-quality, high-diversity mobile GUI interaction trajectory data at low cost?

Key Insight: The paper introduces MCTS into mobile GUI data mining, but vanilla MCTS suffers from extremely low efficiency due to random expansion. The authors observe that: (a) the expansion phase requires intelligent guidance rather than random exploration; (b) the simulation phase can replace rollouts with process rewards; (c) non-primary paths in the search tree contain additional valuable intent–trajectory pairs.

Core Idea: MCTS + three-agent collaboration (guided expansion + accelerated ranking + process evaluation) + intent recycling = efficient, high-quality, and high-diversity GUI data mining.

Method¶

Overall Architecture¶

M²-Miner uses MCTS as its backbone. Given an initial intent $I_0$ and a starting GUI state $s_0$, it outputs an intent–trajectory tree $\mathcal{T}=(\mathcal{V},\mathcal{A},\mathcal{P},\mathcal{I})$ containing valid interaction trajectories $\tau=(s_0,a_0,s_1,\ldots)$. Each node in the tree stores a screenshot, action description, Q-value, visit count, and task completion status. The four MCTS phases (selection → expansion → simulation → backpropagation) iterate until a trajectory matching the intent is mined.

Key Designs¶

InferAgent (Expansion Phase — Action Generation):
- Function: Generates $K$ candidate actions for the selected node.
- Mechanism: Reasons about the most likely correct action based on the current GUI screenshot and target intent. Multiple different MLLMs are used for generation to ensure diversity of the action space; previously generated actions are included in the prompt to avoid duplicates.
- Design Motivation: Replaces the random expansion of vanilla MCTS, substantially increasing the hit rate of correct actions.
OrchestraAgent (Expansion Phase — Action Ranking and Deduplication):
- Function: Merges semantically equivalent actions (e.g., clicks on the same button at different coordinates) and ranks them by their likelihood of achieving the target intent.
- Mechanism: An MLLM selects the most promising action at each iteration via a multi-choice question format; $K-1$ queries yield a ranked queue. Ranked actions are assigned decreasing initial UCT values in order.
- Design Motivation: Avoids redundant expansion and ensures that the search prioritizes the most promising branches.
JudgeAgent (Simulation Phase — Process Reward Estimation):
- Function: Analyzes the GUI screenshot of newly expanded nodes, assesses task completion status, and computes rewards.
- Mechanism: Terminal nodes (success/failure) receive rewards of 1/0. For intermediate nodes, the MLLM head outputs logits for "valid"/"invalid", normalized via softmax into a $[0,1]$ probability as the reward: $r_{\text{intermediate}} = \frac{\exp(logits_{\text{valid}})}{\exp(logits_{\text{valid}}) + \exp(logits_{\text{invalid}})}$
- Design Motivation: Replaces the full rollout required by vanilla MCTS to evaluate rewards. Node Q-values are updated as: $Q_i = \frac{Q_{i-1} \times N_{i-1} + R_i}{N_{i-1}+1}$, greatly reducing computational overhead in the simulation phase.
Intent Recycling Strategy:
- Function: Extracts additional intent–trajectory pairs from non-primary paths in a completed mining tree.
- Mechanism: All root-to-node paths in the tree are enumerated and evaluated by an intent recycling filter (implemented via an MLLM); paths that pass are assigned new intents generated by an MLLM, which are then verified for consistency with the trajectory by JudgeAgent.
- Design Motivation: Evolves the structure from "one tree, one intent" to "one tree, multiple intents," yielding more diverse data without re-mining. For example, when mining the intent "query a route," an accidental tap on a "ride-hailing" button may produce a valid ride-hailing trajectory.
Progressive Model-in-the-Loop Training Strategy:
- Function: Iteratively improves the capabilities of InferAgent and JudgeAgent.
- Mechanism: Three progressive training stages — Stage 1: basic intents (common services + conditional rewriting) → Stage 2: complex intents (function combinations + retry on failure) → Stage 3: recycled intents (applying intent recycling to historical trees). Data mined at each stage is used for continual training of both agents.
- Design Motivation: Forms a positive feedback loop in which agent capability and data complexity grow in tandem.

Loss & Training¶

InferAgent and JudgeAgent are fine-tuned based on Qwen2.5-VL-7B.
OrchestraAgent and the intent recycling filter use Qwen2.5-VL-72B.
Training data also includes description information and preference data (constructed from positive/negative paths).

Key Experimental Results¶

Main Results¶

Model	AC-Low TP/SR	AC-High TP/SR	AITZ TP/SR	GUI-Odyssey TP/SR	CAGUI TP/SR
GPT-4o	74.3/19.4	66.3/20.8	70.0/35.3	-	3.67/3.67
UI-TARS-7B*	98.0/90.8	83.7/72.5	80.4/65.8	90.1/87.0	88.6/70.0
OS-Genesis-7B	90.7/74.2	66.2/44.5	20.0/8.5	11.7/3.6	38.1/14.5
GUI-Owl-7B	93.8/90.0	81.5/72.8	78.9/65.1	83.4/60.7	80.0/59.2
M²-Miner-7B	97.5/93.5	81.8/72.9	81.3/69.4	90.5/79.3	88.8/70.2

*UI-TARS-7B uses large-scale private manually annotated data.

Ablation Study¶

Configuration	TP	SR	Note
Warm-up	85.0	64.2	Pre-trained on public data only
+ Stage 1 (Basic Intents)	86.5	67.3	+3.1% SR
+ Stage 2 (Complex Intents)	87.6	69.1	+1.8% SR
+ Stage 3 (Recycled Intents)	88.2	69.9	+0.8% SR, cumulative +5.7%
Act only	85.2	66.8	Action labels only
Act + Description	88.2	69.9	+3.1% SR
Act + Des + Preference	88.8	70.2	+3.4% SR

Key Findings¶

Exponential efficiency gains: Compared to vanilla MCTS, M²-Miner achieves a 64× efficiency improvement at a task length of 9. OrchestraAgent reduces redundant nodes in the expansion phase; JudgeAgent eliminates rollouts in the simulation phase.
18× cost reduction: The M²-Miner-Agent dataset costs only $0.02 per image, versus $0.36 for manually annotated datasets.
Higher data quality: Manual inspection of 100 randomly sampled records shows that M²-Miner's data quality accuracy (DQA) surpasses that of manually annotated datasets (AC and AITZ).
Description and preference data are beneficial: Compared to action labels alone, adding description information improves SR by 3.1%; adding preference data yields a further 0.3% gain.
Strong generalization to unseen scenarios: On CAGUI, where no training data is available, Qwen2.5-VL-7B trained on M²-Miner data improves SR from 55.2% to 70.2%.

Highlights & Insights¶

The MCTS + multi-agent paradigm is elegantly designed: The three agents each fulfill a distinct role—generation, ranking, and evaluation—precisely addressing the three bottlenecks of vanilla MCTS in the GUI domain (random expansion, redundant nodes, and expensive rollouts).
Intent Recycling is a highly creative design: It repurposes "failed" exploratory paths as an additional data source, simultaneously addressing intent diversity and mining efficiency. This idea is transferable to other agent data collection settings.
Process rewards as rollout substitutes: Using MLLM logit probabilities as intermediate rewards preserves the theoretical advantages of MCTS while avoiding the high cost of full simulation. This design has reference value for all MCTS+LLM systems.

Limitations & Future Work¶

The framework is validated only on mobile platforms; applicability to desktop and web environments remains unexplored.
OrchestraAgent relies on a 72B model (Qwen2.5-VL-72B), incurring high deployment costs.
The quality of intent recycling filtering and intent generation depends on MLLM capability and may degrade for complex applications.
The dataset scale (20k images, 2,565 trajectories) still lags behind the private data used by UI-TARS.
The three-stage model-in-the-loop training requires multiple mining → training cycles; the practical engineering cost is not reported in detail.

vs. AgentQ: AgentQ pioneered MCTS-based data mining for web environments but is limited to parseable HTML environments and suffers from low efficiency. M²-Miner extends the approach to vision-driven mobile platforms, improving efficiency by orders of magnitude through multi-agent collaboration.
vs. OS-Genesis: OS-Genesis employs unsupervised rule-based interaction and reverse task synthesis, requiring no predefined tasks but yielding lower data quality (SR of only 8.5% on AITZ). M²-Miner's structured MCTS search produces higher-quality trajectories.
vs. UI-TARS: UI-TARS relies on large-scale private manually annotated data, yet M²-Miner surpasses it on nearly all SR metrics, demonstrating that the potential of automated mining frameworks has exceeded that of manual annotation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of MCTS, multi-agent collaboration, and intent recycling is pioneering in GUI data mining, with clear technical contributions from each component.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five benchmarks, multiple ablation groups (multi-agent design, training strategy, data structure, training data), efficiency analysis, and cost comparison — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Well-structured with rich illustrations, though some content is repetitive and the paper is lengthy.
Value: ⭐⭐⭐⭐⭐ Provides a highly valuable data production paradigm for the GUI agent community, with empirical evidence that automated mining can surpass manual annotation.