Anticipatory Planning for Multimodal AI Agents¶
Conference: CVPR 2026 arXiv: 2603.16777 Code: Not released Area: Reinforcement Learning Keywords: Multimodal agents, anticipatory planning, trajectory-level reinforcement learning, GUI interaction, tool use, GRPO
TL;DR¶
This paper proposes TraceR1, a two-stage RL framework in which the first stage employs trajectory-level reward optimization to train agents to perform multi-step look-ahead planning, while the second stage applies grounded fine-tuning via tool execution feedback to improve single-step precision. The approach achieves open-source state-of-the-art results across 7 GUI and tool-use benchmarks.
Background & Motivation¶
Background: Multimodal agents have made significant progress in GUI interaction and tool invocation, yet the vast majority of existing systems remain fundamentally reactive — selecting the next action based solely on the current observation without considering long-term consequences.
Limitations of Prior Work: In multi-step tasks, the effects of actions are often delayed and cumulative. Reactive agents cannot anticipate downstream consequences, leading to progressive goal deviation and poor planning coherence in long-horizon tasks.
Key Challenge: Both mainstream technical approaches face fundamental obstacles — model-free RL relies on sparse terminal rewards and struggles to capture long-range dependencies, while model-based planning requires constructing a world model, which is exceedingly difficult in visually rich interactive environments.
Goal: To efficiently train multimodal agents capable of adaptive anticipatory reasoning, enabling consistent planning in complex long-horizon tasks.
Key Insight: Rather than building an explicit world model, the paper directly applies RL at the trajectory level, training the model to predict a sequence of future actions while executing only the first step — analogous to the human strategy of "thinking several steps ahead, acting one step at a time."
Core Idea: A two-stage training procedure — trajectory-level alignment for global consistency followed by grounded RL for single-step executability — unifies anticipatory planning and precise execution.
Method¶
Overall Architecture¶
TraceR1 adopts a plan-act loop: given the current observation, the model predicts a multi-step future trajectory \(\hat{\tau}_{t:T}\) but executes only the first action, then replans upon receiving environmental feedback. Training proceeds in two stages:
- Stage 1 (Anticipatory Trajectory Optimization): Trajectory-level RL with a global alignment reward encouraging coherent multi-step planning.
- Stage 2 (Grounded Reinforcement Fine-tuning): Step-level RL using execution feedback from a frozen tool agent to improve single-step precision.
The backbone model is Qwen3-VL-8B-Thinking, trained with the EasyR1 framework.
Key Design 1: Trajectory-Level Alignment Reward¶
- Function: Given the user instruction, current observation, and interaction history, the model predicts an action sequence over \(T\) future steps, which is aligned against a reference trajectory.
- Mechanism: A discounted trajectory reward is defined as \(R(\hat{\tau}, \tau^*) = \sum_{t=1}^{T} \gamma^{t-1} r_t\), where \(r_t = \lambda_{\text{align}} \cdot \text{sim}(\hat{a}_t, a_t^*) - \lambda_{\text{rep}} \cdot \text{rep}(\hat{a}_{1:t})\). The term \(\text{sim}\) measures action alignment and \(\text{rep}\) penalizes repetitive or cyclic actions.
- Design Motivation: SFT under teacher forcing optimizes token-by-token and neglects global coherence. Trajectory-level RL enables the model to learn cross-step dependencies and avoid redundant or unstable rollouts.
Key Design 2: Repetition Penalty and Temporal Discount¶
- Function: \(\lambda_{\text{rep}}\) penalizes repeated actions within a trajectory; \(\gamma\) serves as a temporal discount factor that biases the model toward near-term correctness.
- Mechanism: These components jointly prevent reward hacking — without repetition penalty, the planner may click the same UI element or invoke the same tool repeatedly to inflate rewards; \(\gamma < 1\) prevents overfitting to highly uncertain long-range predictions.
- Design Motivation: Ablation studies confirm that removing either component leads to significant performance degradation (see Ablation Study section).
Key Design 3: Grounded RL Fine-tuning¶
- Function: The model outputs \((\hat{a}_t, \hat{g}_t)\), which are passed to a frozen tool agent (e.g., UI-TARS-7B) for execution; the execution result is compared against the ground truth to derive a step-level reward.
- Mechanism: For GUI tasks, a coordinate matching reward is used; for tool-use tasks, an answer matching reward is applied: \(r_t^G = \mathbb{1}[\text{coord match}]\) or \(\mathbb{1}[\text{answer match}]\).
- Design Motivation: The trajectory reward in Stage 1 is abstract and provides no signal regarding whether predicted actions are actually executable. Stage 2 supplies concrete execution outcomes as corrective signals, compensating for the "idealized planning" problem.
Key Design 4: Plan-Act Loop at Inference¶
- Function: At inference time, the model predicts a multi-step trajectory but executes only the first action, then replans upon observing the updated state.
- Mechanism: This is analogous to Model Predictive Control (MPC): rolling prediction, single-step execution, and continuous correction.
- Design Motivation: Multi-step prediction provides anticipatory context, yet since the environment changes continuously, executing one step and replanning balances look-ahead capability with robustness.
Loss & Training¶
Both stages employ GRPO (Group-Relative Policy Optimization) as the optimization objective:
- Stage 1: \(\nabla_\theta J(\theta) = \mathbb{E}_{\hat{\tau}}[\hat{A}(\hat{\tau}, \tau^*) \nabla_\theta \log \pi_\theta(\hat{\tau} | u, s_t, \tau_{1:t-1})]\), where \(\hat{A}\) is the normalized group-relative advantage derived from trajectory rewards.
- Stage 2: The trajectory reward is replaced by the grounded step reward \(r_t^G\); GRPO updates are applied identically.
Training data for GUI tasks includes trajectory datasets from AgentNet, AndroidControl, GUI-Odyssey, Multimodal-Mind2Web, and AgentTrek; tool-use tasks use trajectory data from T3-Agent together with an executable toolbox.
Key Experimental Results¶
Main Results: Online GUI Benchmarks (Table 1 — Success Rate %)¶
| Model | Params | AndroidWorld | OSWorld-Verified |
|---|---|---|---|
| OpenAI CUA-o3 | — | 52.5 | 38.1 |
| UI-TARS-2 | — | 73.3 | 53.1 |
| Claude 4.5 Sonnet | — | — | 62.9 |
| Agent S2.5 w/ o3 | 7B w/ — | — | 56.0 |
| Qwen3-VL-32B-Thinking | 32B | 61.4 | 35.6 |
| TraceR1 (Qwen3-VL-32B w/ Ours) | 32B w/ 8B | 64.8 | 41.2 |
Key Takeaway: TraceR1 improves the OSWorld success rate of Qwen3-VL-32B-Thinking from 35.6% to 41.2% (relative gain of 15.7%) and AndroidWorld from 61.4% to 64.8%, establishing open-source state-of-the-art performance.
Tool-Use Benchmarks (Table 3 — GAIA & GTA)¶
| Model | Params | GAIA AnsAcc | GTA AnsAcc | GTA ToolAcc | GTA CodeExec |
|---|---|---|---|---|---|
| GPT-4o | — | 33.4 | 57.1 | 63.4 | 95.1 |
| GPT-5 | — | 59.3 | 60.9 | 68.3 | 98.7 |
| Qwen3-VL-8B | 8B | 31.5 | 49.2 | 56.8 | 74.2 |
| T3-Agent | 7B | 16.9 | 53.8 | 64.6 | 84.3 |
| TraceR1 | 8B | 40.2 | 56.7 | 65.7 | 87.4 |
Key Takeaway: At the 8B scale, TraceR1 surpasses GPT-4o on GAIA (40.2 vs. 33.4) and outperforms Qwen3-VL-8B of the same scale by +8.7 AnsAcc.
Ablation Study¶
| Configuration | AndroidWorld | OSWorld-Verified | GTA |
|---|---|---|---|
| Full TraceR1 (w/ Stage 2) | 64.8 | 41.2 | 56.7 |
| w/o Stage 2 | 57.2 | 36.3 | 50.2 |
Removing Stage 2 causes an average drop of approximately 6%, demonstrating that grounded execution feedback is critical for planning stability.
Additional ablation findings: - Prediction horizon \(T\): Performance peaks at \(T \approx 10\); larger values cause accumulated uncertainty and performance degradation. - \(\lambda_{\text{rep}} = 0\): Removing the repetition penalty induces reward hacking (e.g., repeatedly clicking the same element). - \(\gamma = 1\): Removing the temporal discount causes the model to overfit to highly uncertain long-range predictions.
Highlights & Insights¶
- The "think several steps, act one step" paradigm is elegant and concise: No explicit world model is required; trajectory-level RL directly instills anticipatory reasoning, making the approach far more tractable than model-based planning.
- The two-stage decoupled design is well-motivated: Stage 1 addresses "seeing far ahead" (global consistency), while Stage 2 addresses "acting accurately" (execution feasibility), with a clear division of responsibilities.
- Strong generality: The same framework applies to both GUI interaction (desktop and mobile) and general tool invocation, with comprehensive validation across 7 benchmarks.
- Open-source 8B model surpasses GPT-4o: TraceR1 at 8B outperforms GPT-4o on GAIA, offering exceptional cost-effectiveness.
- Thorough ablations on repetition penalty and temporal discount: The experiments clearly expose the reward hacking problem and its resolution.
Limitations & Future Work¶
- Limitations of short-horizon updates: The current method provides only local corrections and cannot reshape the agent's understanding of long-term feasibility and task structure. Future work may explore multi-round or hierarchical planning mechanisms.
- Stage 2 depends on a frozen tool agent: The quality of the tool agent directly affects the reliability of grounded rewards; errors in the tool agent introduce noise into the corrective signal.
- Offline training vs. online interaction: The current grounded setup is offline and does not involve real online environment interaction, potentially limiting adaptability to dynamic environment changes.
- Sensitivity to prediction horizon: Performance degrades when \(T > 10\), indicating that the method still has bottlenecks on very long-horizon tasks.
- Absence of memory and state update mechanisms: The current framework lacks cross-episode memory integration and cannot learn from historical failures.
Related Work & Insights¶
| Compared Method | Key Differences |
|---|---|
| GUI-R1 / InfiGUI-R1 | Both adopt R1-style RL training but apply only step-level rewards, lacking trajectory-level global optimization. TraceR1 surpasses them by over 40% on AndroidControl-High, validating the necessity of trajectory-level reasoning. |
| Agent S2 / GTA1 | Rely on closed-source models (o3/GPT-5) as planners with open-source small models for execution. TraceR1 does not depend on closed-source planners and instead directly trains the intrinsic planning capabilities of open-source models, yielding greater autonomy. |
| UI-TARS-1.5/2 | Commercial closed-source systems with strong performance but no reproducibility. TraceR1 with an 8B open-source model paired with a 32B executor approaches the performance level of UI-TARS-1.5. |
Rating¶
- Novelty: ⭐⭐⭐⭐ — The two-stage design combining trajectory-level RL and grounded fine-tuning represents a meaningful advance over existing R1-style methods; the MPC-inspired "predict multiple steps, execute one" strategy is relatively novel in the context of multimodal agent training.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Seven benchmarks spanning online/offline GUI and tool-use settings; comprehensive ablations covering Stage 2, prediction horizon, repetition penalty, and temporal discount; results reported as the mean of three independent runs.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-articulated motivation, rigorous mathematical notation, and rich figures and tables; related work is thoroughly categorized.
- Value: ⭐⭐⭐⭐ — Provides a general and practical training paradigm for anticipatory planning in multimodal agents; the result of an 8B model surpassing GPT-4o carries strong practical significance.