Dyna-Mind: Learning to Simulate from Experience for Better AI Agents¶
Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=F848aPzCJy
Code: https://github.com/jasonyux/Dyna-Mind
Area: LLM Agent / Reinforcement Learning / World Models
Keywords: Agents, Mental Simulation, Long-horizon Tasks, GRPO, World Models
TL;DR¶
Dyna-Mind teaches (V)LM agents to perform mental simulations of future states before acting by "compressing real environment search trees into reasoning trajectories" (RESIM). It then employs Dyna-GRPO, which feeds "ground-truth future states" back into online RL to reinforce simulation capabilities, significantly outperforming GRPO/RLOO and Dyna-Think on Sokoban, ALFWorld, and AndroidWorld.
Background & Motivation¶
Background: Reasoning models (e.g., DeepSeek-R1) have achieved expert-level performance on "one-shot" problems like mathematics and programming by expanding long-chain reasoning before acting. Naturally, researchers seek to apply this to agent tasks requiring multi-step interaction, such as web navigation and mobile/PC operations.
Limitations of Prior Work: The performance of reasoning models drops sharply on long-horizon tasks. Empirical tests show that while DeepSeek-R1 is nearly perfect on the structured Sokoban (96.6% success rate), it falls to 62.5% on the complex ALFWorld. Notably, the "simulation score" (the accuracy of the model's predicted next state) is highly correlated with the success rate (\(r\approx0.7\text{-}0.96\)). In other words, models fail not due to a lack of reasoning capacity, but because their internal world models are inaccurate, leading to divergent simulations.
Key Challenge: Success in long-horizon tasks depends on the agent's ability to construct an accurate "world model"—mentally simulating "what the environment will become if I take this action" without execution, known in neuroscience as "vicarious trial and error." Existing solutions have flaws: Dyna-Think distills synthetic data from DeepSeek-R1 which contains inherent errors, while classical Dyna approaches train independent world models, resulting in modular systems where reasoning and simulation are decoupled.
Goal: Internalize simulation capabilities within the agent's own reasoning process without relying on stronger teacher models or independent world models, enabling continuous online reinforcement of these skills.
Key Insight: Instead of letting strong models "hallucinate" synthetic simulation data, it is more effective to extract reliable future states from real environment interactions. Paths taken during actual rollouts provide ground-truth world dynamics, serving as supervision signals free from fabrication.
Core Idea: A two-stage training process: Stage 1 involves SFT (RESIM) to compress real search trees into "reasoning trajectories with simulation," teaching the model to generate mental rollouts. Stage 2 utilizes Dyna-GRPO to feed actual future states back into RL, refining simulation and decision-making online.
Method¶
Overall Architecture¶
Dyna-Mind aims to enable a (V)LM agent to simulate multiple future steps during reasoning and select optimal actions in long-horizon tasks. The process is split into two sequential training stages: Stage 1: RESIM initializes simulation capabilities by building search trees from real interactions and aggregating them into coherent reasoning responses \(a^{\text{ReSim}}\) using a general LLM for SFT. Stage 2: Dyna-GRPO performs online reinforcement by feeding real future states into the RL rollout to refine simulations, optimizing both decision-making and simulation via a modified GRPO.
The task is modeled as a Markov Decision Process \((\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R})\): an agent \(\pi_\theta\) at state \(s_t\) generates an action \(a_t\sim\pi_\theta(\cdot|s_t)\), the environment transitions to \(s_{t+1}\sim\mathcal{T}(s_t,a_t)\), and a reward \(r_T\) is provided upon termination. The key convention is that any text regarding "simulating the future" is part of the action \(a_t\) (simulation is embedded in reasoning), while \(s\) denotes ground-truth states from the environment.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Real Environment Interaction"] --> B["RESIM Data Construction<br/>Search Tree → Single Reasoning Trajectory"]
B --> C["RESIM Distillation<br/>SFT: Learning to Self-Simulate"]
C --> D["SimRollout Simulation Refinement<br/>Injecting Real Future States"]
D --> E["Dyna-GRPO Dual-Phase Optimization<br/>Simulation Improvement ↔ Policy Improvement"]
E -->|Online Iteration| D
E --> F["Long-horizon Agent"]
Key Designs¶
1. RESIM Data Construction: Compressing Search Trees into "Simulated" Reasoning Trajectories
Addressing the weakness of Dyna-Think—where synthetic data contains errors—RESIM derives supervision signals from real interactions. Given a state \(s\), a rollout model \(\pi_\theta\) performs a Depth-First Search (DFS) to generate \(b\) partial rollouts of depth \(d\). A value function \(V_\nu\) scores each rollout. Finally, a general (V)LM aggregates the tree into a single response \(a^{\text{ReSim}}\) by summarizing each rollout (containing real future states) and chaining them into coherent reasoning to select the best plan. This ensures simulations are faithful to world dynamics.
2. RESIM Distillation: Internalizing Simulation Capabilities
Each \(a^{\text{ReSim}}\) encodes an entire search tree. RESIM uses \(a^{\text{ReSim}}\) as the target for SFT. Given a trajectory \(\tau=\{s_0,a^{\text{ReSim}}_0,s_1,a^{\text{ReSim}}_1,\cdots\}\), the model is trained to directly output \(a^{\text{ReSim}}_t\) when observing state \(s_t\) and history \(h\), eliminating the need for search algorithms during inference. This internalizes expensive multi-module simulation into a single forward pass. This method produces tokens at \(1/11\) the volume of DISTILL(R1) while outperforming it in ALFWorld.
3. SimRollout: Refining Simulation with Ground-Truth Future States
Online reinforcement is challenging because standard RL (like GRPO) provides only a scalar reward \(R_T\), lacking direct signals for simulation accuracy. SimRollout generates this signal: at each state \(s_t\), the agent samples \(a\sim\pi_\theta(\cdot|s_t)\), extracts the action plan \(\{\hat a_1,\cdots,\hat a_d\}\), and executes them in the environment to obtain real next states \(\{s'_{t+1},\cdots,s'_{t+d}\}\). These are appended back to the prompt \(s_t^{\text{refine}}\equiv\{s_t\oplus a\oplus s'_{t+1}\oplus\hat a_2\oplus\cdots\}\) to let the model generate a refined response \(a^{\text{refine}}\).
4. Dyna-GRPO: Alternating Optimization between "Simulation" and "Policy"
Following Dyna principles, Dyna-GRPO alternates between two iterations. In the Simulation Improvement phase, SimRollout is used to obtain refined trajectories \(\tau'_{\text{refine}}\). A special advantage is used to reinforce the refinement: a reward of 1.0 is given only if the refined trajectory is both correct and yields a higher reward than the mean of the standard and refined rollouts:
In the Policy Improvement phase, standard rollouts without future states are performed, and the model is optimized via standard GRPO to integrate learned simulation capabilities into decision-making.
Loss & Training¶
The RL backbone is the GRPO objective \(J_{\text{GRPO}}\), using the importance sampling ratio \(\rho_\theta(a)=\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{ref}}}(a|s)}\) and KL regularization \(\beta D_{KL}(\pi_\theta\|\pi_{\theta_{\text{ref}}})\). Episode-level advantages are normalized within the group:
Reward design: \(-0.1\) for non-terminal steps, \(10.0\) for success at termination, and \(0.0\) for failure. Models used include Qwen2.5-7B-Instruct (text) and Qwen2.5-VL-7B/32B-Instruct (AndroidWorld).
Key Experimental Results¶
Main Results¶
Text-based games (Sokoban / ALFWorld, based on Qwen2.5-7B-Instruct, average of 3 runs):
| Method | Gen.Token | Sokoban AVG | ALFWorld AVG |
|---|---|---|---|
| REACT(DeepSeek-R1) | 14.5x | 96.6 (ID only) | 62.5 (ID only) |
| RESIM (at inference) | 2.0x | 96.4 (ID only) | 87.7 (ID only) |
| Dyna-Think | 24.2x | 65.8 | 58.9 |
| DISTILL(RESIM) (Stage 1) | 2.0x | 63.7 | 74.1 |
| DISTILL(RESIM)+GRPO | 2.1x | 73.1 | 87.0 |
| DISTILL(RESIM)+DYNA-GRPO | 1.9x | 77.1 | 90.8 |
AndroidWorld (Qwen2.5-VL, average of 3 runs):
| Method | ID | OOD | AVG |
|---|---|---|---|
| REACT(Qwen2.5-VL-72B) | 19.5 | - | - |
| RESIM (at inference) | 34.4 | - | - |
| DISTILL-32B(RESIM) | 32.8 | 15.6 | 24.2 |
| DISTILL-32B(RESIM)+GRPO | 35.3 | 20.3 | 27.8 |
| DISTILL-32B(RESIM)+DYNA-GRPO | 40.7 | 22.9 | 31.8 |
Ablation Study / Simulation Capability Analysis¶
Simulation Score (Sim Score \(\in[0,1]\), judged by Qwen3-235B by comparing "hallucinated" vs ground-truth future states):
| Method | Sokoban Succ / Sim | ALFWorld Succ / Sim |
|---|---|---|
| REACT(DeepSeek-R1) | 96.6 / 0.93 | 62.5 / 0.36 |
| RESIM | 96.4 / 1.00 | 87.7 / 1.00 |
| DISTILL(RESIM) | 71.9 / 0.62 | 78.9 / 0.37 |
| DISTILL(RESIM)+GRPO | 79.1 / 0.62 | 87.0 / 0.38 |
| DISTILL(RESIM)+DYNA-GRPO | 82.5 / 0.67 | 92.5 / 0.43 |
Key Findings¶
- Simulation accuracy is the bottleneck for long-horizon tasks: DeepSeek-R1's Sim Score in ALFWorld is only 0.36, which limits success. RESIM achieves high performance in both environments due to high Sim Scores.
- Dyna-GRPO improves both success and simulation accuracy: Unlike GRPO, it raises the Sim Score (e.g., ALFWorld 0.38→0.43), indicating that performance gains stem from better simulation rather than reward hacking.
- Efficiency: DISTILL(RESIM) outputs 1/11 the tokens of R1. Dyna-GRPO maintains generation length at approximately 2x the base model, avoiding performance gains at the cost of excessive reasoning length.
- AndroidWorld bounded by base model: Performance reached 34.4% in real mobile environments; errors were attributed to Qwen2.5-VL-72B's inability to fully understand some GUI elements.
Highlights & Insights¶
- The "Real Search Tree → Single Reasoning" distillation is elegant, compressing expensive multi-module search into a single forward pass while using real data to bypass synthetic biases.
- SimRollout uses future states as trainable supervision signals, addressing the lack of feedback for simulation quality in standard scalar-reward RL.
- Causal Analysis: The authors analyze "Sim Score" to prove that performance improvements are rooted in simulation accuracy, providing a rigorous chain of reasoning.
Limitations & Future Work¶
- Base model dependency: Success in AndroidWorld is limited by the GUI comprehension of the rollout model.
- Cost of RESIM data construction: Building search trees is time-intensive (15-20 min per AndroidWorld episode), leading the authors to use prompting instead of training value functions for some tasks.
- Search constraints: Current implementation is restricted to DFS and agent-environment interaction.
- Simulation depth: The depth \(d\) is capped at 5 for training stability, which might be insufficient for very long-range dependencies.
Related Work & Insights¶
- vs Dyna-Think: Dyna-Think distills synthetic simulations from DeepSeek-R1 (with R1's biases) and uses separate predictors. Dyna-Mind derives supervision from real trees, skips independent world models, and consumes an order of magnitude fewer tokens.
- vs Classical Dyna: Classical Dyna separates the world model and policy. Dyna-Mind internalizes simulation into the agent's reasoning, sharing parameters for simulation and decision-making.
- vs Search/Multi-agent: These methods incur high overhead during inference. Dyna-Mind distills the benefits of search into a single (V)LM, requiring no search algorithms at test time.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Distilling real search trees and feeding future states into RL are both novel and complementary.
- Experimental Thoroughness: ⭐⭐⭐⭐ Extensive analysis in text and real environments; AndroidWorld scale limited by compute.
- Writing Quality: ⭐⭐⭐⭐⭐ Logical flow from neuroscience motivation to dual-stage methodology.
- Value: ⭐⭐⭐⭐⭐ Provides a reusable paradigm for internalizing and reinforcing simulation in long-horizon agents.