Bootstrap Off-policy with World Model¶
Conference: NeurIPS 2025 arXiv: 2511.00423 Code: molumitu/BOOM_MBRL Area: Reinforcement Learning Keywords: Model-based reinforcement learning, online planning, off-policy learning, actor divergence, world model, MPPI
TL;DR¶
This paper proposes the BOOM framework, which distills high-quality actions from an online planner into a policy network via a bootstrap alignment loop. By employing a likelihood-free forward KL divergence and a soft Q-weighting mechanism, BOOM effectively mitigates the actor divergence between the planner and the policy, achieving state-of-the-art performance on high-dimensional continuous control tasks.
Background & Motivation¶
- Online planning improves RL performance: Lookahead search with learned world models (e.g., MPPI) can generate higher-quality actions than a standalone policy network, and is widely adopted in model-based RL.
- Actor divergence is inevitable: Training data is collected by the planner-augmented behavior policy \(\beta = \pi + \text{MPPI}\), yet a distributional gap inherently exists between the policy network \(\pi\) and the behavior policy \(\beta\).
- Value function learning bias: The Q-function is trained on the distribution of \(\beta\), leading to overestimation in regions frequently visited by \(\pi\) but rarely covered by \(\beta\).
- Unreliable policy updates: Policy gradient updates based on biased Q-values mislead optimization, causing training instability and performance degradation.
- Planner outputs are non-parametric: Sampling-based planners such as MPPI produce action distributions whose likelihood cannot be computed analytically, making conventional KL divergence methods inapplicable.
- Limitations of prior work: TD-MPC2 ignores actor divergence; BMPC imitates the planner without value guidance, limiting effectiveness when historical action quality is inconsistent.
Method¶
Overall Architecture¶
BOOM consists of three tightly coupled components — a policy network, an online planner, and a world model — forming a bootstrap loop: the policy initializes the planner's solution, and the planner refines actions via model-predictive optimization, which in turn guides policy alignment. The world model, following the TD-MPC2 paradigm, jointly trains an encoder \(h\), a dynamics model \(f\), a reward predictor \(R\), and a value function \(Q\), serving both trajectory simulation for the planner and value estimation for the policy.
Key Design 1: Likelihood-Free Alignment Loss¶
- Function: Aligns the policy \(\pi\) to the planner's action distribution \(\beta\), distilling high-quality planner behavior into the policy network.
- Mechanism: Adopts the forward KL divergence \(\text{KL}(\beta \| \pi)\) rather than the reverse KL. Upon expansion, the entropy term over \(\beta\) is independent of \(\pi\) and can be discarded, yielding the alignment loss \(\mathcal{L}_{\text{align}} = -\mathbb{E}[\log \pi(a|s)]\), which requires only the log-likelihood of \(\pi\) and is entirely free of \(\beta\)'s likelihood.
- Design Motivation: The action distribution produced by MPPI after weighted resampling is non-parametric and its likelihood is intractable. The reverse KL requires \(\beta(a|s)\) and is therefore infeasible; the forward KL naturally circumvents this issue, providing a clean and effective alignment mechanism.
Key Design 2: Soft Q-Weighting Mechanism¶
- Function: Introduces Q-value-based soft weights into the alignment loss to prioritize alignment with high-return actions.
- Mechanism: For each transition in the replay buffer, computes \(w_i = \exp(Q_i/\tau) / \sum_j \exp(Q_j/\tau)\); the weighted alignment loss becomes \(\sum_i w_i [-\log \pi(a_i|s_i)]\), encouraging the policy to focus on high-value experience.
- Design Motivation: The replay buffer stores historical actions generated by the planner at different time steps, with varying quality (earlier planners are weaker). Q-value weighting mirrors the planner's own action selection principle (selecting actions proportional to their value), ensuring the policy learns preferentially from the most informative experience.
Key Design 3: Bootstrap Policy Objective¶
- Function: Combines the alignment loss with the standard Q-maximization loss into a unified policy optimization objective.
- Mechanism: \(\mathcal{L}_{\text{policy}} = -\sum[Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \mathcal{L}_{\text{align}}]\), where \(\lambda_{\text{align}} = \dim(\mathcal{A})/1000\) (DMC) or \(\dim(\mathcal{A})/50\) (Humanoid-Bench).
- Design Motivation: Pure Q-maximization suffers from value bias induced by actor divergence; pure planner imitation forgoes value-driven optimization. Their combination leverages the complementary strengths of Q-value guidance and behavioral alignment.
Loss & Training¶
- World model loss: Jointly trains encoder, dynamics, reward, and Q-value heads: \(\mathcal{L}_{\text{model}} = \sum \gamma^t [\|f(z_t, a_t) - \text{sg}(h(s'_t))\|_2^2 + \text{CE}(R_t, r_t) + \text{CE}(Q_t, q_t)]\)
- Policy loss: \(\mathcal{L}_{\text{policy}} = -\sum [Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \sum_i w_i (-\log \pi(a_i|s_i))]\)
- Training procedure: A random-action warmup phase pre-trains the world model, followed by iterative cycles of: planner data collection → world model training → bootstrap policy update.
Key Experimental Results¶
Table 1: DMC Suite High-Dimensional Control Tasks (Total Average Return TAR, 3 seeds)¶
| Task | SAC | DreamerV3 (10M) | TD-MPC2 | BMPC | BOOM |
|---|---|---|---|---|---|
| Humanoid-stand | 9.0 | 717.0 | 913.3 | 947.9 | 962.1 |
| Humanoid-walk | 173.8 | 755.6 | 884.8 | 935.1 | 936.1 |
| Humanoid-run | 1.6 | 353.5 | 316.2 | 531.2 | 582.8 |
| Dog-stand | 197.6 | 35.4 | 936.4 | 971.3 | 986.8 |
| Dog-walk | 24.7 | 9.1 | 885.0 | 942.9 | 965.4 |
| Dog-trot | 67.1 | 8.4 | 884.4 | 911.3 | 947.9 |
| Dog-run | 16.5 | 4.3 | 427.0 | 673.7 | 820.7 |
| Average | 58.8 | 269.0 | 745.6 | 835.8 | 877.7 |
BOOM achieves the best performance on all 7 high-dimensional DMC tasks, with an average TAR of 877.7, representing a +5.0% gain over BMPC and +17.7% over TD-MPC2. On Dog-run, BOOM outperforms the runner-up by +21.8%.
Table 2: Humanoid Bench Tasks (Total Average Return TAR, 3 seeds)¶
| Task | SAC | DreamerV3 (10M) | TD-MPC2 | BMPC | BOOM |
|---|---|---|---|---|---|
| H1hand-stand | 74.1 | 845.4 | 728.7 | 780.0 | 926.1 |
| H1hand-walk | 27.0 | 744.0 | 644.2 | 672.6 | 935.4 |
| H1hand-run | 14.1 | 622.4 | 66.1 | 236.0 | 682.2 |
| H1hand-sit | 268.4 | 699.1 | 693.7 | 688.2 | 918.1 |
| H1hand-slide | 19.0 | 367.6 | 141.3 | 440.1 | 926.1 |
| H1hand-pole | 122.5 | 577.4 | 207.5 | 739.9 | 930.5 |
| H1hand-hurdle | 12.9 | 135.7 | 59.0 | 197.1 | 435.6 |
| Average | 68.5 | 555.6 | 338.8 | 511.7 | 820.6 |
BOOM achieves the best performance on all 7 H-Bench tasks, with an average TAR of 820.6, representing a +47.7% gain over DreamerV3 (10M) and +60.5% over BMPC. Gains on individual tasks reach +110.5% on H1hand-slide and +121.0% on H1hand-hurdle.
Ablation Study¶
- Alignment metric: The forward KL (likelihood-free) outperforms the reverse KL (which requires approximating \(\beta\)'s likelihood), validating the necessity of avoiding likelihood estimation.
- Q-weighting mechanism: Replacing Q-weighting with uniform weighting degrades both performance and convergence speed, demonstrating that value-guided weighting is critical for handling inconsistent historical action quality.
- Alignment coefficient \(\lambda_{\text{align}}\): Performance remains stable across a range of 0.1× to 10× the default value, indicating robustness to this hyperparameter.
Highlights & Insights¶
- The paper provides a clear formulation of the two consequences of actor divergence in planning + off-policy RL (value bias and policy misguidence), offering a thorough problem analysis.
- The likelihood-free forward KL alignment elegantly avoids the intractability of computing the planner's non-parametric distribution likelihood.
- Theoretical guarantees are provided: bootstrap alignment controls the return gap (Theorem 1) and the Q-value gap (Theorem 2).
- BOOM achieves comprehensive state-of-the-art results across 14 high-dimensional tasks, with gains exceeding 100% on some tasks, strongly supporting the method's efficacy.
- The implementation is simple: BOOM requires only an additional alignment loss term on top of TD-MPC2.
Limitations & Future Work¶
- The approach still depends on the quality of the MPPI planner; if the world model is poorly learned, the alignment targets produced by the planner are themselves unreliable.
- The forward KL (mode-covering) may cause the policy to over-cover the planner's distribution, potentially leading to overly conservative behavior.
- The alignment coefficient \(\lambda_{\text{align}}\) uses different formulas for DMC and H-Bench, requiring task-specific tuning.
- Evaluation is limited to continuous locomotion control tasks; validation on other task types such as manipulation and navigation is absent.
- Comparisons with recent imagination-driven methods (e.g., improved variants of DreamerV3) and offline RL approaches are lacking.
Related Work & Insights¶
- Planning-driven MBRL: TD-MPC/TD-MPC2 jointly learns a world model and policy, using MPPI planning for data collection; BMPC further incorporates imitation of planner actions with relabeling.
- Imagination-driven MBRL: DreamerV3 trains policies via internal world model rollouts without online planning, but with limited sample efficiency.
- Off-policy RL: Methods such as SAC face distributional shift when learning from a replay buffer, which is fundamentally related to the actor divergence discussed in this paper.
- Behavior cloning and policy distillation: The alignment loss in BOOM can be viewed as online distillation from the planner to the policy, analogous to behavior cloning regularization in offline RL.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The problem is clearly defined; the likelihood-free alignment solution is concise and effective, supported by theoretical analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive state-of-the-art results across 14 high-dimensional tasks with complete ablations, though task diversity is limited.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is logically derived; theoretical analysis and empirical results are mutually reinforcing.
- Value: ⭐⭐⭐⭐ — Provides a practical solution to the actor divergence problem in planning + off-policy RL.