Skip to content

Bootstrap Off-policy with World Model

Conference: NeurIPS 2025 arXiv: 2511.00423 Code: molumitu/BOOM_MBRL Area: Reinforcement Learning Keywords: Model-based reinforcement learning, online planning, off-policy learning, actor divergence, world model, MPPI

TL;DR

This paper proposes the BOOM framework, which distills high-quality actions from an online planner into a policy network via a bootstrap alignment loop. By employing a likelihood-free forward KL divergence and a soft Q-weighting mechanism, BOOM effectively mitigates the actor divergence between the planner and the policy, achieving state-of-the-art performance on high-dimensional continuous control tasks.

Background & Motivation

  • Online planning improves RL performance: Lookahead search with learned world models (e.g., MPPI) can generate higher-quality actions than a standalone policy network, and is widely adopted in model-based RL.
  • Actor divergence is inevitable: Training data is collected by the planner-augmented behavior policy \(\beta = \pi + \text{MPPI}\), yet a distributional gap inherently exists between the policy network \(\pi\) and the behavior policy \(\beta\).
  • Value function learning bias: The Q-function is trained on the distribution of \(\beta\), leading to overestimation in regions frequently visited by \(\pi\) but rarely covered by \(\beta\).
  • Unreliable policy updates: Policy gradient updates based on biased Q-values mislead optimization, causing training instability and performance degradation.
  • Planner outputs are non-parametric: Sampling-based planners such as MPPI produce action distributions whose likelihood cannot be computed analytically, making conventional KL divergence methods inapplicable.
  • Limitations of prior work: TD-MPC2 ignores actor divergence; BMPC imitates the planner without value guidance, limiting effectiveness when historical action quality is inconsistent.

Method

Overall Architecture

BOOM consists of three tightly coupled components — a policy network, an online planner, and a world model — forming a bootstrap loop: the policy initializes the planner's solution, and the planner refines actions via model-predictive optimization, which in turn guides policy alignment. The world model, following the TD-MPC2 paradigm, jointly trains an encoder \(h\), a dynamics model \(f\), a reward predictor \(R\), and a value function \(Q\), serving both trajectory simulation for the planner and value estimation for the policy.

Key Design 1: Likelihood-Free Alignment Loss

  • Function: Aligns the policy \(\pi\) to the planner's action distribution \(\beta\), distilling high-quality planner behavior into the policy network.
  • Mechanism: Adopts the forward KL divergence \(\text{KL}(\beta \| \pi)\) rather than the reverse KL. Upon expansion, the entropy term over \(\beta\) is independent of \(\pi\) and can be discarded, yielding the alignment loss \(\mathcal{L}_{\text{align}} = -\mathbb{E}[\log \pi(a|s)]\), which requires only the log-likelihood of \(\pi\) and is entirely free of \(\beta\)'s likelihood.
  • Design Motivation: The action distribution produced by MPPI after weighted resampling is non-parametric and its likelihood is intractable. The reverse KL requires \(\beta(a|s)\) and is therefore infeasible; the forward KL naturally circumvents this issue, providing a clean and effective alignment mechanism.

Key Design 2: Soft Q-Weighting Mechanism

  • Function: Introduces Q-value-based soft weights into the alignment loss to prioritize alignment with high-return actions.
  • Mechanism: For each transition in the replay buffer, computes \(w_i = \exp(Q_i/\tau) / \sum_j \exp(Q_j/\tau)\); the weighted alignment loss becomes \(\sum_i w_i [-\log \pi(a_i|s_i)]\), encouraging the policy to focus on high-value experience.
  • Design Motivation: The replay buffer stores historical actions generated by the planner at different time steps, with varying quality (earlier planners are weaker). Q-value weighting mirrors the planner's own action selection principle (selecting actions proportional to their value), ensuring the policy learns preferentially from the most informative experience.

Key Design 3: Bootstrap Policy Objective

  • Function: Combines the alignment loss with the standard Q-maximization loss into a unified policy optimization objective.
  • Mechanism: \(\mathcal{L}_{\text{policy}} = -\sum[Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \mathcal{L}_{\text{align}}]\), where \(\lambda_{\text{align}} = \dim(\mathcal{A})/1000\) (DMC) or \(\dim(\mathcal{A})/50\) (Humanoid-Bench).
  • Design Motivation: Pure Q-maximization suffers from value bias induced by actor divergence; pure planner imitation forgoes value-driven optimization. Their combination leverages the complementary strengths of Q-value guidance and behavioral alignment.

Loss & Training

  • World model loss: Jointly trains encoder, dynamics, reward, and Q-value heads: \(\mathcal{L}_{\text{model}} = \sum \gamma^t [\|f(z_t, a_t) - \text{sg}(h(s'_t))\|_2^2 + \text{CE}(R_t, r_t) + \text{CE}(Q_t, q_t)]\)
  • Policy loss: \(\mathcal{L}_{\text{policy}} = -\sum [Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \sum_i w_i (-\log \pi(a_i|s_i))]\)
  • Training procedure: A random-action warmup phase pre-trains the world model, followed by iterative cycles of: planner data collection → world model training → bootstrap policy update.

Key Experimental Results

Table 1: DMC Suite High-Dimensional Control Tasks (Total Average Return TAR, 3 seeds)

Task SAC DreamerV3 (10M) TD-MPC2 BMPC BOOM
Humanoid-stand 9.0 717.0 913.3 947.9 962.1
Humanoid-walk 173.8 755.6 884.8 935.1 936.1
Humanoid-run 1.6 353.5 316.2 531.2 582.8
Dog-stand 197.6 35.4 936.4 971.3 986.8
Dog-walk 24.7 9.1 885.0 942.9 965.4
Dog-trot 67.1 8.4 884.4 911.3 947.9
Dog-run 16.5 4.3 427.0 673.7 820.7
Average 58.8 269.0 745.6 835.8 877.7

BOOM achieves the best performance on all 7 high-dimensional DMC tasks, with an average TAR of 877.7, representing a +5.0% gain over BMPC and +17.7% over TD-MPC2. On Dog-run, BOOM outperforms the runner-up by +21.8%.

Table 2: Humanoid Bench Tasks (Total Average Return TAR, 3 seeds)

Task SAC DreamerV3 (10M) TD-MPC2 BMPC BOOM
H1hand-stand 74.1 845.4 728.7 780.0 926.1
H1hand-walk 27.0 744.0 644.2 672.6 935.4
H1hand-run 14.1 622.4 66.1 236.0 682.2
H1hand-sit 268.4 699.1 693.7 688.2 918.1
H1hand-slide 19.0 367.6 141.3 440.1 926.1
H1hand-pole 122.5 577.4 207.5 739.9 930.5
H1hand-hurdle 12.9 135.7 59.0 197.1 435.6
Average 68.5 555.6 338.8 511.7 820.6

BOOM achieves the best performance on all 7 H-Bench tasks, with an average TAR of 820.6, representing a +47.7% gain over DreamerV3 (10M) and +60.5% over BMPC. Gains on individual tasks reach +110.5% on H1hand-slide and +121.0% on H1hand-hurdle.

Ablation Study

  • Alignment metric: The forward KL (likelihood-free) outperforms the reverse KL (which requires approximating \(\beta\)'s likelihood), validating the necessity of avoiding likelihood estimation.
  • Q-weighting mechanism: Replacing Q-weighting with uniform weighting degrades both performance and convergence speed, demonstrating that value-guided weighting is critical for handling inconsistent historical action quality.
  • Alignment coefficient \(\lambda_{\text{align}}\): Performance remains stable across a range of 0.1× to 10× the default value, indicating robustness to this hyperparameter.

Highlights & Insights

  • The paper provides a clear formulation of the two consequences of actor divergence in planning + off-policy RL (value bias and policy misguidence), offering a thorough problem analysis.
  • The likelihood-free forward KL alignment elegantly avoids the intractability of computing the planner's non-parametric distribution likelihood.
  • Theoretical guarantees are provided: bootstrap alignment controls the return gap (Theorem 1) and the Q-value gap (Theorem 2).
  • BOOM achieves comprehensive state-of-the-art results across 14 high-dimensional tasks, with gains exceeding 100% on some tasks, strongly supporting the method's efficacy.
  • The implementation is simple: BOOM requires only an additional alignment loss term on top of TD-MPC2.

Limitations & Future Work

  • The approach still depends on the quality of the MPPI planner; if the world model is poorly learned, the alignment targets produced by the planner are themselves unreliable.
  • The forward KL (mode-covering) may cause the policy to over-cover the planner's distribution, potentially leading to overly conservative behavior.
  • The alignment coefficient \(\lambda_{\text{align}}\) uses different formulas for DMC and H-Bench, requiring task-specific tuning.
  • Evaluation is limited to continuous locomotion control tasks; validation on other task types such as manipulation and navigation is absent.
  • Comparisons with recent imagination-driven methods (e.g., improved variants of DreamerV3) and offline RL approaches are lacking.
  • Planning-driven MBRL: TD-MPC/TD-MPC2 jointly learns a world model and policy, using MPPI planning for data collection; BMPC further incorporates imitation of planner actions with relabeling.
  • Imagination-driven MBRL: DreamerV3 trains policies via internal world model rollouts without online planning, but with limited sample efficiency.
  • Off-policy RL: Methods such as SAC face distributional shift when learning from a replay buffer, which is fundamentally related to the actor divergence discussed in this paper.
  • Behavior cloning and policy distillation: The alignment loss in BOOM can be viewed as online distillation from the planner to the policy, analogous to behavior cloning regularization in offline RL.

Rating

  • Novelty: ⭐⭐⭐⭐ — The problem is clearly defined; the likelihood-free alignment solution is concise and effective, supported by theoretical analysis.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive state-of-the-art results across 14 high-dimensional tasks with complete ablations, though task diversity is limited.
  • Writing Quality: ⭐⭐⭐⭐ — Problem motivation is logically derived; theoretical analysis and empirical results are mutually reinforcing.
  • Value: ⭐⭐⭐⭐ — Provides a practical solution to the actor divergence problem in planning + off-policy RL.