NeurIPS 2025 Reinforcement Learning Model-based reinforcement learning online planning off-policy learning actor divergence world model MPPI

Bootstrap Off-policy with World Model¶

Conference: NeurIPS 2025 arXiv: 2511.00423 Code: molumitu/BOOM_MBRL Area: Reinforcement Learning Keywords: Model-based reinforcement learning, online planning, off-policy learning, actor divergence, world model, MPPI

TL;DR¶

This paper proposes the BOOM framework, which distills high-quality actions from an online planner into a policy network via a bootstrap alignment loop. By employing a likelihood-free forward KL divergence and a soft Q-weighting mechanism, BOOM effectively mitigates the actor divergence between the planner and the policy, achieving state-of-the-art performance on high-dimensional continuous control tasks.

Background & Motivation¶

Online planning improves RL performance: Lookahead search with learned world models (e.g., MPPI) can generate higher-quality actions than a standalone policy network, and is widely adopted in model-based RL.
Actor divergence is inevitable: Training data is collected by the planner-augmented behavior policy \(\beta = \pi + \text{MPPI}\), yet a distributional gap inherently exists between the policy network \(\pi\) and the behavior policy \(\beta\).
Value function learning bias: The Q-function is trained on the distribution of \(\beta\), leading to overestimation in regions frequently visited by \(\pi\) but rarely covered by \(\beta\).
Unreliable policy updates: Policy gradient updates based on biased Q-values mislead optimization, causing training instability and performance degradation.
Planner outputs are non-parametric: Sampling-based planners such as MPPI produce action distributions whose likelihood cannot be computed analytically, making conventional KL divergence methods inapplicable.
Limitations of prior work: TD-MPC2 ignores actor divergence; BMPC imitates the planner without value guidance, limiting effectiveness when historical action quality is inconsistent.

Method¶

Overall Architecture¶

BOOM consists of three tightly coupled components — a policy network, an online planner, and a world model — forming a bootstrap loop: the policy initializes the planner's solution, and the planner refines actions via model-predictive optimization, which in turn guides policy alignment. The world model, following the TD-MPC2 paradigm, jointly trains an encoder \(h\), a dynamics model \(f\), a reward predictor \(R\), and a value function \(Q\), serving both trajectory simulation for the planner and value estimation for the policy.

Key Design 1: Likelihood-Free Alignment Loss¶

Function: Aligns the policy \(\pi\) to the planner's action distribution \(\beta\), distilling high-quality planner behavior into the policy network.
Mechanism: Adopts the forward KL divergence \(\text{KL}(\beta \| \pi)\) rather than the reverse KL. Upon expansion, the entropy term over \(\beta\) is independent of \(\pi\) and can be discarded, yielding the alignment loss \(\mathcal{L}_{\text{align}} = -\mathbb{E}[\log \pi(a|s)]\), which requires only the log-likelihood of \(\pi\) and is entirely free of \(\beta\)'s likelihood.
Design Motivation: The action distribution produced by MPPI after weighted resampling is non-parametric and its likelihood is intractable. The reverse KL requires \(\beta(a|s)\) and is therefore infeasible; the forward KL naturally circumvents this issue, providing a clean and effective alignment mechanism.

Key Design 2: Soft Q-Weighting Mechanism¶

Function: Introduces Q-value-based soft weights into the alignment loss to prioritize alignment with high-return actions.
Mechanism: For each transition in the replay buffer, computes \(w_i = \exp(Q_i/\tau) / \sum_j \exp(Q_j/\tau)\); the weighted alignment loss becomes \(\sum_i w_i [-\log \pi(a_i|s_i)]\), encouraging the policy to focus on high-value experience.
Design Motivation: The replay buffer stores historical actions generated by the planner at different time steps, with varying quality (earlier planners are weaker). Q-value weighting mirrors the planner's own action selection principle (selecting actions proportional to their value), ensuring the policy learns preferentially from the most informative experience.

Key Design 3: Bootstrap Policy Objective¶

Function: Combines the alignment loss with the standard Q-maximization loss into a unified policy optimization objective.
Mechanism: \(\mathcal{L}_{\text{policy}} = -\sum[Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \mathcal{L}_{\text{align}}]\), where \(\lambda_{\text{align}} = \dim(\mathcal{A})/1000\) (DMC) or \(\dim(\mathcal{A})/50\) (Humanoid-Bench).
Design Motivation: Pure Q-maximization suffers from value bias induced by actor divergence; pure planner imitation forgoes value-driven optimization. Their combination leverages the complementary strengths of Q-value guidance and behavioral alignment.

Loss & Training¶

World model loss: Jointly trains encoder, dynamics, reward, and Q-value heads: \(\mathcal{L}_{\text{model}} = \sum \gamma^t [\|f(z_t, a_t) - \text{sg}(h(s'_t))\|_2^2 + \text{CE}(R_t, r_t) + \text{CE}(Q_t, q_t)]\)
Policy loss: \(\mathcal{L}_{\text{policy}} = -\sum [Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \sum_i w_i (-\log \pi(a_i|s_i))]\)
Training procedure: A random-action warmup phase pre-trains the world model, followed by iterative cycles of: planner data collection → world model training → bootstrap policy update.

Key Experimental Results¶

Table 1: DMC Suite High-Dimensional Control Tasks (Total Average Return TAR, 3 seeds)¶

Task	SAC	DreamerV3 (10M)	TD-MPC2	BMPC	BOOM
Humanoid-stand	9.0	717.0	913.3	947.9	962.1
Humanoid-walk	173.8	755.6	884.8	935.1	936.1
Humanoid-run	1.6	353.5	316.2	531.2	582.8
Dog-stand	197.6	35.4	936.4	971.3	986.8
Dog-walk	24.7	9.1	885.0	942.9	965.4
Dog-trot	67.1	8.4	884.4	911.3	947.9
Dog-run	16.5	4.3	427.0	673.7	820.7
Average	58.8	269.0	745.6	835.8	877.7

BOOM achieves the best performance on all 7 high-dimensional DMC tasks, with an average TAR of 877.7, representing a +5.0% gain over BMPC and +17.7% over TD-MPC2. On Dog-run, BOOM outperforms the runner-up by +21.8%.

Table 2: Humanoid Bench Tasks (Total Average Return TAR, 3 seeds)¶

Task	SAC	DreamerV3 (10M)	TD-MPC2	BMPC	BOOM
H1hand-stand	74.1	845.4	728.7	780.0	926.1
H1hand-walk	27.0	744.0	644.2	672.6	935.4
H1hand-run	14.1	622.4	66.1	236.0	682.2
H1hand-sit	268.4	699.1	693.7	688.2	918.1
H1hand-slide	19.0	367.6	141.3	440.1	926.1
H1hand-pole	122.5	577.4	207.5	739.9	930.5
H1hand-hurdle	12.9	135.7	59.0	197.1	435.6
Average	68.5	555.6	338.8	511.7	820.6

BOOM achieves the best performance on all 7 H-Bench tasks, with an average TAR of 820.6, representing a +47.7% gain over DreamerV3 (10M) and +60.5% over BMPC. Gains on individual tasks reach +110.5% on H1hand-slide and +121.0% on H1hand-hurdle.

Ablation Study¶

Alignment metric: The forward KL (likelihood-free) outperforms the reverse KL (which requires approximating \(\beta\)'s likelihood), validating the necessity of avoiding likelihood estimation.
Q-weighting mechanism: Replacing Q-weighting with uniform weighting degrades both performance and convergence speed, demonstrating that value-guided weighting is critical for handling inconsistent historical action quality.
Alignment coefficient \(\lambda_{\text{align}}\): Performance remains stable across a range of 0.1× to 10× the default value, indicating robustness to this hyperparameter.

Highlights & Insights¶

The paper provides a clear formulation of the two consequences of actor divergence in planning + off-policy RL (value bias and policy misguidence), offering a thorough problem analysis.
The likelihood-free forward KL alignment elegantly avoids the intractability of computing the planner's non-parametric distribution likelihood.
Theoretical guarantees are provided: bootstrap alignment controls the return gap (Theorem 1) and the Q-value gap (Theorem 2).
BOOM achieves comprehensive state-of-the-art results across 14 high-dimensional tasks, with gains exceeding 100% on some tasks, strongly supporting the method's efficacy.
The implementation is simple: BOOM requires only an additional alignment loss term on top of TD-MPC2.

Limitations & Future Work¶

The approach still depends on the quality of the MPPI planner; if the world model is poorly learned, the alignment targets produced by the planner are themselves unreliable.
The forward KL (mode-covering) may cause the policy to over-cover the planner's distribution, potentially leading to overly conservative behavior.
The alignment coefficient \(\lambda_{\text{align}}\) uses different formulas for DMC and H-Bench, requiring task-specific tuning.
Evaluation is limited to continuous locomotion control tasks; validation on other task types such as manipulation and navigation is absent.
Comparisons with recent imagination-driven methods (e.g., improved variants of DreamerV3) and offline RL approaches are lacking.

Planning-driven MBRL: TD-MPC/TD-MPC2 jointly learns a world model and policy, using MPPI planning for data collection; BMPC further incorporates imitation of planner actions with relabeling.
Imagination-driven MBRL: DreamerV3 trains policies via internal world model rollouts without online planning, but with limited sample efficiency.
Off-policy RL: Methods such as SAC face distributional shift when learning from a replay buffer, which is fundamentally related to the actor divergence discussed in this paper.
Behavior cloning and policy distillation: The alignment loss in BOOM can be viewed as online distillation from the planner to the policy, analogous to behavior cloning regularization in offline RL.

Rating¶

Novelty: ⭐⭐⭐⭐ — The problem is clearly defined; the likelihood-free alignment solution is concise and effective, supported by theoretical analysis.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive state-of-the-art results across 14 high-dimensional tasks with complete ablations, though task diversity is limited.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is logically derived; theoretical analysis and empirical results are mutually reinforcing.
Value: ⭐⭐⭐⭐ — Provides a practical solution to the actor divergence problem in planning + off-policy RL.