NeurIPS 2025 Reinforcement Learning Model-based reinforcement learning online planning off-policy learning world model actor divergence behavior alignment

Bootstrap Off-policy with World Model (BOOM)¶

Conference: NeurIPS 2025 arXiv: 2511.00423 Code: molumitu/BOOM_MBRL Area: Reinforcement Learning Keywords: Model-based reinforcement learning, online planning, off-policy learning, world model, actor divergence, behavior alignment

TL;DR¶

This paper proposes the BOOM framework, which tightly couples an online planner (MPPI) with off-policy policy learning via a bootstrap loop: the policy initializes the planner, which in turn guides policy improvement through a likelihood-free alignment loss, supplemented by a soft Q-weighted mechanism to prioritize high-return behaviors, achieving state-of-the-art performance on high-dimensional continuous control tasks.

Background & Motivation¶

Advantages of Online Planning: In model-based RL, online planning methods such as MPPI simulate future trajectories to produce higher-quality actions than a standalone policy network, substantially improving sample efficiency and final performance.

Actor Divergence Problem: When the planner is used to collect data, an inherent distributional shift arises between the behavior policy \(\beta = \pi + \text{MPPI}\) and the policy network \(\pi\)—data in the replay buffer is collected under \(\beta\) rather than \(\pi\), violating the distributional consistency assumption of off-policy learning.

Distributional Shift in Value Learning: The value function is trained on the state-action distribution of \(\beta\), leading to overestimation in regions visited by \(\pi\) but rarely covered by \(\beta\), resulting in inaccurate value estimates.

Unreliable Policy Updates: Policy updates driven by biased Q-values may be misguided, a problem that becomes increasingly severe in high-dimensional, complex environments.

Non-parameterizable Planner Distribution: The output action distribution of sampling-based planners such as MPPI is non-parametric (produced via importance reweighting and resampling), precluding the computation of exact likelihoods and rendering metrics such as KL divergence inapplicable in their standard form.

Limitations of Prior Work: TD-MPC2 combines planning with off-policy learning but does not address actor divergence; BMPC employs simple behavioral cloning for alignment but lacks value-guided selection; DreamerV3 exhibits limited sample efficiency on high-dimensional tasks.

Method¶

Overall Architecture¶

BOOM comprises three tightly coupled components: a policy network \(\pi\), an online planner MPPI, and a world model (encoder \(h\), dynamics model \(f\), reward predictor \(R\), and value function \(Q\)). The core is a bootstrap loop: the policy initializes the planner, the planner produces higher-quality actions through model-predictive optimization, and these actions are used to guide policy improvement via behavior alignment. The world model is jointly trained in a TD-MPC2 style, simultaneously supporting trajectory simulation for the planner and providing value estimates for the policy.

Key Design 1: Likelihood-Free Alignment Loss¶

Function: Aligns policy \(\pi\) to the actions generated by planner \(\beta\), reducing the distributional gap between the two.
Mechanism: Employs the forward KL divergence \(\text{KL}(\beta \| \pi)\) rather than the reverse KL. Expanding the forward KL, terms involving \(\beta\) are constant and can be discarded, yielding \(\mathcal{L}_{\text{align}} = \mathbb{E}[-\log \pi(a|s)]\), which requires only the log-likelihood of \(\pi\) evaluated on planner actions, entirely avoiding the need to compute \(\beta(a|s)\).
Design Motivation: The output distribution of MPPI is non-parametric (no longer a simple Gaussian after weighted averaging), making its likelihood intractable. The reverse KL requires the likelihood of \(\beta\), whereas the forward KL requires only samples from \(\beta\) (already available in the replay buffer) and evaluation of \(\pi\)'s likelihood, making it naturally suited to this setting.

Key Design 2: Soft Q-Weighted Mechanism¶

Function: Weights alignment samples from the replay buffer by their Q-values, prioritizing alignment to high-return actions.
Mechanism: Defines softmax weights \(w_i = \exp(Q_i/\tau) / \sum_j \exp(Q_j/\tau)\) and computes the weighted alignment loss as \(\mathcal{L}_{\text{align}} = \sum_i w_i [-\log \pi(a_i|s_i)]\).
Design Motivation: Historical actions in the replay buffer vary considerably in quality, as the planner performs poorly in early training; uniform alignment would introduce low-quality behaviors. The Q-weighted mechanism draws inspiration from MPPI's own value-guided selection principle, directing the policy to focus on high-value experiences and accelerating learning.

Key Design 3: Bootstrap Policy Objective¶

Function: Combines the alignment loss with the standard Q-value maximization objective.
Mechanism: \(\mathcal{L}_{\text{policy}} = -[Q(s, \pi(s)) + \lambda_{\text{align}} \cdot \mathcal{L}_{\text{align}}]\), simultaneously maximizing the policy's own Q-value and enforcing behavioral alignment with the planner.
Design Motivation: Pure Q-value maximization is unstable under actor divergence, while pure imitation of the planner prevents autonomous policy improvement. The combination retains the exploratory advantages of off-policy RL while using the alignment constraint to prevent the policy from deviating from the data distribution.

Key Design 4: Indirect Improvement of the World Model¶

Function: Bootstrap alignment indirectly improves world model quality by reducing the distributional mismatch between the policy and the data.
Mechanism: More accurate value learning → more informative TD loss gradients → improved encoder, dynamics model, and reward predictor (owing to joint TD-style training).
Design Motivation: World model quality directly affects planner performance, creating a virtuous cycle: better alignment → more accurate values → better world model → better planning.

Loss & Training¶

World Model Loss: \(\mathcal{L}_{\text{model}} = \sum_{t=0}^{H} \gamma^t [\|f(z_t,a_t) - \text{sg}(h(s'_t))\|^2 + \text{CE}(R_t, r_t) + \text{CE}(Q_t, q_t)]\), jointly training the encoder, dynamics model, reward predictor, and value function.
Policy Loss: Q-value maximization + \(\lambda_{\text{align}}\) × Q-weighted forward KL alignment loss.
Training Procedure: A warmup phase collects data with random actions to pretrain the world model; the main loop then alternates between collecting data with the planner, sampling from the replay buffer to update the world model, and updating the policy.
Hyperparameters: Alignment coefficient \(\lambda_{\text{align}} = \dim(A)/1000\) (DMC) or \(\dim(A)/50\) (Humanoid-Bench); temperature \(\tau = 1\); results are robust to hyperparameter variation within a 10× range.

Key Experimental Results¶

Table 1: DMC Suite High-Dimensional Locomotion Tasks — Total Average Return (3 seeds)¶

Task	SAC	DreamerV3 (10M)	TD-MPC2	BMPC	BOOM
Humanoid-stand	9.0	717.0	913.3	947.9	962.1
Humanoid-walk	173.8	755.6	884.8	935.1	936.1
Humanoid-run	1.6	353.5	316.2	531.2	582.8
Dog-stand	197.6	35.4	936.4	971.3	986.8
Dog-walk	24.7	9.1	885.0	942.9	965.4
Dog-trot	67.1	8.4	884.4	911.3	947.9
Dog-run	16.5	4.3	427.0	673.7	820.7
DMC Average	58.8	269.0	745.6	835.8	877.7 (+5.0%)

Table 2: Humanoid-Bench High-Dimensional Tasks — Total Average Return (3 seeds)¶

Task	SAC	DreamerV3 (10M)	TD-MPC2	BMPC	BOOM
H1hand-stand	74.1	845.4	728.7	780.0	926.1
H1hand-walk	27.0	744.0	644.2	672.6	935.4
H1hand-run	14.1	622.4	66.1	236.0	682.2
H1hand-sit	268.4	699.1	693.7	688.2	918.1
H1hand-slide	19.0	367.6	141.3	440.1	926.1 (+110.5%)
H1hand-pole	122.5	577.4	207.5	739.9	930.5
H1hand-hurdle	12.9	135.7	59.0	197.1	435.6 (+121.0%)
H-Bench Average	68.5	555.6	338.8	511.7	820.6 (+47.7%)

Ablation Study Key Findings (Dog-run, 223/38 dims)¶

Forward KL substantially outperforms reverse KL (the latter requires likelihood approximation, and inaccurate approximation is harmful).
Q-weighting accelerates training and improves final performance relative to uniform weighting.
Performance remains stable across a 0.1×–10× range of the alignment coefficient, demonstrating strong robustness.

Highlights & Insights¶

Clear Problem Formulation: The paper precisely characterizes the two consequences of actor divergence in the online planning + off-policy RL paradigm—value bias and unreliable policy updates—and provides a theoretical analysis.
Judicious Choice of Forward KL: Given the constraint that the planner distribution is non-parametric, forward KL is the uniquely natural choice, yielding a concise formulation that requires no additional approximations.
Complete Theoretical Guarantees: Theorem 1 proves that alignment controls an upper bound on the return gap; Theorem 2 proves that alignment controls an upper bound on Q-value bias; theoretical results are consistent with empirical findings.
Significant Performance Gains: Particularly large improvements on the most challenging tasks—Dog-run (+21.8%) and H1hand-hurdle (+121.0%)—demonstrate the method's superiority in high-dimensional, complex settings.
Implementation Simplicity: The core modification amounts to adding a single Q-weighted log-likelihood term to the policy loss, incurring virtually no additional computational overhead.

Limitations & Future Work¶

Validation Limited to Continuous Control: All experiments are confined to locomotion tasks in DMC and Humanoid-Bench; the method has not been evaluated on manipulation, navigation, or other task types, nor in discrete action spaces.
Dependence on World Model Quality: BOOM builds on a TD-MPC2-style world model; inaccurate models degrade both planner quality and alignment effectiveness.
Mode-Covering Tendency of Forward KL: Forward KL encourages the policy to cover all modes of the planner distribution, potentially leading to an overly diffuse policy that may be suboptimal when the planner distribution is highly multimodal.
Degradation of Replay Buffer Action Quality: Although the Q-weighted mechanism mitigates this issue, early low-quality data continues to occupy buffer space as training progresses.
Absence of Real-Robot Experiments: Validation is conducted exclusively in simulation; the sim-to-real gap is not discussed.

TD-MPC2: The direct baseline for BOOM, sharing the world model architecture but not addressing actor divergence.
BMPC: Also attempts to align the policy with the planner, but relies on simple behavioral cloning with action relabeling and lacks value-guided selection.
DreamerV3: An imagination-driven MBRL approach that does not employ online planning; its sample efficiency is limited on high-dimensional tasks.
Offline RL: Conservative estimation and implicit policy approaches (e.g., CQL, IQL) for addressing distributional shift share conceptual similarities with the alignment strategy proposed in this work.
Broader Inspiration: The bootstrap loop paradigm can be generalized to other settings where a strong executor and a weak learner are mismatched, such as search-guided LLM training.

Rating¶

Novelty: ⭐⭐⭐⭐ — While actor divergence is a known issue, the combination of likelihood-free forward KL and Q-weighted alignment is concise and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Coverage across 14 high-dimensional tasks with complete ablations, though real-world scenarios and broader task types are absent.
Writing Quality: ⭐⭐⭐⭐⭐ — The motivation–method–theory–experiment logical chain is clear and coherent, with rigorous mathematical derivations.
Value: ⭐⭐⭐⭐ — Provides a practical solution to a core pain point of the planning + off-policy RL paradigm; open-source code enhances practical impact.