Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning¶

Conference: NeurIPS 2025 arXiv: 2510.19530
Code: Unavailable (no explicit link provided)
Area: Reinforcement Learning Keywords: Bayesian Optimization, Energy-Based Model, PPO, Black-Box Optimization, Multi-Step Look-Ahead

TL;DR¶

This paper proposes REBMBO, a framework that unifies Gaussian Processes (local modeling), Energy-Based Models (EBM, global exploration), and PPO-based reinforcement learning (multi-step look-ahead) into a closed-loop Bayesian optimization system, achieving significant improvements over conventional BO methods on high-dimensional and multi-modal black-box optimization tasks.

Background & Motivation¶

Background: Bayesian Optimization (BO) is the dominant approach for optimizing expensive black-box functions, centered on a GP surrogate model paired with an acquisition function (e.g., UCB, EI). Notable extensions include TuRBO (local trust regions), BALLET-ICI (alternating global/local GPs), and EARL-BO (RL-assisted multi-step BO).

Limitations of Prior Work: Standard BO suffers from severe one-step myopia—optimizing only the expected gain of the current step while neglecting long-term exploration strategies. In high-dimensional or multi-modal settings, this leads to rapid convergence to local optima.

Key Challenge: GPs excel at local uncertainty modeling but lack global structural information; multi-step look-ahead methods (e.g., 2-step EI, Knowledge Gradient) are computationally expensive yet still limited in horizon; RL-integrated approaches (e.g., EARL-BO) rely on local posteriors and lack global signals.

Goal: Simultaneously address insufficient global exploration and one-step myopia by integrating global structural information and multi-step planning capability into the BO framework.

Key Insight: Introduce an EBM to learn a global energy landscape supplementing GP's local modeling, and formulate each BO step as an MDP solved via PPO for adaptive multi-step look-ahead.

Core Idea: The EBM provides information on "which regions are globally promising," the GP provides "how certain local estimates are," and PPO performs multi-step planning to leverage both.

Method¶

Overall Architecture¶

REBMBO consists of three tightly coupled modules (Figure 1): - Module A: GP surrogate model (providing local mean $\mu_{f,t}(\mathbf{x})$ and variance $\sigma_{f,t}(\mathbf{x})$) - Module B: EBM global energy landscape $E_\theta(\mathbf{x})$ (trained via short-run MCMC) - Module C: PPO multi-step planning policy $\pi_{\phi_{ppo}}(\mathbf{a}_t | \mathbf{s}_t)$

All three modules are updated synchronously after each evaluation, forming an adaptive closed loop.

Key Designs¶

Module 1: GP Variants (Module A) - REBMBO-C: Classic GP, $\mathcal{O}(n^3)$ exact inference - REBMBO-S: Sparse GP, $m \ll n$ inducing points, $\mathcal{O}(nm^2)$ - REBMBO-D: Deep kernel GP, $h(\mathbf{x}) = W_2 \psi(W_1 \mathbf{x})$ mapped to latent space - Composite kernel: $k_f = \sigma_f^2[w_{\text{RBF}} k_{\text{RBF}} + w_{\text{Matérn}} k_{\text{Matérn}}]$, with weights learned automatically via marginal likelihood

Module 2: EBM Global Exploration (Module B) - Function: Learns a global energy landscape where low-energy regions correspond to high-probability/high-promise areas - Training: MLE via short-run MCMC - Positive phase: Lower energy $E_\theta(\mathbf{x}_i)$ at observed data points - Negative phase: Generate negative samples via Langevin dynamics and raise their energy - EBM-UCB acquisition function: $\alpha_{\text{EBM-UCB}}(\mathbf{x}) = \mu_{f,t}(\mathbf{x}) + \beta \sigma_{f,t}(\mathbf{x}) - \gamma E_\theta(\mathbf{x})$ - Design Motivation: The $-\gamma E_\theta(\mathbf{x})$ term biases search toward globally promising regions identified by the EBM, preventing wasted evaluations in uncertain but unpromising areas

Module 3: PPO Multi-Step Planning (Module C) - MDP state: $\mathbf{s}_t = (\mu_{f,t}(\mathbf{x}), \sigma_{f,t}(\mathbf{x}), E_\theta(\mathbf{x}))$ - MDP action: Next query point $\mathbf{a}_t \in \mathcal{X}$ - Reward: $r_t(\mathbf{s}_t, \mathbf{a}_t) = nf(\mathbf{a}_t) - \lambda E_\theta(\mathbf{a}_t)$ - PPO objective: $\mathcal{L}^{\text{CLIP}} = \mathbb{E}_t[\min(r_t \hat{A}_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)\hat{A}_t)]$ - Design Motivation: Generalizes BO from a static single-step selection rule to multi-step planning over an MDP

Evaluation Metric: Landscape-Aware Regret (LAR) $$R_t^{LAR} = [f(\mathbf{x}^*) - f(\mathbf{x}_t)] + \alpha[E_\theta(\mathbf{x}^*) - E_\theta(\mathbf{x}_t)]$$

Setting $\alpha = 0$ recovers standard regret.

Loss & Training¶

GP: Composite RBF+Matérn kernel; hyperparameters optimized via Type-II marginal likelihood
EBM: Short-run MCMC (10–20 SGLD steps per iteration)
PPO: 2-layer policy network, 64–256 hidden units, clip $\varepsilon$ to prevent policy collapse
$\lambda \in [0.2, 0.5]$ serves as a safety band

Key Experimental Results¶

Main Results (Table 1, Synthetic Benchmarks)¶

Model	Branin 2D (T=50)	Ackley 5D (T=50)	Rosenbrock 8D (T=50)	HDBO 200D (T=100)	Mean
BALLET-ICI	90.44	87.78	90.76	85.85	83.80
EARL-BO	88.76	87.22	88.47	83.74	81.57
TuRBO	88.63	83.79	85.74	80.69	78.56
KG	91.53	90.23	90.29	85.17	87.52
REBMBO-C	97.37	94.46	96.77	90.95	89.40
REBMBO-D	95.21	91.53	96.98	94.42	89.17

Ablation Study (Appendix)¶

Ablated Component	Effect
Remove EBM	Performance degrades; reduces to standard GP-BO
Remove PPO multi-step	Degrades to single-step acquisition; significant performance drop in high dimensions
Remove short-run MCMC	EBM quality degrades
Composite vs. single kernel	RBF+Matérn composite is optimal across all tasks
$\lambda \in [0.2, 0.5]$	Performance is stable; too large or too small $\lambda$ causes degradation

Key Findings¶

REBMBO outperforms all baselines on all 6 benchmarks, with the largest advantage on HDBO 200D (REBMBO-D: 94.42 vs. KG: 85.17)
REBMBO-D performs best on high-dimensional tasks (deep kernel captures complex latent structure)
REBMBO-C performs best on low-dimensional tasks (benefits of exact GP inference)
On the Nanophotonic 3D real-world task, convergence is approximately 30% faster
Computational overhead is only a small constant factor compared to TuRBO

Highlights & Insights¶

Three-Module Synergy: GP, EBM, and PPO are not simply stacked but tightly coupled—all three are updated synchronously after each evaluation, with the RL policy co-evolving with the latest GP posterior and EBM energy landscape
EBM-UCB: Directly embedding EBM energy into the UCB acquisition function is elegant and effective; the $-\gamma E_\theta(\mathbf{x})$ term provides a principled global exploration bias
LAR Metric: Incorporating global exploration quality into regret evaluation yields a more comprehensive measure than standard regret ($\alpha=0$ recovers the standard version, ensuring backward compatibility)
Three GP Variants: Provide flexible choices for problems of varying scale and complexity

Limitations & Future Work¶

EBM training may introduce unavoidable errors; the learned energy landscape may be inaccurate when the number of evaluations is very small
PPO multi-step planning is sensitive to RL hyperparameters; theoretical convergence rate analysis is left for future work
Scale mismatch between the EBM and $f$ may degrade performance (though experiments suggest normalization and adaptive $\lambda$ can mitigate this)
Each iteration requires EBM training and PPO updates, incurring higher computational cost than conventional BO (negligible when function evaluations dominate)
The sub-linear LAR guarantee in the theoretical section (Appendix E) relies on "mild alignment and regularity assumptions" whose specific conditions are not clearly articulated

vs. EARL-BO: The latter integrates RL into BO but lacks a global energy signal; REBMBO adds EBM-based global guidance
vs. GLASSES: The latter approximates multi-step loss via forward simulation; REBMBO directly learns a multi-step policy via PPO
vs. TuRBO: The latter excels at local search but lacks long-range jumps; REBMBO achieves global jumps through the EBM
Insights: The idea of using EBMs as global structural priors in BO is transferable to other sequential decision-making scenarios

Rating¶

⭐⭐⭐⭐ (4/5)

The method is strongly innovative in design—the three-module synergy of GP, EBM, and PPO represents a new paradigm in BO. The experiments are comprehensive and convincing (6 benchmarks + ablations + real-world task). Weaknesses include theoretical guarantees that rely on strong assumptions, questionable EBM training reliability in low-data regimes, and an overall framework of considerable complexity.