Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning¶
Conference: NeurIPS 2025
arXiv: 2510.19530
Code: Unavailable (no explicit link provided)
Area: Reinforcement Learning
Keywords: Bayesian Optimization, Energy-Based Model, PPO, Black-Box Optimization, Multi-Step Look-Ahead
TL;DR¶
This paper proposes REBMBO, a framework that unifies Gaussian Processes (local modeling), Energy-Based Models (EBM, global exploration), and PPO-based reinforcement learning (multi-step look-ahead) into a closed-loop Bayesian optimization system, achieving significant improvements over conventional BO methods on high-dimensional and multi-modal black-box optimization tasks.
Background & Motivation¶
Background: Bayesian Optimization (BO) is the dominant approach for optimizing expensive black-box functions, centered on a GP surrogate model paired with an acquisition function (e.g., UCB, EI). Notable extensions include TuRBO (local trust regions), BALLET-ICI (alternating global/local GPs), and EARL-BO (RL-assisted multi-step BO).
Limitations of Prior Work: Standard BO suffers from severe one-step myopia—optimizing only the expected gain of the current step while neglecting long-term exploration strategies. In high-dimensional or multi-modal settings, this leads to rapid convergence to local optima.
Key Challenge: GPs excel at local uncertainty modeling but lack global structural information; multi-step look-ahead methods (e.g., 2-step EI, Knowledge Gradient) are computationally expensive yet still limited in horizon; RL-integrated approaches (e.g., EARL-BO) rely on local posteriors and lack global signals.
Goal: Simultaneously address insufficient global exploration and one-step myopia by integrating global structural information and multi-step planning capability into the BO framework.
Key Insight: Introduce an EBM to learn a global energy landscape supplementing GP's local modeling, and formulate each BO step as an MDP solved via PPO for adaptive multi-step look-ahead.
Core Idea: The EBM provides information on "which regions are globally promising," the GP provides "how certain local estimates are," and PPO performs multi-step planning to leverage both.
Method¶
Overall Architecture¶
REBMBO consists of three tightly coupled modules (Figure 1): - Module A: GP surrogate model (providing local mean \(\mu_{f,t}(\mathbf{x})\) and variance \(\sigma_{f,t}(\mathbf{x})\)) - Module B: EBM global energy landscape \(E_\theta(\mathbf{x})\) (trained via short-run MCMC) - Module C: PPO multi-step planning policy \(\pi_{\phi_{ppo}}(\mathbf{a}_t | \mathbf{s}_t)\)
All three modules are updated synchronously after each evaluation, forming an adaptive closed loop.
Key Designs¶
Module 1: GP Variants (Module A) - REBMBO-C: Classic GP, \(\mathcal{O}(n^3)\) exact inference - REBMBO-S: Sparse GP, \(m \ll n\) inducing points, \(\mathcal{O}(nm^2)\) - REBMBO-D: Deep kernel GP, \(h(\mathbf{x}) = W_2 \psi(W_1 \mathbf{x})\) mapped to latent space - Composite kernel: \(k_f = \sigma_f^2[w_{\text{RBF}} k_{\text{RBF}} + w_{\text{Matérn}} k_{\text{Matérn}}]\), with weights learned automatically via marginal likelihood
Module 2: EBM Global Exploration (Module B) - Function: Learns a global energy landscape where low-energy regions correspond to high-probability/high-promise areas - Training: MLE via short-run MCMC - Positive phase: Lower energy \(E_\theta(\mathbf{x}_i)\) at observed data points - Negative phase: Generate negative samples via Langevin dynamics and raise their energy - EBM-UCB acquisition function: \(\alpha_{\text{EBM-UCB}}(\mathbf{x}) = \mu_{f,t}(\mathbf{x}) + \beta \sigma_{f,t}(\mathbf{x}) - \gamma E_\theta(\mathbf{x})\) - Design Motivation: The \(-\gamma E_\theta(\mathbf{x})\) term biases search toward globally promising regions identified by the EBM, preventing wasted evaluations in uncertain but unpromising areas
Module 3: PPO Multi-Step Planning (Module C) - MDP state: \(\mathbf{s}_t = (\mu_{f,t}(\mathbf{x}), \sigma_{f,t}(\mathbf{x}), E_\theta(\mathbf{x}))\) - MDP action: Next query point \(\mathbf{a}_t \in \mathcal{X}\) - Reward: \(r_t(\mathbf{s}_t, \mathbf{a}_t) = nf(\mathbf{a}_t) - \lambda E_\theta(\mathbf{a}_t)\) - PPO objective: \(\mathcal{L}^{\text{CLIP}} = \mathbb{E}_t[\min(r_t \hat{A}_t, \text{clip}(r_t, 1-\varepsilon, 1+\varepsilon)\hat{A}_t)]\) - Design Motivation: Generalizes BO from a static single-step selection rule to multi-step planning over an MDP
Evaluation Metric: Landscape-Aware Regret (LAR) $\(R_t^{LAR} = [f(\mathbf{x}^*) - f(\mathbf{x}_t)] + \alpha[E_\theta(\mathbf{x}^*) - E_\theta(\mathbf{x}_t)]\)$
Setting \(\alpha = 0\) recovers standard regret.
Loss & Training¶
- GP: Composite RBF+Matérn kernel; hyperparameters optimized via Type-II marginal likelihood
- EBM: Short-run MCMC (10–20 SGLD steps per iteration)
- PPO: 2-layer policy network, 64–256 hidden units, clip \(\varepsilon\) to prevent policy collapse
- \(\lambda \in [0.2, 0.5]\) serves as a safety band
Key Experimental Results¶
Main Results (Table 1, Synthetic Benchmarks)¶
| Model | Branin 2D (T=50) | Ackley 5D (T=50) | Rosenbrock 8D (T=50) | HDBO 200D (T=100) | Mean |
|---|---|---|---|---|---|
| BALLET-ICI | 90.44 | 87.78 | 90.76 | 85.85 | 83.80 |
| EARL-BO | 88.76 | 87.22 | 88.47 | 83.74 | 81.57 |
| TuRBO | 88.63 | 83.79 | 85.74 | 80.69 | 78.56 |
| KG | 91.53 | 90.23 | 90.29 | 85.17 | 87.52 |
| REBMBO-C | 97.37 | 94.46 | 96.77 | 90.95 | 89.40 |
| REBMBO-D | 95.21 | 91.53 | 96.98 | 94.42 | 89.17 |
Ablation Study (Appendix)¶
| Ablated Component | Effect |
|---|---|
| Remove EBM | Performance degrades; reduces to standard GP-BO |
| Remove PPO multi-step | Degrades to single-step acquisition; significant performance drop in high dimensions |
| Remove short-run MCMC | EBM quality degrades |
| Composite vs. single kernel | RBF+Matérn composite is optimal across all tasks |
| \(\lambda \in [0.2, 0.5]\) | Performance is stable; too large or too small \(\lambda\) causes degradation |
Key Findings¶
- REBMBO outperforms all baselines on all 6 benchmarks, with the largest advantage on HDBO 200D (REBMBO-D: 94.42 vs. KG: 85.17)
- REBMBO-D performs best on high-dimensional tasks (deep kernel captures complex latent structure)
- REBMBO-C performs best on low-dimensional tasks (benefits of exact GP inference)
- On the Nanophotonic 3D real-world task, convergence is approximately 30% faster
- Computational overhead is only a small constant factor compared to TuRBO
Highlights & Insights¶
- Three-Module Synergy: GP, EBM, and PPO are not simply stacked but tightly coupled—all three are updated synchronously after each evaluation, with the RL policy co-evolving with the latest GP posterior and EBM energy landscape
- EBM-UCB: Directly embedding EBM energy into the UCB acquisition function is elegant and effective; the \(-\gamma E_\theta(\mathbf{x})\) term provides a principled global exploration bias
- LAR Metric: Incorporating global exploration quality into regret evaluation yields a more comprehensive measure than standard regret (\(\alpha=0\) recovers the standard version, ensuring backward compatibility)
- Three GP Variants: Provide flexible choices for problems of varying scale and complexity
Limitations & Future Work¶
- EBM training may introduce unavoidable errors; the learned energy landscape may be inaccurate when the number of evaluations is very small
- PPO multi-step planning is sensitive to RL hyperparameters; theoretical convergence rate analysis is left for future work
- Scale mismatch between the EBM and \(f\) may degrade performance (though experiments suggest normalization and adaptive \(\lambda\) can mitigate this)
- Each iteration requires EBM training and PPO updates, incurring higher computational cost than conventional BO (negligible when function evaluations dominate)
- The sub-linear LAR guarantee in the theoretical section (Appendix E) relies on "mild alignment and regularity assumptions" whose specific conditions are not clearly articulated
Related Work & Insights¶
- vs. EARL-BO: The latter integrates RL into BO but lacks a global energy signal; REBMBO adds EBM-based global guidance
- vs. GLASSES: The latter approximates multi-step loss via forward simulation; REBMBO directly learns a multi-step policy via PPO
- vs. TuRBO: The latter excels at local search but lacks long-range jumps; REBMBO achieves global jumps through the EBM
- Insights: The idea of using EBMs as global structural priors in BO is transferable to other sequential decision-making scenarios
Rating¶
⭐⭐⭐⭐ (4/5)
The method is strongly innovative in design—the three-module synergy of GP, EBM, and PPO represents a new paradigm in BO. The experiments are comprehensive and convincing (6 benchmarks + ablations + real-world task). Weaknesses include theoretical guarantees that rely on strong assumptions, questionable EBM training reliability in low-data regimes, and an overall framework of considerable complexity.