MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization¶

Conference: NeurIPS 2025 arXiv: 2505.17745 Code: GitHub Area: Reinforcement Learning Keywords: Meta-Black-Box Optimization, benchmark platform, parallelization, RL-based optimization, generalization

TL;DR¶

MetaBox-v2 is a milestone upgrade to the Meta-Black-Box Optimization (MetaBBO) benchmark platform. It provides unified support for four learning paradigms (RL/SL/NE/ICL), reproduces 23 baseline algorithms, integrates 18 test suites (1900+ problem instances), and achieves 10–40× speedup via vectorized environments and distributed evaluation.

Background & Motivation¶

Meta-Black-Box Optimization (MetaBBO) automates algorithm design through meta-learning—after training, a meta-level policy can generate efficient algorithm configurations for unseen black-box optimization problems. Its bilevel structure consists of a lower-level BBO optimizer that optimizes sampled problems, and a meta-level policy that outputs algorithm design decisions based on optimization state features: \(\omega_i^t = \pi_\theta(s_i^t)\). The meta-training objective is to maximize cumulative performance gain \(J(\theta) = \mathbb{E}_{p \in \mathcal{P}}[\sum_{t=1}^T r_t]\).

The original MetaBox, released in 2023, was the first open-source MetaBBO benchmark, but it only supported single-objective optimization and the RL paradigm (8 baselines, 3 test sets), and has since fallen behind the field's rapid development:

Diversification of learning paradigms: Beyond MetaBBO-RL, new paradigms have emerged, including supervised learning (MetaBBO-SL), neuroevolution (MetaBBO-NE), and large-model in-context learning (MetaBBO-ICL), which are incompatible with the original RL-specific interface.

Expansion of optimization scenarios: MetaBBO has been applied to multi-objective, multimodal, large-scale global, and multi-task optimization, whereas the original MetaBox only supported single-objective problems.

Efficiency bottleneck: The bilevel nested structure makes training and evaluation extremely time-consuming. The original MetaBox relied on sequential environment evaluation, making large-scale testing infeasible.

Method¶

Overall Architecture¶

MetaBox-v2 achieves its upgrade through four synergistic enhancements: (1) a unified MetaBBO template interface; (2) efficient training/testing parallelization; (3) a rich, multi-type benchmark suite; and (4) flexible, extensible analysis and visualization interfaces. All baselines share a Basic_Agent base class with universal train and rollout interfaces, and wrapper functions convert heterogeneous learning objectives into a unified data object.

Key Designs¶

Unified MetaBBO Interface: The core innovation replaces the original RL-specific agent class with a Basic_Agent base class. Wrapper functions enable compatibility across four paradigms at the unified data-object level—RL requires reward signals, SL requires gradients, NE requires fitness values, and ICL requires context. Similarly, the single-objective Problem class is abstracted into an inheritable Basic_Problem parent class, supporting multi-objective, multi-task, and other problem types via polymorphic overriding of the eval() interface. Based on this design, 23 MetaBBO baselines (including the original 8) and 13 traditional BBO baselines are reproduced.
Efficient Parallelization:
- Training acceleration (vectorized environments): A batch of lower-level optimization environments is constructed simultaneously and encapsulated as a Tianshou-based vectorized environment. The meta-level agent executes batched algorithm design decisions in parallel via multiprocessing, and learning signals are aggregated into mini-batch updates. This represents the first implementation of training-phase parallelization for MetaBBO, achieving approximately 10× speedup.
- Evaluation acceleration (Ray distributed): Four parallel modes are provided, ranging from mode-1 (distributed over \(N\) problem instances) to mode-4 (fully parallel over \(N \times B \times R\)), with maximum speedup exceeding 40×. Parallelism is decomposed orthogonally along the problem dimension and the independent-run dimension.
Enriched Benchmark Suite: Test suites are expanded from 3 to 18 (1900+ instances), covering single-objective optimization (bbob series, hpo-b, uav, protein), multi-objective optimization (ZDT, DTLZ, WFG, UF), large-scale optimization (LSGO, neuroevolution), multimodal optimization (MMO), and multi-task optimization (CEC2017MTO, WCCI2020). Deep integration with open-source ecosystems including EvoX, DEAP, and PyCMA is also provided.

Evaluation Metric Innovations¶

Metadata System: Complete procedural data—including per-generation populations, objective values, and runtime—are saved for each algorithm–test-set evaluation. A standardized performance metric is defined as: \(\text{Perf}(\mathcal{A}, \mathbb{D}) = \frac{1}{N \times K}\sum_{i=1}^N \sum_{j=1}^K \frac{Y_{i,j}^* - p_i^*}{Y_{i,j}^0 - p_i^*}\).
Learning Efficiency Metric: Multiple snapshots are saved during training, and the ratio \(\frac{\text{Perf}(\mathcal{A}^{(g)}, \mathbb{D})}{T^{(g)}}\) (performance per unit training time) is computed at each checkpoint, enabling fair comparison of training efficiency across algorithms at different stages.
Anti-NFL Metric: Measures generalization consistency across test sets: \(\text{Anti-NFL} = \exp\left(\frac{1}{B}\sum_{b=1}^B \frac{\text{Perf}(\mathcal{A}, \mathbb{D}_{\text{test}}^{(b)}) - \text{Perf}(\mathcal{A}, \mathbb{D}_{\text{train}})}{\text{Perf}(\mathcal{A}, \mathbb{D}_{\text{train}})}\right)\). Higher values indicate greater robustness under problem distribution shift.

Key Experimental Results¶

Main Results¶

In-distribution evaluation: bbob-10D test set (16 problems, 51 independent runs, 8 training problems)

Algorithm	Type	Sharp Ridge	Different Powers	Schaffers HC	Schwefel	Avg. Rank
PSO	Traditional BBO	1.91E+02	6.80E-01	5.60E+00	2.56E+00	Poor
DE	Traditional BBO	8.59E-01	8.18E-04	9.45E-02	9.16E-01	Medium
DEDDQN	MetaBBO-RL	1.84E-03	4.22E-09	1.08E-02	1.72E+00	1st
LDE	MetaBBO-RL	5.96E-01	5.16E-05	2.16E-01	1.07E+00	2nd–3rd
SHADE	Traditional BBO	1.44E+00	2.72E-04	2.65E-01	1.34E+00	Medium
RNNOPT	MetaBBO-SL	1.82E+03	2.30E+01	4.65E+01	9.30E+03	Worst

Ablation Study¶

Training acceleration comparison (vectorized environment, batch_size=16)

Baseline	MetaBox Training Time	MetaBox-v2 Training Time	Speedup
Representative baseline	Reference	Up to 10× faster	10×

Evaluation acceleration comparison (4 Ray parallel modes)

Mode	Parallelism Dimension	Cores	Speedup
Mode-1	\(N\) problem instances	\(N\)	~5×
Mode-2	\(R\) independent runs	\(R\)	~10×
Mode-3	\(N \times B\) instances × baselines	\(N \times B\)	~20×
Mode-4	\(N \times B \times R\) full parallel	\(N \times B \times R\)	≥40×

Key Findings¶

MetaBBO-RL achieves overall best performance: MetaBBO baselines outperform traditional BBO on 14 of 16 bbob-10D test problems, and the RL paradigm consistently leads over SL, NE, and ICL paradigms.
DEDDQN (2019) still ranks first: This notable finding suggests that more complex newer methods do not necessarily outperform older ones, and implies a trade-off between learning efficiency and model complexity.
Large generalization gaps: Different baselines exhibit substantial variation in cross-test-set generalization; even algorithms that perform well in-distribution may degrade significantly on real-world problems such as protein or UAV. The Anti-NFL metric reveals that out-of-distribution generalization is a core challenge in MetaBBO.

Highlights & Insights¶

As a benchmark platform paper, the architectural design is highly mature: the unified interface accommodates multiple paradigms via a wrapper pattern, and vectorized environments combined with Ray distributed evaluation address both training and testing dimensions.
The Anti-NFL metric is a thoughtful design choice—directly quantifying an algorithm's ability to resist the No Free Lunch theorem, which is highly instructive for the MetaBBO community.
Comprehensive metadata preservation lowers the barrier to custom analysis, making the platform accessible to researchers new to the field.

Limitations & Future Work¶

The paper contains an abundance of tables but lacks sufficient analytical depth; the comprehensive comparison of 23 baselines results in relatively coarse characterization of each method's strengths and weaknesses.
MetaBBO-ICL includes only a single baseline (OPRO), providing insufficient coverage of LLMs as optimizers.
In-depth evaluation on GPU-accelerated continuous optimization problems is absent, despite partial adoption of problems from EvoX.
A unified theoretical analysis of different MetaBBO paradigms is lacking; comparisons are purely empirical.

Compared to traditional BBO benchmarks such as COCO and CEC, MetaBox-v2 is the only platform supporting the full MetaBBO bilevel framework, establishing a clearly differentiated positioning.
EvoX's GPU acceleration approach warrants further integration with MetaBox-v2—the current vectorized environment relies on CPU multiprocessing; migrating to JAX/GPU would yield substantially greater speedups.
The Anti-NFL metric's design principle generalizes naturally to other settings involving generalization from a training task distribution to new tasks, such as multi-task RL and meta-learning.

Rating¶

Novelty: ⭐⭐⭐ Primarily an engineering upgrade; the unified interface and Anti-NFL metric exhibit moderate design originality
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 20 baselines, 18 test suites, 51 independent runs—extremely broad coverage
Writing Quality: ⭐⭐⭐⭐ Well-structured with rich tables, though high information density
Value: ⭐⭐⭐⭐ Strong practical value and catalytic impact for the MetaBBO community