Meta-Black-Box-Optimization through Offline Q-function Learning (Q-Mamba)¶
Conference: ICML2025
arXiv: 2505.02010
Code: Open-sourced
Area: Meta-Learning / Black-Box Optimization
Keywords: Meta-BBO, Offline Reinforcement Learning, Q-function Decomposition, Mamba, Dynamic Algorithm Configuration, Conservative Q-Learning
TL;DR¶
This paper proposes Q-Mamba, the first offline MetaBBO framework. By integrating Q-function decomposition, conservative Q-learning, and the Mamba architecture, it achieves comparable or superior BBO algorithm configuration performance with less than half the training budget of online methods.
Background & Motivation¶
Meta-Black-Box-Optimization (MetaBBO) is a bi-level "learning to optimize" paradigm: a meta-level neural network policy \(\pi_\theta\) dynamically configures the hyperparameters of a bottom-level BBO algorithm (such as Differential Evolution, DE) to achieve optimal performance over a problem distribution.
Existing MetaBBO methods face two core bottlenecks:
Learning Effectiveness: Advanced BBO algorithms have vast configuration spaces (e.g., MadDE has more than 10 hyperparameters). The joint action space grows exponentially with dimension (\(M^K\)), making it difficult for RL to learn effective policies.
Training Efficiency: Current methods all adopt an online RL paradigm, where bottom-level optimization requires hundreds of simulation steps each time, making trajectory sampling extremely time-consuming (training typically takes 25-28 hours).
Core Problem: Can offline RL be used to learn DAC policies from pre-collected trajectory data while guaranteeing learning performance?
Method¶
Overall Architecture¶
Q-Mamba consists of three core designs:
- Offline E&E Dataset Collection: Mixing exploration and exploitation data.
- Conservative Q-Learning Loss: Decomposed Bellman update + conservative regularization.
- Mamba-based Q-Learner: Long-sequence Q-function learning based on State Space Models (SSMs).
1. Q-function Decomposition Mechanism¶
The joint action space \(A = (A_1, \ldots, A_K)\) is decomposed into dimension-wise sequential decisions. For the decomposed Q-function of the \(i\)-th dimension:
- Each hyperparameter \(A_i\) is discretized into \(M=16\) bins.
- A 5-bit binary code is used to represent the action token (
00000to01111), with11111serving as the start token. - This converts the exponential space \(M^K\) to a linear scale \(K \times M\).
2. Offline E&E Dataset (Exploration & Exploitation)¶
A dataset of \(D = 10\text{K}\) trajectories is collected with a mixing ratio of \(\mu = 0.5\):
- Exploitation Data (\(\mu \cdot D\)): High-quality trajectories obtained by rolling out pre-trained MetaBBO methods (RLPSO, LDE, GLEET) on the problem distribution.
- Exploration Data (\((1-\mu) \cdot D\)): Exploration trajectories obtained by controlling hyperparameters via a random policy.
Mixing these two types of data simultaneously provides high-quality experience and diversity, which reduces offline learning bias.
3. Conservative Q-Learning Loss¶
The training target is a three-branch combined loss (computed for each timestep \(t\) in trajectory \(\tau\), each action dimension \(i\), and each bin \(j\)):
- First two branches: Standard decomposed Bellman targets (TD error), where the last dimension is weighted by \(\beta=10\) to anchor the backpropagation accuracy.
- Third branch: CQL conservative regularization, which pushes the Q-values of bins not selected in the trajectory towards 0 (lower bound), mitigating overestimation caused by distribution shift.
- \(\lambda=1\) balances the conservative term.
4. Mamba-based Q-Learner Architecture¶
\(Q_i^t\) is output dimension-by-dimension in an autoregressive manner:
- Input: The current optimization state \(s^t \in \mathbb{R}^9\) (6-dimensional population statistical features + 3-dimensional temporal features) concatenated with the action token from the previous step.
- Mamba Block: Processed through a Selective State Space Model (SSM), where the parameters \(\bar{B}\) and \(C\) are functions of the input (time-varying), enabling flexible learning of long-term and short-term dependencies.
- Q-value Head: A linear layer + Leaky ReLU, outputting a 16-dimensional Q-value vector.
- Choosing \(a_i^t = \arg\max_j Q_{i,j}^t\), which is then tokenized and fed into the next step.
Reasons for choosing Mamba over Transformer:
- MetaBBO involves \(T \times K\) decision steps (long sequences of several thousand steps). Mamba flexibly processes long-term and short-term dependencies through selective SSM.
- Hardware-friendly parallel scan (PrefixSum) algorithm, offering memory efficiency equivalent to FlashAttention.
- Faster training speed compared to Transformer (13h vs. 16h).
Training utilizes AdamW, with a learning rate of \(5 \times 10^{-3}\), 300 epochs, and a batch size of 64.
Key Experimental Results¶
In-Distribution Generalization Performance (8 unseen test problems, 19 independent runs, cumulative performance metric Perf)¶
| Method | Type | Alg0 (K=3) | Alg1 (K=10) | Alg2 (K=16) | Training/Inference Time |
|---|---|---|---|---|---|
| RLPSO | Online | 0.9855 | 0.9953 | 0.9914 | 28h / 11s |
| LDE | Online | 0.9563 | 0.9877 | 0.9904 | 28h / 12s |
| GLEET | Online | 0.9616 | 0.9938 | 0.9910 | 25h / 13s |
| DT | Offline | 0.9325 | 0.6764 | 0.8706 | 13h / 10s |
| DeMa | Offline | 0.9492 | 0.9015 | 0.9159 | 12h / 10s |
| QDT | Offline | 0.9683 | 0.9917 | 0.9919 | 20h / 12s |
| QT | Offline | 0.9729 | 0.9955 | 0.9926 | 20h / 12s |
| Q-Transformer | Offline | 0.9666 | 0.9951 | 0.9895 | 16h / 11s |
| Q-Mamba | Offline | 0.9889 | 0.9973 | 0.9950 | 13h / 10s |
Key Findings: Q-Mamba achieves optimal performance across three algorithms with different configuration space scales, while requiring less than half the training time of online methods.
Out-of-Distribution Generalization (Neuroevolution / MuJoCo Continuous Control)¶
- Trained solely on synthetic BBOB problems (\(\le 50\) dimensions), Q-Mamba achieves zero-shot transfer to MLP policy evolution tasks with thousands of parameters.
- It achieves comparable performance to online MetaBBO baselines on MuJoCo tasks such as Ant, HalfCheetah, and Hopper.
Ablation Study¶
| Setting | \(\lambda=0\) | \(\lambda=1\) | \(\lambda=10\) |
|---|---|---|---|
| \(\beta=1\) | 0.9756 | 0.9828 | 0.9855 |
| \(\beta=10\) | 0.9833 | 0.9889 | 0.9857 |
- Removing conservative regularization (\(\lambda=0\)) leads to a remarkable performance drop \(\rightarrow\) The conservative term is critical for mitigating distribution shift.
- \(\beta=10\) weights the Q-value of the final dimension \(\rightarrow\) Improving overall accuracy by anchoring backpropagation.
- A data mixing ratio of \(\mu=0.5\) (equal proportion mixing) achieves the best performance.
Highlights & Insights¶
- First Offline MetaBBO Framework: Breaks the paradigm of sole reliance on online RL in this field, improving training efficiency by \(\ge 2\times\).
- Q-function Decomposition: Elegantly transforms the exponential joint action space into linear sequential decisions, lowering the learning difficulty.
- Conservative Q-learning + Decomposed Bellman: The combined loss is exquisitely designed, and the trick of weighting the last dimension with \(\beta\) is theoretically supported (as a backpropagation anchor).
- Mamba vs. Transformer: Empirically demonstrates that Mamba outperforms Transformer in long-sequence Q-learning, leveraging the selective memory advantages of time-varying SSM.
- Strong Generalization: Achieves zero-shot transfer from synthetic problems of \(\le 50\) dimensions to Neuroevolution scenarios with thousands of dimensions.
Limitations & Future Work¶
- Offline Data Reliance on Pre-trained Baselines: The E&E dataset requires training online methods like RLPSO, LDE, or GLEET beforehand; this initial cost is not fully discussed.
- Limited to Evolutionary Algorithms: Validation has not been conducted on other BBO paradigms like gradient-based optimizers or Bayesian optimization.
- Information Loss from Action Discretization: Uniform discretization into 16 bins might discard the fine-grained details of continuous hyperparameters.
- Limited OOD Generalization Experiments: Only one type of OOD scenario (Neuroevolution) was tested; more real-world scenarios remain to be validated.
- Mamba Architecture's Reliance on CUDA: Hardware acceleration is restricted to specific GPUs, limiting deployment flexibility.
Related Work & Insights¶
- ConfigX (Guo et al., 2024b): Constructs a large-scale algorithm space; this work selects bottom-level algorithms based on it.
- Q-Transformer (Chebotar et al., 2023): Q-decomposition + Transformer; this work gets improvements by replacing it with Mamba.
- CQL (Kumar et al., 2020): Its conservative Q-learning ideology is directly absorbed in this paper.
- Mamba (Gu & Dao, 2023): Selective State Space Model; this work is the first to apply it to MetaBBO.
- Decision Transformer (Chen et al., 2021): Serves as a baseline for the conditional imitation learning paradigm.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The first combination of offline MetaBBO + Q-decomposition + Mamba exhibits significant innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Relatively comprehensive, featuring comparisons with 9 baselines, ablation studies, and OOD generalization.
- Writing Quality: ⭐⭐⭐⭐ — Clear mathematical derivations, intuitive framework diagrams, and smooth writing style.
- Value: ⭐⭐⭐⭐ — Introduces a new offline RL paradigm to the MetaBBO community, demonstrating high practical value.