Cost-Sensitive Freeze-thaw Bayesian Optimization for Efficient Hyperparameter Tuning¶
Conference: NeurIPS 2025 arXiv: 2510.21379 Code: GitHub Area: LLM Evaluation Keywords: Hyperparameter Optimization, Bayesian Optimization, Freeze-thaw, Cost-Sensitive, Learning Curve Extrapolation, Transfer Learning
TL;DR¶
CFBO incorporates user-defined utility functions (cost–performance trade-offs) into the freeze-thaw Bayesian optimization framework, and combines an adaptive stopping criterion with LC mixup-based transfer learning to achieve optimal cost–performance trade-offs on multi-fidelity HPO benchmarks.
Background & Motivation¶
Background: Hyperparameter optimization (HPO) for deep learning is computationally expensive. Multi-fidelity methods (Hyperband, DyHPO, ifBO) significantly improve efficiency by pruning configurations early via learning curve extrapolation.
Limitations of Prior Work: Traditional multi-fidelity methods assume sufficient budgets and target final performance maximization, without accounting for user preferences regarding cost–performance trade-offs — for instance, cloud computing users may prefer early stopping when credits are limited rather than running configurations to completion.
Key Challenge: How can the HPO process automatically terminate near the optimal utility point according to user preferences?
Key Insight: Define a utility function \(U(b, \tilde{y}_b)\) to characterize the cost–performance trade-off, and design a matching acquisition function and stopping criterion.
Core Idea: Maximize user utility rather than asymptotic performance within the freeze-thaw BO framework, while using LC mixup transfer learning to improve early extrapolation accuracy.
Method¶
Utility Function¶
The user utility function \(U(b, \tilde{y}_b): [B] \times [0,1] \to [0,1]\) is decreasing in budget \(b\) and increasing in the best cumulative performance \(\tilde{y}_b\). A typical form is:
where \(\alpha \in [0,1]\) controls penalty strength and \(c \in \{0.5, 1, 2\}\) corresponds to square-root, linear, and quadratic penalties, respectively. Setting \(\alpha=0\) reduces to the conventional penalty-free setting.
For users uncertain about their own preferences, the utility can be estimated from pairwise preference data via the Bradley-Terry model, requiring approximately 30 comparisons for recovery.
EI-type Acquisition Function¶
Configuration \(x_{n^*}\) is selected to maximize the expected improvement in utility:
- The remaining learning curve of configuration \(x_n\) is extrapolated via a Prior-Fitted Network (PFN).
- The optimal target epoch \(\Delta t\) is selected dynamically rather than fixed to the final epoch.
- The reference value \(U_p\) is the utility at the most recent step (not the historical best), since budget expenditure is irreversible.
A key behavioral property: early in the search the acquisition favors long-horizon, high-performance configurations (exploration); as performance saturates and cost dominates, it shifts toward short-term greedy selection (exploitation).
Adaptive Stopping Criterion¶
Based on an estimated regret value:
The adaptive threshold incorporates the probability of improvement (PI):
where \(p_b\) is the probability that the selected configuration improves \(U_p\) in the future. \(\gamma\) controls the threshold when PI = 0.5, and \(\beta\) controls the degree of PI incorporation (\(\beta \to 0\): PI ignored; \(\beta \to \infty\): pure PI decision). Defaults are \(\beta=e^{-1}\) and \(\gamma=\log_{0.5}0.2\).
LC Mixup Transfer Learning¶
To improve early learning curve extrapolation accuracy, two-level mixup data augmentation is applied to the PFN surrogate model:
- Across datasets: \(L' = \lambda_1 L^{(m)} + (1-\lambda_1)L^{(m')}\)
- Across configurations: \(l' = \lambda_2 l_n + (1-\lambda_2) l_{n'}\), \(x' = \lambda_2 x_n + (1-\lambda_2) x_{n'}\)
\(\lambda_1, \lambda_2 \sim \text{Uniform}(0,1)\). This allows unlimited sampling of training data and reduces overfitting.
Key Experimental Results¶
Datasets¶
| Dataset | \(d_x\) | \(|\mathcal{X}|\) | \(T\) | Train Tasks | Test Tasks | |--------|-------|---------|-----|---------|---------| | LCBench | 7 | 2000 | 51 | 20 | 15 | | TaskSet | 8 | 1000 | 50 | 21 | 9 | | PD1 | 4 | 240 | 50 | 16 | 7 |
Main Results (Normalized Regret ×100, PD1 Benchmark)¶
| Method | α=0 | α=2⁻⁶ | α=2⁻⁴ | α=2⁻² |
|---|---|---|---|---|
| ifBO | 0.8±0.1 | 2.3±0.1 | 6.0±0.6 | 15.2±2.0 |
| Quick-Tune† | — | — | — | — |
| CFBO-NT | 0.2±0.0 | 1.5±0.0 | 4.5±0.0 | 8.5±0.0 |
| CFBO | 0.2±0.0 | 1.0±0.0 | 0.9±0.0 | 1.7±0.0 |
Ablation Study (PD1)¶
| Adaptive Threshold | Acq. | Transfer | α=0 | α=2⁻⁴ | α=2⁻² |
|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 0.8 | 6.0 | 15.2 |
| ✗ | ✗ | ✓ | 0.2 | 5.9 | 11.7 |
| ✗ | ✓ | ✓ | 0.2 | 4.5 | 8.5 |
| ✓ | ✓ | ✓ | 0.2 | 0.9 | 1.7 |
- Transfer learning yields the largest gain at α=0; the cost-sensitive acquisition function contributes most when α>0.
- The adaptive threshold delivers the greatest improvement under strong penalty (8.5→1.7).
Runtime (Seconds per BO Step, A100)¶
| Method | LCBench | TaskSet | PD1 |
|---|---|---|---|
| ifBO | 0.58 | 0.30 | 0.08 |
| CFBO | 1.52 | 0.78 | 0.23 |
The gap is negligible — network training time dominates BO step time in practice.
Highlights & Insights¶
- Elegant problem formulation: Formalizing "when should the user stop" as utility maximization is considerably more practical than fixed-budget settings.
- Self-adaptive acquisition behavior: The method naturally transitions from exploration to exploitation, with \(\Delta t\) decreasing as BO progresses.
- Adaptive stopping criterion substantially outperforms fixed thresholds and approaches the theoretically optimal stopping point.
- LC mixup data augmentation is simple yet effective, and generalizes to other PFN training scenarios.
Limitations & Future Work¶
- The utility function form requires user priors (although Bradley-Terry estimation is provided, a functional family must still be specified).
- The configuration space is restricted to a finite set \(\mathcal{X}\), precluding continuous HPO.
- Validation is limited to tabular, NLP, and visual classification learning curves; large-scale LLM training scenarios are not covered.
- Runtime is approximately 2–3× that of ifBO (though overall overhead remains negligible).
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of cost-sensitive utility and adaptive stopping constitutes a novel HPO formalization.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks × multiple utility functions × complete ablations × real-dataset validation.
- Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear and figures are intuitive.
- Value: ⭐⭐⭐⭐ Directly applicable to cloud-based and time-constrained HPO scenarios.