Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization¶
Conference: ICML2025
arXiv: 2505.02101
Code: To be confirmed
Area: Optimization
Keywords: bilevel optimization, hypergradient, inexact Newton, curvature information, meta-learning
TL;DR¶
Proposes the NBO framework, which exploits the inherent structure of hypergradients in bilevel optimization (where solving the lower-level problem and the Hessian-inverse-vector product share the same Hessian). By efficiently integrating curvature information via inexact Newton methods to approximate the hypergradient, it improves the gradient computational complexity by a factor of \(\kappa \log \kappa\) compared to the SOTA in deterministic settings.
Background & Motivation¶
Bilevel optimization is widely applied in machine learning tasks such as hyperparameter optimization, meta-learning, adversarial training, and neural architecture search. Its core challenge lies in estimating the hypergradient (implicit gradient):
where \(u^*(x) = [\nabla_{22}^2 g(x,y^*(x))]^{-1} \nabla_2 f(x,y^*(x))\). Existing methods typically treat solving the lower-level problem and computing the Hessian-inverse-vector product as independent tasks and process them separately. However, they share the same Hessian \(\nabla_{22}^2 g\), and this inherent structure has not been fully utilized.
SHINE (Ramzi et al., 2022) made the first attempt to exploit this structure by using quasi-Newton matrices to handle both sub-problems simultaneously, but only provided asymptotic convergence analysis, lacking a precise characterization of the convergence rate and computational complexity. This paper aims to systematically explore and utilize the benefits of the hypergradient structure.
Method¶
Framework: NBO (Newton-based Bilevel Optimization)¶
Core Idea: Utilize inexact Newton methods to simultaneously solve the lower-level optimal solution \(y^*(x)\) and the Hessian-inverse-vector product \(u^*(x,y)\), where both share the same Hessian \(H_k = \nabla_{22}^2 g(x^k, y^k)\).
Step 1: Curvature-Aware Hypergradient Approximation. Given \((x, y)\), simultaneously solve the linear system:
to obtain the inexact Newton direction \(v\) (approximating the lower-level Newton step) and \(u\) (approximating the Hessian-inverse-vector product).
Step 2: Compute Approximate Hypergradient:
Key Formula — Quadratic Decay of Single-Step Newton: The classical Newton step \(y^+ = y - [\nabla_{22}^2 g]^{-1} \nabla_2 g\) can improve the hypergradient approximation error from linear decay to quadratic decay:
Transforming Linear Systems into Quadratic Programming: The sub-problems (8) and (10) are strongly convex quadratic programming problems, which are solved approximately using GD or CG.
Algorithmic Instances¶
| Algorithm | Setting | Sub-problem Solver | Characteristics |
|---|---|---|---|
| NBO-GD | Deterministic | Gradient Descent (T+1 steps) | Degenerates to the SOBA/Dagréou framework when T=0 |
| NBO-CG | Deterministic | Conjugate Gradient | Better Hessian-vector product complexity |
| NSBO-SGD | Stochastic | Stochastic Gradient Descent | Suitable for large-scale scenarios |
Convergence and Complexity (Deterministic Setting)¶
Under standard assumptions (lower-level strong convexity parameter \(\mu\), gradient Lipschitz continuity, etc.), the convergence guarantee of NBO-GD is:
| Method | Gradient Computation | Hessian-Vector Product |
|---|---|---|
| AmIGO-GD | \(O(\kappa^4 \log\kappa \cdot \epsilon^{-1})\) | \(O(\kappa^4 \log\kappa \cdot \epsilon^{-1})\) |
| AID-BiO | \(O(\kappa^4 \log\kappa \cdot \epsilon^{-1})\) | \(O(\kappa^{3.5} \log\kappa \cdot \epsilon^{-1})\) |
| Ours (NBO-GD) | \(O(\kappa^3 \cdot \epsilon^{-1})\) | \(O(\kappa^4 \cdot \epsilon^{-1})\) |
| Ours (NBO-CG) | \(O(\kappa^3 \cdot \epsilon^{-1})\) | \(O(\kappa^{3.5} \log\kappa \cdot \epsilon^{-1})\) |
The gradient complexity is improved by a factor of \(\kappa \log\kappa\).
Stochastic Setting (NSBO-SGD)¶
The total sample complexity is \(O(\kappa^9 \epsilon^{-2})\), which improves upon AmIGO's \(O(\kappa^9 \log\kappa \cdot \epsilon^{-2})\) by a factor of \(\log\kappa\).
Lyapunov Analysis¶
Defines the Lyapunov function \(V_k = \Phi(x^k) - \Phi^* + b_y \|y^k - y^*(x^k)\|^2 + b_u \|u^k - u^*(x^k)\|^2\). It proves that under the inexact Newton step, \(\|y^{k+1} - y^*(x^{k+1})\|\) satisfies a descent condition of quadratic decay + linear perturbation, ensuring by induction that the iterates always lie within the local convergence region of Newton's method.
Key Experimental Results¶
Synthetic Experiments (Hyperparameter Optimization + Logistic Regression)¶
| Comparison | Result |
|---|---|
| NBO-GD (T=1) vs AmIGO (Q=10) | Comparable descent speed of the objective function |
| NBO-GD (T=1) vs AmIGO (Q=1) | NBO-GD is significantly superior |
| Evaluation by runtime | NBO-GD (T=1) is the fastest |
Key Findings: Even with only T=1 step to approximate the Newton direction, NBO-GD matches the performance of AmIGO with Q=10 steps of GD, indicating that curvature information is much more efficient than multi-step GD.
Hyperparameter Optimization (IJCNN1 + Covtype)¶
- IJCNN1: The convergence speed of NSBO-SGD is significantly superior to all baselines including SOBA, SABA, StoBiO, AmIGO, SHINE, F2SA, and MA-SABA.
- Covtype: NSBO-SGD also converges the fastest, achieving the lowest test error.
Data Cleaning (MNIST + FashionMNIST, 50% Noisy Labels)¶
- NSBO-SGD and AmIGO achieve the lowest test error.
- NSBO-SGD converges the fastest, demonstrating the efficiency advantage of integrating curvature information.
Variance Reduction and Momentum Extensions¶
- NSBO-SAGA outperforms SABA.
- MA-NSBO-SAGA outperforms MA-SABA.
- The NBO framework is compatible with variance reduction techniques (such as SAGA, STORM, PAGE) and momentum techniques.
Highlights & Insights¶
- Elegant Structural Insight: The structure where the lower-level problem and the Hessian-inverse-vector product in the hypergradient share the same Hessian is systematically utilized. This is a highly natural yet overlooked observation.
- Newton Step Brings Quadratic Decay: A single Newton step can improve the hypergradient estimation error from \(O(\|y-y^*\|)\) to \(O(\|y-y^*\|^2)\).
- Consistency Between Theory and Experiments: The gradient complexity in the deterministic setting is improved by \(\kappa \log\kappa\) times, which is also validated by the significant acceleration in experiments.
- T=1 Already Effective: The inner loop requires only 1 step of approximating the Newton direction to outperform 10 steps of GD, leading to low practical deployment costs.
- Framework Generality: NBO can be easily integrated with techniques such as CG, variance reduction (SAGA), and momentum (MA).
Limitations & Future Work¶
- Requires the initial point to be within the local Newton convergence region (BOX 1/BOX 2 conditions), which may necessitate an additional warm-up step in practice.
- Limited Improvement in Stochastic Settings: NSBO-SGD only achieves a \(\log\kappa\) improvement, which is less significant than the improvement in the deterministic setting.
- Lower-level Problem Must Be Strongly Convex: Although extension to the convex case via aggregate functions (BAMM+NBO) is discussed, the theoretical guarantees are limited to the strongly convex case.
- Non-smooth/Constrained Scenarios Unconsidered: Extensions to lower-level problems containing constraints or non-smooth terms have not yet been explored.
- High Cost of Hessian-Vector Products in Deep Networks: For large-scale neural networks, the practical overhead of computing exact Hessian-vector products remains substantial.
Related Work & Insights¶
- SHINE (Ramzi et al., 2022): Also utilizes the hypergradient structure, but with quasi-Newton methods and lacks non-asymptotic analysis.
- AmIGO (Arbel & Mairal, 2022a): A representative framework that solves sub-problems using multi-step GD.
- SOBA/SABA (Dagréou et al., 2022): Single-loop methods, to which NBO-GD degenerates when T=0.
- F2SA (Kwon et al., 2023): A fully first-order method that avoids Hessians but has higher complexity.
- Insight: The idea of sharing the Hessian can be generalized to multi-level optimization or scenarios with multiple coupled sub-problems.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Simple and elegant structural insight. Incorporating inexact Newton into bilevel optimization is a natural yet novel combination.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive baselines, covering synthetic experiments, hyperparameter optimization, data cleaning, and various extensions.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation, rigorous theories, and clear tabular comparisons.
- Value: ⭐⭐⭐⭐ — Provides a more efficient hypergradient estimation method for bilevel optimization, with contributions in both theory and practice.