Differentially Private Bilevel Optimization: Efficient Algorithms with Near-Optimal Rates¶

Conference: NeurIPS 2025 arXiv: 2506.12994 Code: None Area: AI Safety / Differential Privacy Keywords: Bilevel Optimization, Differential Privacy, Exponential Mechanism, Log-Concave Sampling, Non-Convex Optimization

TL;DR¶

This paper systematically studies bilevel optimization under differential privacy (DP). For the convex setting, it establishes near-tight upper and lower bounds via the exponential mechanism and regularized exponential mechanism, matching the optimal rate of single-level DP-ERM. For the non-convex setting, it proposes a second-order DP method achieving state-of-the-art convergence rates that are independent of the inner-level dimension.

Background & Motivation¶

Background: Bilevel optimization (BLO) is widely applied in meta-learning, hyperparameter optimization, and adversarial training, where the outer objective depends on the optimal solution of an inner optimization problem — one problem is nested within another.

Limitations of Prior Work: Machine learning models may leak the privacy of training data, and differential privacy is the standard defense mechanism. While DP optimization has been well studied for single-level problems, DP theory for bilevel optimization remains severely underdeveloped — only two prior works exist, and they either consider only local DP or achieve non-convex rates that depend on the inner-level dimension.

Key Challenge: The nested structure of BLO prevents direct evaluation of the outer objective (the inner problem must first be solved), making classical DP mechanisms inapplicable. How function evaluation error affects privacy and utility is the central challenge.

Goal: (a) What is the optimal rate for DP-ERM in convex BLO? (b) Can the existing inner-dimension-dependent convergence rates be improved in non-convex BLO?

Key Insight: The exponential mechanism combined with log-concave sampling is employed for the convex setting; second-order approximate gradients with non-private inner solutions and sensitivity analysis are used for the non-convex setting.

Core Idea: Rather than privatizing the inner solution \(y\), the optimal inner variable is solved non-privately to high precision; Gaussian noise is then added to the outer approximate gradient, with DP guaranteed via sensitivity analysis, thereby eliminating dependence on the inner-level dimension.

Method¶

Overall Architecture¶

The input is a sensitive dataset, and the goal is to solve bilevel ERM or stochastic optimization under DP guarantees. Two major settings are considered: (1) convex outer level — a private solution is obtained by sampling via the exponential mechanism or regularized exponential mechanism; (2) non-convex outer level — an iterative second-order method combined with Gaussian noise.

Key Designs¶

Convex Setting — Pure DP Exponential Mechanism:
- Function: Draws the outer variable from a sampling distribution with probability proportional to \(\exp(-\varepsilon/(2s) \cdot \Phi_Z(x))\).
- Mechanism: The key is bounding the sensitivity \(s\); the deviation of the optimal inner solution between neighboring datasets is controlled by the strong convexity parameter.
- Achieves excess risk \(O(d_x / (\varepsilon \cdot n))\), matching the optimal rate of single-level DP-ERM.
Convex Setting — Approximate DP Regularized Exponential Mechanism:
- Function: Samples the outer variable with probability proportional to \(\exp(-k(\Phi_Z(x) + \mu\|x\|^2))\).
- Mechanism: Proves that the difference in outer objectives between neighboring datasets is \(G\)-Lipschitz, then compares against the privacy curves of two Gaussian distributions.
- Achieves excess risk \(O(\sqrt{d_x \log(1/\delta)} / (\varepsilon \cdot n))\).
Efficient Implementation — Log-Concave Sampling with Approximate Function Evaluations:
- Function: Since the optimal inner solution cannot be computed exactly, analyzes the effect of function evaluation error on the Grid-Walk algorithm.
- Core Results: The conductance of the perturbed Markov chain satisfies \(\phi' \geq e^{-6\zeta} \cdot \phi\); mixing time increases by a factor of \(e^{12\zeta}\); the distributional distance satisfies \(\text{Dist}_\infty(\pi', \pi) \leq 2\zeta\).
- Design Motivation: Ensures efficient sampling in polynomial time; this is the core technical contribution addressing imprecise function evaluation in BLO.
Non-Convex Setting — Second-Order DP BLO (Algorithm 1):
- Function: Uses approximate hypergradients in place of exact hypergradients for iterative optimization.
- Mechanism: The inner approximate optimal solution is solved non-privately to high precision (e.g., via Katyusha), after which Gaussian noise is added to the outer approximate gradient. Sensitivity analysis relies on operator norm perturbation inequalities.
- Achieves gradient norm \(\tilde{O}((\sqrt{d_x}/(\varepsilon n))^{1/2})\), with no dependence on the inner-level dimension.
Warm-Start Improvement (Algorithm 2):
- Function: First obtains an initial point via the exponential mechanism, then runs Algorithm 1 initialized from this point.
- Achieves \(\tilde{O}(\sqrt{d_x}/(\varepsilon n)^{3/4})\), which is superior in certain parameter regimes.

Loss & Training¶

Convex setting: No gradient descent is required; solutions are obtained directly via MCMC sampling.
Non-convex setting: DP-SGD-style iterations; overall DP is guaranteed via the composition theorem.

Key Experimental Results¶

Main Results: Convex DP BLO Rate Comparison¶

Setting	Upper Bound (Ours)	Lower Bound (Ours)	Single-Level Optimal Rate
Pure DP ERM	\(O(d_x/(\varepsilon n))\)	\(\Omega((LD) \cdot d_x/(n\varepsilon))\)	\(\Theta(d_x/(\varepsilon n))\)
Approximate DP ERM	\(O(\sqrt{d_x \log(1/\delta)}/(\varepsilon n))\)	\(\Omega(\sqrt{d_x \log(1/\delta)}/(n\varepsilon))\)	\(\Theta(\cdots)\)

Non-Convex DP BLO Rate Comparison¶

Method	Gradient Norm Upper Bound	Inner Dimension Dependent?
Kornowski et al. (2024)	\(\tilde{O}((\sqrt{d_x}/(\varepsilon n))^{1/2} + (\sqrt{d_y}/(\varepsilon n))^{1/3})\)	Yes
Algorithm 1 (Ours)	\(\tilde{O}((\sqrt{d_x}/(\varepsilon n))^{1/2})\)	No
Algorithm 2 (Ours)	\(\tilde{O}(\sqrt{d_x}/(\varepsilon n)^{3/4})\)	No

Key Findings¶

When parameters are constants, the DP rate for convex BLO coincides exactly with that of single-level problems — the bilevel structure incurs no additional privacy cost.
The lower bounds establish the necessity of the \(L_{f,y} \cdot D_y\) term: DP BLO is strictly harder than single-level DP optimization.
In the non-convex setting, the key insight is "do not privatize the inner solution," which avoids the error introduced by high-dimensional noise.

Highlights & Insights¶

The analysis of log-concave sampling with approximate function evaluations has independent research value, applicable to any sampling problem where functions cannot be evaluated exactly.
The lower bound construction is elegant: a linear outer level combined with a quadratic inner level links variables, reducing to mean estimation.
The sensitivity analysis methodology in the second-order method can be generalized to other DP multi-level or nested optimization problems.
The perturbation analysis framework for Grid-Walk is directly applicable to other MCMC settings requiring approximate function evaluations, such as variational inference.

Limitations & Future Work¶

A gap remains between the upper and lower bounds for population risk in convex stochastic optimization.
The exponential mechanism component of Algorithm 2 is computationally inefficient (exponential complexity).
The work is entirely theoretical, with no empirical validation.
Incorporating variance reduction techniques into DP BLO remains an important and challenging open problem.

vs. Kornowski et al. (2024): Their non-convex rate depends on the inner-level dimension; this paper eliminates that dependence entirely. The key distinction lies in whether the inner solution is privatized.
vs. BST14 (single-level DP-ERM): This paper shows that convex BLO achieves the same rate under constant parameters, though in the general case an additional bilevel complexity term appears.
vs. Chen et al. (2024, local DP): That work considers the local DP setting and provides guarantees only when \(\varepsilon\) is large; this paper delivers complete rates in the important privacy regime of \(\varepsilon = O(1)\).
The approximate-evaluation sampling analysis in this paper is directly applicable to DP variational inference, DP MCMC, and other scenarios requiring approximate function evaluations.

Rating¶

Novelty: ⭐⭐⭐⭐ — First near-tight upper and lower bounds for convex DP BLO; inner-dimension dependence eliminated in the non-convex setting.
Experimental Thoroughness: ⭐⭐ — Purely theoretical; no experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear structure; technical overview is well written.
Value: ⭐⭐⭐⭐ — Establishes a comprehensive theoretical foundation for the DP BLO field.