Private Zeroth-Order Optimization with Public Data¶

Conference: NeurIPS 2025

Code: None

Area: AI Security / Differential Privacy

Keywords: Differential Privacy, Zeroth-Order Optimization, Public Data, DP-SGD, Privacy-Utility Tradeoff

TL;DR¶

This paper proposes the PAZO framework, which leverages public data to guide gradient approximation in private zeroth-order optimization. PAZO achieves a superior privacy-utility tradeoff compared to DP-SGD on both vision and text tasks, while delivering up to a 16× speedup.

Background & Motivation¶

Limitations of Prior Work¶

Background: Differentially private (DP) machine learning algorithms such as DP-SGD are the standard approach for protecting training data privacy, but suffer from severe computational bottlenecks:

High computational and memory overhead: DP-SGD requires computing per-sample gradients, followed by clipping and noise addition.

Potential of zeroth-order methods: Zeroth-order (ZO) methods approximate gradients via function value queries, making them naturally amenable to privatization, yet they typically yield lower utility.

Underutilization of public data: Public data (e.g., pre-training corpora) is often available in practice, but existing ZO methods do not exploit this resource.

Core Problem: Can public data be leveraged to improve the utility of private zeroth-order optimization while preserving its computational efficiency advantages?

Method¶

Overall Architecture¶

The PAZO (Public-data-Assisted Zeroth-Order) framework incorporates public-data guidance into the standard zeroth-order optimization pipeline to improve the quality of gradient approximation.

Key Designs¶

1. Public-Data-Guided Direction Selection

Standard ZO methods probe along random directions, resulting in high variance.
PAZO constructs an importance sampling distribution using gradient information derived from public data.
Probing is concentrated along directions indicated by public-data gradients.
Core formulation: \(\hat{g} = \sum_{i=1}^{q} \frac{f(x + \mu u_i) - f(x)}{\mu} u_i\), where the sampling of \(u_i\) is guided by public-data gradients.

2. PAZO Variants

PAZO-Subspace: Performs zeroth-order optimization within the subspace spanned by public-data gradients.
PAZO-Projection: Projects zeroth-order estimates onto the signal directions identified by public data.
PAZO-Precondition: Preconditions zeroth-order gradients using the Fisher information matrix estimated from public data.

3. Privacy Analysis

Only function value queries involve private data; all public-data operations consume no privacy budget.
Standard privacy accounting tools for zeroth-order DP methods are directly applicable.

Loss & Training¶

Privacy budget: Rényi DP or zero-concentrated DP is used for tight privacy accounting.
Noise calibration: Gaussian noise standard deviation is determined based on sensitivity and the target \(\varepsilon\).
Learning rate schedule: Cosine annealing, consistent with non-private training.

Key Experimental Results¶

Main Results¶

Vision task (CIFAR-10 fine-tuning, \(\varepsilon=3\)):

Method	Accuracy	Training Time	Peak Memory
DP-SGD	82.1%	48 min	12.3 GB
DP-SGD + Public	84.5%	52 min	13.1 GB
Zero-order DP	76.3%	8 min	3.2 GB
PAZO (Ours)	85.2%	3 min	2.8 GB

Text task (SST-2 fine-tuning, \(\varepsilon=8\)):

Method	Accuracy	Training Time	Speedup
DP-SGD	90.1%	120 min	1×
DP-SGD + Public	91.3%	135 min	0.9×
Zero-order DP	85.7%	12 min	10×
PAZO (Ours)	91.8%	7.5 min	16×

Ablation Study¶

Performance under varying privacy budgets (CIFAR-10):

\(\varepsilon\)	DP-SGD	DP-SGD+Public	ZO-DP	PAZO
1	71.2%	75.8%	62.3%	77.5%
3	82.1%	84.5%	76.3%	85.2%
8	88.5%	89.2%	83.1%	89.8%
∞ (no privacy)	94.2%	94.2%	91.5%	93.8%

Key Findings¶

PAZO achieves the most pronounced advantage under high privacy protection (low \(\varepsilon\)), outperforming even first-order methods augmented with public data.
Runtime speedup reaches 16×, primarily due to the elimination of per-sample gradient computation.
Greater similarity between public and private data yields larger improvements; however, significant gains are observed even with limited similarity.
PAZO-Subspace is optimal in low-dimensional settings, while PAZO-Precondition is optimal in high-dimensional settings.

Highlights & Insights¶

Surpassing the first-order ceiling: This is the first demonstration in private learning where a zeroth-order method outperforms a first-order method that also uses public data.
High practical value: A 16× speedup combined with low memory usage makes privacy-preserving training on edge devices feasible.
Theoretical grounding: Convergence analysis is provided under assumptions on the similarity between public and private data distributions.

Limitations & Future Work¶

The method requires a degree of similarity between public and private data; performance degrades when cross-domain discrepancy is large.
Theoretical analysis relies on convexity or the PL condition, offering limited guarantees for non-convex deep learning models.
The framework is currently validated in fine-tuning settings; its effectiveness for pre-training from scratch remains unverified.
A systematic investigation of how public data quality and quantity affect final performance is lacking.

DP-SGD (Abadi et al.): The standard approach for differentially private stochastic gradient descent.
MeZO (Malladi et al.): A zeroth-order optimization framework for large language models.
Public-data-assisted DP: First-order methods proposed by Yu et al. and De et al.

Rating¶

⭐ Novelty: 8/10 — The idea of guiding zeroth-order methods with public data is both elegant and effective.
⭐ Value: 9/10 — Dual advantages in speed and privacy-utility tradeoff yield high practical deployment value.
⭐ Writing Quality: 8/10 — Comprehensive comparison across multiple variants with well-designed experiments.