Private Zeroth-Order Optimization with Public Data¶
Conference: NeurIPS 2025
arXiv: 2511.10859
Code: None
Area: AI Security / Differential Privacy
Keywords: Differential Privacy, Zeroth-Order Optimization, Public Data, DP-SGD, Privacy-Utility Tradeoff
TL;DR¶
This paper proposes the PAZO framework, which leverages public data to guide gradient approximation in private zeroth-order optimization. PAZO achieves a superior privacy-utility tradeoff compared to DP-SGD on both vision and text tasks, while delivering up to a 16× speedup.
Background & Motivation¶
Limitations of Prior Work¶
Background: Differentially private (DP) machine learning algorithms such as DP-SGD are the standard approach for protecting training data privacy, but suffer from severe computational bottlenecks:
High computational and memory overhead: DP-SGD requires computing per-sample gradients, followed by clipping and noise addition.
Potential of zeroth-order methods: Zeroth-order (ZO) methods approximate gradients via function value queries, making them naturally amenable to privatization, yet they typically yield lower utility.
Underutilization of public data: Public data (e.g., pre-training corpora) is often available in practice, but existing ZO methods do not exploit this resource.
Core Problem: Can public data be leveraged to improve the utility of private zeroth-order optimization while preserving its computational efficiency advantages?
Method¶
Overall Architecture¶
The PAZO (Public-data-Assisted Zeroth-Order) framework incorporates public-data guidance into the standard zeroth-order optimization pipeline to improve the quality of gradient approximation.
Key Designs¶
1. Public-Data-Guided Direction Selection
- Standard ZO methods probe along random directions, resulting in high variance.
- PAZO constructs an importance sampling distribution using gradient information derived from public data.
- Probing is concentrated along directions indicated by public-data gradients.
- Core formulation: \(\hat{g} = \sum_{i=1}^{q} \frac{f(x + \mu u_i) - f(x)}{\mu} u_i\), where the sampling of \(u_i\) is guided by public-data gradients.
2. PAZO Variants
- PAZO-Subspace: Performs zeroth-order optimization within the subspace spanned by public-data gradients.
- PAZO-Projection: Projects zeroth-order estimates onto the signal directions identified by public data.
- PAZO-Precondition: Preconditions zeroth-order gradients using the Fisher information matrix estimated from public data.
3. Privacy Analysis
- Only function value queries involve private data; all public-data operations consume no privacy budget.
- Standard privacy accounting tools for zeroth-order DP methods are directly applicable.
Loss & Training¶
- Privacy budget: Rényi DP or zero-concentrated DP is used for tight privacy accounting.
- Noise calibration: Gaussian noise standard deviation is determined based on sensitivity and the target \(\varepsilon\).
- Learning rate schedule: Cosine annealing, consistent with non-private training.
Key Experimental Results¶
Main Results¶
Vision task (CIFAR-10 fine-tuning, \(\varepsilon=3\)):
| Method | Accuracy | Training Time | Peak Memory |
|---|---|---|---|
| DP-SGD | 82.1% | 48 min | 12.3 GB |
| DP-SGD + Public | 84.5% | 52 min | 13.1 GB |
| Zero-order DP | 76.3% | 8 min | 3.2 GB |
| PAZO (Ours) | 85.2% | 3 min | 2.8 GB |
Text task (SST-2 fine-tuning, \(\varepsilon=8\)):
| Method | Accuracy | Training Time | Speedup |
|---|---|---|---|
| DP-SGD | 90.1% | 120 min | 1× |
| DP-SGD + Public | 91.3% | 135 min | 0.9× |
| Zero-order DP | 85.7% | 12 min | 10× |
| PAZO (Ours) | 91.8% | 7.5 min | 16× |
Ablation Study¶
Performance under varying privacy budgets (CIFAR-10):
| \(\varepsilon\) | DP-SGD | DP-SGD+Public | ZO-DP | PAZO |
|---|---|---|---|---|
| 1 | 71.2% | 75.8% | 62.3% | 77.5% |
| 3 | 82.1% | 84.5% | 76.3% | 85.2% |
| 8 | 88.5% | 89.2% | 83.1% | 89.8% |
| ∞ (no privacy) | 94.2% | 94.2% | 91.5% | 93.8% |
Key Findings¶
- PAZO achieves the most pronounced advantage under high privacy protection (low \(\varepsilon\)), outperforming even first-order methods augmented with public data.
- Runtime speedup reaches 16×, primarily due to the elimination of per-sample gradient computation.
- Greater similarity between public and private data yields larger improvements; however, significant gains are observed even with limited similarity.
- PAZO-Subspace is optimal in low-dimensional settings, while PAZO-Precondition is optimal in high-dimensional settings.
Highlights & Insights¶
- Surpassing the first-order ceiling: This is the first demonstration in private learning where a zeroth-order method outperforms a first-order method that also uses public data.
- High practical value: A 16× speedup combined with low memory usage makes privacy-preserving training on edge devices feasible.
- Theoretical grounding: Convergence analysis is provided under assumptions on the similarity between public and private data distributions.
Limitations & Future Work¶
- The method requires a degree of similarity between public and private data; performance degrades when cross-domain discrepancy is large.
- Theoretical analysis relies on convexity or the PL condition, offering limited guarantees for non-convex deep learning models.
- The framework is currently validated in fine-tuning settings; its effectiveness for pre-training from scratch remains unverified.
- A systematic investigation of how public data quality and quantity affect final performance is lacking.
Related Work & Insights¶
- DP-SGD (Abadi et al.): The standard approach for differentially private stochastic gradient descent.
- MeZO (Malladi et al.): A zeroth-order optimization framework for large language models.
- Public-data-assisted DP: First-order methods proposed by Yu et al. and De et al.
Rating¶
- ⭐ Novelty: 8/10 — The idea of guiding zeroth-order methods with public data is both elegant and effective.
- ⭐ Value: 9/10 — Dual advantages in speed and privacy-utility tradeoff yield high practical deployment value.
- ⭐ Writing Quality: 8/10 — Comprehensive comparison across multiple variants with well-designed experiments.