Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective¶
Conference: ICLR 2026 arXiv: 2603.03226 Code: None (uses Google's open-source DP² repository) Area: AI Safety / Differentially Private Optimization Keywords: Differential Privacy, SDE Analysis, DP-SGD, DP-SignSGD, Privacy-Utility Tradeoff
TL;DR¶
This work is the first to analyze differentially private optimizers through a stochastic differential equation (SDE) framework, revealing fundamental behavioral differences between DP-SGD and DP-SignSGD under privacy noise: adaptive methods achieve a superior privacy-utility tradeoff of \(\mathcal{O}(1/\varepsilon)\) vs. \(\mathcal{O}(1/\varepsilon^2)\) in high-privacy regimes, and their hyperparameters transfer across privacy budgets.
Background & Motivation¶
Background: Differential privacy (DP) has become the standard for large-scale private training. DP-SGD protects privacy via per-example gradient clipping and Gaussian noise injection. Adaptive DP optimizers (e.g., DP-Adam) are widely used in practice but remain theoretically underexplored. Prior work suggests DP-SGD and DP-Adam perform comparably under careful tuning, and which is superior remains an open question.
Limitations of Prior Work: (1) The interaction between DP noise and adaptivity lacks theoretical characterization; (2) hyperparameters must be re-searched for different privacy budgets \(\varepsilon\), consuming additional privacy budget; (3) there is no consensus on whether adaptive methods confer advantages under DP.
Key Challenge: The mechanisms by which DP noise acts differ fundamentally between non-adaptive and adaptive methods, yet existing analyses cannot distinguish this difference.
Goal: (1) Establish SDE models for DP optimizers; (2) precisely characterize the effect of \(\varepsilon\) on convergence rate and asymptotic neighborhood; (3) compare performance under fixed-hyperparameter and optimally-tuned protocols.
Key Insight: The SDE weak approximation framework can capture the effect of DP noise on continuous dynamics; SignSGD serves as a tractable theoretical proxy for Adam.
Core Idea: Although DP-SignSGD's convergence rate depends on \(\varepsilon\), its privacy-utility tradeoff is only \(\mathcal{O}(1/\varepsilon)\), whereas DP-SGD converges at a rate independent of \(\varepsilon\) but incurs an \(\mathcal{O}(1/\varepsilon^2)\) tradeoff. Adaptive methods are therefore preferable under strict privacy constraints.
Method¶
Overall Architecture¶
The paper derives continuous-time SDE models for DP-SGD and DP-SignSGD using SDE weak approximation theory. Two phases induced by per-example clipping are considered (Phase 1: all gradients clipped; Phase 2: no clipping), and convergence bounds are derived for each. Two analytical protocols are designed: Protocol A (fixed hyperparameters, varying \(\varepsilon\)) and Protocol B (independent hyperparameter tuning per \(\varepsilon\)).
Key Designs¶
-
SDE Analysis of DP-SGD (Protocol A):
- Function: Characterizes the effect of privacy budget \(\varepsilon\) on DP-SGD under fixed hyperparameters.
- Mechanism: Under \(\mu\)-PL and \(L\)-smoothness conditions, it is proved that the DP-SGD loss satisfies \(\mathbb{E}[f(X_t)] \lesssim f(X_0)e^{-\mu t} + (1-e^{-\mu t}) \cdot \mathcal{O}(1/\varepsilon^2)\). The decay term (convergence rate) is independent of \(\varepsilon\), while the asymptotic neighborhood (privacy-utility term) scales as \(1/\varepsilon^2\).
- Design Motivation: Separating convergence rate from the asymptotic neighborhood precisely reveals that \(\varepsilon\) affects only the latter.
-
SDE Analysis of DP-SignSGD (Protocol A):
- Function: Reveals fundamentally different behavior of adaptive methods under DP.
- Mechanism: It is proved that the DP-SignSGD loss satisfies \(\mathbb{E}[f(X_t)] \lesssim f(X_0)e^{-c\varepsilon t} + (1-e^{-c\varepsilon t}) \cdot \mathcal{O}(1/\varepsilon)\). The key distinction is that the decay term depends linearly on \(\varepsilon\) (slow convergence at small \(\varepsilon\)), but the asymptotic neighborhood is only \(\mathcal{O}(1/\varepsilon)\). This exploits the noise-compression effect of the sign operation: \(\mathbb{E}[\text{sign}(g_k)] \approx \nabla f(x)/(\sigma_\gamma\sqrt{d})\).
- Design Motivation: The sign operator inherently compresses noise magnitude, reducing the impact of DP noise from quadratic to linear.
-
Hyperparameter Transfer Across Privacy Budgets (Protocol B):
- Function: Compares asymptotic performance and hyperparameter sensitivity of both methods under optimal tuning.
- Mechanism: Optimal learning rates are derived — DP-SGD requires \(\eta^\star \propto \varepsilon\) (privacy-budget-dependent), whereas DP-SignSGD's \(\eta^\star\) is independent of \(\varepsilon\). Under optimal learning rates, asymptotic performance is comparable, but DP-SignSGD requires no re-tuning across different \(\varepsilon\) values.
- Design Motivation: In practice, hyperparameter search consumes additional privacy budget; methods insensitive to \(\varepsilon\) are therefore more practical.
Loss & Training¶
The theoretical analysis assumes \(\mu\)-PL or \(L\)-smooth loss functions. Experiments validate the theory on quadratic convex functions and logistic regression on IMDB and StackOverflow. Standard DP training with per-example clipping and Gaussian noise injection is used. Theoretical insights from DP-SignSGD are empirically extended to DP-Adam.
Key Experimental Results¶
Main Results (Privacy-Utility Tradeoff Verification)¶
| Method | Privacy-Utility Scaling | Convergence Rate vs. \(\varepsilon\) | \(\eta^\star\) vs. \(\varepsilon\) |
|---|---|---|---|
| DP-SGD | \(\mathcal{O}(1/\varepsilon^2)\) | Independent of \(\varepsilon\) | \(\eta^\star \propto \varepsilon\) |
| DP-SignSGD | \(\mathcal{O}(1/\varepsilon)\) | Linear in \(\varepsilon\) | Independent of \(\varepsilon\) |
| DP-Adam | \(\approx \mathcal{O}(1/\varepsilon)\) | Consistent with DP-SignSGD | Consistent with DP-SignSGD |
Ablation Study (Effect of Batch Noise — IMDB Dataset)¶
| Batch Size \(B\) | Threshold \(\varepsilon^\star\) for DP-SignSGD Advantage | Notes |
|---|---|---|
| 48 | Large | High batch noise; DP-SignSGD dominates throughout |
| 64 | Moderate | Transition regime |
| 80 | Small | Low batch noise; DP-SignSGD superior only under strict privacy |
Key Findings¶
- On quadratic functions, theoretical predictions match experimental values exactly, validating the precision of the SDE analysis.
- On IMDB and StackOverflow, the \(1/\varepsilon^2\) and \(1/\varepsilon\) scalings for DP-SGD and DP-SignSGD respectively hold for both training and test loss.
- When batch noise is sufficiently large, DP-SignSGD outperforms DP-SGD across all \(\varepsilon\); when batch noise is small, a critical threshold \(\varepsilon^\star\) exists.
- DP-Adam behaves qualitatively consistently with DP-SignSGD, validating the use of SignSGD as a proxy for Adam.
Highlights & Insights¶
- This work is the first to introduce SDE tools into DP optimization analysis, revealing structural differences between privacy noise and adaptivity that are invisible to all prior discrete-time analyses.
- The practical implications are clear: DP-Adam/DP-SignSGD should be preferred under strict privacy constraints — not only for superior asymptotic performance, but also because their hyperparameters transfer across \(\varepsilon\) values, reducing the privacy cost of hyperparameter tuning.
Limitations & Future Work¶
- The theory covers only DP-SGD and DP-SignSGD; DP-Adam is not directly analyzed (its treatment relies on empirical extension via the SignSGD proxy).
- Experiments are limited to logistic regression and simple convex problems; validation on deep networks is insufficient.
- The analysis assumes gradient noise follows Gaussian or Student-t distributions; the noise structure in practical deep learning may be more complex.
Related Work & Insights¶
- vs. Li et al. (2022b): That work finds comparable performance between DP-SGD and DP-Adam in LLM fine-tuning (Protocol B); the present paper's Protocol B theory is consistent, but additionally identifies a fundamental practical advantage of DP-Adam in terms of hyperparameter usability.
- vs. Jin & Dai (2025): That work analyzes Noisy SignSGD from a privacy amplification perspective without accounting for clipping; this paper fully handles per-example clipping.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First SDE analysis of DP optimizers; solid theoretical contribution.
- Experimental Thoroughness: ⭐⭐⭐ Experiments are relatively simple (logistic regression); deep network validation is insufficient.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are rigorous, notation is consistent, and figures are highly informative.
- Value: ⭐⭐⭐⭐ Provides theoretical justification for DP optimizer selection; offers practical guidance for privacy-aware ML.