Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation¶
Conference: ECCV 2024
arXiv: 2408.14738
Area: Image Generation
TL;DR¶
The DP-SAD framework is proposed to train differentially private diffusion models via stochastic adversarial distillation. It leverages the diffusion model's timesteps to dilute the impact of DP noise, introduces a discriminator to accelerate convergence, and combines the gradient chain rule with DP's post-processing property to reduce the introduction of randomness, achieving SOTA privacy-preserving image generation quality without requiring pre-training.
Background & Motivation¶
In privacy-sensitive areas (e.g., healthcare, finance), data sharing is restricted. Differentially private (DP) generative models offer a solution by releasing the trained generative model rather than the raw data. Existing approaches face three major challenges:
Difficulties in GAN Training: Privacy constraints aggravate the already unstable GAN training process.
Amplification of High-Dimensional Noise: As data and model dimensions increase, more DP noise is required to maintain the same level of privacy.
Severe Damage from Direct Noise Addition: DPSGD adds noise to all gradients, introducing excessive randomness.
Existing DP diffusion models (e.g., DP-DM, DP-LDM) require pre-training on large datasets, and directly training diffusion models with DPSGD leads to excessive privacy budget consumption.
Method¶
Overall Architecture¶
DP-SAD consists of three components: 1. Teacher Model \(\epsilon_\psi\): Trained directly on private data without protection, used only to guide student training and is not released. 2. Student Model \(\epsilon_\theta\): Learns from the teacher via distillation, achieving DP by applying clipping and noise addition to the gradients. 3. Discriminator \(\epsilon_\phi\): Distinguishes between the teacher and student outputs, forming adversarial training to accelerate convergence.
Key Designs¶
Timestep Dilution of DP Noise: The core innovation lies in utilizing the \(T\) timesteps of the diffusion model to dilute the impact of DP noise. The DP noise \(\mathcal{N}(0, \sigma^2 C^2 \mathbf{I})\) is amortized by \(B \cdot T\):
Increasing \(T\) does not increase the privacy budget but reduces the impact of noise on gradients.
Stochastic Timestep Sampling: Replacing the average gradient over \(T\) steps with the gradient of a single randomly selected timestep \(r\) drastically reduces computational overhead (no need to run the entire \(T\)-step diffusion chain for each sample). The intermediate state \(x_r\) is obtained directly via the forward process instead of reverse inference from noise.
Chain Rule + Post-Processing Property: Utilizing the gradient chain rule \(\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial x_{\theta}} \cdot \frac{\partial x_\theta}{\partial \theta}\), clipping and noise addition are only applied to \(\frac{\partial \mathcal{L}}{\partial x_\theta}\). The term \(\frac{\partial x_\theta}{\partial \theta}\) requires no noise injection due to the post-processing property of DP, thereby reducing the introduction of randomness.
Adversarial Discriminator: Concatenates teacher and student outputs as the discriminator input, with teacher labels as \([1,0]\) and student labels as \([0,1]\), driving the student output to closely match the teacher.
Loss & Training¶
- \(\mathcal{L}_{dis}\): MSE distillation loss = MSE(teacher output, student output) + MSE(real data, student output)
- \(\mathcal{L}_{adv}\): adversarial loss = \(\log(1 - \epsilon_\phi(\mathcal{C}(x_{\psi}, x_{\theta})))\)
- \(\lambda = 1\) (the optimal trade-off coefficient, determined via ablation)
Key Experimental Results¶
Main Results¶
CelebA 64×64 perceptual quality comparison (\(\varepsilon=10\), while methods not requiring pre-training use \(\varepsilon=10^4\)):
| Method | Requires Pre-training | IS ↑ | FID ↓ |
|---|---|---|---|
| DP-GAN | No | 1.00 | 403.94 |
| PATE-GAN | No | 1.00 | 397.62 |
| GS-WGAN | No | 1.00 | 384.78 |
| DP-MERF | No | 1.36 | 327.24 |
| DPGEN | No | 1.48 | 55.91 |
| DP-LDM | Yes | N/A | 14.30 |
| DP-SAD | No | 2.37 | 11.26 |
Classification accuracy comparison (\(\varepsilon=1\) / \(\varepsilon=10\)):
| Method | MNIST | FMNIST | CelebA-H | CelebA-G |
|---|---|---|---|---|
| DP-GAN | 0.404/0.801 | 0.105/0.610 | 0.533/0.521 | 0.345/0.392 |
| GS-WGAN | 0.143/0.808 | 0.166/0.658 | 0.590/0.614 | 0.420/0.523 |
| DPGEN | 0.905/0.936 | 0.828/0.878 | 0.700/0.884 | 0.661/0.815 |
| DP-DM (Pre-trained) | 0.952/0.981 | 0.794/0.862 | N/A | N/A |
| DP-SAD | 0.962/0.976 | 0.844/0.896 | 0.915/0.928 | 0.826/0.841 |
Ablation Study¶
Impact of structural coefficient \(\lambda\) (CelebA, \(\varepsilon=10\)):
| \(\lambda\) | 0.0 | 0.2 | 0.5 | 1.0 | 2.0 | 4.0 |
|---|---|---|---|---|---|---|
| FID ↓ | 14.63 | 13.41 | 12.38 | 11.68 | 12.11 | 13.55 |
| IS ↑ | 2.12 | 2.18 | 2.26 | 2.37 | 2.35 | 2.31 |
Impact of timestep \(T\): As \(T\) increases, IS steadily rises and FID consistently decreases, verifying the effectiveness of timestep dilution of DP noise. In experiments, \(T=500\) is selected to balance efficiency and quality.
Ablation on model conditioning: The combination of student conditioning + discriminator conditioning yields the best performance.
Key Findings¶
- No pre-training required while still outperforming the pre-trained DP-LDM (FID: 11.26 vs 14.30), demonstrating the strength of the proposed method itself.
- Outperforms other training-from-scratch methods by at least 16 percentage points on complex tasks (CelebA, \(\varepsilon=1\)).
- Increasing \(T\) acts as a "free" means to improve the privacy-utility trade-off—improving generation quality without exhausting more privacy budget.
- The method permits a small batch size + large \(T\) instead of a large batch size, enabling training in resource-constrained scenarios.
Highlights & Insights¶
- A New Utility for Diffusion Timesteps: This is the first work to discover and utilize diffusion timesteps to dilute the impact of DP noise, establishing a novel connection between diffusion model architecture and privacy protection.
- Rigorous Privacy Proofs: Provides a complete privacy bound analysis using Rényi DP and the Gaussian mechanism.
- High Practicality: Does not require large-scale pre-training datasets and supports resource-constrained scenarios.
- Gradient Chain Rule + Post-Processing: Cleverly exploits the post-processing property of DP to reduce noise injection into model parameter gradients.
Limitations & Future Work¶
- The teacher model is trained on private data without protection—although not released, it poses a potential risk.
- Validated only on relatively low resolutions (32x32, 64x64); the performance on high resolutions remains unknown.
- Requires an additional clustering step (MoCo + k-means) for unlabeled data.
- A gap may exist between the theoretical privacy bound and the actual level of privacy protection.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The idea of timestep dilution of DP noise is novel, and the adversarial distillation framework is well-designed.
- Practicality: ⭐⭐⭐⭐ — Non-reliance on pre-training, support for resource-constrained scenarios, SOTA performance.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 11 baselines, 3 datasets, and thorough ablation analyses.
- Writing Quality: ⭐⭐⭐⭐ — Rigorous theoretical derivation and complete privacy analysis, though the writing could be more concise.