Differentially Private Preference Data Synthesis for Large Language Model Alignment¶
Conference: ICML 2026
arXiv: 2605.30808
Code: https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis
Area: LLM Security / Differential Privacy / Preference Alignment
Keywords: Differential Privacy, Preference Data Synthesis, Bradley-Terry, DP-PCA, DPO/RLHF
TL;DR¶
DPPrefSyn replaces "DP fine-tuning on private preference data" with "learning a DP preference reward model distribution and then synthesizing DP preference data using public prompts." By leveraging the geometric structure of Bradley-Terry linear rewards + DP-PCA + DP-KMeans clustering to capture user preference heterogeneity, it achieves a 56.5% GPT-4o win-rate on Anthropic-HH at \(\varepsilon=2\), outperforming both non-private fine-tuning (55.95%) and DP-FT (37.0%).
Background & Motivation¶
Background: LLM preference alignment (RLHF / DPO) relies on triplet data consisting of a prompt, a pair of responses, and human preference labels. These datasets (e.g., Anthropic-HH, OpenAssistant, TL;DR) often contain sensitive information such as health, identity, and political orientation in prompts, and annotations themselves may leak annotator preferences.
Limitations of Prior Work: Existing DP alignment works fall into three categories: (1) Label-DP (Chowdhury 2024, Zhang 2025), which only protects labels while prompts remain exposed; (2) Private fine-tuning for specific algorithms like DP-PPO (Wu 2023a), which is incompatible with DPO; (3) DP synthesis for instructions (Yu 2024) that does not target preference pairs. All three offer "partial protection" or are "algorithm-specific" and suffer from limited private data volume—human preference annotation is extremely expensive.
Key Challenge: Preference data exhibits strong heterogeneity (different users value different aspects: accuracy, politeness, creativity), but DP-SGD has extremely low sample efficiency in high-dimensional embedding spaces. Furthermore, it is desirable for the DP-protected output to be reusable for DPO / RLHF / various downstream LLMs without further budget consumption.
Goal: (1) Protect all private signals including prompt + response + label; (2) Maintain compatibility with any alignment algorithm like DPO or RLHF; (3) Surpass the utility baseline of DP fine-tuning on private data.
Key Insight: Transform the task from "privately fine-tuning an alignment model" to "using DP to learn a preference reward model distribution → and using it to construct synthetic preference pairs on public prompts." Public prompts do not consume budget; all budget is used to build preference models. Synthetic data can be reused arbitrarily through the DP post-processing property.
Core Idea: Bradley-Terry + Linear Reward \(\rightarrow\) preference = sign of \(\langle \theta, \phi(x, a^+) - \phi(x, a^-) \rangle\); cluster based on \(\phi\) difference vectors to capture heterogeneous preferences; use DP-PCA for dimension reduction to save samples, DP-KMeans for clustering, and DP-SGD to learn linear rewards for each cluster; finally, sample across the cluster distribution on public prompts and use the corresponding reward models to select best/worst pairs.
Method¶
Overall Architecture¶
DPPrefSyn consists of three steps: 1. Preference Representation + Clustering: For each \((x_i, a_i^+, a_i^-)\), calculate \(d_i = \psi(x_i, a_i^+) - \psi(x_i, a_i^-)\); apply DP-PCA to reduce to \(p=20\) dimensions (consumes \(\varepsilon_0\)); use DP-KMeans to partition into \(K=5\) clusters (consumes \(\varepsilon_1\)). 2. DP Reward Model Training: Learn a linear \(\theta_k \in \mathbb{R}^p\) for each cluster using DP-SGD; the parallel composition theorem (disjoint clusters) ensures the total budget is \(\varepsilon - \varepsilon_0 - \varepsilon_1\). 3. Synthetic Generation: Compute DP histogram \(\bm p \leftarrow \bm h / |\mathcal{D}_{\text{priv}}|\). For each public prompt \(\tilde x_j\), the LLM generates \(L=5\) candidates; sample cluster \(k \sim \bm p\), calculate rewards using \(\theta_k\), and select the max/min as \((\tilde a^+, \tilde a^-)\); pairs with reward differences too small (< 0.5) are discarded to ensure quality.
A PRV accountant is used for tight composition of DP-SGD steps; the post-processing property allows the synthetic data to be reused indefinitely.
Key Designs¶
-
Bradley-Terry Linear Reward \(\rightarrow\) Geometric Clustering:
- Function: Capture heterogeneous user preferences in a low-dimensional space, replacing a single global reward with a family of cluster rewards.
- Mechanism: Under the BT model, \(\mathbb{P}[a^+ \succ a^-] = \sigma(\langle \theta, \phi(x,a^+) - \phi(x,a^-)\rangle)\). Users with similar preferences have aligned \(\phi\) difference vectors, allowing clustering by difference vectors to approximate cluster-specific \(\theta_k\). Discovered clusters are interpretable (e.g., "focus on factuality" vs. "focus on politeness").
- Design Motivation: A single reward cannot represent heterogeneous preferences; clustering avoids high-dimensional multi-model issues; the linear structure balances expressivity and DP-friendliness (DP-SGD sample efficiency is much higher for linear models than deep models).
-
Budget Configuration for DP-PCA + DP-KMeans + DP-SGD:
- Function: Improve DP-SGD sample efficiency through dimension reduction and phased budget consumption.
- Mechanism: Original embeddings are 1024D; training a reward model directly via DP-SGD at this dimension requires massive samples. DP-PCA projects difference vectors to \(p=20\) to retain primary signals. Budget is allocated as \(\varepsilon_0\) (PCA) + \(\varepsilon_1\) (KMeans) + remaining for DP-SGD. Disjoint clusters \(\rightarrow\) DP parallel composition, where budget is only affected by the smallest cluster.
- Design Motivation: Dimension reduction is a standard technique for high-dimensional DP training. PCA is more targeted than random projection; KMeans ensures intra-cluster preference homogeneity so linear models suffice.
-
Public Prompts to Save Budget + Candidate Scoring for Pair Construction:
- Function: Spend all DP budget on "building preferences" rather than "synthesizing prompts."
- Mechanism: Use public prompt sets (Alpaca / SafeRLHF / XSum); for each prompt, an LLM generates 5 candidates with high temperature. Draw cluster \(k\) per the DP histogram, calculate rewards with \(\theta_k\), and form preference pairs from the highest/lowest. Discard pairs with reward diffs < 0.5 to avoid noisy pairs.
- Design Motivation: Synthesizing prompts consumes budget and yields poor quality. Using public prompts saves this budget; distribution shifts are mitigated by \(\theta_k\) capturing preference invariance (as preferences typically do not change with prompts).
Key Experimental Results¶
Main Results: GPT-4o Win-rate (Pythia-2.8B + SFT+DPO)¶
| Task | \(\varepsilon=0\) (base) | DP-FT \(\varepsilon=2\) | DPPrefSyn \(\varepsilon=2\) | DP-FT \(\varepsilon=\infty\) (Non-private) |
|---|---|---|---|---|
| OpenAssistant | 2.11 | 6.18 | 11.04 | 8.20 |
| Anthropic-HH | 12.14 | 37.02 | 56.48 | 38.72 |
| TL;DR | 11.64 | 35.2 | 53.8 | 39.5 |
Under strong privacy (\(\varepsilon = 2\)), DPPrefSyn significantly outperforms DP-FT and even exceeds fully non-private DP-FT (\(\varepsilon = \infty\))—DP is no longer just a utility cost, but acts as a regularizer.
Privacy vs. Performance (Anthropic-HH)¶
| \(\varepsilon\) | DP-FT win-rate | DPPrefSyn win-rate |
|---|---|---|
| 0.5 | 35.00 | 55.08 |
| 1 | 36.27 | 55.96 |
| 2 | 37.02 | 56.48 |
| 4 | 36.74 | 56.51 |
| 8 | 36.94 | 56.86 |
| ∞ | 38.72 | 57.53 |
DPPrefSyn remains stable at 55%+ across almost all \(\varepsilon\), while DP-FT is stuck at 35-37%. DPPrefSyn is insensitive to the budget because it is only used on low-dimensional linear rewards, which is much more efficient than training an entire LLM.
Ablation Study (OpenAssistant, \(\varepsilon = 2\))¶
| Configuration | win-rate |
|---|---|
| Full DPPrefSyn | 11.04 |
| w/o DP-PCA (Direct 1024D DP-SGD) | 6.32 |
| w/o KMeans (Single global reward) | 8.41 |
| DP synthesized prompts instead of public | 7.95 |
| GPT-2 fine-tuned reward instead of linear | 11.21 |
DP-PCA contributes the most (-4.7 points), followed by clustering for heterogeneity (-2.6 points). Linear rewards perform similarly to full GPT-2, proving the linear structure is sufficient.
Key Findings¶
- DP Synthetic Data > Direct DP Fine-tuning: DPPrefSyn beats DP-FT at all \(\varepsilon\), challenging the common belief that synthetic data loses information.
- Dimension Reduction is Vital for High-D DP: Dropping DP-PCA causes a 4.7-point drop, indicating DP-SGD learns almost nothing in 1024D.
- Heterogeneous Preference Modeling works: Clustering provides a 2.6-point boost, confirming human preferences are multimodal.
- Post-processing Reusability: Once the synthetic dataset is generated, it can be used for different models/algorithms (SFT, DPO, RLHF) with zero additional budget.
Highlights & Insights¶
- Elegant "Trio" of DP-PCA + Linear Reward + Clustering: Each component solves a specific pain point of high-dimensional DP training (sample efficiency / expressivity / heterogeneity). The combination crosses the utility-privacy boundary.
- Maximizing Post-processing Property: Synthesis detaches the data from DP control, allowing for arbitrary reuse—a fundamental advantage of DP synthesis over DP fine-tuning that this paper fully exploits.
- "DP as Regularization" Phenomenon: DPPrefSyn at \(\varepsilon=2\) outperforms the non-private baseline, suggesting DP noise acts as a regularizer on heterogeneous data, mitigating overfitting to specific annotator biases.
- Insight on Public vs. Private Prompts: The authors argue that "user preference decouples from prompt distribution," allowing public prompts to carry private preferences—a logic applicable to other preference tasks (recommendation, advertising).
Limitations & Future Work¶
- The linear reward assumption might be too strong; it lacks expressivity for non-linear preferences (e.g., compositional or long-range dependency judgments).
- The choice of \(K=5\) clusters lacks a principled method and relies on heuristics; too many or too few clusters hurt performance.
- If public prompt distributions deviate severely from private ones, coverage may be insufficient; there is a lack of quantitative analysis on distribution shift.
- Validation is limited to Pythia-2.8B; biases from DP-PCA dimension reduction might be amplified in larger models (e.g., 13B+).
Related Work & Insights¶
- vs DP-FT / DP-PPO / DP-RLHF: These fine-tune the alignment model directly; budgets must be re-spent for every algorithm or model change. DPPrefSyn: DP once, reuse many times.
- vs label-DP (Chowdhury / Zhang): Protections are label-only; prompts are still leaked. DPPrefSyn protects everything.
- vs DP synthetic instructions (Yu 2024): General instruction synthesis isn't optimized for preference. DPPrefSyn synthesizes preference pairs directly for better alignment.
- vs Aug-PE (Xie 2024): Generic text synthesis via LLM API iterations. DPPrefSyn is more specialized for preference pairs using the BT geometric structure.
- Insight: Replacing "DP direct training" with "DP learn abstraction \(\rightarrow\) Synthetic data \(\rightarrow\) Reuse" can be generalized to many supervised tasks requiring label protection (medical, legal, recommendation).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First dedicated DP preference pair synthesis; systematic strategy combining BT geometry + DP-PCA + clustering.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 tasks × 5 \(\varepsilon\) values × multiple models × detailed ablation; comprehensive head-to-head with DP-FT.
- Writing Quality: ⭐⭐⭐⭐ Clear three-step algorithm; budget allocation well-explained; Fig 1 is intuitive; geometric arguments for BT-clustering could be deeper.
- Value: ⭐⭐⭐⭐⭐ DP alignment is a compliance necessity for enterprise LLM deployment; providing a pipeline ready for industry adoption.