DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas¶
Conference: NeurIPS 2025 arXiv: 2511.07338 Code: https://deeppersona-ai.github.io/ Area: AI Safety Keywords: Synthetic Personas, Persona Simulation, LLM Personalization, Social Simulation, Attribute Taxonomy
TL;DR¶
This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.
Background & Motivation¶
Background: Using LLMs to generate synthetic personas has been widely adopted in personalized assistants, social behavior simulation, role-playing agents, and alignment research. PersonaHub can generate up to one billion brief persona descriptions.
Limitations of Prior Work: Existing synthetic personas are extremely shallow—typically containing fewer than 30 manually defined attributes or a few lines of templated description, lacking depth, diversity, and authenticity. Directly scaling with LLMs leads to insufficient diversity, stereotypical biases, and overly optimistic tendencies.
Key Challenge: Depth is the critical bottleneck for narratively coherent personas. Existing methods can scale in quantity and diversity, but attribute depth consistently remains in the single-to-double-digit range, failing to capture the rich complexity of real human individuals.
Goal: To build a scalable, data-driven method that simultaneously achieves (a) broad attribute coverage (\(k > 10^2\) attributes); (b) diversity (free of stereotypes); and (c) internal consistency.
Key Insight: Extract attributes from real user self-disclosure conversations to construct a taxonomy, then use the taxonomy to guide progressive sampling rather than directly prompting LLMs to generate personas.
Core Idea: A data-driven 8,000+ node attribute taxonomy guides progressive sampling, transforming LLMs from "free generation" to "structured slot-filling," achieving both depth and diversity in synthetic personas.
Method¶
Overall Architecture¶
A two-stage pipeline: Stage 1 constructs a human attribute taxonomy \(T\) from real conversations; Stage 2 takes a small set of anchor attributes \(S\) and produces a narratively complete synthetic persona \(P = \{\langle a_i, v_i \rangle\}\) via progressive attribute sampling and LLM-based value generation.
Key Designs¶
-
Human Attribute Taxonomy Construction (Stage 1):
- Function: Extract, organize, and merge attribute nodes from 62,224 high-quality personalized QA pairs.
- Mechanism: GPT-4.1-mini first classifies each QA pair as "non-personalizable / partially personalizable / personalizable" to filter personalized dialogues; hierarchical attributes (e.g., Lifestyle → Food Preference → Vegan) are then recursively extracted; multiple candidate hierarchies are merged using a semantic similarity threshold.
- Result: A taxonomy with 12 top-level categories and 8,496 unique nodes. Most attributes are no deeper than 3 levels.
- Semantic validation and filtering: Two stages—before merging, personalizability, semantic coherence, and appropriate abstraction level are verified; after merging, duplicates are removed and erroneous parent–child relationships are corrected.
-
Progressive Attribute Sampling (Stage 2):
- Function: Iteratively select attributes from the taxonomy and fill in values with an LLM until the target depth \(k\) is reached.
- Sampling is modeled as a structured distribution: \(P \sim \mathcal{F}_{\theta,T}(\cdot|S,k) = \prod_{i=1}^{k} \Pr(a_i|S,P_{<i},T) \cdot \Pr_\theta(v_i|a_i,S,P_{<i})\)
- Four key design choices:
- Anchored stable core: Core attributes such as age, location, career, and values are fixed first to prevent sampling drift.
- Unbiased value assignment: Demographic attributes (age, gender, occupation, etc.) are sampled from predefined statistical distribution tables rather than generated by the LLM, avoiding majority-culture default biases.
- Balanced attribute diversification: Candidate attributes are divided into near/mid/far tiers by cosine similarity to core attributes and sampled at a 5:3:2 ratio, balancing coherence with novelty.
- Progressive LLM slot-filling: A random breadth-first traversal of the taxonomy prioritizes long-tail branches; each selected attribute's value is generated by the LLM conditioned on the existing profile \(P_{<i}\).
-
Life-Story-Driven Core Attribute Inference:
- Function: For core attributes without predefined categories (e.g., hobbies and interests), inference is grounded through a life-story narrative.
- Mechanism: Fix demographics → LLM infers core values → expand to life attitudes → fabricate 1–3 life-story vignettes → derive interests and hobbies from the stories.
- Design Motivation: Produces greater depth and internal consistency than direct generation.
Loss & Training¶
DeepPersona is a generation framework rather than a trained model. GPT-4.1 and GPT-4.1-mini serve as the underlying LLM \(\theta\).
Key Experimental Results¶
Intrinsic Evaluation¶
| Metric | PersonaHub | OpenCharacter | DeepPersona |
|---|---|---|---|
| Avg. # Attributes | 3.98 | 38.50 | 50.92 |
| Uniqueness (1–5) | 2.50 | 2.86 | 4.12 (+44%) |
| Actionability (1–5) | 3.60 | 4.78 | 5.00 |
Social Survey Simulation (World Values Survey, 6-country average)¶
| Method | KS Stat↓ | Wasserstein↓ | JS Div↓ | Mean Diff↓ |
|---|---|---|---|---|
| Cultural Prompting | 0.570 | 1.059 | 0.601 | 0.713 |
| OpenCharacter | 0.374 | 0.827 | 0.434 | 0.666 |
| DeepPersona | 0.325 | 0.721 | 0.425 | 0.451 |
Personalized QA (GPT-4.1 as Responder)¶
| Method | Avg. Score (10 dims) | vs. OpenCharacter | vs. PersonaHub |
|---|---|---|---|
| PersonaHub | Baseline | — | — |
| OpenCharacter | Baseline +4% | — | — |
| DeepPersona | +5.58% vs. OC | Leads on all 10 dims | +14.66% |
| Largest gain | Attribute Coverage +10.6%, Justification +10.2% |
Key Findings¶
- 200–250 attributes represents the optimal depth; exceeding 300 attributes introduces noise and degrades performance.
- DeepPersona yields especially notable improvements for underrepresented cultures (e.g., Kenya, India).
- Cross-model validation (DeepSeek-v3, GPT-4o-mini, Gemini-2.5-flash) demonstrates that the framework is model-agnostic.
- In Big Five personality tests, "average citizen" personas generated by DeepPersona deviate 17% less from real personality distributions than LLM direct simulation.
Highlights & Insights¶
- Taxonomy-driven sampling is broadly applicable to any scenario requiring structured generation—transforming "free generation" into "constrained exploration" is an effective strategy for controlling LLM generation diversity and quality.
- Unbiased value assignment is noteworthy: bypassing the LLM for demographic attributes and sampling directly from statistical tables is a simple yet effective way to mitigate training-data bias.
- The 5:3:2 near/mid/far attribute sampling ratio is a practical trick for balancing "coherent but not stereotypical" persona generation.
- Human evaluation corroborates automatic evaluation findings (81–87% win rate), strengthening the credibility of the conclusions.
Limitations & Future Work¶
- The number of attributes recovered by LLM-as-judge (~50) is far lower than the number actually generated (~200), indicating that many implicit attributes cannot be effectively recovered from narrative text.
- The taxonomy is derived from English-language conversations, potentially resulting in insufficient attribute coverage for non-English-speaking cultures.
- Social simulation is validated only on WVS and Big Five, which represents a narrow set of scenarios.
- Generation cost is relatively high—each deep persona requires multiple rounds of LLM calls.
- Privacy risks of generated personas are not examined; although personas are claimed to be privacy-free, deeply detailed persona combinations may indirectly correspond to real individuals.
Related Work & Insights¶
- vs. PersonaHub: Produces personas at the billion scale, but each contains only ~5 lines (~4 attributes)—"broad but shallow." DeepPersona achieves "broad and deep."
- vs. OpenCharacter: Offers 38.5 attributes plus stylistic dialogue, representing moderate depth but insufficient uniqueness and diversity.
- vs. Cultural Prompting (Tao et al.): Drives LLM simulation with nationality prompts alone; WVS bias is 31.7% higher than DeepPersona.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of attribute taxonomy and progressive sampling is novel, though individual components are relatively standard.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Intrinsic and extrinsic evaluation, multiple tasks and metrics, cross-model validation, and human evaluation—comprehensive coverage.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear, but methodological details are scattered across the appendix and the main text is somewhat lengthy.
- Value: ⭐⭐⭐⭐ As a generation engine rather than a one-off dataset, it supports continuous scaling and customization.