DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas¶

Conference: NeurIPS 2025 arXiv: 2511.07338 Code: https://deeppersona-ai.github.io/ Area: AI Safety Keywords: Synthetic Personas, Persona Simulation, LLM Personalization, Social Simulation, Attribute Taxonomy

TL;DR¶

This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.

Background & Motivation¶

Background: Using LLMs to generate synthetic personas has been widely adopted in personalized assistants, social behavior simulation, role-playing agents, and alignment research. PersonaHub can generate up to one billion brief persona descriptions.

Limitations of Prior Work: Existing synthetic personas are extremely shallow—typically containing fewer than 30 manually defined attributes or a few lines of templated description, lacking depth, diversity, and authenticity. Directly scaling with LLMs leads to insufficient diversity, stereotypical biases, and overly optimistic tendencies.

Key Challenge: Depth is the critical bottleneck for narratively coherent personas. Existing methods can scale in quantity and diversity, but attribute depth consistently remains in the single-to-double-digit range, failing to capture the rich complexity of real human individuals.

Goal: To build a scalable, data-driven method that simultaneously achieves (a) broad attribute coverage (\(k > 10^2\) attributes); (b) diversity (free of stereotypes); and (c) internal consistency.

Key Insight: Extract attributes from real user self-disclosure conversations to construct a taxonomy, then use the taxonomy to guide progressive sampling rather than directly prompting LLMs to generate personas.

Core Idea: A data-driven 8,000+ node attribute taxonomy guides progressive sampling, transforming LLMs from "free generation" to "structured slot-filling," achieving both depth and diversity in synthetic personas.

Method¶

Overall Architecture¶

A two-stage pipeline: Stage 1 constructs a human attribute taxonomy \(T\) from real conversations; Stage 2 takes a small set of anchor attributes \(S\) and produces a narratively complete synthetic persona \(P = \{\langle a_i, v_i \rangle\}\) via progressive attribute sampling and LLM-based value generation.

Key Designs¶

Human Attribute Taxonomy Construction (Stage 1):
- Function: Extract, organize, and merge attribute nodes from 62,224 high-quality personalized QA pairs.
- Mechanism: GPT-4.1-mini first classifies each QA pair as "non-personalizable / partially personalizable / personalizable" to filter personalized dialogues; hierarchical attributes (e.g., Lifestyle → Food Preference → Vegan) are then recursively extracted; multiple candidate hierarchies are merged using a semantic similarity threshold.
- Result: A taxonomy with 12 top-level categories and 8,496 unique nodes. Most attributes are no deeper than 3 levels.
- Semantic validation and filtering: Two stages—before merging, personalizability, semantic coherence, and appropriate abstraction level are verified; after merging, duplicates are removed and erroneous parent–child relationships are corrected.
Progressive Attribute Sampling (Stage 2):
- Function: Iteratively select attributes from the taxonomy and fill in values with an LLM until the target depth \(k\) is reached.
- Sampling is modeled as a structured distribution: \(P \sim \mathcal{F}_{\theta,T}(\cdot|S,k) = \prod_{i=1}^{k} \Pr(a_i|S,P_{<i},T) \cdot \Pr_\theta(v_i|a_i,S,P_{<i})\)
- Four key design choices:
  - Anchored stable core: Core attributes such as age, location, career, and values are fixed first to prevent sampling drift.
  - Unbiased value assignment: Demographic attributes (age, gender, occupation, etc.) are sampled from predefined statistical distribution tables rather than generated by the LLM, avoiding majority-culture default biases.
  - Balanced attribute diversification: Candidate attributes are divided into near/mid/far tiers by cosine similarity to core attributes and sampled at a 5:3:2 ratio, balancing coherence with novelty.
  - Progressive LLM slot-filling: A random breadth-first traversal of the taxonomy prioritizes long-tail branches; each selected attribute's value is generated by the LLM conditioned on the existing profile \(P_{<i}\).
Life-Story-Driven Core Attribute Inference:
- Function: For core attributes without predefined categories (e.g., hobbies and interests), inference is grounded through a life-story narrative.
- Mechanism: Fix demographics → LLM infers core values → expand to life attitudes → fabricate 1–3 life-story vignettes → derive interests and hobbies from the stories.
- Design Motivation: Produces greater depth and internal consistency than direct generation.

Loss & Training¶

DeepPersona is a generation framework rather than a trained model. GPT-4.1 and GPT-4.1-mini serve as the underlying LLM \(\theta\).

Key Experimental Results¶

Intrinsic Evaluation¶

Metric	PersonaHub	OpenCharacter	DeepPersona
Avg. # Attributes	3.98	38.50	50.92
Uniqueness (1–5)	2.50	2.86	4.12 (+44%)
Actionability (1–5)	3.60	4.78	5.00

Method	KS Stat↓	Wasserstein↓	JS Div↓	Mean Diff↓
Cultural Prompting	0.570	1.059	0.601	0.713
OpenCharacter	0.374	0.827	0.434	0.666
DeepPersona	0.325	0.721	0.425	0.451

Personalized QA (GPT-4.1 as Responder)¶

Method	Avg. Score (10 dims)	vs. OpenCharacter	vs. PersonaHub
PersonaHub	Baseline	—	—
OpenCharacter	Baseline +4%	—	—
DeepPersona	+5.58% vs. OC	Leads on all 10 dims	+14.66%
Largest gain	Attribute Coverage +10.6%, Justification +10.2%

Key Findings¶

200–250 attributes represents the optimal depth; exceeding 300 attributes introduces noise and degrades performance.
DeepPersona yields especially notable improvements for underrepresented cultures (e.g., Kenya, India).
Cross-model validation (DeepSeek-v3, GPT-4o-mini, Gemini-2.5-flash) demonstrates that the framework is model-agnostic.
In Big Five personality tests, "average citizen" personas generated by DeepPersona deviate 17% less from real personality distributions than LLM direct simulation.

Highlights & Insights¶

Taxonomy-driven sampling is broadly applicable to any scenario requiring structured generation—transforming "free generation" into "constrained exploration" is an effective strategy for controlling LLM generation diversity and quality.
Unbiased value assignment is noteworthy: bypassing the LLM for demographic attributes and sampling directly from statistical tables is a simple yet effective way to mitigate training-data bias.
The 5:3:2 near/mid/far attribute sampling ratio is a practical trick for balancing "coherent but not stereotypical" persona generation.
Human evaluation corroborates automatic evaluation findings (81–87% win rate), strengthening the credibility of the conclusions.

Limitations & Future Work¶

The number of attributes recovered by LLM-as-judge (~50) is far lower than the number actually generated (~200), indicating that many implicit attributes cannot be effectively recovered from narrative text.
The taxonomy is derived from English-language conversations, potentially resulting in insufficient attribute coverage for non-English-speaking cultures.
Social simulation is validated only on WVS and Big Five, which represents a narrow set of scenarios.
Generation cost is relatively high—each deep persona requires multiple rounds of LLM calls.
Privacy risks of generated personas are not examined; although personas are claimed to be privacy-free, deeply detailed persona combinations may indirectly correspond to real individuals.

vs. PersonaHub: Produces personas at the billion scale, but each contains only ~5 lines (~4 attributes)—"broad but shallow." DeepPersona achieves "broad and deep."
vs. OpenCharacter: Offers 38.5 attributes plus stylistic dialogue, representing moderate depth but insufficient uniqueness and diversity.
vs. Cultural Prompting (Tao et al.): Drives LLM simulation with nationality prompts alone; WVS bias is 31.7% higher than DeepPersona.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of attribute taxonomy and progressive sampling is novel, though individual components are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Intrinsic and extrinsic evaluation, multiple tasks and metrics, cross-model validation, and human evaluation—comprehensive coverage.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear, but methodological details are scattered across the appendix and the main text is somewhat lengthy.
Value: ⭐⭐⭐⭐ As a generation engine rather than a one-off dataset, it supports continuous scaling and customization.