Skip to content

DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas

Conference: NeurIPS 2025 arXiv: 2511.07338 Code: https://deeppersona-ai.github.io/ Area: AI Safety Keywords: Synthetic Personas, Persona Simulation, LLM Personalization, Social Simulation, Attribute Taxonomy

TL;DR

This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.

Background & Motivation

Background: Using LLMs to generate synthetic personas has been widely adopted in personalized assistants, social behavior simulation, role-playing agents, and alignment research. PersonaHub can generate up to one billion brief persona descriptions.

Limitations of Prior Work: Existing synthetic personas are extremely shallow—typically containing fewer than 30 manually defined attributes or a few lines of templated description, lacking depth, diversity, and authenticity. Directly scaling with LLMs leads to insufficient diversity, stereotypical biases, and overly optimistic tendencies.

Key Challenge: Depth is the critical bottleneck for narratively coherent personas. Existing methods can scale in quantity and diversity, but attribute depth consistently remains in the single-to-double-digit range, failing to capture the rich complexity of real human individuals.

Goal: To build a scalable, data-driven method that simultaneously achieves (a) broad attribute coverage (\(k > 10^2\) attributes); (b) diversity (free of stereotypes); and (c) internal consistency.

Key Insight: Extract attributes from real user self-disclosure conversations to construct a taxonomy, then use the taxonomy to guide progressive sampling rather than directly prompting LLMs to generate personas.

Core Idea: A data-driven 8,000+ node attribute taxonomy guides progressive sampling, transforming LLMs from "free generation" to "structured slot-filling," achieving both depth and diversity in synthetic personas.

Method

Overall Architecture

A two-stage pipeline: Stage 1 constructs a human attribute taxonomy \(T\) from real conversations; Stage 2 takes a small set of anchor attributes \(S\) and produces a narratively complete synthetic persona \(P = \{\langle a_i, v_i \rangle\}\) via progressive attribute sampling and LLM-based value generation.

Key Designs

  1. Human Attribute Taxonomy Construction (Stage 1):

    • Function: Extract, organize, and merge attribute nodes from 62,224 high-quality personalized QA pairs.
    • Mechanism: GPT-4.1-mini first classifies each QA pair as "non-personalizable / partially personalizable / personalizable" to filter personalized dialogues; hierarchical attributes (e.g., Lifestyle → Food Preference → Vegan) are then recursively extracted; multiple candidate hierarchies are merged using a semantic similarity threshold.
    • Result: A taxonomy with 12 top-level categories and 8,496 unique nodes. Most attributes are no deeper than 3 levels.
    • Semantic validation and filtering: Two stages—before merging, personaliz­ability, semantic coherence, and appropriate abstraction level are verified; after merging, duplicates are removed and erroneous parent–child relationships are corrected.
  2. Progressive Attribute Sampling (Stage 2):

    • Function: Iteratively select attributes from the taxonomy and fill in values with an LLM until the target depth \(k\) is reached.
    • Sampling is modeled as a structured distribution: \(P \sim \mathcal{F}_{\theta,T}(\cdot|S,k) = \prod_{i=1}^{k} \Pr(a_i|S,P_{<i},T) \cdot \Pr_\theta(v_i|a_i,S,P_{<i})\)
    • Four key design choices:
      • Anchored stable core: Core attributes such as age, location, career, and values are fixed first to prevent sampling drift.
      • Unbiased value assignment: Demographic attributes (age, gender, occupation, etc.) are sampled from predefined statistical distribution tables rather than generated by the LLM, avoiding majority-culture default biases.
      • Balanced attribute diversification: Candidate attributes are divided into near/mid/far tiers by cosine similarity to core attributes and sampled at a 5:3:2 ratio, balancing coherence with novelty.
      • Progressive LLM slot-filling: A random breadth-first traversal of the taxonomy prioritizes long-tail branches; each selected attribute's value is generated by the LLM conditioned on the existing profile \(P_{<i}\).
  3. Life-Story-Driven Core Attribute Inference:

    • Function: For core attributes without predefined categories (e.g., hobbies and interests), inference is grounded through a life-story narrative.
    • Mechanism: Fix demographics → LLM infers core values → expand to life attitudes → fabricate 1–3 life-story vignettes → derive interests and hobbies from the stories.
    • Design Motivation: Produces greater depth and internal consistency than direct generation.

Loss & Training

DeepPersona is a generation framework rather than a trained model. GPT-4.1 and GPT-4.1-mini serve as the underlying LLM \(\theta\).

Key Experimental Results

Intrinsic Evaluation

Metric PersonaHub OpenCharacter DeepPersona
Avg. # Attributes 3.98 38.50 50.92
Uniqueness (1–5) 2.50 2.86 4.12 (+44%)
Actionability (1–5) 3.60 4.78 5.00

Social Survey Simulation (World Values Survey, 6-country average)

Method KS Stat↓ Wasserstein↓ JS Div↓ Mean Diff↓
Cultural Prompting 0.570 1.059 0.601 0.713
OpenCharacter 0.374 0.827 0.434 0.666
DeepPersona 0.325 0.721 0.425 0.451

Personalized QA (GPT-4.1 as Responder)

Method Avg. Score (10 dims) vs. OpenCharacter vs. PersonaHub
PersonaHub Baseline
OpenCharacter Baseline +4%
DeepPersona +5.58% vs. OC Leads on all 10 dims +14.66%
Largest gain Attribute Coverage +10.6%, Justification +10.2%

Key Findings

  • 200–250 attributes represents the optimal depth; exceeding 300 attributes introduces noise and degrades performance.
  • DeepPersona yields especially notable improvements for underrepresented cultures (e.g., Kenya, India).
  • Cross-model validation (DeepSeek-v3, GPT-4o-mini, Gemini-2.5-flash) demonstrates that the framework is model-agnostic.
  • In Big Five personality tests, "average citizen" personas generated by DeepPersona deviate 17% less from real personality distributions than LLM direct simulation.

Highlights & Insights

  • Taxonomy-driven sampling is broadly applicable to any scenario requiring structured generation—transforming "free generation" into "constrained exploration" is an effective strategy for controlling LLM generation diversity and quality.
  • Unbiased value assignment is noteworthy: bypassing the LLM for demographic attributes and sampling directly from statistical tables is a simple yet effective way to mitigate training-data bias.
  • The 5:3:2 near/mid/far attribute sampling ratio is a practical trick for balancing "coherent but not stereotypical" persona generation.
  • Human evaluation corroborates automatic evaluation findings (81–87% win rate), strengthening the credibility of the conclusions.

Limitations & Future Work

  • The number of attributes recovered by LLM-as-judge (~50) is far lower than the number actually generated (~200), indicating that many implicit attributes cannot be effectively recovered from narrative text.
  • The taxonomy is derived from English-language conversations, potentially resulting in insufficient attribute coverage for non-English-speaking cultures.
  • Social simulation is validated only on WVS and Big Five, which represents a narrow set of scenarios.
  • Generation cost is relatively high—each deep persona requires multiple rounds of LLM calls.
  • Privacy risks of generated personas are not examined; although personas are claimed to be privacy-free, deeply detailed persona combinations may indirectly correspond to real individuals.
  • vs. PersonaHub: Produces personas at the billion scale, but each contains only ~5 lines (~4 attributes)—"broad but shallow." DeepPersona achieves "broad and deep."
  • vs. OpenCharacter: Offers 38.5 attributes plus stylistic dialogue, representing moderate depth but insufficient uniqueness and diversity.
  • vs. Cultural Prompting (Tao et al.): Drives LLM simulation with nationality prompts alone; WVS bias is 31.7% higher than DeepPersona.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of attribute taxonomy and progressive sampling is novel, though individual components are relatively standard.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Intrinsic and extrinsic evaluation, multiple tasks and metrics, cross-model validation, and human evaluation—comprehensive coverage.
  • Writing Quality: ⭐⭐⭐⭐ Problem definition is clear, but methodological details are scattered across the appendix and the main text is somewhat lengthy.
  • Value: ⭐⭐⭐⭐ As a generation engine rather than a one-off dataset, it supports continuous scaling and customization.