Synthia: Scalable Grounded Persona Generation from Social Media Data¶

Conference: ACL 2026 arXiv: 2507.14922 Code: None Area: Computational Social Science / Persona Modeling Keywords: Persona generation, synthetic population, social media, social survey simulation, fairness analysis

TL;DR¶

This paper proposes Synthia, a framework that generates grounded LLM persona narratives from real social media posts (Bluesky), achieving up to 11.6% improvement over the state of the art on social survey alignment while using smaller models, and preserving social network topology to support network-aware analysis.

Background & Motivation¶

Background: Persona-driven LLM simulation is increasingly adopted in computational social science to model population-level attitudes and behaviors. Persona construction approaches range from simple demographic descriptions to rich life narratives.

Limitations of Prior Work: Constructing synthetic populations that are both realistic and scalable remains a core challenge. Interview-based methods (e.g., Park et al. 2024) yield high realism but are resource-intensive and difficult to scale; fully synthetic methods (e.g., Anthology) are scalable but often introduce systematic artifacts that degrade realism, and the resulting narratives frequently contain internal contradictions (63% of personas exhibit inconsistencies).

Key Challenge: The trade-off between realism and scalability. Unconstrained LLM generation is scalable, but the lack of real-world grounding leads to hallucinations and narrative inconsistencies.

Goal: To design a persona generation framework that uses real social media content as grounding while relying on LLMs for narrative construction, balancing realism, scalability, and fairness.

Key Insight: Public posts from the Bluesky platform serve as a real-world data source; LLMs synthesize each user's posts into a first-person life narrative while preserving the original social network graph structure.

Core Idea: Persona narratives should be grounded in real user-generated content rather than synthesized from scratch. Grounding in authentic data substantially reduces intra-narrative contradictions, thereby improving the alignment of population-level opinion distributions.

Method¶

Overall Architecture¶

The pipeline consists of three stages: (1) collecting and filtering a pool of user posts from Bluesky (~170 million posts, 650K users) and sampling 3K users; (2) using an LLM (Gemma-3-27B) to synthesize each user's posts into a first-person persona narrative; (3) aligning the synthetic population with real survey respondents via demographic matching and comparing simulated opinion distributions against ground-truth distributions.

Key Designs¶

Real-Data-Grounded Persona Generation:
- Function: Generate rich persona narratives with real-world grounding.
- Mechanism: Each user's 100–1,000 posts are collected (too few provides insufficient context; too many exceeds the context window). Social identifiers such as @mentions, URLs, and email addresses are removed, and replies and reposts are excluded. An LLM then generates a synthesized first-person life background story. High-quality personas can be generated using Gemma-3-27B (27B) or even Phi-4-mini (4B).
- Design Motivation: Real posts provide anchor points that constrain the LLM from fabricating content, substantially reducing internal contradictions (the proportion of contradictory personas drops from 63% to 18%).
Demographic Matching and Opinion Survey:
- Function: Align the synthetic population with real survey respondents on demographic distributions.
- Mechanism: Demographic attributes (age, gender, ethnicity, etc.) are inferred for each persona via LLM. A greedy matching algorithm pairs each survey respondent with the most compatible persona. The LLM is then conditioned on the persona narrative to answer survey questions, and simulated versus real opinion distributions are compared using EMD, Frobenius norm, and Cronbach's \(\alpha\).
- Design Motivation: Evaluation is anchored to human survey data rather than LLM judgment, ensuring reliability.
Social Network Graph Preservation:
- Function: Support network-aware downstream analysis.
- Mechanism: Synthia personas directly inherit the directed follow-graph of the original Bluesky users, linking social topology to persona content and enabling social network analyses such as homophily studies.
- Design Motivation: This is a distinctive feature of Synthia—combining persona narratives with network structure—filling a gap in existing methods that provide text only.

Loss & Training¶

Synthia requires no training; it directly employs pretrained LLMs for persona generation and survey response. During the opinion survey stage, non-instruction-tuned models are used, as prior work has shown they outperform instruction-tuned models in survey simulation.

Key Experimental Results¶

Main Results¶

Method	EMD↓	Frob.↓	Cron. \(\alpha\)↑	Notes
Synthia (Gemma-27B)	0.35	2.30	0.39	Best on W34
Anthology (LLaMA-70B)	0.35	2.46	0.34	Uses a model 2.6× larger
Anthology (Gemma-27B)	0.34	2.65	0.32	Synthia outperforms across all metrics under the same model
PChat (human-written)	0.35	2.76	0.29	Human-annotated but high variance
Synthia (Phi-4B)	0.38	2.43	0.38	Competitive with a model 6× smaller

Ablation Study¶

Dimension	Synthia	Anthology	Notes
Contradictory persona ratio	18%	63%	Grounding substantially reduces internal contradictions
Avg. errors per persona	0.221	0.959	77% reduction in narrative contradictions
Cross-wave Frob. variance	0.04	0.20	Synthia is more stable

Key Findings¶

Intra-narrative consistency is a critical factor for aligning population-level opinions—Synthia reduces contradictions by 77% through real-data grounding.
Even with a 4B model (Phi-4-mini), Synthia remains competitive with Anthology generated by a 70B model.
Fairness analysis shows that the accuracy gap between the best- and worst-performing demographic subgroups is reduced by up to 25% with Synthia.
Link prediction accuracy improves by 8.3% and embedding-space separability improves by 46%, validating the utility of the preserved network structure.

Highlights & Insights¶

Grounding persona generation in real social media posts is a simple yet effective idea. The core insight is that intra-narrative consistency matters more than narrative richness. Anthology employs large models with high-temperature sampling to produce rich narratives, but the absence of grounding leads to frequent contradictions that ultimately degrade downstream task quality.
Preserving social network topology is a distinctive contribution, transforming the synthetic population from a collection of isolated individuals into a community with social relationships, opening new possibilities for social network simulation.
Achieving comparable or superior performance with smaller models demonstrates that data quality (grounding in real content) is more important than model scale.

Limitations & Future Work¶

Only English-language Bluesky data are used; the user base may not be representative of the general population.
Removing social identifiers may discard useful contextual information.
Demographic inference relies on LLM accuracy and may introduce biases.
Evaluation is limited to U.S. social surveys (ATP); cross-cultural generalizability remains to be verified.
Future work could explore multilingual and multi-platform persona generation.

vs. Anthology (Moon et al. 2024): Anthology uses unconstrained high-temperature sampling, which is scalable but produces frequent contradictions; Synthia grounds generation in real posts, yielding better consistency.
vs. Park et al. 2024: Interview-based data yield high realism but are not scalable; Synthia offers a scalable alternative using social media posts.
vs. PChat (Zhang et al. 2018): Human-written personas vary in quality and are not scalable.

Rating¶

Novelty: ⭐⭐⭐⭐ Real-data grounding combined with network topology preservation constitutes a meaningful contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 54 experimental configurations, multi-dimensional evaluation, fairness analysis, and network case studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured with in-depth analysis.
Value: ⭐⭐⭐⭐ Directly applicable to population simulation in computational social science.