Synthia: Scalable Grounded Persona Generation from Social Media Data¶
Conference: ACL 2026
arXiv: 2507.14922
Code: None
Area: Computational Social Science / Persona Modeling
Keywords: Persona Generation, Synthetic Populations, Social Media, Social Survey Simulation, Fairness Analysis
TL;DR¶
The Synthia framework is proposed to generate grounded LLM persona narratives based on real social media posts (Bluesky), improving social survey alignment by up to 11.6% compared to SOTA while using smaller models and preserving social network topology to support network-aware analysis.
Background & Motivation¶
Background: Persona-driven LLM simulations are increasingly widely applied in computational social science to simulate population-level attitudes and behaviors. Persona construction methods range from simple demographic descriptions to rich life narratives.
Limitations of Prior Work: Building virtual populations that are both realistic and scalable is a core challenge. Interview-based methods (e.g., Park et al. 2024) offer high authenticity but are resource-intensive and difficult to scale; fully synthetic methods (e.g., Anthology) are scalable but often introduce systemic artifacts that reduce realism, and narratives frequently contain self-contradictory facts (63% of personas have contradictions).
Key Challenge: The trade-off between authenticity and scalability. Unconstrained LLM generation, though scalable, lacks real-world anchoring, leading to hallucinations and narrative inconsistencies.
Goal: Design a persona generation framework that uses real social media content as an anchor while utilizing LLMs for narrative construction, balancing authenticity, scalability, and fairness.
Key Insight: Utilize public posts from the Bluesky platform as a real data source, using LLMs to synthesize user posts into first-person life narratives while preserving the original social network graph structure.
Core Idea: Persona narratives should be anchored in real user-generated content rather than synthesized from scratch. Anchoring in real data significantly reduces internal narrative contradictions, thereby improving alignment with population opinion distributions.
Method¶
Overall Architecture¶
A three-stage pipeline: (1) Collect and filter a user post pool from Bluesky (approx. 170 million posts, 650k users), sampling 3K users; (2) Use an LLM (Gemma-3-27B) to synthesize each user's posts into a first-person persona narrative; (3) Align the synthetic population with real survey respondents through demographic matching and compare simulated opinion distributions with real distributions.
Key Designs¶
-
Real-Data Anchored Persona Generation:
- Function: Generate rich persona narratives with real-world grounding.
- Mechanism: Collect 100-1000 posts per user (too few lacks context, too many exceeds the context window), remove social identifiers like @mentions, URLs, and emails, exclude replies/reposts, and use an LLM to generate a synthesized first-person life background story. High-quality personas can be generated using Gemma-3-27B (27B) or even Phi-4-mini (4B).
- Design Motivation: Real posts provide anchoring points that constrain the LLM from fabricating information, significantly reducing internal contradictions (the proportion of contradictory personas dropped from 63% to 18%).
-
Demographic Matching and Opinion Surveying:
- Function: Align the synthetic population with the demographic distribution of a real survey population.
- Mechanism: Infer demographic attributes (age, gender, race, etc.) for each persona via LLM, then use a greedy matching algorithm to pair each survey respondent with the best-matching persona. Condition the LLM on the persona narrative to answer survey questions, and compare simulated vs. real opinion distributions using EMD, Frobenius norm, and Cronbach's \(\alpha\).
- Design Motivation: Evaluation is anchored on human survey data rather than LLM judgments to ensure reliability.
-
Social Network Graph Retention:
- Function: Support network-aware downstream analysis.
- Mechanism: Synthia personas directly inherit the directed graph of following relationships from the original users on Bluesky, linking social topology with persona content to support social network research such as homophily analysis.
- Design Motivation: This is a unique feature of Synthia—providing both persona narratives and network structures, filling the gap left by existing methods that only provide text.
Loss & Training¶
Synthia requires no training, directly using pre-trained LLMs for persona generation and survey answering. Non-instruction-tuned models are used during the opinion survey phase (as prior research indicates they perform better in survey simulation than instruction-tuned models).
Key Experimental Results¶
Main Results¶
| Method | EMD↓ | Frob.↓ | Cron. \(\alpha\)↑ | Description |
|---|---|---|---|---|
| Synthia (Gemma-27B) | 0.35 | 2.30 | 0.39 | Best on W34 |
| Anthology (LLaMA-70B) | 0.35 | 2.46 | 0.34 | Used 2.6x larger model |
| Anthology (Gemma-27B) | 0.34 | 2.65 | 0.32 | Synthia leads across the board with the same model |
| PChat (Human) | 0.35 | 2.76 | 0.29 | Human-annotated but highly volatile |
| Synthia (Phi-4B) | 0.38 | 2.43 | 0.38 | Comparable even with 6x smaller model |
Ablation Study¶
| Analysis Dimension | Synthia | Anthology | Description |
|---|---|---|---|
| Contradictory Persona % | 18% | 63% | Anchoring significantly reduces internal contradictions |
| Avg. Errors per Persona | 0.221 | 0.959 | Narrative contradictions reduced by 77% |
| Frob. Fluctuation Across Waves | 0.04 | 0.20 | Synthia is more stable |
Key Findings¶
- Internal narrative consistency is a key factor in aligning population opinions—Synthia reduces contradictions by 77% through real-data anchoring.
- Even with a 4B model (Phi-4-mini), Synthia can compete with Anthology generated by a 70B model.
- Fairness analysis shows that Synthia reduces the accuracy gap between the best and worst demographic subgroups by up to 25%.
- Link prediction accuracy increased by 8.3%, and embedding space separability increased by 46%, proving the effectiveness of the network structure.
Highlights & Insights¶
- The approach of anchoring persona generation with real social media posts is both simple and effective. The core insight is that internal consistency of the persona narrative is more important than narrative richness. Anthology uses high-temperature sampling with large models to generate rich narratives, but the lack of anchoring leads to frequent contradictions, which degrades the quality of downstream tasks.
- Preserving social network topology is a unique contribution, making synthetic populations no longer collections of isolated individuals but communities with social relationships. This opens new possibilities for social network simulation.
- Achieving or exceeding the performance of larger models using smaller models indicates that data quality (anchoring in real content) is more important than model scale.
Limitations & Future Work¶
- Only English Bluesky data is used; the user base may not be representative of the general population.
- Removing social identifiers might result in the loss of some useful context.
- Demographic inference relies on the accuracy of the LLM, which may introduce bias.
- Evaluation was limited to U.S. social surveys (ATP); cross-cultural generalization remains to be verified.
- Future work could explore multilingual and multi-platform persona generation.
Related Work & Insights¶
- vs Anthology (Moon et al. 2024): Unanchored high-temperature sampling; scalable but prone to contradictions. Synthia uses real post anchoring for better consistency.
- vs Park et al. 2024: Based on interview data; authentic but not scalable. Synthia uses social media posts as a scalable alternative.
- vs PChat (Zhang et al. 2018): Human-written personas; inconsistent quality and not scalable.
Rating¶
- Novelty: ⭐⭐⭐⭐ Real data anchoring + network topology retention is a meaningful innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 54 experimental configurations, multi-dimensional evaluation, fairness analysis, and network case studies.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and in-depth analysis.
- Value: ⭐⭐⭐⭐ Direct application value for population simulation in computational social science.