Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions¶
Conference: ACL 2025
arXiv: 2502.13135
Code: Yes
Area: Other
Keywords: Synthetic users, health condition modeling, coaching conversational agents, user simulation, LLM evaluation
TL;DR¶
Proposes an end-to-end framework to generate synthetic users with health conditions (covering sleep and diabetes management) based on real demographic, health/lifestyle, and behavioral/psychological profile data. This framework is used to evaluate the interaction quality of health coaching agents, and is validated through human expert evaluation to significantly outperform generic synthetic users.
Background & Motivation¶
Interactive health coaching agents require interaction with users to evaluate their effectiveness, but collecting and evaluating diverse, long-term human interactions is both expensive and time-consuming. LLMs-generated synthetic users offer the potential for automated evaluation, but prior methods exhibit key limitations:
Lack of Grounding in Real Health Conditions: Generic synthetic users cannot accurately reflect the needs and challenges of users under specific health conditions.
Demographic Bias: LLM training data is biased toward English-speaking cultures and highly active online populations, which does not represent the actual patient distribution.
Lack of Contextualized Knowledge: LLMs can reference phenomena like difficulty sleeping, but these do not substitute for contextualized knowledge rooted in lived experiences.
Risk of Causal Implication: Presenting LLMs with specific advice might unintentionally alter other implicit characteristics of the synthetic user.
Core Idea: Synthetic users should be generated based on real data—constructed from actual demographic, health, and behavioral/psychological profiles, rather than relying entirely on free-form LLM generation.
Method¶
Overall Architecture¶
Two-stage construction of synthetic users:
- Structured Data Generation: Generate structured attributes based on real demographic, health/lifestyle, and behavioral/psychological data.
- Complete Profile Generation: Generate complete user "vignettes" using LLMs based on the structured data.
Then, simulated interactions are conducted between the synthetic users and the coaching agent using either the Concordia system or direct LLM calls.
Key Designs¶
-
Attribute Grounding Based on Real Data
- Sleep Scenario: Uses the LifeSnaps public dataset (68 participants, including demographics, sleep data, Big Five personality, etc.)
- Diabetes Scenario: Uses the PBHS longitudinal cohort (345 patients with Type 2 diabetes, containing detailed demographic, socioeconomic, and clinical data)
- Design Motivation: Directly sample the distribution of real data to avoid LLM distributional biases.
-
Multi-level User Modeling
For the sleep scenario: - Basic Attributes: Age, gender, BMI, sleep duration and efficiency, Big Five personality. - LLM-generated Sleep Profile: Primary sleep concerns, sleep goals, reasons for goals, barriers. - Optional Extensions: Challenges from the COM-B behavior model framework, rich backstory.
For the diabetes scenario: - Sample barriers from 246 real challenges according to the COM-B model distribution. - Build vignettes based on patients' demographics, socioeconomic, and clinical data. - Generate communication styles (tone, verbosity, level of confidence).
-
Interaction Simulation
- Instantiate synthetic users using the Concordia generative agent framework.
- Concordia provides associative memory, chain-of-thought reasoning, and modular architecture.
- The Sleep Agent employs a "Talker-Reasoner" dual-agent architecture (System 1 + System 2).
- Uses Gemini 1.5 Pro as the underlying LLM.
-
Multi-dimensional Evaluation Strategy
- Automated Evaluation: Compare the coaching agent's internal user model with the ground-truth user profile.
- Expert Evaluation: Trained human evaluators blindly assess interaction quality.
- Comparative Evaluation: Full synthetic users vs. demographic-only baseline users.
Loss & Training¶
This is a framework-oriented work that does not involve model training. The core lies in the design of synthetic data generation and evaluation pipelines.
Key Experimental Results¶
Sleep Coaching Experiment (68 synthetic users, 10-turn interactions)¶
| Evaluation Dimension | Metric |
|---|---|
| Accuracy of sleep concern identification | 89.7% |
| Recall of barriers | 71.4% |
| Precision of barriers | 72.5% |
| Recall of sleep goals | 66.4% |
| Precision of sleep goals | 84.2% |
Human Expert Evaluation (Sleep Scenario)¶
| Evaluation Item | Preference of Full User vs. Baseline | Inter-annotator Agreement |
|---|---|---|
| Overall Preference | Full User Wins Significantly | Fleiss' \(\kappa = 0.67\) |
| p-value | \(3.7 \times 10^{-12}\) | |
| 5/5 Perfect Agreement Rate | 64% | |
| \(\ge 4/5\) Agreement Rate | 91% |
Diabetes Coaching Experiment (200 synthetic users)¶
| Evaluation Dimension | Expert Rating |
|---|---|
| User Consistency | 92% |
| Fidelity of Barrier Representation | 100% |
Key Findings¶
- Coaching agents can identify synthetic users' primary sleep concerns with 89.7% accuracy, rendering evidence that synthetic users indeed convey assigned health attributes effectively during interactions.
- Synthetic users based on complete health/behavioral attributes significantly outperform baseline users based solely on demographics (\(p < 10^{-12}\)).
- Inter-rater agreement is high (\(\kappa = 0.67\)), indicating that quality differences are distinct and easy to judge.
- The framework is validated across two independently developed agents and scenarios, demonstrating its generalizability.
Highlights & Insights¶
- End-to-End Framework Design: A complete workflow covering real data sampling \(\rightarrow\) attribute generation \(\rightarrow\) vignette construction \(\rightarrow\) interaction simulation \(\rightarrow\) multi-dimensional evaluation.
- Crucial Role of Grounding in Real Data: Experiments forcefully demonstrate that demographic information alone is far from sufficient; health conditions and behavioral profiles are critical to generating realistic synthetic users.
- Independent Validation in Two Scenarios: The sleep and diabetes scenarios were developed independently by different teams, enhancing the reliability of the conclusions.
- Systematic Reflection on LLM Biases: Explicitly identifies and mitigates multiple sources of bias when using LLMs as synthetic users.
- Privacy Preservation Support: Synthetic users can generate novel individuals, reducing direct reliance on real patients' private data.
Limitations & Future Work¶
- Synthetic users may still lack the depth and nuance of real-life lived experiences.
- Only the elicitation of goals and barriers was evaluated, without assessing the subsequent behavior change process.
- Reliance on the Gemini family of models: different LLMs may yield varying qualities of synthetic users.
- Performance declines when substituting with open-source models (e.g., Gemma 2-27B).
- The effectiveness of long-term interactions (beyond 10 turns) has not been verified.
Related Work & Insights¶
- AMIE (Tu et al., 2024): A conversational agent for medical diagnosis, but its synthetic patient design suffers from limitations such as demographic bias.
- Yu et al. (2024): A knowledge graph-based patient LLM, suitable for clinical but not health coaching scenarios.
- Castricato et al. (2024): Synthetic users based on US census statistics, but considering only demographics without health conditions.
- Concordia (Vezhnevets et al., 2023): The generative agent framework used in this work.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to systematically integrate real health data into synthetic user generation for coaching agent evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Well-designed dual-scenario validation, automated + expert evaluation, and comparative experiments.
- Writing Quality: ⭐⭐⭐⭐ Clear description of the framework, comprehensive background review.
- Value: ⭐⭐⭐⭐ Provides a practical methodology for agent evaluation in the health AI domain.