Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation¶

Conference: ACL 2025
arXiv: 2506.19352
Code: None
Area: Others
Keywords: persona fidelity, LLM evaluation, personality, out-of-character, atomic evaluation

TL;DR¶

An atomic-level (sentence-level) evaluation framework is proposed to detect fine-grained Out-of-Character (OOC) behaviors of Large Language Models (LLMs) in open-ended generation through three metrics (ACC_atom, IC_atom, RC_atom). This addresses the issue where traditional holistic scoring approaches fail to capture subtle personality inconsistencies in long texts.

Background & Motivation¶

Background¶

Assigning a persona to LLMs is a core requirement for scenarios such as role-playing, social simulation, and dialogue systems.
LLMs often exhibit Out-of-Character (OOC) behaviors during long-text generation: the generated content deviates from the assigned persona, leading to inconsistencies.
For example: when a model is assigned a persona that is "neither extroverted nor introverted", it might act extroverted at times and introverted at other times within the same essay.

Limitations of Prior Work¶

Existing evaluation methods typically assign a single response-level score, failing to capture subtle personality deviations inside long texts.
When the overall average alignment score is correct but the internal sentence-level behaviors fluctuate wildly, traditional methods mistakenly judge it as "well-aligned persona".
Prior studies have mostly focused on multiple-choice questions or closed-ended QA, paying insufficient attention to open-ended generation scenarios.

Motivation¶

An atomic-level evaluation method is required to decompose text into minimal semantic units (sentences) and evaluate persona alignment sentence by sentence.
Borrowing the concept of FActScore, the fine-grained factual verification method is transferred to persona fidelity evaluation.

Method¶

Overall Architecture¶

Atomic Unit Segmentation: The generated long text G is split into sentence-level atomic units using NLTK's sent_tokenize.
Trait Scoring: GPT-4o is employed as the grader model to assign a personality trait score in \([1, 5]\) for each atomic sentence.
Invalid Sentence Filtering: Sentences without personality indicators (e.g., purely factual statements) are filtered out.
Metric Calculation: Three complementary metrics are calculated based on the atomic-level scores.

Key Designs¶

Metric	Definition	Computation	Value Range
ACC_atom	Atomic Accuracy	The average rate of atomic units matching the target personality interval	\([0, 1]\)
IC_atom	Internal Consistency	The inverse of the standard deviation of trait scores within a single generation	\([0, 1]\)
RC_atom	Test-Retest Consistency	The normalized average Earth Mover's Distance of distributions across multiple generations	\([-1, 1]\)

ACC_atom (Atomic Accuracy): Partitions the scale \([1, 5]\) into three equal intervals (low/medium/high) and checks whether the score of each atomic sentence falls into the target interval, taking the average. This detects the proportion of "out-of-character" sentences.

IC_atom (Internal Consistency): Measures the consistency of persona expression within a single generation. A lower standard deviation indicates higher internal consistency; higher inverse values are better. This detects scenarios where the "overall average score is correct but the internal fluctuation is wild."

RC_atom (Test-Retest Consistency): Employs Earth Mover's Distance (EMD) to measure the distribution difference of scores across multiple generations. This is better at capturing distribution-level differences compared to traditional standard deviation metrics.

Persona Setting¶

Based on the Big Five Personality Model (OCEAN): Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism.
Highly, moderately, and lowly levels are set for each dimension, resulting in 15 persona settings in total.
Evaluates the model's performance in three open-ended tasks: questionnaire interviews, essay writing, and social media posting.

Key Experimental Results¶

Experimental Setup¶

Models: 12 LLMs, including 3 base models and 9 instruction-tuned/RLHF models.
Tasks: 3 types of open-ended generation tasks (questionnaire interviews, essays, social media posts).
Evaluation Scale: Each model \(\times\) each persona \(\times\) each task is run 30 times.

Human Validation Results¶

Personality Dimension	Kendall's tau	Fleiss' kappa
O (Openness)	\(0.69^{***}\)	0.90
C (Conscientiousness)	\(0.76^{***}\)	0.96
E (Extraversion)	\(0.67^{***}\)	0.80
A (Agreeableness)	\(0.72^{***}\)	0.84
N (Neuroticism)	\(0.69^{***}\)	0.74

The scores from GPT-4o show high agreement with human judgments (all \(p < .001\)), validating the reliability of automatic evaluation.

Correlation with Traditional Metrics¶

Traditional Metric	ACC_atom	RC_atom	IC_atom
ACC	0.91	0.51	0.40
RC	0.48	0.98	0.37

ACC is highly correlated with ACC_atom, yet critical differences remain: some lower-tier persona models have high ACC but low ACC_atom, indicating that the overall score masks internal inconsistency. IC_atom correlates poorly with traditional metrics (\(r = 0.37 - 0.40\)), proving that it captures a completely new dimension.

Overall Model Performance Comparison¶

Model	Instruction-tuned	RLHF	ACC_atom	IC_atom	RC_atom
Davinci-002	Yes		0.39	0.64	0.56
GPT-3.5-turbo	Yes	Yes	0.60	0.75	0.79
GPT-4o	Yes	Yes	0.61	0.74	0.78
LLaMA-3-8B			0.41	0.60	0.64
LLaMA-3-8B-Instruct	Yes	Yes	0.65	0.70	0.82
Mistral-7B			0.41	0.59	0.67
Mistral-7B-Instruct	Yes		0.58	0.69	0.80
Claude-3-haiku	Yes	Yes	0.59	0.71	0.69

Key Findings¶

Structured tasks perform better: ACC_atom for questionnaire tasks (0.73) > essays (0.58) > social posts (0.52).
High-level personas are easiest to remain loyal to: High-level personas score an average ACC_atom of 0.95, mid-level scores only 0.27, and low-level scores 0.62.
Social Desirability Bias: Models show the lowest fidelity towards socially undesirable personalities (e.g., "closed-mindedness" or "carelessness"), presumably due to RLHF training alignment preferences.
Tuned models systematically outperform base models: Instruction-tuned + RLHF models obtain significant improvements across all three metrics.

Highlights & Insights¶

Innovative Atomic-Level Evaluation Paradigm: The FActScore-style fine-grained verification is transferred to persona fidelity evaluation for the first time, characterizing alignment accuracy, internal consistency, and cross-generation consistency through three complementary metrics.
Discovery of Social Desirability Bias: Reveals that RLHF training implicitly biases models towards socially favorable personas, resulting in systematic fidelity degradation for less favorable personas.
Clever Application of Earth Mover's Distance: Uses distribution distance instead of traditional standard deviation to measure test-retest consistency, capturing distributional shifts missed by traditional metrics.
Rigorous Human Validation: Human evaluation on 250 sentence pairs confirms the reliability of the automatic scoring (Kendall's tau between 0.67 and 0.76).

Limitations & Future Work¶

Limited to Personality Domain: Only validated on the Big Five personality dimensions, leaving other dimensions like social values and political stances unexplored.
Dependence on GPT-4o for Scoring: The accuracy of the automated evaluation is inherently limited by the bias of the grading model itself.
Sentence-level Granularity May Lack Precision: Intra-sentence inconsistencies (such as self-contradictions within a single sentence) cannot be captured by the current framework.
No Mitigation Method Proposed: It only serves as an evaluation framework, leaving concrete strategies to alleviate OOC behaviors unaddressed.
Limited Generation Length: Each generation in the experiments is only 100-300 words. Whether persona drift becomes worse in longer texts remains uninvestigated.

Persona Assignment to LLMs: Zhang et al. (2018) on conversational personas, Safdari et al. (2023) on personality measurements, and Park et al. (2023) on role-playing.
Persona Fidelity Evaluation: Response-level ACC/RC by Wang et al. (2024), and prompt format sensitivity analysis by Shu et al. (2024).
Open-ended Generation Evaluation: Atomic factual verification derived from FActScore (Min et al., 2023).
Long-text Consistency: Research on the decay of Transformer long-text coherence by Sun et al. (2021) and Krishna et al. (2022).

Rating¶

Dimension	Rating (1-5)	Description
Novelty	4	Atomic-level persona evaluation is an innovative angle, and the discovery of social desirability bias is insightful.
Technical Depth	3	The methodology is straightforward (sentence splitting + grading + statistics), but the metric design is solid.
Experimental Thoroughness	5	Scaled across 12 models \(\times\) 15 personas \(\times\) 3 tasks \(\times\) 30 runs, complemented by rigorous human validation.
Writing Quality	4	The paper is clearly structured, with intuitive case studies and thorough quantitative analysis.
Value	4	Establishes a new standard for LLM persona fidelity evaluation, and the findings can guide future alignment studies.