Supernova Event Dataset: Interpreting Large Language Models' Personality through Critical Event Analysis¶

Conference: ICML 2025
arXiv: 2506.12189
Code: None
Area: Interpretability
Keywords: LLM Personality Analysis, Event Extraction and Ranking, Interpretability, LLM-as-Judge, Subjective Reasoning

TL;DR¶

This paper proposes the Supernova Event Dataset (comprising Wikipedia articles of biographies, historical events, news, and scientific discoveries). By instructing LLMs to extract and rank key events from long texts, and utilizing another LLM as a judge to infer the target model's "personality traits," this work reveals differences in the consistent behavioral patterns of different LLMs during subjective decision-making.

Background & Motivation¶

Modern LLM benchmarks primarily focus on tasks with objective ground-truth answers (e.g., question answering, reasoning). However, as LLMs are increasingly deployed in high-stakes fields such as healthcare, law, and finance, evaluating factual accuracy alone is no longer sufficient; understanding the models' subjective judgments and value inclinations has become crucial.

Prior work has shown that LLMs can simulate personality traits when explicitly prompted with specific personas. However, the core finding of this work is that even without role-playing prompts, LLMs exhibit consistent behavioral patterns when handling complex subjective tasks, and these patterns can be interpreted as "personalities."

Critical event identification and ranking is inherently a subjective task: - It requires reasoning across long contexts. - It requires modeling causal chains and non-linear interactions between events. - Different individuals (and models) make distinct choices due to variations in underlying values.

This makes the task an ideal tool for probing the latent decision-making tendencies of LLMs.

Method¶

Overall Architecture¶

The framework consists of three stages:

Dataset Construction: Building the Supernova Event Dataset, which includes four categories of Wikipedia articles (biographies, historical events, news, and scientific discoveries).
Event Extraction and Ranking: The target LLMs receive articles via RAG, extract, and rank the five most critical events.
Personality Judgment: Another LLM (Judge) analyzes the target model's event selection and ranking to infer its personality type.

Key Designs¶

Dataset Construction (Supernova Event Dataset)¶

Category	Source	Min Word Count	Min Views	Additional Filtering	Article Count
Biography	English Wikipedia	3000	50000	Infobox template filtering	150
Historical Events	English Wikipedia	500	5000	ORES \(\ge\) B + LLM verification + Year < 2000	150
News Events	English Wikipedia	500	5000	ORES \(\ge\) B + LLM verification + Year > 2000	150
Scientific Discoveries	Gemini Deep Research	-	-	Nobel Prize API + Gemini expansion	25

Highlights of dataset design: - Biography: Requires \(\ge 3000\) words to ensure coverage of the subject's entire life, using standardized infobox templates. - Historical/News Events: Two-stage filtering—first applying heuristic rules to filter ambiguous pages, then using a local LLaMA-3-8B for semantic validation (confidence \(> 0.9\)). - Scientific Discoveries: Extracted 384 award records (1901-2024) from the Nobel Prize REST API, expanded into encyclopedic articles using Gemini 2.5 Pro Deep Research.

RAG Pipeline and Event Extraction¶

Document processing pipeline: 1. Chunking: Segmenting documents into semantic chunks of 1000 tokens (with a 100-token overlap). 2. Embedding: Using the nomic-embed-text-v1 model to generate high-dimensional vectors. 3. Indexing: Storing vectors in a FAISS vector database. 4. Retrieval: MultiQueryRetriever rewrites queries into multiple search queries to improve retrieval recall.

Two-stage prompt strategy: - First-stage prompt: Directs the retriever to focus on critical event characteristics such as "turning points" and "cascading effects," rather than merely fetching topically relevant content. - Second-stage prompt: Guides the LLM to perform structural analysis, requiring it to identify and rank the five most critical events.

For scientific discoveries, counterfactual testing ("would the result change if this event did not occur?") is additionally incorporated as a selection criterion.

Personality Judgment Framework¶

Judge Model: Uses Qwen-2.5 14B as an external evaluator.
Evaluation Method: The Judge receives the target LLM's complete event selection and ranking output to analyze its decision-making patterns.
Personality Encoding: Employs sentence-transformers (all-MiniLM-L6-v2) to perform semantic embedding of the identified personality traits.
Visualization: Applies PCA dimensionality reduction to the aggregated embeddings to visualize the model's personality location in a 2D space.
Similarity Measure: Calculates cosine similarity to quantify personality similarity across different models.

Loss & Training¶

This work does not involve model training; instead, it is an evaluation framework. Its core components include:

Inference-time strategy: Structured prompt guidance combined with RAG retrieval augmentation.
Personality quantification: Aggregation of trait embeddings weighted by frequency.
Analysis of scientific discoveries: Combines keyword counting and open coding to converge into three types of decision-making principles:
- Causality-centric: Focuses on mechanisms and causal pathways.
- Enablement-centric: Focuses on foundations, barrier removal, and validation.
- Synthesis-centric: Emphasizes conceptual integration and paradigm-level connections.

Key Experimental Results¶

Main Results¶

Evaluated Models: - Small Models: Phi-4, Orca 2 (13B), Qwen 2.5 (14B) - Large Models (for Scientific Discoveries): Claude Sonnet 3.7, Gemini 2.5 Pro, OpenAI o3

Distribution of Personality Categories (seven personality dimensions):

Model	Strategic Achiever	Creative Innovator	Emotional	Community Support	Ideological	Observational	Influencer
Phi-4	Highest	High	Moderate	Low	Low	Low	Low
Orca 2	Moderate	Low	Highest	Moderate	Low	Low	Low
Qwen 2.5	Highest	High	Moderate	Moderate	Moderate	Moderate	Moderate

Distribution of Decision-Making Principles in Scientific Discoveries:

Model	Causality-centric	Enablement-centric	Synthesis-centric
o3	Dominant	Moderate	Low
Gemini 2.5 Pro	Moderate	Dominant	Low
Claude 3.7 Sonnet	Low	Prominent	Dominant

Ablation Study¶

Configuration	Key Metrics	Description
Movie Script Dataset (1,172 scripts)	Consistent personality patterns	Verifies personality stability across different domains
Phi-4's performance in movies	Strategic/plot-oriented	Prefers "Jafar's schemes", "Zuckerberg's decision to launch Facebook"
Orca 2's performance in movies	Emotional/relationship-oriented	Prefers "Aladdin meeting Jasmine", "Mark and Eduardo's fallout"
Qwen 2.5's performance in movies	Milestone-oriented	Prefers "The creation of Facemash", "The final performance at Lincoln Center"

Key Findings¶

Reproducible Model Personality: Models exhibit consistent behavioral preferences across different domains (biographies, financial crises, movie scripts, and scientific discoveries).
Significant Differences in Small Models: Phi-4 leans toward "strategic achievement," Orca 2 leans toward "emotional reasoning," and Qwen 2.5 is the most balanced.
Divergent Reasoning Styles in Large Models: o3 shows causal reasoning (step-by-step), Gemini focuses on empirical validation, and Claude excels in conceptual integration.
Clear Separation in Semantic Space: PCA visualization shows that the three small models occupy distinctly separate personality regions.
No Role-Playing Required: Personality traits naturally emerge without explicit personality prompting.

Highlights & Insights¶

Elegant Task Design: Critical event ranking is an inherently subjective task with no single correct answer, thereby directly reflecting the model's value preferences. This probes model behavior at a deeper level than traditional benchmarks.
Prompt-agnostic: The personality identification methodology in this work does not rely on specific prompt engineering; the behavioral patterns of the models remain consistent across different prompts.
Insightful Scientific Discovery Analysis: The three types of reasoning principles (causality, enablement, and synthesis) provide a practical reference for selecting LLMs: use o3 when causal analysis is needed, Gemini for methodological foundation evaluation, and Claude for cross-domain conceptual integration.
Counterfactual Testing: Using "would the outcome have changed if this event had not occurred?" to filter key events is methodologically rigorous.
Significance for AI-Assisted Research: Understanding the reasoning personality of LLMs assists in designing better human-AI collaborative research workflows.

Limitations & Future Work¶

Data Bias: Wikipedia naturally contains inherent editorial biases and Western-centrism, which may influence the inferred personality labels.
LLM-as-Judge Bias: The evaluator model possesses its own stylistic bias and lacks human validation.
Non-standardized Personality Framework: The personality categories are empirically derived rather than being grounded in established psychological frameworks like the Big Five.
Small Sample Size for Scientific Discoveries: Consisting of only 25 articles, the statistical significance is limited.
Lack of Adversarial Testing: Whether model personalities remain stable under adversarial prompting has not been verified.
Exclusion of Inference Parameters: The impact of inference parameters such as temperature was not analyzed; different sampling strategies might affect event selection.
Single Evaluator: Only Qwen 2.5 was used as the judge, without cross-validation from a multi-judge committee.

LLM Personality Research: Jiang et al. (2023) and Bodroža et al. (2024) utilize psychometric tools like the Big Five to evaluate the behavioral traits of LLMs. This study extends this exploration to scenarios without explicit personality prompts.
Event Extraction: DDEE (Liu & Luo, 2024), ULTRA (Zhang et al., 2024), and EventRL (Gao et al., 2024) focus on the accuracy of event extraction; this paper shifts the focus to the subjective dimension of event importance ranking.
Long-Context Reasoning: NoLiMa (Modarressi et al., 2025) and BABILong (Kuratov et al., 2024) evaluate long-context capabilities; this work complements these evaluations from a personality perspective.
Inspirations for Future Ideas: The personality analysis framework can be extended to more models and domains; combining it with mechanistic interpretability can probe how personality-related features are internally represented in models.

Rating¶

Dimension	Score (1-5)	Description
Novelty	4	Novel task design mapping event ranking to personality inference
Technical Depth	3	The method itself is relatively straightforward (RAG + prompt + judge) without complex model designs
Experimental Thoroughness	3	Sufficient cross-domain validation, but sample sizes are limited and human evaluation is lacking
Value	4	Offers practical guidance for model selection and human-AI collaboration
Writing Quality	4	Clear structure, rich cases, and strong readability
Overall	3.5	Valuable concept, but requires a more rigorous validation framework