Personality-Guided Code Generation Using Large Language Models¶

Conference: ACL 2025
arXiv: 2411.00006
Code: GitHub
Area: Code Intelligence
Keywords: code generation, personality, MBTI, prompt engineering, role-playing

TL;DR¶

This work dynamically generates matching MBTI personality types and detailed descriptions for each programming task using GPT-4o, and guides the target LLM to generate code by role-playing a programmer with this personality. Across 28 combinations of 7 LLMs and 4 datasets, improvements in pass rates are achieved in 23 cases (up to 12.9%). The key factor is personality diversity rather than any single specific personality.

Background & Motivation¶

Background: LLM-driven code generation has become mainstream. A common practice is to prompt LLMs to role-play as a "programmer" to generate code. Meanwhile, software engineering research has fully demonstrated that matching tasks with developer personality can improve development quality, and team personality diversity is positively correlated with high-quality software delivery.

Limitations of Prior Work: Existing role-playing prompts only broadly ask the LLM to act as a "programmer" without refining specific personality traits. Using the same persona setting for all programming tasks overlooks the fact that different tasks may suit different thinking styles.

Key Challenge: Can the quality improvement brought by "task-personality matching" in human software development be transferred to LLM code generation? If so, what factors affect the performance?

Key Insight: Drawing on the MBTI personality framework, this study automatically generates matched personality types for each programming task, forcing the LLM to generate code with diverse personality personas. Large-scale experiments are conducted to verify the effects and underlying factors.

Method¶

Overall Architecture¶

A two-stage pipeline: 1. Personality Generation: GPT-4o analyzes the programming task description and outputs the most suitable MBTI type (one of 16 types) along with its detailed personality description for that task. 2. Code Generation: The generated MBTI type and detailed description are embedded into the prompt, guiding the target LLM to act as a programmer with that personality to generate code.

Evaluation method: The generated code is considered "passed" only if it passes all test cases. Each LLM is run 3 times on each dataset to obtain the average pass rate.

Key Designs¶

Task-Adapted Dynamic Personality Generation:
- Function: GPT-4o dynamically selects the most suitable MBTI type and generates a customized description based on the specific characteristics of each programming task.
- Mechanism: Different programming tasks suit different cognitive styles—logic-intensive tasks suit the Thinking dimension, algorithm design suits the Intuition dimension, and debugging tasks suit the Sensing dimension.
- Design Motivation: Utilizing a fixed MBTI type yields almost no difference compared to not using personality guidance (e.g., the pass rate of 16 fixed MBTI types on Qwen-Long ranges from 65.7% to 68.4%, while diverse adaptation reaches 80.8%), indicating that "diversity" is the key.
Full Prompt Design (Detailed Personality Description):
- Function: The prompt not only specifies the four-letter MBTI type but also includes a detailed personality description generated by GPT-4o tailored to the specific task.
- Mechanism: Helps the LLM understand and role-play the persona more deeply.
- Design Motivation: The Full Prompt outperforms the Short Prompt (only labelling the MBTI type) by an average of 3.94% across 7 LLMs. Using a generic template description (instead of a task-customized description) on Qwen-Long yields a pass rate of only 65.5%, which is far below the 80.8% achieved by customized descriptions.
Orthogonal Combination with Other Prompting Strategies:
- Function: Personality guidance can be combined with other strategies like CoT and few-shot prompting.
- Mechanism: Personality guidance affects "persona perception" while CoT affects the "reasoning process"; the two are orthogonal and complementary.
- Design Motivation: CoT + Personality outperforms either strategy alone on 5 out of 7 LLMs, with a maximum gain of 13.8%.

Loss & Training¶

This work is a zero-shot training-free prompt engineering method and does not involve model training. GPT-4o is used for personality generation, and default settings of each target LLM are used for code generation.

Key Experimental Results¶

Main Results¶

Pass rate comparison (Direct vs. MBTI-guided) across 7 LLMs and 4 datasets:

LLM	MBPP Sanitized	MBPP+	HumanEval+	APPS	Average Gain
GPT-4o	78.2→84.3 (+6.1%)	71.2→72.7 (+1.5%)	84.8→82.9 (-1.9%)	46.2→45.2 (-1.0%)	+1.2%
GPT-4o mini	69.3→82.2 (+12.9%)	69.4→71.7 (+2.3%)	80.5→82.3 (+1.8%)	34.6→37.2 (+2.6%)	+4.9%
Llama3.1 (70B)	69.8→81.0 (+11.2%)	66.7→69.2 (+2.5%)	72.0→72.6 (+0.6%)	18.4→25.2 (+6.8%)	+5.3%
Qwen-Long	68.4→80.8 (+12.4%)	67.7→71.2 (+3.5%)	76.8→78.7 (+1.9%)	10.2→18.2 (+8.0%)	+6.5%
DeepSeek-Coder V2	74.9→85.7 (+10.8%)	71.4→72.2 (+0.8%)	80.5→76.2 (-4.3%)	39.4→34.4 (-5.0%)	+0.6%
Codestral (22B)	64.2→73.8 (+9.6%)	61.2→64.9 (+3.7%)	75.6→76.8 (+1.2%)	15.8→22.6 (+6.8%)	+5.3%
CodeLlama (13B)	43.3→46.8 (+3.5%)	42.4→52.4 (+10.0%)	32.9→29.9 (-3.0%)	1.4→6.4 (+5.0%)	+3.9%

Overall: Improvements are observed in 23 out of 28 combinations, with 11 exceeding 5% and 5 exceeding 10%.

Ablation Study¶

Configuration	Qwen-Long MBPP Pass Rate	Explanation
Direct (No Personality)	68.4%	Baseline
16 Fixed MBTI Types	65.7%-68.4%	Almost no difference from baseline
Diverse MBTI (Full Prompt)	80.8%	+12.4%, diversity is key
Short Prompt (MBTI 4-letter only)	73.3%	7.5% lower than Full Prompt
Generic Template Description	65.5%	Far inferior to task-customized descriptions
MBTI vs. Big Five Personality	80.8% vs 71.4%	MBTI significantly outperforms Big Five
Personality alone	84.3% (GPT-4o)	Outperforms 3-shot (77.3%) and CoT (77.0%)
CoT + Personality	85.7% (GPT-4o)	Best combination
CoT + Personality (Qwen-Long)	82.2%	13.8% improvement over Direct

Key Findings¶

Medium-performing models benefit the most: GPT-4o (too strong) and CodeLlama (too weak) show limited benefits, while Qwen-Long, Llama3.1, and Codestral show the most significant gains.
Medium-difficulty datasets show the largest improvement: Performance on HumanEval+ (where 78% of tasks are assigned INTJ, resulting in low diversity) and APPS (90.6% assigned INTJ) is unstable.
Personality diversity is the core factor: Any fixed single personality ≈ no personality guidance. Different personalities show complementarity in solving tasks (e.g., INTJ and ISTJ each have 4.5% uniquely solved problems).
MBTI outperforms Big Five: The four dimensions of MBTI (S/N, T/F, E/I, J/P) align better with the cognitive demands of programming; in Big Five, only conscientiousness is strongly related to programming.
Full Prompt consistently outperforms Short Prompt: Detailed descriptions customized to tasks yield an average improvement of 3.94% over simply labeling the MBTI type.

Highlights & Insights¶

Crossing domains to transfer the personality-task matching theory from software engineering to LLM prompt design provides insightful ideas.
The key finding is that "diversity" rather than "a specific personality" brings improvements, implying that role-playing prompts need to be dynamically matched with task characteristics rather than statically set.
The method is extremely lightweight: zero training, simply prepending a personality description to the prompt, making it plug-and-play.
The CoT + Personality combined strategy demonstrates the possibility of orthogonal complementarity across different dimensions of prompt engineering.

Limitations & Future Work¶

Personality generation relies on GPT-4o, introducing extra API call costs and adding one LLM call per task.
MBTI is controversial in the psychology community and is not universally accepted as the most scientific personality framework.
It has only been validated on Python function-level code generation, without covering complex scenarios such as repository-level or multi-file environments.
The underlying mechanism of why personality guidance is effective remains unclear; it might simply increase role-playing contextual information through richer prompts.
The theoretical upper bound cannot be calculated due to the lack of ground truth for personality-task matching.

vs. Direct role-playing: Simply using "You are a programmer" is inferior to personality guidance on all 7 LLMs, indicating that role-playing needs fine-grained refinement.
vs. CoT: Personality guidance alone outperforms CoT and 3-shot across all 7 LLMs, and can be combined with CoT for additional improvements.
vs. Team diversity studies (Pieterse et al., 2018): This work validates that personality diversity is equally important in LLM scenarios.

Rating¶

Novelty: ⭐⭐⭐ Introducing MBTI to code generation is interesting but essentially a variant of prompt engineering.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive with 7 LLMs, 4 datasets, 5 RQs, comparison of 16 fixed MBTI types, and comparison of two personality frameworks.
Writing Quality: ⭐⭐⭐⭐ The RQ-driven empirical research paradigm is clear, with definitive conclusions for each RQ.
Value: ⭐⭐⭐ Practicality is average (performance is unstable across different scenarios), but the finding that "diversity is key" holds inspiring value.