Personality-Guided Code Generation Using Large Language Models¶
Conference: ACL 2025
arXiv: 2411.00006
Code: GitHub
Area: Code Intelligence
Keywords: code generation, personality, MBTI, prompt engineering, role-playing
TL;DR¶
This work dynamically generates matching MBTI personality types and detailed descriptions for each programming task using GPT-4o, and guides the target LLM to generate code by role-playing a programmer with this personality. Across 28 combinations of 7 LLMs and 4 datasets, improvements in pass rates are achieved in 23 cases (up to 12.9%). The key factor is personality diversity rather than any single specific personality.
Background & Motivation¶
Background: LLM-driven code generation has become mainstream. A common practice is to prompt LLMs to role-play as a "programmer" to generate code. Meanwhile, software engineering research has fully demonstrated that matching tasks with developer personality can improve development quality, and team personality diversity is positively correlated with high-quality software delivery.
Limitations of Prior Work: Existing role-playing prompts only broadly ask the LLM to act as a "programmer" without refining specific personality traits. Using the same persona setting for all programming tasks overlooks the fact that different tasks may suit different thinking styles.
Key Challenge: Can the quality improvement brought by "task-personality matching" in human software development be transferred to LLM code generation? If so, what factors affect the performance?
Key Insight: Drawing on the MBTI personality framework, this study automatically generates matched personality types for each programming task, forcing the LLM to generate code with diverse personality personas. Large-scale experiments are conducted to verify the effects and underlying factors.
Method¶
Overall Architecture¶
A two-stage pipeline: 1. Personality Generation: GPT-4o analyzes the programming task description and outputs the most suitable MBTI type (one of 16 types) along with its detailed personality description for that task. 2. Code Generation: The generated MBTI type and detailed description are embedded into the prompt, guiding the target LLM to act as a programmer with that personality to generate code.
Evaluation method: The generated code is considered "passed" only if it passes all test cases. Each LLM is run 3 times on each dataset to obtain the average pass rate.
Key Designs¶
-
Task-Adapted Dynamic Personality Generation:
- Function: GPT-4o dynamically selects the most suitable MBTI type and generates a customized description based on the specific characteristics of each programming task.
- Mechanism: Different programming tasks suit different cognitive styles—logic-intensive tasks suit the Thinking dimension, algorithm design suits the Intuition dimension, and debugging tasks suit the Sensing dimension.
- Design Motivation: Utilizing a fixed MBTI type yields almost no difference compared to not using personality guidance (e.g., the pass rate of 16 fixed MBTI types on Qwen-Long ranges from 65.7% to 68.4%, while diverse adaptation reaches 80.8%), indicating that "diversity" is the key.
-
Full Prompt Design (Detailed Personality Description):
- Function: The prompt not only specifies the four-letter MBTI type but also includes a detailed personality description generated by GPT-4o tailored to the specific task.
- Mechanism: Helps the LLM understand and role-play the persona more deeply.
- Design Motivation: The Full Prompt outperforms the Short Prompt (only labelling the MBTI type) by an average of 3.94% across 7 LLMs. Using a generic template description (instead of a task-customized description) on Qwen-Long yields a pass rate of only 65.5%, which is far below the 80.8% achieved by customized descriptions.
-
Orthogonal Combination with Other Prompting Strategies:
- Function: Personality guidance can be combined with other strategies like CoT and few-shot prompting.
- Mechanism: Personality guidance affects "persona perception" while CoT affects the "reasoning process"; the two are orthogonal and complementary.
- Design Motivation: CoT + Personality outperforms either strategy alone on 5 out of 7 LLMs, with a maximum gain of 13.8%.
Loss & Training¶
This work is a zero-shot training-free prompt engineering method and does not involve model training. GPT-4o is used for personality generation, and default settings of each target LLM are used for code generation.
Key Experimental Results¶
Main Results¶
Pass rate comparison (Direct vs. MBTI-guided) across 7 LLMs and 4 datasets:
| LLM | MBPP Sanitized | MBPP+ | HumanEval+ | APPS | Average Gain |
|---|---|---|---|---|---|
| GPT-4o | 78.2→84.3 (+6.1%) | 71.2→72.7 (+1.5%) | 84.8→82.9 (-1.9%) | 46.2→45.2 (-1.0%) | +1.2% |
| GPT-4o mini | 69.3→82.2 (+12.9%) | 69.4→71.7 (+2.3%) | 80.5→82.3 (+1.8%) | 34.6→37.2 (+2.6%) | +4.9% |
| Llama3.1 (70B) | 69.8→81.0 (+11.2%) | 66.7→69.2 (+2.5%) | 72.0→72.6 (+0.6%) | 18.4→25.2 (+6.8%) | +5.3% |
| Qwen-Long | 68.4→80.8 (+12.4%) | 67.7→71.2 (+3.5%) | 76.8→78.7 (+1.9%) | 10.2→18.2 (+8.0%) | +6.5% |
| DeepSeek-Coder V2 | 74.9→85.7 (+10.8%) | 71.4→72.2 (+0.8%) | 80.5→76.2 (-4.3%) | 39.4→34.4 (-5.0%) | +0.6% |
| Codestral (22B) | 64.2→73.8 (+9.6%) | 61.2→64.9 (+3.7%) | 75.6→76.8 (+1.2%) | 15.8→22.6 (+6.8%) | +5.3% |
| CodeLlama (13B) | 43.3→46.8 (+3.5%) | 42.4→52.4 (+10.0%) | 32.9→29.9 (-3.0%) | 1.4→6.4 (+5.0%) | +3.9% |
Overall: Improvements are observed in 23 out of 28 combinations, with 11 exceeding 5% and 5 exceeding 10%.
Ablation Study¶
| Configuration | Qwen-Long MBPP Pass Rate | Explanation |
|---|---|---|
| Direct (No Personality) | 68.4% | Baseline |
| 16 Fixed MBTI Types | 65.7%-68.4% | Almost no difference from baseline |
| Diverse MBTI (Full Prompt) | 80.8% | +12.4%, diversity is key |
| Short Prompt (MBTI 4-letter only) | 73.3% | 7.5% lower than Full Prompt |
| Generic Template Description | 65.5% | Far inferior to task-customized descriptions |
| MBTI vs. Big Five Personality | 80.8% vs 71.4% | MBTI significantly outperforms Big Five |
| Personality alone | 84.3% (GPT-4o) | Outperforms 3-shot (77.3%) and CoT (77.0%) |
| CoT + Personality | 85.7% (GPT-4o) | Best combination |
| CoT + Personality (Qwen-Long) | 82.2% | 13.8% improvement over Direct |
Key Findings¶
- Medium-performing models benefit the most: GPT-4o (too strong) and CodeLlama (too weak) show limited benefits, while Qwen-Long, Llama3.1, and Codestral show the most significant gains.
- Medium-difficulty datasets show the largest improvement: Performance on HumanEval+ (where 78% of tasks are assigned INTJ, resulting in low diversity) and APPS (90.6% assigned INTJ) is unstable.
- Personality diversity is the core factor: Any fixed single personality ≈ no personality guidance. Different personalities show complementarity in solving tasks (e.g., INTJ and ISTJ each have 4.5% uniquely solved problems).
- MBTI outperforms Big Five: The four dimensions of MBTI (S/N, T/F, E/I, J/P) align better with the cognitive demands of programming; in Big Five, only conscientiousness is strongly related to programming.
- Full Prompt consistently outperforms Short Prompt: Detailed descriptions customized to tasks yield an average improvement of 3.94% over simply labeling the MBTI type.
Highlights & Insights¶
- Crossing domains to transfer the personality-task matching theory from software engineering to LLM prompt design provides insightful ideas.
- The key finding is that "diversity" rather than "a specific personality" brings improvements, implying that role-playing prompts need to be dynamically matched with task characteristics rather than statically set.
- The method is extremely lightweight: zero training, simply prepending a personality description to the prompt, making it plug-and-play.
- The CoT + Personality combined strategy demonstrates the possibility of orthogonal complementarity across different dimensions of prompt engineering.
Limitations & Future Work¶
- Personality generation relies on GPT-4o, introducing extra API call costs and adding one LLM call per task.
- MBTI is controversial in the psychology community and is not universally accepted as the most scientific personality framework.
- It has only been validated on Python function-level code generation, without covering complex scenarios such as repository-level or multi-file environments.
- The underlying mechanism of why personality guidance is effective remains unclear; it might simply increase role-playing contextual information through richer prompts.
- The theoretical upper bound cannot be calculated due to the lack of ground truth for personality-task matching.
Related Work & Insights¶
- vs. Direct role-playing: Simply using "You are a programmer" is inferior to personality guidance on all 7 LLMs, indicating that role-playing needs fine-grained refinement.
- vs. CoT: Personality guidance alone outperforms CoT and 3-shot across all 7 LLMs, and can be combined with CoT for additional improvements.
- vs. Team diversity studies (Pieterse et al., 2018): This work validates that personality diversity is equally important in LLM scenarios.
Rating¶
- Novelty: ⭐⭐⭐ Introducing MBTI to code generation is interesting but essentially a variant of prompt engineering.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive with 7 LLMs, 4 datasets, 5 RQs, comparison of 16 fixed MBTI types, and comparison of two personality frameworks.
- Writing Quality: ⭐⭐⭐⭐ The RQ-driven empirical research paradigm is clear, with definitive conclusions for each RQ.
- Value: ⭐⭐⭐ Practicality is average (performance is unstable across different scenarios), but the finding that "diversity is key" holds inspiring value.