Skip to content

Personality-Guided Code Generation Using Large Language Models

Conference: ACL 2025
arXiv: 2411.00006
Code: GitHub
Area: Code Intelligence
Keywords: code generation, personality, MBTI, prompt engineering, role-playing

TL;DR

This work dynamically generates matching MBTI personality types and detailed descriptions for each programming task using GPT-4o, and guides the target LLM to generate code by role-playing a programmer with this personality. Across 28 combinations of 7 LLMs and 4 datasets, improvements in pass rates are achieved in 23 cases (up to 12.9%). The key factor is personality diversity rather than any single specific personality.

Background & Motivation

Background: LLM-driven code generation has become mainstream. A common practice is to prompt LLMs to role-play as a "programmer" to generate code. Meanwhile, software engineering research has fully demonstrated that matching tasks with developer personality can improve development quality, and team personality diversity is positively correlated with high-quality software delivery.

Limitations of Prior Work: Existing role-playing prompts only broadly ask the LLM to act as a "programmer" without refining specific personality traits. Using the same persona setting for all programming tasks overlooks the fact that different tasks may suit different thinking styles.

Key Challenge: Can the quality improvement brought by "task-personality matching" in human software development be transferred to LLM code generation? If so, what factors affect the performance?

Key Insight: Drawing on the MBTI personality framework, this study automatically generates matched personality types for each programming task, forcing the LLM to generate code with diverse personality personas. Large-scale experiments are conducted to verify the effects and underlying factors.

Method

Overall Architecture

A two-stage pipeline: 1. Personality Generation: GPT-4o analyzes the programming task description and outputs the most suitable MBTI type (one of 16 types) along with its detailed personality description for that task. 2. Code Generation: The generated MBTI type and detailed description are embedded into the prompt, guiding the target LLM to act as a programmer with that personality to generate code.

Evaluation method: The generated code is considered "passed" only if it passes all test cases. Each LLM is run 3 times on each dataset to obtain the average pass rate.

Key Designs

  1. Task-Adapted Dynamic Personality Generation:

    • Function: GPT-4o dynamically selects the most suitable MBTI type and generates a customized description based on the specific characteristics of each programming task.
    • Mechanism: Different programming tasks suit different cognitive styles—logic-intensive tasks suit the Thinking dimension, algorithm design suits the Intuition dimension, and debugging tasks suit the Sensing dimension.
    • Design Motivation: Utilizing a fixed MBTI type yields almost no difference compared to not using personality guidance (e.g., the pass rate of 16 fixed MBTI types on Qwen-Long ranges from 65.7% to 68.4%, while diverse adaptation reaches 80.8%), indicating that "diversity" is the key.
  2. Full Prompt Design (Detailed Personality Description):

    • Function: The prompt not only specifies the four-letter MBTI type but also includes a detailed personality description generated by GPT-4o tailored to the specific task.
    • Mechanism: Helps the LLM understand and role-play the persona more deeply.
    • Design Motivation: The Full Prompt outperforms the Short Prompt (only labelling the MBTI type) by an average of 3.94% across 7 LLMs. Using a generic template description (instead of a task-customized description) on Qwen-Long yields a pass rate of only 65.5%, which is far below the 80.8% achieved by customized descriptions.
  3. Orthogonal Combination with Other Prompting Strategies:

    • Function: Personality guidance can be combined with other strategies like CoT and few-shot prompting.
    • Mechanism: Personality guidance affects "persona perception" while CoT affects the "reasoning process"; the two are orthogonal and complementary.
    • Design Motivation: CoT + Personality outperforms either strategy alone on 5 out of 7 LLMs, with a maximum gain of 13.8%.

Loss & Training

This work is a zero-shot training-free prompt engineering method and does not involve model training. GPT-4o is used for personality generation, and default settings of each target LLM are used for code generation.

Key Experimental Results

Main Results

Pass rate comparison (Direct vs. MBTI-guided) across 7 LLMs and 4 datasets:

LLM MBPP Sanitized MBPP+ HumanEval+ APPS Average Gain
GPT-4o 78.2→84.3 (+6.1%) 71.2→72.7 (+1.5%) 84.8→82.9 (-1.9%) 46.2→45.2 (-1.0%) +1.2%
GPT-4o mini 69.3→82.2 (+12.9%) 69.4→71.7 (+2.3%) 80.5→82.3 (+1.8%) 34.6→37.2 (+2.6%) +4.9%
Llama3.1 (70B) 69.8→81.0 (+11.2%) 66.7→69.2 (+2.5%) 72.0→72.6 (+0.6%) 18.4→25.2 (+6.8%) +5.3%
Qwen-Long 68.4→80.8 (+12.4%) 67.7→71.2 (+3.5%) 76.8→78.7 (+1.9%) 10.2→18.2 (+8.0%) +6.5%
DeepSeek-Coder V2 74.9→85.7 (+10.8%) 71.4→72.2 (+0.8%) 80.5→76.2 (-4.3%) 39.4→34.4 (-5.0%) +0.6%
Codestral (22B) 64.2→73.8 (+9.6%) 61.2→64.9 (+3.7%) 75.6→76.8 (+1.2%) 15.8→22.6 (+6.8%) +5.3%
CodeLlama (13B) 43.3→46.8 (+3.5%) 42.4→52.4 (+10.0%) 32.9→29.9 (-3.0%) 1.4→6.4 (+5.0%) +3.9%

Overall: Improvements are observed in 23 out of 28 combinations, with 11 exceeding 5% and 5 exceeding 10%.

Ablation Study

Configuration Qwen-Long MBPP Pass Rate Explanation
Direct (No Personality) 68.4% Baseline
16 Fixed MBTI Types 65.7%-68.4% Almost no difference from baseline
Diverse MBTI (Full Prompt) 80.8% +12.4%, diversity is key
Short Prompt (MBTI 4-letter only) 73.3% 7.5% lower than Full Prompt
Generic Template Description 65.5% Far inferior to task-customized descriptions
MBTI vs. Big Five Personality 80.8% vs 71.4% MBTI significantly outperforms Big Five
Personality alone 84.3% (GPT-4o) Outperforms 3-shot (77.3%) and CoT (77.0%)
CoT + Personality 85.7% (GPT-4o) Best combination
CoT + Personality (Qwen-Long) 82.2% 13.8% improvement over Direct

Key Findings

  • Medium-performing models benefit the most: GPT-4o (too strong) and CodeLlama (too weak) show limited benefits, while Qwen-Long, Llama3.1, and Codestral show the most significant gains.
  • Medium-difficulty datasets show the largest improvement: Performance on HumanEval+ (where 78% of tasks are assigned INTJ, resulting in low diversity) and APPS (90.6% assigned INTJ) is unstable.
  • Personality diversity is the core factor: Any fixed single personality ≈ no personality guidance. Different personalities show complementarity in solving tasks (e.g., INTJ and ISTJ each have 4.5% uniquely solved problems).
  • MBTI outperforms Big Five: The four dimensions of MBTI (S/N, T/F, E/I, J/P) align better with the cognitive demands of programming; in Big Five, only conscientiousness is strongly related to programming.
  • Full Prompt consistently outperforms Short Prompt: Detailed descriptions customized to tasks yield an average improvement of 3.94% over simply labeling the MBTI type.

Highlights & Insights

  • Crossing domains to transfer the personality-task matching theory from software engineering to LLM prompt design provides insightful ideas.
  • The key finding is that "diversity" rather than "a specific personality" brings improvements, implying that role-playing prompts need to be dynamically matched with task characteristics rather than statically set.
  • The method is extremely lightweight: zero training, simply prepending a personality description to the prompt, making it plug-and-play.
  • The CoT + Personality combined strategy demonstrates the possibility of orthogonal complementarity across different dimensions of prompt engineering.

Limitations & Future Work

  • Personality generation relies on GPT-4o, introducing extra API call costs and adding one LLM call per task.
  • MBTI is controversial in the psychology community and is not universally accepted as the most scientific personality framework.
  • It has only been validated on Python function-level code generation, without covering complex scenarios such as repository-level or multi-file environments.
  • The underlying mechanism of why personality guidance is effective remains unclear; it might simply increase role-playing contextual information through richer prompts.
  • The theoretical upper bound cannot be calculated due to the lack of ground truth for personality-task matching.
  • vs. Direct role-playing: Simply using "You are a programmer" is inferior to personality guidance on all 7 LLMs, indicating that role-playing needs fine-grained refinement.
  • vs. CoT: Personality guidance alone outperforms CoT and 3-shot across all 7 LLMs, and can be combined with CoT for additional improvements.
  • vs. Team diversity studies (Pieterse et al., 2018): This work validates that personality diversity is equally important in LLM scenarios.

Rating

  • Novelty: ⭐⭐⭐ Introducing MBTI to code generation is interesting but essentially a variant of prompt engineering.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive with 7 LLMs, 4 datasets, 5 RQs, comparison of 16 fixed MBTI types, and comparison of two personality frameworks.
  • Writing Quality: ⭐⭐⭐⭐ The RQ-driven empirical research paradigm is clear, with definitive conclusions for each RQ.
  • Value: ⭐⭐⭐ Practicality is average (performance is unstable across different scenarios), but the finding that "diversity is key" holds inspiring value.