A Study of LLMs' Preferences for Libraries and Programming Languages¶
Conference: ACL 2026 (Findings)
arXiv: 2503.17181
Code: GitHub
Area: LLM/NLP
Keywords: Code generation preferences, library choice bias, programming language preference, LLM behavior analysis, technical diversity
TL;DR¶
This study presents the first systematic investigation into the preferences of 8 LLMs regarding libraries and programming languages during code generation. It reveals that LLMs exhibit a severe bias toward popular libraries like NumPy (45% unnecessary usage) and the Python language (chosen in 58% of high-performance tasks), and that natural language recommendations often diverge from actual code selection behavior.
Background & Motivation¶
Background: LLMs have made significant strides in code generation. However, existing evaluations primarily focus on functional correctness and syntactic validity, overlooking critical design decisions made by LLMs during generation—specifically, the choice of libraries and programming languages.
Limitations of Prior Work: Developers often do not specify specific libraries in their prompts, and many end-users lack the expertise to judge whether an LLM's language choice is appropriate. This implies that the technical preferences of LLMs may profoundly influence the diversity of the software ecosystem.
Key Challenge: LLMs should ideally select the technical stack best suited to the task requirements. However, the frequency distribution in training data may cause them to systematically favor popular technologies, even when they are not optimal.
Goal: To quantify the preference patterns of LLMs in library and programming language selection and to evaluate the rationality and potential risks of these biases.
Key Insight: The authors design three sets of experiments: library selection for benchmark tasks, library/language selection for project initialization, and consistency checks between natural language recommendations and actual coding behavior.
Core Idea: LLMs exhibit a significant "familiarity bias" in code generation, prioritizing popular technologies over those most technically appropriate for the task.
Method¶
Overall Architecture¶
This is an empirical study aimed at measuring the implicit technical choices LLMs make during code generation. The design consists of three experiments covering two dimensions (libraries and languages) across two scenarios (benchmark tasks and project initialization), supplemented by a consistency check. The first two experiments measure the distribution of technologies actually generated when no library or language is specified. The third experiment compares this actual behavior with the "best" recommendations provided by the LLMs in natural language. Inputs consist of code generated by 8 diverse LLMs (GPT-4o-mini, GPT-3.5-turbo, Claude-3.5 Sonnet/Haiku, Llama-3.2-3B, Mistral-7B, Qwen-2.5-Coder, DeepSeek-LLM) on standard tasks. Each task was sampled 3–100 times to minimize randomness, using new sessions and default API parameters without system prompts to reflect baseline model behavior.
Key Designs¶
1. Library Preference (Experiment 1): Quantifying Python library selection tendencies in unconstrained scenarios
Since developers frequently request code without specifying libraries, this realistic scenario is highly significant. The authors utilized 525 tasks from BigCodeBench, filtering out prompts that explicitly named the ground-truth library to avoid leakage. LLMs were then required to generate code using external libraries. Usage frequencies were calculated and compared against ground-truth data to determine if LLMs systematically default to a few popular libraries.
2. Language Preference (Experiment 2): Testing whether LLMs select languages based on task characteristics or default to Python
Beyond libraries, language-level inertia has deeper implications. This experiment utilized 6 language-agnostic datasets (Multi-HumanEval, MBXP, AixBench, CoNaLa, APPS, CodeContests) to test language selection for benchmark tasks. Additionally, 5 project initialization tasks for high-performance scenarios—such as concurrent web servers, cross-platform GUIs, and low-latency trading platforms—were designed. In these scenarios, Python is typically not the optimal choice. If LLMs default to Python for tasks requiring high performance, it reveals a failure to adapt language selection to task demands.
3. Recommendation Consistency (Experiment 3): Verifying the consistency between "recommendations" and "implementations"
If an LLM identifies a superior choice but fails to implement it in code, the preference is rooted in generation behavior rather than a lack of knowledge. The authors asked LLMs to rank the "best" libraries/languages in natural language and compared these against the actual usage frequencies from Experiments 1 and 2. The consistency between the two rankings was measured using the Kendall's \(\tau_b\) coefficient. A low \(\tau_b\) indicates a "knowledge-action gap," where preferences are embedded as implicit biases in code generation.
Key Experimental Results¶
Main Results¶
| Finding | Specific Data | Impact |
|---|---|---|
| NumPy Overuse | Used in 192 (63%) of 305 tasks where NumPy was not required | Severe Preference |
| Lack of Library Diversity | Each LLM used only 32-39 distinct libraries | Ecosystem Monopolization |
| Python Preference | Python chosen in 58% of high-performance tasks | Technical Mismatch |
| Absence of Rust | Rust usage was 0% in high-performance projects | Extreme Bias |
| NL-Code Inconsistency | Extremely low Kendall's \(\tau_b\) | Knowledge-Action Gap |
Ablation Study¶
| Configuration | Key Metrics | Description |
|---|---|---|
| Prompt Sensitivity | Stable preference patterns | Similar results across varying levels of prompt strictness |
| Cross-LLM Consistency | Identical Top-3 libraries | The top three libraries (NumPy, pandas, Matplotlib) were identical across all 8 LLMs |
Key Findings¶
- Library usage distributions are highly similar across all LLMs; the top three libraries (NumPy > pandas > Matplotlib) are consistent regardless of model size or open-source status.
- Even when tasks explicitly require high performance (e.g., low-latency trading, parallel execution), Python remains dominant, while Rust is entirely absent.
- There is very low consistency between the tech stacks LLMs "recommend" and those they actually use, suggesting that preferences are rooted in generation behavior rather than factual knowledge.
Highlights & Insights¶
- The finding that "LLMs know what is better but do not necessarily do it" is significant, indicating that code generation preferences may stem from training data distribution rather than reasoning capabilities.
- This serves as a warning for the software ecosystem: Large-scale LLM deployment may create a positive feedback loop—preference for popular libraries leads to more generated code for those libraries, which creates more training data, further strengthening the bias.
- The experimental design is concise and effective, with three experiments forming a complete chain of evidence.
Limitations & Future Work¶
- Only 8 LLMs were tested, and the study did not cover the latest reasoning-enhanced models (e.g., o1, DeepSeek-R1).
- Library analysis focused heavily on Python; preferences in other languages remain unexplored.
- No specific debiasing methods were proposed; the study remains primarily descriptive of the phenomenon.
- Future work could investigate how fine-tuning and RLHF influence technical preferences.
Related Work & Insights¶
- vs LLM Social Bias: Expands bias analysis from social dimensions to technical dimensions, opening a new research direction.
- vs Code Generation Evaluation: Proposes "quality of design decisions" as a neglected but critical evaluation dimension.
- vs Tool Recommendation Systems: Reveals the limitations of LLMs acting as "implicit recommendation systems."
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic study of LLM technical preferences; opens a new direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive design involving 8 models, multiple scenarios, and consistency testing.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear problem statement, concise experiments, and impactful findings.
- Value: ⭐⭐⭐⭐ Provides an important warning for both LLM developers and users.