Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions¶

Conference: NeurIPS 2025 arXiv: 2507.02087 Code: None Area: AI Safety / Fairness & Bias Keywords: LLM bias, hiring fairness, algorithmic auditing, disparate impact, EEOC four-fifths rule

TL;DR¶

This paper systematically evaluates the hiring-match performance of mainstream LLMs—including GPT-4o/4.1, Claude 3.5, Gemini 2.5, Llama 3.1/4, and DeepSeek R1—on approximately 10,000 real-world candidate–job pairs. Results show that a domain-specialized model (Match Score) comprehensively outperforms general-purpose LLMs in both accuracy (AUC 0.85 vs. 0.77) and fairness (Race IR 0.957 vs. ≤0.809).

Background & Motivation¶

Background: Over 98% of Fortune 500 companies employ some form of automated tool in their hiring processes. LLMs, owing to their broad language understanding capabilities, are increasingly considered for résumé screening and candidate matching.

Limitations of Prior Work: LLMs trained on massive internet corpora inevitably inherit and amplify societal biases related to gender and race. Amazon's 2018 AI recruiting tool—exposed for discriminating against women—remains a landmark case. Even after alignment procedures by LLM providers, biases may still manifest in subtle ways.

Key Challenge: A fundamental tension exists between the general-purpose capabilities of LLMs and the strict fairness requirements imposed by high-stakes domains. Hiring is classified as a high-risk AI application under the EU AI Act, and New York City has enacted legislation mandating bias audits of AI hiring systems.

Goal: To systematically quantify the accuracy and fairness of mainstream LLMs in realistic hiring scenarios and compare them against a domain-specialized model.

Key Insight: The study employs real hiring data (including self-reported gender and race information), a unified evaluation framework (PII-stripped résumés → standardized prompts → median-threshold binarization → EEOC four-fifths rule), and assesses both accuracy and fairness simultaneously.

Core Idea: General-purpose LLMs are both less accurate and more biased than domain-specialized models on hiring tasks; accuracy and fairness need not be mutually exclusive.

Method¶

Overall Architecture¶

Input: Résumés (parsed and de-identified to remove PII such as names, addresses, and phone numbers) + job descriptions
Process: Unified input fed to all models (Match Score + 8 LLMs) to obtain matching scores
Output: Binarized as "selected/not selected" using a median threshold; accuracy and fairness are then evaluated
Ground truth: Whether the candidate progressed (interview invitation, offer, or hire)

Key Designs¶

Data De-identification and Standardization:
- Function: All résumés are processed through a unified parser, stripped of PII, and normalized into structured text segments (skills, experience, education, etc.)
- Mechanism: De-identified résumés are identical across all models, eliminating input variability
- Design Motivation: Ensures a fair comparison and prevents models from directly inferring protected attributes from the input
Prompt Design:
- Function: Standardized evaluation prompts are designed for LLMs, specifying six evaluation criteria
- Mechanism: The system message defines sequential assessment across six dimensions (experience relevance, industry fit, skill match, seniority match, job title match, and educational background) and explicitly instructs the model not to make judgments based on protected attributes
- All LLMs are evaluated zero-shot without fine-tuning
Fairness Evaluation Framework:
- Function: Fairness is assessed using the EEOC "four-fifths rule"
- Core Metric: \(\text{IR} = \frac{\min_g(\text{SR}_g)}{\max_g(\text{SR}_g)}\), where SR denotes the selection rate for each demographic group
- An IR < 0.8 indicates potential disparate impact
- Evaluation is conducted along three dimensions: gender, race, and intersectional groups (e.g., "Asian Female")

Evaluation Metrics¶

Accuracy: ROC AUC, PR AUC, F1
Fairness: Gender IR, Race IR, Intersectional IR

Key Experimental Results¶

Main Results: Comprehensive Accuracy and Fairness Evaluation¶

Model	ROC AUC	PR AUC	F1	Gender IR	Race IR	Inter. IR
Match Score	0.85	0.83	0.753	0.933	0.957	0.906
GPT-4o	0.76	0.79	0.746	0.997	0.774	0.773
GPT-4.1	0.77	0.80	0.749	0.873	0.718	0.603
o3-mini	0.76	0.78	0.705	0.938	0.640	0.647
Claude 3.5 v2	0.77	0.79	0.740	0.919	0.684	0.624
Gemini 2.5 Flash	0.76	0.78	0.714	0.851	0.773	0.616
Llama 3.1-405B	0.74	0.77	0.705	0.907	0.667	0.666
Llama 4-Maverick	0.76	0.78	0.719	0.928	0.689	0.673
DeepSeek R1	0.75	0.77	0.710	0.850	0.809	0.620

Race-Dimension Breakdown (Match Score vs. GPT-4o vs. Llama 4-Maverick)¶

Group	Match Score SR/IR	GPT-4o SR/IR	Llama 4 SR/IR
Asian	64.3% / 0.957	76.6% / 1.000	66.2% / 1.000
Black	66.3% / 0.988	65.9% / 0.860	53.7% / 0.810
Hispanic	66.9% / 0.996	71.7% / 0.936	46.7% / 0.705
White	66.4% / 0.989	68.5% / 0.895	56.9% / 0.859
Native American	66.9% / 0.996	59.3% / 0.774	46.2% / 0.698

Key Findings¶

Accuracy: Match Score AUC of 0.85 exceeds the best-performing LLM (GPT-4.1, 0.77) by 8 percentage points, demonstrating that domain-specific training outweighs model scale
Fairness: All LLMs yield intersectional IR values below 0.8 (violating the four-fifths rule), with a minimum of 0.603 (GPT-4.1); Match Score maintains 0.906
Gender vs. Race: LLMs exhibit relatively mild gender bias (GPT-4o approaches 1.0) but severe racial bias, indicating that single-attribute debiasing is insufficient
Open-source vs. Closed-source: Open-source models (Llama, DeepSeek) exhibit worse racial fairness; Llama 3.1-405B achieves a Race IR of only 0.667
Match Score simultaneously achieves the highest accuracy and the highest fairness, demonstrating that the two objectives are not mutually exclusive

Highlights & Insights¶

Real data + full model coverage: 10K real hiring pairs evaluated across 9 models (spanning OpenAI, Anthropic, Google, Meta, and DeepSeek), representing the most comprehensive LLM hiring bias benchmark to date
Intersectional analysis reveals problems invisible in single-dimension evaluations: GPT-4o's gender IR approaches 1.0, yet its intersectional IR is only 0.773, indicating severe race × gender interaction effects
Practical implication: Off-the-shelf LLMs should not be directly deployed in high-stakes hiring decisions; explicit prompt instructions against discrimination are insufficient
The paper explicitly argues that the accuracy–fairness trade-off is not zero-sum—a well-designed domain-specific model can achieve optimality on both dimensions simultaneously

Limitations & Future Work¶

Match Score is a proprietary model and cannot be reproduced; the paper effectively endorses Eightfold.ai's commercial product, representing a potential conflict of interest
Fairness evaluation relies on median-threshold binarization; different threshold choices may alter conclusions
Only zero-shot LLMs are evaluated; it remains untested whether few-shot prompting or fine-tuning can close the performance gap
Although real-world, the dataset originates from a single platform, potentially introducing selection bias in industry, regional, and role distribution
The paper does not explore whether prompt engineering or post-processing can improve LLM fairness

vs. Bertrand & Mullainathan (2004): The classic résumé audit study revealed racial discrimination in human hiring; the present paper demonstrates that LLMs exhibit analogous patterns
vs. Gaebler et al. (2024): Prior work found no significant gender or race disparities in GPT-3.5/Claude 1.3 résumé evaluations; the present paper draws different conclusions using larger-scale real data and a broader set of models
vs. NYC Local Law 144: New York City has mandated bias audits for AI hiring tools; the evaluation framework proposed here can serve as a methodological reference for such audits

Rating¶

Novelty: ⭐⭐⭐ The evaluation methodology is not entirely novel, but the coverage is the broadest of its kind; the intersectional analysis contributes meaningful value
Experimental Thoroughness: ⭐⭐⭐⭐ 10K real-world data points, 9 models, and multi-dimensional fairness metrics constitute a thorough evaluation
Writing Quality: ⭐⭐⭐⭐ Well-structured, results are presented clearly, and the discussion is constructive
Value: ⭐⭐⭐⭐ Directly relevant to AI hiring regulation and practice; demonstrates the necessity of domain-specialized models combined with bias auditing