Measuring What Matters: Construct Validity in Large Language Model Benchmarks¶

Conference: NeurIPS 2025 arXiv: 2511.04703 Code: None Area: Recommender Systems Keywords: LLM evaluation, benchmark, construct validity, systematic review, evaluation methodology

TL;DR¶

This paper presents a systematic review of 445 LLM benchmark papers conducted by 29 experts, examining existing LLM evaluation benchmarks through the lens of construct validity across four dimensions — phenomenon definition, task design, scoring metrics, and conclusion claims — and proposes 8 actionable recommendations for improvement.

Background & Motivation¶

Background: LLM evaluation is one of the most active research directions in AI, with a large number of benchmark papers published annually. The reliability of evaluation results directly determines the accuracy of assessments of model capabilities and the effectiveness of pre-deployment safety evaluations. In recent years, the number of benchmarks has grown exponentially, but quality varies considerably.

Limitations of Prior Work: Many benchmarks attempt to measure abstract concepts such as "safety" and "robustness," yet their task designs and scoring mechanisms often fail to genuinely reflect the target phenomena. 47.8% of benchmarks exhibit disputed or insufficiently consensual definitions of what they measure; 27% rely on convenience sampling to obtain test instances.

Key Challenge: There exists a severe disconnect between the explosive growth in the number of LLM benchmarks and quality control — benchmarks are increasingly abundant, yet individual papers rarely provide adequate justification for why a given benchmark validly measures the intended capability. Only 53.4% of papers discuss construct validity.

Goal: (1) Systematically identify common construct validity issues in LLM benchmark papers published at top NLP/ML venues; (2) quantify the prevalence of each issue; (3) propose actionable recommendations and a checklist for improvement.

Key Insight: The authors draw on the construct validity framework from psychometrics, treating a benchmark as a chain of "phenomenon → task → metric → claim," and systematically examine validity issues that may arise at each stage.

Core Idea: Apply the well-established construct validity framework from psychometrics to systematically audit the quality of LLM benchmarks, identify widespread methodological deficiencies, and distill 8 actionable recommendations for improvement.

Method¶

Overall Architecture¶

This paper adopts a systematic review methodology. From a corpus of 46,114 papers published at ICML/ICLR/NeurIPS (2018–2024) and ACL/NAACL/EMNLP (2020–2024), keyword-based filtering ("benchmark" combined with "LLM"/"language model") yielded 2,189 candidate papers. These were subsequently processed through automated pre-screening by GPT-4o mini (F1 = 84%) and manual review by 29 experts, resulting in 445 papers selected for in-depth coding analysis.

Key Designs¶

Codebook:
- Function: Standardizes the coding of phenomenon definition, task design, metric selection, and conclusion claims for each paper.
- Mechanism: Designed 21 question items based on multiple dimensions of construct validity (face validity, predictive validity, content validity, ecological validity, convergent validity, and discriminant validity).
- Design Motivation: Transforms the subjective question of "is this benchmark good?" into a quantifiable multi-dimensional assessment that supports statistical analysis.
Two-Round Review Process:
- Function: Each paper is first evaluated item-by-item by a primary reviewer using the codebook, then a secondary reviewer maps the responses to simplified options and confirms with the primary reviewer.
- Mechanism: 46 papers were randomly sampled for double review; inter-rater reliability was assessed using the Brennan-Prediger Kappa coefficient (mean = 0.524).
- Design Motivation: Balances review scale (445 papers) with quality control to ensure the reliability of coding outcomes.
Inductive Recommendation Generation:
- Function: The first author read a subset of 50 papers and reviewed all 445 annotations, deriving preliminary recommendations through inductive open coding.
- Mechanism: Through 5 rounds of iterative multi-author discussion, findings were consolidated into 8 primary recommendations.
- Design Motivation: Ensures that recommendations are both empirically grounded and operationally practical.

Overview of 8 Recommendations¶

Define the phenomenon: Clearly define the target phenomenon; sub-components should be evaluated separately.
Measure the phenomenon and only the phenomenon: Control for confounding factors such as output format and instruction complexity.
Construct representative datasets: Use random or stratified sampling instead of convenience sampling.
Exercise caution when reusing datasets: Document differences between old and new versions and assess changes in construct validity.
Guard against data contamination: Check for contamination at dataset creation time and consider dynamic benchmarks.
Use statistical methods to compare models: Currently only 16% of papers employ statistical testing.
Conduct error analysis: Verify whether failure patterns correspond to the target phenomenon.
Justify construct validity: Explicitly articulate the reasoning chain from phenomenon to task to metric to claim.

Key Experimental Results¶

Core Statistical Findings¶

Dimension	Finding	Proportion
Phenomenon definition	Definition provided	78.2%
Phenomenon definition	Definition is contested	47.8%
Phenomenon type	Composite phenomena (with sub-capabilities)	61.2%
Task source	Manually constructed tasks	43.3%
Task source	Reused existing benchmarks	42.6%
Task source	LLM-generated	31.2%
Sampling method	Convenience sampling (at least in part)	27.0%
Scoring metric	Exact match (at least in part)	81.3%
Statistical methods	Used statistical testing	16.0%
Validity justification	Discussed construct validity	53.4%

Distribution of Benchmark Phenomena¶

Phenomenon Category	Proportion	Notes
Reasoning	18.5%	Most common category
Alignment	8.1%	Among the most contested in definition
Code Generation	5.7%	Relatively well-defined
Other general capabilities	~30%	Includes knowledge, comprehension, etc.
Domain-specific applications	~38%	Medical, legal, etc.

Key Findings¶

Fewer than 10% of benchmarks use fully real-world tasks; 40.7% rely on manually constructed tasks.
The most common scoring metric is exact match (81.3%); LLM-as-a-judge is used in only 17.1% of papers.
The number of benchmark papers has grown significantly year over year, but the proportion discussing construct validity has not increased correspondingly.
Approximately half of all papers exhibit validity weaknesses in at least one dimension.

Highlights & Insights¶

Systematic scale: This large-scale systematic review — encompassing 445 papers and 29 experts — is unprecedented in the field of LLM evaluation methodology, providing first-hand empirical data for quantifying "benchmark quality." The codebook design skillfully adapts psychometric theory to AI evaluation contexts.
High practical value: The 8 recommendations are accompanied by an operational checklist that can guide the design of new benchmarks and serve as a reviewing standard for assessing existing ones. The checklist is released as an appendix, encouraging researchers to provide justifications for any items they choose to omit.

Limitations & Future Work¶

Coverage is limited to top-venue papers, excluding important benchmarks released by industry (e.g., industrial iterations of MMLU, HumanEval, etc.).
The use of GPT-4o mini for pre-screening may introduce systematic false-negative bias.
Each paper was coded by only 1–2 reviewers; the double-review Kappa of 0.524 reflects only moderate inter-rater agreement.
The paper does not deeply analyze whether construct validity issues differ systematically across domains (reasoning vs. safety vs. code generation).

vs. BetterBench (Reuel et al., 2024): BetterBench also addresses benchmark quality assessment, but the present work is larger in scale (445 vs. fewer papers) and more systematic in its theoretical framework (multi-dimensional construct validity).
vs. Bowman & Dahl (2021): That work proposed directional recommendations for fixing NLU benchmarks; the present paper builds upon it by quantifying the prevalence of these issues through large-scale empirical data.
vs. tinyBenchmarks (Polo et al., 2024): Focuses on efficient evaluation with fewer samples, complementing this paper's emphasis on sample representativeness.
vs. Dynabench (Kiela et al., 2021): Proposes dynamic benchmarks to address data contamination, consistent with Recommendation 5 (guarding against contamination) in the present work.
Implications: The proposed checklist can be directly applied during peer review to assess the methodological quality of benchmark papers.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic application of the construct validity framework to large-scale auditing of LLM benchmarks.
Experimental Thoroughness: ⭐⭐⭐⭐ Coding analysis of 445 papers with substantial data coverage, though inter-rater Kappa is moderate.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, highly actionable recommendations, and rich figures and tables.
Value: ⭐⭐⭐⭐⭐ Carries significant guidance value for the entire LLM evaluation community and has the potential to reshape benchmark design paradigms.