Measuring What Matters: Construct Validity in Large Language Model Benchmarks¶
Conference: NeurIPS 2025 arXiv: 2511.04703 Code: None Area: Recommender Systems Keywords: LLM evaluation, benchmark, construct validity, systematic review, evaluation methodology
TL;DR¶
This paper presents a systematic review of 445 LLM benchmark papers conducted by 29 experts, examining existing LLM evaluation benchmarks through the lens of construct validity across four dimensions — phenomenon definition, task design, scoring metrics, and conclusion claims — and proposes 8 actionable recommendations for improvement.
Background & Motivation¶
Background: LLM evaluation is one of the most active research directions in AI, with a large number of benchmark papers published annually. The reliability of evaluation results directly determines the accuracy of assessments of model capabilities and the effectiveness of pre-deployment safety evaluations. In recent years, the number of benchmarks has grown exponentially, but quality varies considerably.
Limitations of Prior Work: Many benchmarks attempt to measure abstract concepts such as "safety" and "robustness," yet their task designs and scoring mechanisms often fail to genuinely reflect the target phenomena. 47.8% of benchmarks exhibit disputed or insufficiently consensual definitions of what they measure; 27% rely on convenience sampling to obtain test instances.
Key Challenge: There exists a severe disconnect between the explosive growth in the number of LLM benchmarks and quality control — benchmarks are increasingly abundant, yet individual papers rarely provide adequate justification for why a given benchmark validly measures the intended capability. Only 53.4% of papers discuss construct validity.
Goal: (1) Systematically identify common construct validity issues in LLM benchmark papers published at top NLP/ML venues; (2) quantify the prevalence of each issue; (3) propose actionable recommendations and a checklist for improvement.
Key Insight: The authors draw on the construct validity framework from psychometrics, treating a benchmark as a chain of "phenomenon → task → metric → claim," and systematically examine validity issues that may arise at each stage.
Core Idea: Apply the well-established construct validity framework from psychometrics to systematically audit the quality of LLM benchmarks, identify widespread methodological deficiencies, and distill 8 actionable recommendations for improvement.
Method¶
Overall Architecture¶
This paper adopts a systematic review methodology. From a corpus of 46,114 papers published at ICML/ICLR/NeurIPS (2018–2024) and ACL/NAACL/EMNLP (2020–2024), keyword-based filtering ("benchmark" combined with "LLM"/"language model") yielded 2,189 candidate papers. These were subsequently processed through automated pre-screening by GPT-4o mini (F1 = 84%) and manual review by 29 experts, resulting in 445 papers selected for in-depth coding analysis.
Key Designs¶
-
Codebook:
- Function: Standardizes the coding of phenomenon definition, task design, metric selection, and conclusion claims for each paper.
- Mechanism: Designed 21 question items based on multiple dimensions of construct validity (face validity, predictive validity, content validity, ecological validity, convergent validity, and discriminant validity).
- Design Motivation: Transforms the subjective question of "is this benchmark good?" into a quantifiable multi-dimensional assessment that supports statistical analysis.
-
Two-Round Review Process:
- Function: Each paper is first evaluated item-by-item by a primary reviewer using the codebook, then a secondary reviewer maps the responses to simplified options and confirms with the primary reviewer.
- Mechanism: 46 papers were randomly sampled for double review; inter-rater reliability was assessed using the Brennan-Prediger Kappa coefficient (mean = 0.524).
- Design Motivation: Balances review scale (445 papers) with quality control to ensure the reliability of coding outcomes.
-
Inductive Recommendation Generation:
- Function: The first author read a subset of 50 papers and reviewed all 445 annotations, deriving preliminary recommendations through inductive open coding.
- Mechanism: Through 5 rounds of iterative multi-author discussion, findings were consolidated into 8 primary recommendations.
- Design Motivation: Ensures that recommendations are both empirically grounded and operationally practical.
Overview of 8 Recommendations¶
- Define the phenomenon: Clearly define the target phenomenon; sub-components should be evaluated separately.
- Measure the phenomenon and only the phenomenon: Control for confounding factors such as output format and instruction complexity.
- Construct representative datasets: Use random or stratified sampling instead of convenience sampling.
- Exercise caution when reusing datasets: Document differences between old and new versions and assess changes in construct validity.
- Guard against data contamination: Check for contamination at dataset creation time and consider dynamic benchmarks.
- Use statistical methods to compare models: Currently only 16% of papers employ statistical testing.
- Conduct error analysis: Verify whether failure patterns correspond to the target phenomenon.
- Justify construct validity: Explicitly articulate the reasoning chain from phenomenon to task to metric to claim.
Key Experimental Results¶
Core Statistical Findings¶
| Dimension | Finding | Proportion |
|---|---|---|
| Phenomenon definition | Definition provided | 78.2% |
| Phenomenon definition | Definition is contested | 47.8% |
| Phenomenon type | Composite phenomena (with sub-capabilities) | 61.2% |
| Task source | Manually constructed tasks | 43.3% |
| Task source | Reused existing benchmarks | 42.6% |
| Task source | LLM-generated | 31.2% |
| Sampling method | Convenience sampling (at least in part) | 27.0% |
| Scoring metric | Exact match (at least in part) | 81.3% |
| Statistical methods | Used statistical testing | 16.0% |
| Validity justification | Discussed construct validity | 53.4% |
Distribution of Benchmark Phenomena¶
| Phenomenon Category | Proportion | Notes |
|---|---|---|
| Reasoning | 18.5% | Most common category |
| Alignment | 8.1% | Among the most contested in definition |
| Code Generation | 5.7% | Relatively well-defined |
| Other general capabilities | ~30% | Includes knowledge, comprehension, etc. |
| Domain-specific applications | ~38% | Medical, legal, etc. |
Key Findings¶
- Fewer than 10% of benchmarks use fully real-world tasks; 40.7% rely on manually constructed tasks.
- The most common scoring metric is exact match (81.3%); LLM-as-a-judge is used in only 17.1% of papers.
- The number of benchmark papers has grown significantly year over year, but the proportion discussing construct validity has not increased correspondingly.
- Approximately half of all papers exhibit validity weaknesses in at least one dimension.
Highlights & Insights¶
- Systematic scale: This large-scale systematic review — encompassing 445 papers and 29 experts — is unprecedented in the field of LLM evaluation methodology, providing first-hand empirical data for quantifying "benchmark quality." The codebook design skillfully adapts psychometric theory to AI evaluation contexts.
- High practical value: The 8 recommendations are accompanied by an operational checklist that can guide the design of new benchmarks and serve as a reviewing standard for assessing existing ones. The checklist is released as an appendix, encouraging researchers to provide justifications for any items they choose to omit.
Limitations & Future Work¶
- Coverage is limited to top-venue papers, excluding important benchmarks released by industry (e.g., industrial iterations of MMLU, HumanEval, etc.).
- The use of GPT-4o mini for pre-screening may introduce systematic false-negative bias.
- Each paper was coded by only 1–2 reviewers; the double-review Kappa of 0.524 reflects only moderate inter-rater agreement.
- The paper does not deeply analyze whether construct validity issues differ systematically across domains (reasoning vs. safety vs. code generation).
Related Work & Insights¶
- vs. BetterBench (Reuel et al., 2024): BetterBench also addresses benchmark quality assessment, but the present work is larger in scale (445 vs. fewer papers) and more systematic in its theoretical framework (multi-dimensional construct validity).
- vs. Bowman & Dahl (2021): That work proposed directional recommendations for fixing NLU benchmarks; the present paper builds upon it by quantifying the prevalence of these issues through large-scale empirical data.
- vs. tinyBenchmarks (Polo et al., 2024): Focuses on efficient evaluation with fewer samples, complementing this paper's emphasis on sample representativeness.
- vs. Dynabench (Kiela et al., 2021): Proposes dynamic benchmarks to address data contamination, consistent with Recommendation 5 (guarding against contamination) in the present work.
- Implications: The proposed checklist can be directly applied during peer review to assess the methodological quality of benchmark papers.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic application of the construct validity framework to large-scale auditing of LLM benchmarks.
- Experimental Thoroughness: ⭐⭐⭐⭐ Coding analysis of 445 papers with substantial data coverage, though inter-rater Kappa is moderate.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, highly actionable recommendations, and rich figures and tables.
- Value: ⭐⭐⭐⭐⭐ Carries significant guidance value for the entire LLM evaluation community and has the potential to reshape benchmark design paradigms.