Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory¶

Conference: AAAI 2026 arXiv: 2505.15055 Code: https://github.com/Joe-Hall-Lee/PSN-IRT Area: LLM Evaluation Keywords: IRT, benchmark evaluation, condition number, item quality, PSN-IRT

TL;DR¶

This paper proposes PSN-IRT (Pseudo-Siamese Network for IRT), an enhanced Item Response Theory framework that jointly estimates LLM ability parameters and four-parameter item characteristics (difficulty / discrimination / guessing / feasibility). Applied to 41,871 items across 11 benchmarks, the framework reveals systemic issues including widespread saturation, insufficient difficulty ceilings, and data contamination. Item subsets selected by PSN-IRT achieve a ranking consistency of Kendall \(\tau = 1.00\).

Background & Motivation¶

Background: LLM evaluation predominantly relies on average-score rankings on benchmarks (e.g., MMLU, HumanEval), yet different leaderboards produce inconsistent rankings, and performance gaps among top models are too small to be meaningfully distinguished.

Limitations of Prior Work: - Different benchmarks can yield entirely divergent rankings for the same set of models — raising the question of whether such signals reflect genuine capability differences or merely noise. - Current benchmarks treat all items with equal weight, ignoring item quality — easy and hard items contribute identically to the final score. - No systematic tooling exists to diagnose the quality of benchmarks themselves (saturation, discrimination, contamination).

Key Challenge: Benchmarks are supposed to serve as objective measuring instruments, yet the quality of the instruments themselves has never been systematically audited.

Goal: To audit the item quality of LLM benchmarks using IRT, and to establish more reliable ability estimation and model ranking.

Key Insight: Deeply customizing Item Response Theory (IRT) from educational measurement for LLM evaluation — learning model ability and item parameters end-to-end via a deep pseudo-siamese network.

Core Idea: PSN-IRT = dual network (model ability + item parameters) \(\times\) 4PL IRT formula \(\rightarrow\) benchmark quality auditing + reliable ranking.

Method¶

Overall Architecture¶

Input: binary response matrices of 12 LLMs \(\times\) 11 benchmarks \(\rightarrow\) PSN-IRT dual-branch network \(\rightarrow\) Output: per-model ability \(\theta\) + four parameters per item (difficulty \(b\), discrimination \(a\), guessing \(c\), feasibility upper bound \(d\)) \(\rightarrow\) benchmark quality analysis + model ranking.

Key Designs¶

PSN-IRT Architecture:
- Function: End-to-end joint estimation of model ability and item parameters.
- Mechanism: Two independent MLP branches — one estimates \(\theta\) from model response patterns; the other estimates \((a, b, c, d)\) from item response patterns. Both are jointly optimized via the 4PL IRT formula \(P(\theta) = c + \frac{d-c}{1+e^{-a(\theta-b)}}\).
- Design Motivation: Traditional IRT relies on MLE or MCMC for iterative inference; PSN-IRT uses neural network end-to-end training for greater efficiency and scalability to large datasets.
Four-Parameter IRT Model (4PL):
- Function: More accurately models LLM response behavior compared to standard IRT.
- Four parameters: difficulty \(b\) (minimum \(\theta\) required to answer correctly), discrimination \(a\) (effectiveness of the item in differentiating high- vs. low-ability models), guessing \(c\) (probability that a low-ability model answers correctly by chance), and feasibility upper bound \(d\) (probability ceiling even for the strongest models).
- Design Motivation: LLMs may answer easy items correctly by "guessing" (\(c > 0\)), and some items may be infeasible for all models (\(d < 1\)) — phenomena that standard IRT cannot capture.
Benchmark Quality Diagnostics:
- Function: Diagnosing systemic benchmark issues via item parameters.
- Diagnostic dimensions: Saturation (proportion of items with excessively low discrimination \(a\)); difficulty ceiling (whether the maximum \(b\) suffices to differentiate top models); data contamination (anomalously high guessing rate \(c\) may indicate answer leakage into training data).

Loss & Training¶

Binary cross-entropy loss (response prediction).
Evaluated on 12 models (GPT-4, DeepSeek-V3, Qwen-Plus, etc.) \(\times\) 11 benchmarks.

Key Experimental Results¶

Main Results¶

Metric	PSN-IRT	Deep-IRT (1PL)	Traditional IRT (4PL MLE)
ACC	0.7998	0.7974	0.7211
F1	0.8538	0.8516	0.8034
AUC	0.8485	0.8519	0.7012
Kendall \(\tau\)	1.0000	0.9697	0.9697

Ablation Study: Benchmark Quality Diagnostics¶

Benchmark	Primary Issue	Description
MMLU	High saturation	Most items fail to differentiate top-tier models
HumanEval+	Insufficient difficulty ceiling	Hardest items are not challenging enough for GPT-4
GSM8K	Suspected contamination	Some items exhibit anomalously high guessing rate \(c\)
MATH	Good discrimination	The only benchmark performing well across multiple quality dimensions

Key Findings¶

No single benchmark excels across all quality dimensions — every benchmark exhibits systemic weaknesses.
PSN-IRT rankings align with human preferences: \(\tau = 1.00\), substantially outperforming traditional methods (\(0.97\)).
Item subsets selected by PSN-IRT can substitute entire benchmarks — a small set of high-quality items suffices for reliable ranking.
Model scale is not the sole determinant of capability — IRT-estimated \(\theta\) values sometimes diverge from rankings based on parameter count.

Highlights & Insights¶

"Auditing the benchmark itself" constitutes meta-evaluation — applying psychometric tools to assess the quality of measurement instruments, which is conceptually significant.
Using the 4PL guessing parameter \(c\) as a contamination detector is a clever application — if answers appear in training data, models can respond correctly even without genuine understanding.
PSN-IRT can serve as a quality-gating tool for any AI benchmark.

Limitations & Future Work¶

Assumes binary responses (correct/incorrect), making it inapplicable to generative evaluation settings.
12 models may be insufficient to yield stable IRT parameter estimates.
Item interdependencies are not modeled (standard IRT assumes local independence).

vs. Chatbot Arena (LMSYS): Human preference-based ranking. PSN-IRT ranks models via items; the two approaches are complementary.
vs. BenchmarkCards: Descriptive diagnostics. PSN-IRT provides quantitative item-level parameters.
vs. DynaBench: DynaBench addresses data leakage via dynamic datasets but does not resolve item quality issues; PSN-IRT quantifies each item's discriminative power from a statistical perspective.
The application of IRT to AI evaluation is generalizable to domain-specific benchmarks in coding, reasoning, and beyond.
Insight: New benchmarks should undergo IRT analysis prior to release to filter out low-discrimination items.

Rating¶

Novelty: ⭐⭐⭐⭐ — A systematic framework for auditing LLM benchmarks with IRT, introducing mature psychometric tools into AI evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 11 benchmarks, 12 models, and 41K items; analysis is conducted at sufficient scale.
Writing Quality: ⭐⭐⭐⭐ — Theory and empirical findings are well integrated; visualizations clearly illustrate item quality issues.
Value: ⭐⭐⭐⭐⭐ — Makes an important foundational contribution to LLM evaluation methodology by exposing the pervasive presence of low-quality items in existing benchmarks.