Skip to content

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Conference: AAAI 2026 arXiv: 2505.15055 Code: https://github.com/Joe-Hall-Lee/PSN-IRT Area: LLM Evaluation Keywords: IRT, benchmark evaluation, condition number, item quality, PSN-IRT

TL;DR

This paper proposes PSN-IRT (Pseudo-Siamese Network for IRT), an enhanced Item Response Theory framework that jointly estimates LLM ability parameters and four-parameter item characteristics (difficulty / discrimination / guessing / feasibility). Applied to 41,871 items across 11 benchmarks, the framework reveals systemic issues including widespread saturation, insufficient difficulty ceilings, and data contamination. Item subsets selected by PSN-IRT achieve a ranking consistency of Kendall \(\tau = 1.00\).

Background & Motivation

Background: LLM evaluation predominantly relies on average-score rankings on benchmarks (e.g., MMLU, HumanEval), yet different leaderboards produce inconsistent rankings, and performance gaps among top models are too small to be meaningfully distinguished.

Limitations of Prior Work: - Different benchmarks can yield entirely divergent rankings for the same set of models — raising the question of whether such signals reflect genuine capability differences or merely noise. - Current benchmarks treat all items with equal weight, ignoring item quality — easy and hard items contribute identically to the final score. - No systematic tooling exists to diagnose the quality of benchmarks themselves (saturation, discrimination, contamination).

Key Challenge: Benchmarks are supposed to serve as objective measuring instruments, yet the quality of the instruments themselves has never been systematically audited.

Goal: To audit the item quality of LLM benchmarks using IRT, and to establish more reliable ability estimation and model ranking.

Key Insight: Deeply customizing Item Response Theory (IRT) from educational measurement for LLM evaluation — learning model ability and item parameters end-to-end via a deep pseudo-siamese network.

Core Idea: PSN-IRT = dual network (model ability + item parameters) \(\times\) 4PL IRT formula \(\rightarrow\) benchmark quality auditing + reliable ranking.

Method

Overall Architecture

Input: binary response matrices of 12 LLMs \(\times\) 11 benchmarks \(\rightarrow\) PSN-IRT dual-branch network \(\rightarrow\) Output: per-model ability \(\theta\) + four parameters per item (difficulty \(b\), discrimination \(a\), guessing \(c\), feasibility upper bound \(d\)) \(\rightarrow\) benchmark quality analysis + model ranking.

Key Designs

  1. PSN-IRT Architecture:

    • Function: End-to-end joint estimation of model ability and item parameters.
    • Mechanism: Two independent MLP branches — one estimates \(\theta\) from model response patterns; the other estimates \((a, b, c, d)\) from item response patterns. Both are jointly optimized via the 4PL IRT formula \(P(\theta) = c + \frac{d-c}{1+e^{-a(\theta-b)}}\).
    • Design Motivation: Traditional IRT relies on MLE or MCMC for iterative inference; PSN-IRT uses neural network end-to-end training for greater efficiency and scalability to large datasets.
  2. Four-Parameter IRT Model (4PL):

    • Function: More accurately models LLM response behavior compared to standard IRT.
    • Four parameters: difficulty \(b\) (minimum \(\theta\) required to answer correctly), discrimination \(a\) (effectiveness of the item in differentiating high- vs. low-ability models), guessing \(c\) (probability that a low-ability model answers correctly by chance), and feasibility upper bound \(d\) (probability ceiling even for the strongest models).
    • Design Motivation: LLMs may answer easy items correctly by "guessing" (\(c > 0\)), and some items may be infeasible for all models (\(d < 1\)) — phenomena that standard IRT cannot capture.
  3. Benchmark Quality Diagnostics:

    • Function: Diagnosing systemic benchmark issues via item parameters.
    • Diagnostic dimensions: Saturation (proportion of items with excessively low discrimination \(a\)); difficulty ceiling (whether the maximum \(b\) suffices to differentiate top models); data contamination (anomalously high guessing rate \(c\) may indicate answer leakage into training data).

Loss & Training

  • Binary cross-entropy loss (response prediction).
  • Evaluated on 12 models (GPT-4, DeepSeek-V3, Qwen-Plus, etc.) \(\times\) 11 benchmarks.

Key Experimental Results

Main Results

Metric PSN-IRT Deep-IRT (1PL) Traditional IRT (4PL MLE)
ACC 0.7998 0.7974 0.7211
F1 0.8538 0.8516 0.8034
AUC 0.8485 0.8519 0.7012
Kendall \(\tau\) 1.0000 0.9697 0.9697

Ablation Study: Benchmark Quality Diagnostics

Benchmark Primary Issue Description
MMLU High saturation Most items fail to differentiate top-tier models
HumanEval+ Insufficient difficulty ceiling Hardest items are not challenging enough for GPT-4
GSM8K Suspected contamination Some items exhibit anomalously high guessing rate \(c\)
MATH Good discrimination The only benchmark performing well across multiple quality dimensions

Key Findings

  • No single benchmark excels across all quality dimensions — every benchmark exhibits systemic weaknesses.
  • PSN-IRT rankings align with human preferences: \(\tau = 1.00\), substantially outperforming traditional methods (\(0.97\)).
  • Item subsets selected by PSN-IRT can substitute entire benchmarks — a small set of high-quality items suffices for reliable ranking.
  • Model scale is not the sole determinant of capability — IRT-estimated \(\theta\) values sometimes diverge from rankings based on parameter count.

Highlights & Insights

  • "Auditing the benchmark itself" constitutes meta-evaluation — applying psychometric tools to assess the quality of measurement instruments, which is conceptually significant.
  • Using the 4PL guessing parameter \(c\) as a contamination detector is a clever application — if answers appear in training data, models can respond correctly even without genuine understanding.
  • PSN-IRT can serve as a quality-gating tool for any AI benchmark.

Limitations & Future Work

  • Assumes binary responses (correct/incorrect), making it inapplicable to generative evaluation settings.
  • 12 models may be insufficient to yield stable IRT parameter estimates.
  • Item interdependencies are not modeled (standard IRT assumes local independence).
  • vs. Chatbot Arena (LMSYS): Human preference-based ranking. PSN-IRT ranks models via items; the two approaches are complementary.
  • vs. BenchmarkCards: Descriptive diagnostics. PSN-IRT provides quantitative item-level parameters.
  • vs. DynaBench: DynaBench addresses data leakage via dynamic datasets but does not resolve item quality issues; PSN-IRT quantifies each item's discriminative power from a statistical perspective.
  • The application of IRT to AI evaluation is generalizable to domain-specific benchmarks in coding, reasoning, and beyond.
  • Insight: New benchmarks should undergo IRT analysis prior to release to filter out low-discrimination items.

Rating

  • Novelty: ⭐⭐⭐⭐ — A systematic framework for auditing LLM benchmarks with IRT, introducing mature psychometric tools into AI evaluation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 11 benchmarks, 12 models, and 41K items; analysis is conducted at sufficient scale.
  • Writing Quality: ⭐⭐⭐⭐ — Theory and empirical findings are well integrated; visualizations clearly illustrate item quality issues.
  • Value: ⭐⭐⭐⭐⭐ — Makes an important foundational contribution to LLM evaluation methodology by exposing the pervasive presence of low-quality items in existing benchmarks.