Skip to content

Resolution Diagnostics for Paired LLM Evaluation

Conference: ICML 2026
arXiv: 2605.30315
Code: https://github.com/akotawala10/llm-power
Area: LLM Evaluation / Hypothesis Testing / Leaderboard Statistics
Keywords: Paired testing, McNemar, Resolution Ratio, Minimum Detectable Effect, Leaderboard Multiple Comparisons

TL;DR

This paper treats the ranking difference of "A vs B" on LLM leaderboards as a paired hypothesis testing problem. By inverting level-\(\alpha\) / power-\((1-\beta)\) tests, it defines the "Resolution Ratio" \(q=N/N^\star\). It proves that the common calculator shortcut of multiplying the single-arm Cohen-\(h\) formula by \((1-\rho)\) systematically underestimates the required sample size by half under small effects. Empirical findings show that 11/40 pairs on Open LLM Leaderboard v1 and 4/9 adjacent pairs in the MMLU-Pro top-10 are entirely "unresolvable" at \((\alpha,1-\beta)=(0.05,0.8)\), a number that increases to 6/9 after accounting for multiple comparisons, subject clustering, and anytime-validity constraints.

Background & Motivation

Background: Modern LLM leaderboards translate tiny differences, such as "Model A scores 78.3% and Model B scores 77.5%" (a 0.8 pp gap), directly into headlines and product decisions. The relevant statistical tools for this are classic: binary accuracy uses the McNemar paired test with the Connor (1987) sample size formula, while continuous scoring uses paired \(t\)-tests or paired bootstrap.

Limitations of Prior Work: The closed-form required-\(N\) formula provided by Miller (2024) is based on unpaired Gaussian assumptions. NLP/LLM works often either adopt this directly or use unpaired sample sizes (from Cohen 1988, G*Power, or R's pwr package) and multiply by \((1-\rho)\) as a paired correction. Consequently, many adjacent rankings claimed as "significant" fail to reach the target 0.8 power resolution under a true paired design.

Key Challenge: The true variance of a paired design is \(\sigma_D^2 = p_A q_A + p_B q_B - 2\rho\sqrt{p_A q_A p_B q_B}\), which is the variance of the difference \(X^A-X^B\) involving both arms. The "universal paired adjustment" of \((1-\rho)\) only scales the single-arm variance \(p(1-p)\). Multiplying the single-arm result by \((1-\rho)\) misses a factor of 2 introduced by the summation \(\text{Var}(X^A)+\text{Var}(X^B)\).

Goal: (1) Provide a standardized "Resolution Report" protocol for leaderboard evaluation; (2) Strictly characterize the bias of the aforementioned shortcut; (3) Audit real public leaderboards to identify unsupported rankings.

Key Insight: Treat "benchmark size + observed gap" as a hypothesis testing design parameter. Invert the classic Wald/McNemar-Connor formulas to derive three metrics: MDE (Minimum Detectable Effect at current \(N\)), \(N^\star\) (required paired sample size for a target effect), and Resolution Ratio \(q=N/N^\star\). A conclusion is supported only if \(q\ge 1\).

Core Idea: Use \(q=N/N^\star\) as a one-sentence diagnostic for every pair comparison on a leaderboard. Provide a sharp expansion lemma for small effects and a pip package llm-power to transform "paired sample size estimation" from empirical jargon into a verifiable standard.

Method

Overall Architecture

The framework accepts a per-question score matrix \(\{(X_i^A, X_i^B)\}_{i=1}^N\) of two models on any paired benchmark. It outputs: (i) the minimum detectable effect \(\delta_{\mathrm{MDE}}\) for the current \(N\); (ii) the minimum paired sample size \(N^\star\) for a target effect \(\delta\); (iii) the ratio \(q=N/N^\star(\hat\delta)\) for the observed gap \(\hat\delta\). All quantities are derived by inverting two-sided level-\(\alpha\), power-\((1-\beta)\) paired Wald tests (McNemar-Connor for binary, paired bootstrap for scores). The pipeline also incorporates Bonferroni/Holm/BH corrections, design effect clustering adjustments, and anytime-valid e-processes as stress tests to provide an end-to-end verdict table for leaderboards.

Key Designs

  1. Resolution Ratio \(q=N/N^\star\) as a Unified Diagnostic:

    • Function: Compresses the reliability of a ranking into a dimensionless number. \(q\ge 1\) indicates the benchmark is large enough to achieve \((\alpha,1-\beta)\) resolution, while \(q<1\) indicates it is not.
    • Mechanism: Derived by inverting the Wald formula \(N^\star(\delta;\alpha,\beta) = ((z_{1-\alpha/2}+z_{1-\beta})\sigma_D/|\delta|)^2\). For a single pair, \(q\) corresponds directly to the squared Wald statistic (where \(q\ge 1 \Leftrightarrow |T_N| \ge z_{1-\alpha/2}+z_{1-\beta}\approx 2.80\)). The value-add lies in the aggregation layer—multiple comparisons, clustering, and anytime-validity are more naturally combined on the \(q\) scale than the \(p\)-value scale.
    • Design Motivation: \(q\) explicitly states "the benchmark lacks statistical power to resolve this gap," avoiding the "power at observed effect" trap criticized by Hoenig & Heisey (2001). It provides maintainers with a one-liner for each row.
  2. Small Effect Expansion Lemma (Lemma 1) — Quantifying Shortcut Bias:

    • Function: Provides an explicit second-order constant \(C(p,\rho)\) for the ratio \(n_h/N^\star - 1/2\) near \(p\), quantifying when the "single-arm \(\times (1-\rho)\)" shortcut is too inaccurate to use.
    • Mechanism: Taylor expands \(h^2\) (Cohen's \(h\) arcsine difference) and \(\sigma_D^2\) at the midpoint \(p_A=p+\delta/2, p_B=p-\delta/2\). Linear terms cancel due to symmetry, and combining \(O(\delta^2)\) corrections yields \(n_h/N^\star = 1/2 - (\delta^2/2)[(1+\rho)(1-2p)^2/(16(1-\rho)u^2) - 1/(6u)] + O(\delta^4)\), where \(u=p(1-p)\). At \(p=1/2\), the \((1-2p)^2\) term vanishes, leaving \(C(1/2,\rho)=1/3\) independent of \(\rho\). Corollary 1 provides a threshold \(\delta^\star(p,\rho,\epsilon)=\sqrt{\epsilon/C(p,\rho)}\) for valid use.
    • Design Motivation: Industrial software (Cohen 1988, G*Power 3.1, R pwr) often returns single-arm \(K/h^2\) sample sizes. This lemma formalizes the "doubling" factor (\(N^\star \approx 2 \times n_{\text{shortcut}}\)) as an \(O(\delta^2)\) bias. For close comparisons (\(|\hat\delta| \le 5\) pp), the shortcut underestimates \(N^\star\) by half, potentially flipping a verdict from "unresolvable" to "significant."
  3. Triple Stress Tests: Multiple Comparison / Clustering / Anytime-Valid:

    • Function: Simultaneously reports "unresolved pairs" under fixed-\(n\), Bonferroni/Holm, clustering, and anytime-valid paradigms to see how verdicts tighten.
    • Mechanism: Multiple comparisons scale \(N^\star\) by \(((z_{1-\alpha'/2}+z_{1-\beta})/(z_{1-\alpha/2}+z_{1-\beta}))^2\); clustering scales IID \(N^\star\) by the design effect \(\mathrm{DE}=1+(\bar m-1)\mathrm{ICC}(D)\); anytime-validity replaces \(z_{1-\alpha/2}\) with time-uniform boundaries \(u(n)\) (Howard et al., 2021).
    • Design Motivation: Fixed-\(n\) is a necessary but insufficient condition. Real leaderboards face family-wise error rates, subject clustering, and sequential updates. Presenting these side-by-side demonstrates the true statistical vulnerability of LLM rankings.

Loss & Training

No training is involved. Values are derived from lm-evaluation-harness dumps and OLL detail repositories. Evaluation is performed on OLL v1 (40 pairs) and MMLU-Pro top-10 (\(N=12{,}032\), 9 adjacent pairs). Calibration experiments were conducted on synthetic paired Bernoulli data (\(p\in\{0.5,0.7,0.9\}\times \rho_z\in\{0,0.4,0.8\}\)) with \(M=1500\) trials/cell.

Key Experimental Results

Main Results

Unresolved proportion on OLL v1 (40 pairs) binned by \(|\delta|\):

\(\|\delta\|\) Bin Pairs Unresolved Median \(r=N^\star/N\) Worst
\(\le 1\%\) 3 3 (100%) 94 1,892
1–2% 4 4 (100%) 4.2 6.8
2–5% 10 4 (40%) 0.75 2.8
5–15% 17 0 (0%) 0.15 0.65
>15% 6 0 (0%) 0.03 0.07
all 40 11 (28%) 0.16 1,892

The resolution boundary falls near \(|\delta|\approx 5\%\). Paired McNemar shows a median efficiency gain of 2.15× over the unpaired formula (Miller, 2024), matching the \(1/(1-\rho)\) prediction. Prospective validation on 3 pairs with \(|\delta|\in[6.3, 10.1]\) pp shows empirical power of 0.796–0.827, hitting the 0.80 target within \(\pm 2.7\) pp.

Ablation Study

Unresolved count for 9 adjacent pairs on MMLU-Pro top-10 under 4 paradigms:

Paradigm OLL v1 (40 pairs) MMLU-Pro (9 pairs) Description
Fixed-\(n\) 11/40 4/9 Baseline verdict, IID assumption
Bonferroni / Holm 14/40 4/9 Family-wise error, \(N^\star\) inflation ~2.11×
Anytime-valid 14/40 5/9 Uniform e-process, threshold inflation ~2.15×
Clustering n/a 6/9 DE calculated via 14 MMLU-Pro subjects

In the clustering column, the rank 4 vs 5 comparison jumps from IID \(N^\star=432\) to cluster \(N^\star=13{,}621\) (exceeding \(N\)), as model differences associate with specific subjects (\(\mathrm{ICC}(D)=0.036\), \(\mathrm{DE}=31.5\)).

Key Findings

  • Lemma 1 accuracy: On OLL v1, the \(n_h/N^\star\) median is 0.5002 (IQR [0.4999, 0.5035]). For close comparisons (\(|\hat\delta|\le 2\%\)), it hits 0.5000 precisely. The shortcut consistently underestimates \(N^\star\) by half.
  • HellaSwag boundary: gemma-7B vs Llama-3-8B shows \(\hat\delta=+0.46\) pp (\(N=10{,}042\)). The asymptotic \(\chi^2_1\) yields \(p=0.049\) (significant), but the exact conditional binomial yields \(p=0.054\) (not significant); \(q\approx 1/2\). This highlights that "significant" is not the same as "resolvable."
  • Clustering robustness: The clustering verdict is robust to different granularities. Leave-one-subject-out (LOSO) maintains 6/9 unresolved in 11/14 cases.

Highlights & Insights

  • Formalizing shortcut bias: What was previously "common knowledge" is written as a strict Lemma. The explicit \(C(p,\rho)\) and the \(p=1/2\) independence are elegant additions to statistical reporting.
  • Engineering value of \(q=N/N^\star\): \(q\) provides a superior abstraction for aggregation compared to \(p\)-values. Family-wise, design effect, and anytime-valid adjustments all translate directly into \(N^\star\) multipliers, making the audit more intuitive.
  • Synergy of stress tests: The side-by-side display of stress test results provides a powerful "reporting protocol" for reproducibility and benchmark design, showing how easily tiny gaps vanish under realistic statistical assumptions.

Limitations & Future Work

  • Authors acknowledge the empirical focus on binary accuracy; continuous scoring and pairwise preferences (Chatbot Arena style) were treated via paired bootstrap but lack the same depth of empirical audit.
  • MMLU-Pro clustering is limited to \(K=14\) subjects; finer granularity (task templates) might further inflate the design effect.
  • The work separates "resolution" from "construct validity"; thus, \(q\ge 1\) is a necessary but not sufficient condition for a meaningful benchmark comparison.
  • Future work: Generalize Lemma 1 to multi-arm power and integrate Bradley–Terry e-processes for Elo-style leaderboards.
  • vs Miller (2024): Miller uses unpaired Gaussian formulas; this work quantifies the paired efficiency gain at 2.15× and provides a per-pair reporting protocol.
  • vs Card et al. (2020) / Dror et al. (2018): While previous work identified underpowered NLP tests, this paper provides the first systematic audit of paired \(N^\star\) across multiple modern leaderboards.
  • vs Madaan et al. (2024) / Jo & Wilson (2025): These works estimate benchmark variance; this paper uses those estimates to derive testing power, showing calibration and design effect corrections are complementary.

Rating

  • Novelty: ⭐⭐⭐⭐ Lemma 1 provides a new explicit constant; the \(q\) framework is a systematic integration of known tools.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Broad coverage of public leaderboards, multiple statistical calculators, and robust calibration.
  • Writing Quality: ⭐⭐⭐⭐ Transparent explanation of subtle statistical points; excellent comparative tables.
  • Value: ⭐⭐⭐⭐⭐ Provides a practical pip package and protocol for the increasingly crowded LLM evaluation field.