ICLR2026 LLM Evaluation micro-benchmarking evaluation reliability MDAD pairwise ranking random sampling MMLU-Pro BIG-bench Hard

How Reliable is Language Model Micro-Benchmarking?¶

Conference: ICLR2026 arXiv: 2510.08730 Code: dill-lab/micro-benchmarking-reliability Area: LLM Evaluation Keywords: micro-benchmarking, evaluation reliability, MDAD, pairwise ranking, random sampling, MMLU-Pro, BIG-bench Hard

TL;DR¶

This paper proposes Minimum Detectable Ability Difference (MDAD) as a meta-evaluation metric, systematically demonstrating that micro-benchmarks at extremely small scales cannot reliably distinguish model pairs with small performance gaps, and that random sampling becomes competitive with carefully designed micro-benchmark methods once the sample size reaches ~250.

Background & Motivation¶

Efficiency demands: Evaluating on full benchmarks (e.g., MMLU-Pro with 12K examples, BBH with 5.7K examples) is costly, motivating micro-benchmarking approaches that aim to predict full-benchmark model rankings using as few as 10–100 samples.

Existing methods: Anchor Points selects cluster centers based on source model confidence correlations; tinyBenchmarks uses Item Response Theory (IRT) embedding space clustering; additional methods include stratified sampling by confidence and diversity-based sampling.

Limitations of prior meta-evaluation: Previous work assessed micro-benchmark quality solely via (i) per-model mean estimation error and (ii) global Kendall's \(\tau\) rank correlation. Neither addresses the question: "When two models differ by only 2–3 accuracy points on the full benchmark, can the micro-benchmark still rank them correctly?"

Core insight: A high Kendall's \(\tau\) does not imply reliable pairwise comparisons across the board — it may merely reflect the fact that model pairs with large performance gaps are easy to distinguish, masking systematic failures on closely matched pairs.

Practical pain point: When comparing models of similar capability (e.g., a set of 8B instruction-tuned models), performance differences tend to be small, making micro-benchmark reliability a critical concern.

Neglected baseline: Prior work has not adequately examined under what conditions micro-benchmark methods genuinely outperform simple uniform random sampling.

Method¶

Overall Architecture: Agreement and MDAD¶

The central idea is to evaluate micro-benchmark reliability from a pairwise model ranking perspective.

Agreement function: Given that model \(M_1\) outperforms \(M_2\) on the full benchmark \(D_{\text{full}}\), the probability that micro-benchmark \(D_{\text{micro}}\) agrees with this ordering is:

\[\text{agreement}(D_{\text{micro}}, D_{\text{full}}, B) = \Pr_{M_1, M_2 \in \mathcal{T}}\left(\Delta_{D_{\text{micro}}}(M_1, M_2) > 0 \mid \Delta_{D_{\text{full}}}(M_1, M_2) \in B\right)\]

where \(\Delta_D(M_1, M_2) = \text{perf}_D(M_1) - \text{perf}_D(M_2)\) and \(B\) is a binned interval of performance differences.

MDAD (Minimum Detectable Ability Difference): The smallest performance difference a micro-benchmark can reliably distinguish, subject to an agreement threshold of ≥ 0.8:

\[\text{MDAD}(D_{\text{micro}}, D_{\text{full}}) = \arg\min_{\text{centroid}(B), B \in \mathcal{B}} \left\{\text{agreement}(D_{\text{micro}}, D_{\text{full}}, B)\right\} \text{ s.t. } \Pr \geq 0.8\]

Lower MDAD is better: an MDAD of 2 means the micro-benchmark can reliably distinguish model pairs whose full-benchmark performance differs by ≥ 2 accuracy points.

Key Designs¶

Binning strategy: Accuracy differences are binned at 0.5-point resolution, i.e., \(\mathcal{B} = \{[0, 0.25), [0.25, 0.75), [0.75, 1.25), \ldots\}\)
Data split: Each benchmark is split in half — a train half used to select the micro-benchmark, and a held-out half used for generalization testing
Model split: 470 models are randomly partitioned into source models (used to construct micro-benchmarks) and target models (used for evaluation)
50-trial averaging: Randomness from data and model splits is mitigated by averaging over 50 independent trials
Multiple micro-benchmark sizes: \(k \in \{10, 25, 50, 100, 250, 500, 1000\}\)
Ablation over source model count: \(\{10, 50, 100, 150, 200, 250, 300\}\)

Compared Methods¶

Method	Strategy	Model-Dependent
Anchor Points	\(k\)-medoids cluster centers based on source model confidence correlations	Yes
tinyBenchmarks (IRT)	\(k\)-means cluster centers in IRT embedding space	Yes
Stratified (Confidence)	Stratified random sampling by model confidence	Yes
Diversity	Uniform spread sampling in source model correlation space	Yes
Uniform Random	Uniform random sampling	No
Subtask-Stratified Random	Equal random sampling per subtask	No

Key Experimental Results¶

Main Results: MDAD Across Methods and Benchmarks¶

Table 1: MMLU-Pro (12,032 examples) — MDAD (lower is better)

Method	10 ex.	25 ex.	50 ex.	100 ex.	250 ex.	500 ex.	1000 ex.
Anchor Points	3.5	2.5	2.0	2.0	1.5	1.5	1.5
tinyBenchmarks	7.0	4.0	3.0	2.0	1.0	1.0	1.0
Stratified (Conf.)	9.0	5.0	3.5	2.5	1.5	1.0	1.0
Diversity	8.0	4.5	3.0	2.0	1.5	1.0	1.0
Uniform Random	10.0	6.0	4.0	3.0	2.0	1.0	1.0
Subtask-Stratified	9.5	5.5	3.5	2.5	1.5	1.0	1.0

Table 2: BBH (5,761 examples) — MDAD

Method	10 ex.	25 ex.	50 ex.	100 ex.	250 ex.	500 ex.	1000 ex.
Anchor Points	6	4	3	2	2	2	2
tinyBenchmarks	16	8	5	4	2	2	1
Stratified (Conf.)	15	8	5	3	2	2	1
Diversity	14	7	4	3	2	1	1
Uniform Random	16	9	6	4	2	2	1
Subtask-Stratified	15	8	5	3	2	1	1

Ablation: MDAD and Pairwise Comparison Reliability for 8B Instruction-Tuned Models¶

Micro-benchmark Size	MDAD	Fraction of Unreliable Pairs (gap ≤ MDAD)
10 examples	≥ 5	> 51%
25 examples	≥ 5	51%
100 examples	~3	~35%
1000 examples	~2	21%

Key Findings¶

Reliability bounds of very small micro-benchmarks: With only 10 samples selected, no method can reliably distinguish model pairs whose gap is < 3.5 points on MMLU-Pro, < 6 points on BBH, or < 6.5 points on GPQA.
Anchor Points leads at small scales but stagnates at large scales: Anchor Points achieves the lowest MDAD at 10–50 examples, but at 1000 examples its MDAD is the highest among all methods due to severe \(k\)-medoids clustering imbalance (47% singleton clusters).
Random sampling is competitive at ≥ 250 examples: Across all benchmarks, uniform random sampling achieves MDAD values on par with carefully designed methods when 250 or more examples are selected.
MDAD correlates with but is more informative than Kendall's \(\tau\): The two metrics exhibit a Kendall's \(\tau\) correlation of −0.787, yet identical rank correlation values can correspond to different MDAD values and vice versa.
Micro-benchmarks generalize to new data: Micro-benchmarks selected at the overall benchmark level show virtually no change in MDAD on held-out data; per-subtask selection exhibits slightly reduced generalization.

Highlights & Insights¶

Practical utility of MDAD: The metric translates micro-benchmark reliability from the vague claim of "rank correlation = 0.74" into the actionable statement "can distinguish model pairs differing by ≥ X accuracy points" — enabling practitioners to select an appropriate micro-benchmark size based on their specific needs (coarse screening vs. fine-grained ranking).
Exposing the illusion of high rank correlation: A Kendall's \(\tau\) of 0.74 may appear satisfactory, yet it can arise primarily from correctly ordering model pairs with large performance gaps, obscuring systematic failures on fine-grained, closely matched comparisons.
An Occam's razor conclusion: When the evaluation budget permits 250 or more samples, complex micro-benchmark construction methods offer no meaningful advantage over simple random sampling — eliminating the overhead of training IRT models or computing source model confidences.
MDAD explains the stability of top-model rankings: Top-performing models differ from most others by margins exceeding the MDAD, so even small micro-benchmarks can correctly identify them; mid-tier models, however, are clustered within the MDAD range and thus exhibit unstable rankings.

Limitations & Future Work¶

Restricted to classification/accuracy tasks: Experiments cover only multiple-choice accuracy and do not address open-ended generation or preference-based evaluation scenarios (the authors note extensibility in the Discussion but provide no empirical validation).
The 0.8 agreement threshold is arbitrary: While the appendix shows that conclusions remain qualitatively consistent across different thresholds, the optimal threshold may vary by application context.
MDAD is not used to guide data selection: Currently MDAD serves only as a post-hoc evaluation tool; optimizing MDAD during the micro-benchmark construction process remains unexplored.
Source model selection effects are underanalyzed: Although the number of source models is ablated, the influence of their diversity and representativeness on the results warrants further investigation.
Temporal validity is not addressed: As new models continuously emerge, micro-benchmarks constructed from a fixed set of source models may progressively lose validity.

Anchor Points (Vivek et al., 2024): One of the primary baselines; performs best at very small scales but is degraded at larger scales by clustering imbalance.
tinyBenchmarks (Polo et al., 2024): An IRT-based method whose performance alternates with Anchor Points at moderate scales.
Card et al. (2020): A pioneering work on statistical power analysis in NLP; MDAD directly draws on the concept of "minimum detectable effect size."
Perlitz et al. (2024): The Flash-HELM efficient evaluation framework; this paper uses its observation that top-model rankings are stable and provides a theoretical explanation via MDAD.
Broader inspiration: The MDAD framework can be extended to other evaluation settings — such as reliability analysis of Elo ratings in Chatbot Arena or performance comparisons between checkpoints during training.

Rating¶

Novelty: ⭐⭐⭐⭐ — MDAD is a natural transfer of statistical power analysis to this domain rather than an entirely new framework, but represents the first systematic proposal and validation within micro-benchmarking
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 benchmarks, 6 methods, 7 scale settings, 7 source model count conditions, 50-trial averaging; comprehensive coverage with detailed appendices
Writing Quality: ⭐⭐⭐⭐⭐ — The overview figure in Figure 1 is elegantly designed; the visualization tracing agreement curves to MDAD is exceptionally clear; the overall narrative logic is rigorous
Value: ⭐⭐⭐⭐ — Provides highly actionable practical guidance (use random sampling for ≥ 250 examples), though the primarily negative nature of the conclusions offers greater inspiration to method developers than direct utility to general practitioners