ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities¶

Conference: ACL 2025
arXiv: 2412.06745
Code: GitHub
Area: LLM Evaluation
Keywords: Benchmark Evaluation, Model Ranking, Plackett-Luce, Sample-level Evaluation, Personalized Evaluation

TL;DR¶

ONEBench proposes a new benchmarking paradigm: pooling samples from multiple evaluation datasets into a unified pool, and performing sample-level model comparisons using the Plackett-Luce rank aggregation algorithm. This paradigm supports heterogeneous metric aggregation, incomplete data handling, and personalized capability probing.

Background & Motivation¶

Deep learning has entered the "post-dataset era"—the zero-shot capabilities of foundation models continue to expand, making traditional evaluation methods based on fixed test sets increasingly inadequate. Static benchmarks face the following challenges:

Inadequate capability coverage: A single dataset can only test specific capabilities, failing to comprehensively evaluate the open-ended capabilities of models.

Dataset bias: Each dataset possesses its own collection bias, potentially leading to unfair evaluation.

Overfitting risks: Models might be optimized for specific benchmarks, leading to inflated actual capabilities.

Lack of evaluation democratization: Traditional benchmarks are created by specific teams with single standards, preventing different user groups from defining their own evaluation dimensions.

The key challenge lies in: How to build a unified evaluation framework that is dynamic, sample-level, and supports heterogeneous metrics and incomplete data?

Method¶

Overall Architecture¶

ONEBench consists of four core components: - Data Pool D: A collection of evaluation samples from multiple benchmarks, where each sample contains inputs, reference answers, and metadata. - Model Set M: Contains a baseline model and all models to be evaluated. - Sample-level Ranking S: For each sample, the evaluated models are ranked based on their performance on that sample. - Capability Tags: Categorized into tasks (e.g., QA, summarization) and concepts (e.g., immunology, geography), supporting structured and semantic retrieval.

Workflow: Users retrieve relevant samples via queries (e.g., "antibody research") → Aggregate the rankings on these retrieved samples → Obtain model rankings tailored to specific capabilities.

Key Designs¶

Sample-Level Rank Conversion: Heterogeneous metrics from different benchmarks (binary correctness, numerical BLEU scores, pairwise preference rankings, etc.) are converted into ordinal rankings. This loss of information is intentional—ordinal comparison is more robust and offers greater external validity than absolute scores. Recht et al. (2019) observed that model rankings remain stable across different test sets, even when absolute accuracy varies significantly.
Plackett-Luce Rank Aggregation: This is the core algorithm of ONEBench. Assuming that each model mk has a latent utility parameter γk, the rankings on samples are generated by these utility parameters according to a specific probability model. The utility parameters are recovered via Maximum Likelihood Estimation (MLE), and the global ranking is then obtained by sorting based on utility.

Key advantages of the Plackett-Luce model: - Identifiability: The utility distribution can be uniquely recovered (up to a constant shift) under the condition that the comparison graph is connected. - Sample-Efficient Convergence: Accurate ranking recovery requires only Ω(|M|log|M|)/k samples. - Social Choice Properties: It satisfies anonymity, neutrality, and independence of irrelevant alternatives (IIA).

Capability Probing: Combines two retrieval methods:
- Semantic Search: Uses the embedding space of all-MiniLM-L6-v2 (text) or SigLIP-B16 (vision-language) for kNN retrieval.
- Metadata Search: Filters based on structured metadata (such as task type and domain category).
Lifelong Expansion: The data pool, model set, and ranking data are stored in a relational database, supporting incremental insertion of new samples, new models, and new rankings.

Loss & Training¶

The Plackett-Luce model estimates parameters by maximizing the log-likelihood:

γ̂ = argmax_γ log p(s|γ)

Since the likelihood function is strictly concave, the MLE yields a unique solution. In practice, rank-breaking techniques are utilized to accelerate computation. The baseline model's utility is set to 0 to eliminate the uncertainty of constant shift.

Key Experimental Results¶

Main Results¶

Comparison of Kendall τ correlation coefficients between Plackett-Luce and other ranking methods across four mainstream benchmarks:

Dataset	Elo	LMArena(BT)	ONEBench(PL)
HELM	0.35±0.13	0.85±0.00	0.88±0.00
Open LLM Leaderboard	0.21±0.07	0.97±0.00	0.99±0.00
VHELM	0.63±0.02	0.69±0.00	0.80±0.00
LMMs-Eval	0.33±0.11	0.42±0.00	0.64±0.00

Dataset	Borda Count	Dowdall	ONEBench(PL)
HELM	0.81	0.83	0.88
Leaderboard	0.95	0.99	0.99
VHELM	0.35	0.21	0.79
LMMs-Eval	0.08	0.18	0.64

Ablation Study¶

Configuration	Key Metric	Description
95% data missing	Ranking remains stable	Evaluation cost reduced by 20x
95% model measurements missing	Kendall τ remains high	Applicable to incomplete evaluation scenarios
Top-10 model retention rate	PL method is optimal	Reliably recovers top rankings

Key Findings¶

Plackett-Luce significantly outperforms Elo and Bradley-Terry across all datasets, with the advantage being particularly pronounced on highly heterogeneous benchmarks (VHELM, LMMs-Eval).
Even with 95% of the data missing, rankings remain accurate—implying that evaluation costs can be reduced by up to 20 times.
In capability probing experiments, retrieval accuracy for 50 selected concepts reached Cohen-κ=0.79(LLM)/0.91(LMM), and CMC@1=0.95/0.94.
Elo ratings exhibit extremely high variance (depending on matchmaking order) and are unsuitable for large-scale benchmarks.

Highlights & Insights¶

Paradigm Shift: The transition from "one benchmark, one score" to "sample-level dynamic evaluation" represents an important methodological advancement.
Theoretical Rigor: Unlike many empirical-oriented evaluation works, ONEBench is grounded in solid social choice theory and random utility models, providing theoretical guarantees for identifiability and convergence.
High Practicality: It supports personalized queries, lifelong expansion, and handling of incomplete data, aligning well with practical evaluation needs.
Deep Insight into Ranking vs. Score: The trade-off analysis from information loss to external validity is compelling—rankings are more robust and more transferable than scores.

Limitations & Future Work¶

Limitations of the Plackett-Luce Assumption: The model assumes independence among samples and constant utility parameters, failing to capture capability discrepancy structures (e.g., "a model is good at math but poor at language").
Violation of Separability and Pairwise Majority Consistency: The paper itself acknowledges that Plackett-Luce violates these two social choice properties.
Semantic Retrieval Quality: Concept-level retrieval relies on the quality of embedding models, which may introduce false positives.
Circularity of Ground Truth: Using the mean of raw leaderboard scores as Ground Truth to evaluate the aggregation algorithm involves a certain degree of circular reasoning.
Lack of Discussion on Evaluation Gaming: Dynamic evaluations can also be gamed by model developers.

Chatbot Arena (Chiang et al., 2024): Uses the Bradley-Terry model to aggregate human preference pairs. ONEBench generalizes this to automatic evaluation scenarios.
Plackett-Luce Model (Maystre and Grossglauser, 2015): An efficient rank aggregation algorithm, which this paper systematically applies to LLM/LMM evaluation for the first time.
Recht et al. (2019): Found that model rankings are more robust across datasets than absolute scores, providing theoretical justification for ONEBench's use of ordinal comparisons.
Zhang and Hardt (2024): Analyzed rank aggregation from the perspective of classical voting theory, proposing trade-offs between different notions of fairness.

Rating¶

Novelty: ⭐⭐⭐⭐ The evaluation paradigm is innovative, but the underlying methods stem from existing social choice theory.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both LLM and LMM domains with comprehensive ablation studies and rich baseline methods.
Writing Quality: ⭐⭐⭐⭐ Structured clearly but some parts are verbose, and mathematical notation is sometimes excessive.
Value: ⭐⭐⭐⭐⭐ Directly instructive for AI evaluation practices, and the open-source framework is ready to use.