tags: - ACL 2025 - LLM (Other) - LLM-as-a-Judge - Judge-Bench date: 2026-05-08 content_hash: 3670ded2ec82d258
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks¶
Conference: ACL 2025
arXiv: 2406.18403
Code: github.com/dmg-illc/JUDGE-BENCH
Area: LLM/NLP
Keywords: LLM-as-a-Judge, human evaluation, evaluation benchmark, NLP evaluation, Judge-Bench
TL;DR¶
Establishes Judge-Bench with 20 datasets (70k+ instances) to systematically evaluate 11 LLM judges against human annotations. Findings reveal colossal performance variance across tasks, attributes, and user expertise, indicating that task-specific human validation remains critical before deploying LLM judges.
Background & Motivation¶
Rise of LLM-as-a-Judge: Utilizing LLMs to evaluate NLP models is increasingly popular due to low costs, yet its empirical reliability lacks extensive and systematic verification.
Conflicting Findings: Prior publications present contradictory claims on whether LLM evaluations align with human grades, often suffering from constrained dataset and model selection.
Reproducibility Threats: Closed-source LLMs function as black-boxes that undergo silent updates, posing a severe threat to scientific reproducibility.
Diverse Biases: Models possess distinctive biases differing from human evaluators, such as self-model preference or over-sensitive safe refusals.
Scanty Coverage: Prior works mostly evaluate confined dimensions, leaving cross-analyses of expertise levels, task targets, and dataset properties largely unexplored.
Core Problem: Under what specific situations and boundaries can LLMs evaluate output text reliably in place of humans?
Method¶
Overall Architecture¶
This body of work constructs Judge-Bench, a standardized suite aggregating 20 NLP datasets featuring human assessments (70k+ instances) across categorical and graded structures. By formatting inputs to a unified schema, it benchmarks 11 LLM judges and audits their alignment against human references.
Key Designs¶
Data Construction & Categorization¶
- Source Dichotomy: Classifies texts into human-authored versus machine-generated to systematically check whether LLMs are biased towards machine outputs.
- Annotation Schemas: Categorical targets utilize Cohen's \(\kappa\) metric; graded evaluations use Spearman's \(\rho\) correlation.
- Covered Attributes: Spans fluency, coherence, factual consistency, acceptability, verbosity, engagingness, toxicity, and safety labels.
- Evaluator Expertise: Differentiates annotations between expert panels and non-expert crowdworkers to analyze task-wise sensitivity.
Model Selection & Prompt Design¶
- 11 Benchmark Models: Includes closed-source models (GPT-4o, Gemini-1.5) and open models (LLaMA-3.1-8B/70B, Mixtral-8x7B/8x22B, Command R/R+, OLMo, Starling-7B, Mistral).
- Prompt Strategy: Leverages the original annotation guidelines and appends formatting constraints like
"Answer with one of {}. Do not explain your answer."Altered setups (CoT, few-shot) yielded inconsistent gains.
Evaluation Protocol¶
- Refusal Handlers: Safeguard triggers and empty outputs are filled using random draws, ensuring valid, matched-size samples for comparisons.
- Target Comparisons: Computes Cohen's \(\kappa\) (categorical agreement) or Spearman's \(\rho\) (graded agreement) against humans.
- Human Upper Bound: Estimated via bootstrap correlation between single annotators and aggregated consensus labels.
Loss & Training¶
This paper performs evaluation only; no model training or weight adjustments are conducted. All LLMs perform inference in zero-shot or few-shot formats.
Key Experimental Results¶
Main Results¶
Table 1: Main Results on Categorical & Graded Annotations¶
| Model | Categorical Avg \(\kappa\) | Graded Avg \(\rho\) |
|---|---|---|
| GPT-4o | \(0.28 \pm 0.32\) | \(0.50 \pm 0.21\) |
| LLaMA-3.1-70B | \(0.28 \pm 0.30\) | \(0.43 \pm 0.22\) |
| Mixtral-8x22B | \(0.24 \pm 0.30\) | \(0.44 \pm 0.19\) |
| Gemini-1.5 | \(0.22 \pm 0.28\) | \(0.43 \pm 0.21\) |
| Mixtral-8x7B | \(0.21 \pm 0.28\) | \(0.38 \pm 0.22\) |
| Command R+ | \(0.10 \pm 0.18\) | \(0.30 \pm 0.17\) |
- GPT-4o secures head leadership overall, but top-tier open models like LLaMA-3.1-70B and Mixtral-8x22B sit within close range.
- Open-source models occasionally outperform closed models on specific tasks (such as CoLa syntax and SummEval).
- Massive standard deviations (\(\sigma\) up to 0.23) underscore structural difficulty gaps across the evaluated tasks.
Table 2: Cross-Analysis of Key Dimensions¶
| Analytic Dimension | Key Finding |
|---|---|
| Expert vs Non-expert | Models exhibit higher correlation with non-expert evaluators, likely due to shared reliance on surface-level context cues. |
| Human vs Machine Text | LLMs correlate significantly better when evaluating human-authored texts compared to machine-generated alternatives. |
| Attribute Variation | Closed models lead on acceptability; Mixtral leads on coherence and consistency. Every model struggles on engagingness. |
| Safety & Toxicity | Achieves negative correlations on DICES and Medical-safety due to overactive guardrails causing invalid outputs. |
Key Findings¶
- No Absolute Winner: Attribute strengths vary across different model series, cautioning against defaulting exclusively to GPTs.
- CoT is Inconsistent: Chain-of-Thought prompting fails to universally scale up judge performance across tasks.
- Safety Filters Impede Judges: System guardrails trigger high refusal rates on sensitive datasets, rendering safety judgments inaccurate.
- Gap to Human Ceilings: Performance largely stays far beneath human upper bounds, with sparse exceptions.
- Logic and Structure Remain Safest: Evaluation tasks emphasizing mathematical rules and clear formatting constraints present highly reliable correlations.
Highlights & Insights¶
- Unmatched Benchmark Scale: Combines evaluations over 20 tasks, 11 engines, and 70k+ instances for thorough, empirical evidence.
- Multi-Dimensional Metrics: Cross-references source types, human expertise levels, and task metrics to pinpoint exact alignment patterns.
- Unified Testing Schema: Judge-Bench provides a standardized data format, making it easier to integrate new domains.
- Sobering Real-world Advice: Dispenses detailed domain advice outlining when LLM judges are viable and when they require human checks.
Limitations & Future Work¶
- Assumes Perfect Agreement: Relies on Correlation/Kappa values, which can mask instances of shared, systematic systematic bias.
- Excludes Pairwise Setups: Benchmarks primarily direct-scoring systems, omitting pairwise comparison modes (e.g., PairEval).
- Language Limitation: Focuses heavily on English tasks, leaving multilingual capacities untested.
- Potential Data Leakage: Open evaluation datasets might already have been crawled within proprietary training runs.
- Post-GPT-4o Missing: Does not feature the newest model iterations (such as Claude 3.5 or o1 series).
Related Work & Insights¶
- LLM Judges: Explores G-Eval (Liu et al., 2023) and MT-Bench (Zheng et al., 2024) highlighting cost decreases.
- Evaluation Biases: Evaluates target model bias (Wang et al., 2024; Xu et al., 2024).
- Alternative Setups: Explores pairwise alignment models (Park et al., 2024; Kim et al., 2024).
Rating¶
| Dimension | Rating | Description |
|---|---|---|
| Novelty | ⭐⭐⭐ | Standard evaluation tooling; value stems from comprehensive analysis. |
| Experimental Thoroughness | ⭐⭐⭐⭐ | Sweepy testing across a variety of models, tasks, and attributes. |
| Practical Value | ⭐⭐⭐⭐ | Offers clear verification pipelines for building LLM judges. |
| Writing Quality | ⭐⭐⭐ | Clear structural writing with informative figures. |