Skip to content

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Conference: ACL 2026 arXiv: 2510.18196 Code: N/A Area: LLM Evaluation Keywords: LLM-as-a-Judge, contrastive decoding, score range bias, direct assessment, model family bias

TL;DR

This paper identifies score range bias in LLM judges under direct assessment settings — i.e., model outputs are highly sensitive to predefined score ranges — and proposes contrastive decoding as a mitigation strategy, leveraging the mutual cancellation of similar biases within the same model family, achieving an average relative improvement of up to 11.3% in Spearman correlation.

Background & Motivation

Background: LLM-as-a-Judge has become an indispensable component of evaluation ecosystems, widely applied to two tasks: direct assessment (assigning scores to outputs) and pairwise comparison.

Limitations of Prior Work: Known LLM judge biases include self-enhancement bias (favoring the model's own outputs) and family enhancement bias (favoring outputs from models in the same family). Whether additional hidden biases exist has not been thoroughly investigated. In direct assessment tasks, the correlation between LLM judges and human annotations has consistently lagged behind pairwise comparison.

Key Challenge: When different score ranges are used (e.g., 0–4, 1–5, 2–6, 3–7), the output correlations of LLM judges shift substantially, indicating unstable evaluation results that preclude reliable search for the optimal score range.

Goal: To identify and quantify score range bias in LLM judges and propose effective mitigation strategies.

Key Insight: The observation that models of different sizes within the same family encode similar score range biases (e.g., Qwen2.5 family models of 3B/7B/14B all tend to output Score 2) motivates the use of contrastive decoding to cancel these shared biases.

Core Idea: Apply contrastive decoding to the LLM-as-a-Judge setting by using a smaller model from the same family as an "assistant model," subtracting its logits from those of the main model to eliminate the shared score range bias.

Method

Overall Architecture

The proposed method is built on the Contrastive Decoding framework, employing two models from the same model family: a main model and an assistant model. The final output is generated by subtracting the weighted log-probabilities of the assistant model from those of the main model, thereby canceling the score range bias shared by both.

Key Designs

1. Contrastive Decoding with a Scaling Factor

  • Function: Aligns the logit distributions between the main and assistant models.
  • Mechanism: The final score is computed as \(\log p_{\text{main}} - \lambda \log p_{\text{asst}}\), where \(p_{\text{asst}}\) is controlled by temperature \(t\): \(p_{\text{asst}} = e^{e_i/t} / \sum_j e^{e_j/t}\).
  • Design Motivation: The logit ranges differ substantially across model sizes (3B max logit ≈ 25, 7B ≈ 30, 14B ≈ 34). Introducing scaling factor \(\lambda\) and temperature \(t\) to align logit distributions across models constitutes an improvement over the original contrastive decoding formulation.

2. Same-Family Model Pairing Strategy

  • Function: Selects model pairs that encode similar biases to enable effective bias cancellation.
  • Mechanism: The Llama-3 family uses 8B (main) + 1B/3B (assistant); the Qwen-2.5 family uses 7B/14B (main) + 3B (assistant).
  • Design Motivation: Models within the same family share similar score range bias patterns (e.g., the Llama family tends to output Score 4, while the Qwen family tends to output Score 2). This shared structure enables contrastive decoding to effectively cancel the bias.

3. Hyperparameter Grid Search

  • Function: Identifies the optimal \(\lambda\) and \(t\) combination for each model pair and score range.
  • Mechanism: A grid search is conducted over \(\lambda \in \{0.01, 0.1, 0.5, 1.0\}\) and \(t \in \{0.5, 1.0, 2.0, 3.0, 4.0, 5.0\}\), using 10% of the data as a development set.
  • Design Motivation: Different score ranges and model combinations require different hyperparameters to optimally align logit distributions.

Key Experimental Results

Main Results

Model / Method Score Range Pearson Spearman Kendall
Llama 3.1-8B (greedy) Average 0.346 0.334 0.290
Contrastive (8B–1B) Average 0.361 0.352 0.306
Qwen2.5-14B (greedy) Average 0.383 0.384 0.334
Contrastive (14B–3B) Average 0.424 0.433 0.376

Ablation Study

Analysis Dimension Key Findings
Different score ranges (0–4 / 1–5 / 2–6 / 3–7) Greedy decoding exhibits large correlation fluctuations across ranges (Llama 8B: 0.257–0.372); contrastive decoding is more stable (0.298–0.378)
Assistant model selection (1B vs. 3B) Marginal difference; 1B slightly outperforms 3B (Spearman 0.352 vs. 0.343)
Multi-dimensional evaluation (coherence / relevance / consistency) Score range bias is present across all dimensions; contrastive decoding improves most of them

Key Findings

  1. Score range bias is pervasive: Models across different families (Llama-3, Qwen-2.5) and scales (1B–14B) all exhibit score range bias, showing preference for specific score values.
  2. Same-family models encode similar biases: Qwen family models at 3B/7B/14B all tend to output Score 2; the bias weakens as model size increases but persists.
  3. Contrastive decoding yields the largest gains in the 2–6 range: This is the range where greedy decoding performs worst; contrastive decoding improves it most substantially (Llama: Spearman 0.257 → 0.302).
  4. Qwen-14B with contrastive decoding achieves the best overall performance: Average Spearman improves from 0.384 to 0.433, a relative gain of approximately 12.8%.

Highlights & Insights

  1. High value in problem identification: This is the first systematic revelation of score range bias in LLM judges — a previously overlooked issue with broad implications.
  2. Simple yet effective method: Contrastive decoding requires no additional training; it only requires running two same-family models concurrently, and can share computational overhead with speculative decoding.
  3. Clear bias visualization: Logit distribution plots intuitively demonstrate each model's preference for specific scores and how contrastive decoding shifts the score distribution closer to human annotations.
  4. Expanded evaluation search space: Contrastive decoding makes it feasible to search for optimal score ranges beyond the conventional 1–5 scale.

Limitations & Future Work

  1. Limited model scale: Experiments only cover models up to 14B parameters; bias patterns and contrastive decoding effectiveness for larger models (e.g., 70B+) remain unknown.
  2. Narrow task coverage: Validation is limited to summarization evaluation; applicability to other evaluation tasks (e.g., code evaluation, dialogue evaluation) has not been verified.
  3. English only: No testing in multilingual settings.
  4. Increased inference cost: Two forward passes must be run simultaneously; although computational overhead can be shared via speculative decoding, deployment complexity increases.
  5. Hyperparameter sensitivity: Each model pair and score range requires separate tuning, and generalization in practical applications warrants further investigation.
  1. G-Eval (Liu et al., 2023): A seminal work on NLG evaluation using GPT-4; the present paper identifies score range bias as a limitation building on this framework.
  2. Family Enhancement Bias (Goel et al., 2025): Documents the tendency of same-family models to favor each other's outputs; this paper repurposes such "family similarity" for bias cancellation.
  3. Contrastive Decoding (Li et al., 2023): The original contrastive decoding method for open-ended text generation; this paper transfers it to the LLM-as-a-Judge setting.
  4. Prometheus 2 (Kim et al., 2024): A dedicated trained evaluation model, providing a contrast to the training-free approach proposed here.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to reveal score range bias and propose contrastive decoding as a remedy; the problem identification itself is of significant value.
  • Experimental Thoroughness: ⭐⭐⭐ — Covers two model families, four score ranges, and three evaluation dimensions, but task coverage is narrow (summarization only).
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, effective visualizations, and thorough bias analysis.
  • Value: ⭐⭐⭐⭐ — Carries important cautionary implications for the LLM-as-a-Judge community; the method is practical and compatible with existing inference acceleration techniques.

Highlights & Insights

To be supplemented after a thorough reading of the paper.

Limitations & Future Work

To be supplemented after a thorough reading of the paper.

To be supplemented after a thorough reading of the paper.

Rating

  • Novelty: Pending
  • Experimental Thoroughness: Pending
  • Writing Quality: Pending
  • Value: Pending