Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases¶
Conference: ACL 2026
arXiv: 2601.03630
Code: https://github.com/HuihuiChyan/LRM-Judge
Area: LLM Safety / LLM Evaluation / LLM-as-a-Judge
Keywords: Reasoning Model Evaluation, Evaluation Bias, PlanJudge, RewardBench, BiasBench
TL;DR¶
The paper systematically compares the performance of reasoning models versus standard LLMs as judges. It finds that while reasoning models exhibit stronger accuracy, evaluation instruction following, and attack robustness, they remain susceptible to surface quality biases. The authors propose a prompt-only strategy, PlanJudge, to mitigate these biases.
Background & Motivation¶
Background: With the increase in open-ended generation tasks, traditional metrics like BLEU and ROUGE struggle to capture LLM output quality. LLM-as-a-Judge has become a mainstream evaluation solution, where researchers use strong models to compare two responses or score them, replacing expensive human review.
Limitations of Prior Work: Judge models themselves are prone to errors. Existing research has found that LLM-as-a-Judge is affected by biases related to position, length, style, specificity, and format. Meanwhile, although large reasoning models have shown superior performance in math and coding through long-form reasoning and self-checking, whether they are better suited as judges remains without a systematic conclusion.
Key Challenge: Reasoning models might excel at complex judgments, but they might also be more easily induced by surface features due to "overthinking" or excessive reliance on explicit criteria. Determining if they are superior requires looking beyond a single reward benchmark to include accuracy, instruction following, attack robustness, and bias robustness.
Goal: Under a controlled setting of reasoning-as-the-only-variant, this study compares the differences between four sets of reasoning and non-reasoning models as judges and designs a lightweight strategy to reduce bias.
Key Insight: The authors select reasoning/instruct variants within the same model families, such as DeepSeek-V3 vs DeepSeek-R1, Qwen2.5-32B-Instruct vs QwQ-32B, and Qwen3 instruct vs thinking mode. This approach isolates the impact of the "reasoning process" itself to the greatest extent possible.
Core Idea: Reasoning models are indeed stronger judges, but to reduce over-reliance on surface signals like length and specificity, they must first be required to draft a clear evaluation plan and then execute the judgment according to that plan.
Method¶
The paper consists of two parts. The first is a systematic empirical comparison examining four dimensions of LRM-as-a-Judge: general evaluation accuracy, evaluation instruction following, robustness to prompt injection, and robustness to evaluation biases. The second part introduces PlanJudge: a method that does not train new models but instead prompts the judge to generate or receive a fine-grained evaluation plan before performing the comparison.
Overall Architecture¶
The experiments first select four model pairs and compare them across RewardBench, JudgeBench, Helpsteer2-trivial, RobustJudge, BiasBench, and LLMBar. Helpsteer2-trivial is a new dataset constructed by the authors: Response A is overall better, but Response B is superior in a single specific dimension. A competent judge should select A under an overall prompt but switch to B under a specific prompt; thus, the authors use the Reversal Rate to measure evaluation instruction following.
Subsequently, PlanJudge is integrated into the same set of judges. PlanJudge splits the evaluation into "planning" and "execution" steps: first writing out evaluation dimensions, priorities, and precautions based on the task, and then having the model make a judgment based on that plan. Plans can be derived from human heuristics, self-generated by the model, or a combination of both.
Key Designs¶
-
Controlled reasoning vs non-reasoning comparison:
- Function: To determine whether the reasoning process itself enhances judge capabilities.
- Mechanism: Selecting instruct/thinking variants within the same model family to avoid misattributing gains from parameter scale, pre-training corpora, or architectural differences to reasoning.
- Design Motivation: If only arbitrary strong and weak models were compared, the conclusions would be confounded by differences in model size and training data. Controlled model pairs make the conclusions more credible.
-
Helpsteer2-trivial and Reversal Rate:
- Function: Specifically testing whether a judge can switch preferences according to specified evaluation dimensions.
- Mechanism: Constructing samples where A is overall superior but B is better in a single dimension. if the model selects A for the overall prompt and B for the specific prompt, it demonstrates understanding and execution of the evaluation dimension instructions.
- Design Motivation: Instruction following in evaluation tasks differs from general chat. A judge might know which response is better overall but fail to ignore other dimensions when instructed to "evaluate only helpfulness."
-
PlanJudge Two-stage Evaluation:
- Function: Reducing the judge's sensitivity to surface quality biases.
- Mechanism: Generating a specific evaluation plan first, then executing the evaluation. Plans can be heuristic-based, self-synthesized, or combined. Combined plans utilize both human rules and the model's understanding of the current sample.
- Design Motivation: Reasoning models inherently check criteria item by item, but if the criteria are ambiguous, they may mistake length, detail, and tone for quality. Explicit planning pulls attention back to the task requirements.
Loss & Training¶
PlanJudge is a prompt-only method, requiring no additional fine-tuning, reward models, or external resources. Its cost stems from longer reasoning and evaluation prompts, but it has a low deployment threshold. Compared to methods requiring judge training, PlanJudge is more of an evaluation protocol modification: changing "direct judgment" to "plan evaluation dimensions first, then execute judgment."
Key Experimental Results¶
Main Results¶
In terms of general evaluation accuracy, reasoning variants generally outperformed instruct variants, with the Qwen series showing significant gains on JudgeBench. DeepSeek-R1 scored higher than V3 on RewardBench but lower on JudgeBench, which the authors attribute to hallucination issues in knowledge-based tasks.
| Model Pair | RewardBench | JudgeBench | Conclusion |
|---|---|---|---|
| DeepSeek-V3 | 89.74 | 84.19 | Standard model is more stable on JudgeBench |
| DeepSeek-R1 | 91.18 | 80.48 | Stronger on RewardBench, exceptions in knowledge judgment |
| Qwen2.5-32B-Instruct | 89.31 | 60.40 | Non-reasoning version is significantly weaker on JudgeBench |
| QwQ-32B | 91.05 | 79.75 | Reasoning brings substantial Gain |
| Qwen3-30B-A3B-Instruct | 89.88 | 74.00 | Strong instruct baseline |
| Qwen3-30B-A3B-Thinking | 92.01 | 83.87 | Reasoning superior across both metrics |
| Qwen3-Next-80B-A3B-Instruct | 88.96 | 79.45 | Instruct version is stable |
| Qwen3-Next-80B-A3B-Thinking | 92.90 | 82.42 | Reasoning continues to provide Gain |
Regarding evaluation instruction following, the Reversal Rate of reasoning versions is generally higher, indicating that long-form reasoning does not weaken instruction following in evaluation contexts; instead, it encourages the model to repeatedly verify evaluation dimensions.
| Model | OriACC | RR | Observation |
|---|---|---|---|
| DeepSeek-V3 | 78.22 | 87.80 | Accurate overall judgment, but dimension switching is slightly weaker |
| DeepSeek-R1 | 73.61 | 95.24 | RR is significantly higher |
| Qwen2.5-32B-Instruct | 71.13 | 83.19 | Insufficient dimension switching in instruct version |
| QwQ-32B | 76.49 | 91.11 | Reasoning improves instruction following |
| Qwen3-30B-A3B-Instruct | 72.78 | 95.67 | Inherently strong |
| Qwen3-30B-A3B-Thinking | 78.14 | 97.44 | One of the best performers across both metrics |
| Qwen3-Next-80B-A3B-Instruct | 75.88 | 82.50 | Lower RR |
| Qwen3-Next-80B-A3B-Thinking | 77.94 | 91.18 | Significant improvement after reasoning |
Ablation Study¶
Regarding bias robustness, the results are more complex. LRMs are generally stronger on LLMBar because they better identify explicit instruction misalignment; however, they can be more influenced by surface quality features like length and specificity on BiasBench.
| Model | BiasBench | LLMBar | Explanation |
|---|---|---|---|
| DeepSeek-V3 | 81.25 | 76.49 | Balanced bias robustness |
| DeepSeek-R1 | 65.00 | 79.00 | Stronger surface quality bias, but identifies obvious mismatch |
| Qwen2.5-32B-Instruct | 82.50 | 67.71 | High BiasBench, weaker LLMBar |
| QwQ-32B | 67.50 | 79.31 | Reasoning improves LLMBar, reduces BiasBench |
| Qwen3-30B-Instruct | 81.25 | 59.25 | Weaker LLMBar |
| Qwen3-30B-Thinking | 77.50 | 83.07 | Reasoning significantly improves LLMBar |
| Qwen3-Next-Instruct | 80.00 | 64.55 | Lower bias in instruct version |
| Qwen3-Next-Thinking | 75.00 | 77.55 | LLMBar improves with reasoning |
The combined strategy of PlanJudge significantly improves BiasBench and usually preserves or enhances RewardBench/LLMBar.
| Model | Method | RewardBench | BiasBench | LLMBar |
|---|---|---|---|---|
| DeepSeek-V3 | Original | 89.70 | 81.25 | 76.49 |
| DeepSeek-V3 | Combined | 93.07 | 98.75 | 86.83 |
| DeepSeek-R1 | Original | 91.10 | 65.00 | 79.00 |
| DeepSeek-R1 | Combined | 92.47 | 97.50 | 86.21 |
| Qwen2.5-32B | Original | 89.30 | 82.50 | 67.71 |
| Qwen2.5-32B | Combined | 89.68 | 93.59 | 75.55 |
| QwQ-32B | Original | 91.00 | 67.50 | 79.31 |
| QwQ-32B | Combined | 93.13 | 95.00 | 83.07 |
Key Findings¶
- LRM-as-a-Judge is overall superior to standard LLM judges, particularly in reasoning-intensive coding, math, and complex judgment tasks.
- Reasoning models are stronger at following evaluation instructions. this differs from some general instruction-following studies suggesting "reasoning models are more stubborn," indicating that evaluation scenarios have unique characteristics.
- Reasoning models are more stable against prompt injection attacks because they verify task boundaries and evaluation requirements during the reasoning process.
- However, reasoning models still prefer responses that appear more specific, longer, or more organized, even if these surface features do not represent true quality.
- The core value of PlanJudge is not to make the model "think more," but to clarify "what criteria it should think with."
Highlights & Insights¶
- The paper breaks down judge capability into four dimensions, avoiding single-metric conclusions like "better because of a high benchmark score."
- Helpsteer2-trivial is practical, converting evaluation instruction following into a measurable Reversal Rate to test if the judge truly evaluates based on specified dimensions.
- PlanJudge is simple but effective. Many judge biases stem from ambiguous evaluation criteria; making the evaluation plan explicit significantly reduces such issues.
- The study serves as a reminder: reasoning chains are not naturally reliable and can amplify existing evaluation preferences; therefore, the reasoning process needs to be constrained by structured criteria.
Limitations & Future Work¶
- Model coverage was intentionally limited to families with clear reasoning/non-reasoning pairs; conclusions for LLaMA-based families or closed-source o-series models require further verification.
- Each evaluation dimension used only one or two benchmarks, which may still be influenced by dataset design bias.
- PlanJudge increases reasoning cost and latency, requiring a trade-off with throughput in large-scale automated evaluations.
- Bias types mainly come from existing BiasBench/LLMBar; domain, cultural, and linguistic biases in real-world evaluations are not yet fully covered.
- Future work could investigate how to automatically verify plan quality and whether PlanJudge can be combined with calibration and human few-shot auditing.
Related Work & Insights¶
- vs LLM-as-a-Judge Empirical Studies: Existing research mostly discusses the human alignment of GPT-4-like judges; this paper further compares whether reasoning modes are suitable as judges.
- vs BiasBench / LLMBar: These benchmarks provide bias diagnosis; this paper finds that reasoning does not affect different biases in a uniform direction, so it cannot be simply claimed that reasoning is more stable or worse.
- vs Training-based Judge Improvements: Some methods improve judge capability through fine-tuning; PlanJudge does not train models and is lighter to deploy, though it relies on the model's ability to execute a plan.
- Insight: When using LLM judges in research experiments, writing evaluation rubrics as explicit plans and reporting whether planning was used can improve evaluation reproducibility.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ The systematic comparison of reasoning judges is timely, and PlanJudge is simple but effective.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers four categories of judge capabilities across multiple model pairs, though benchmark dimensions could be further expanded.
- Writing Quality: ⭐⭐⭐⭐☆ Clear conclusions and information-dense tables.
- Value: ⭐⭐⭐⭐⭐ Direct guidance for model selection, bias control, and evaluation protocol design in LLM-as-a-Judge applications.