Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning¶
Conference: ICLR 2026 arXiv: 2510.23038 Code: None Area: Model Compression Keywords: LLM-as-a-Judge, tool-integrated reasoning, reinforcement learning, code execution, evaluation
TL;DR¶
This paper proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to interleave reasoning and code execution during evaluation. With only 8B parameters, TIR-Judge surpasses 32B reasoning reward models across 7 public benchmarks; its distillation-free variant, TIR-Judge-Zero, achieves further self-bootstrapped improvement.
Background & Motivation¶
LLM-as-a-Judge has become increasingly critical in the LLM ecosystem—providing preference signals during training, performing best-of-N selection at inference, and replacing human evaluators during assessment. However, existing judge models face two major challenges:
Ceiling of pure-text reasoning: Current reasoning-enhanced judge models (e.g., JudgeLRM, J1-Judge) rely solely on textual chain-of-thought and struggle in scenarios requiring precise computation or symbolic reasoning (e.g., verifying code outputs, checking instruction constraints).
Limitations of tool use: The few methods that incorporate tools suffer from (i) applying tools only at inference time without training-time optimization, and (ii) being restricted to specific tasks or domains.
Core Idea: Train judge models end-to-end via reinforcement learning to learn when to invoke a code interpreter and how to iteratively refine reasoning based on execution results, achieving deep integration of reasoning and tool use.
Method¶
Overall Architecture¶
TIR-Judge constructs judging trajectories via multi-turn tool-integrated reasoning (TIR): \(s_k = \{r_1,c_1,o_1,...,r_k,c_k,o_k\}\), where \(r_i\) denotes a reasoning step, \(c_i\) the generated code, and \(o_i = \mathcal{I}(c_i)\) the execution output. RL training is performed using DAPO (an improved variant of GRPO). Three judging formats are supported: Pointwise, Pairwise, and Listwise.
Key Designs¶
-
Diverse Training Data Construction:
- Function: Balance training data between verifiable domains (math, coding) and non-verifiable domains (dialogue, safety, general code).
- Mechanism: Real preference pairs are collected from HelpSteer3, UltraInteract, CodeRM, etc.; synthetic preference pairs are generated by sampling from multiple models (e.g., Qwen3-8B/14B) and automatically verified. The dataset comprises approximately 26K preference pairs spanning multiple domains and formats.
- Design Motivation: Enable the model to learn when tool invocation is beneficial (verifiable scenarios) and when pure reasoning suffices (non-verifiable scenarios).
-
Three-Dimensional Reward Design:
- Function: Guide the model to simultaneously optimize correctness, format compliance, and tool usage quality.
- Mechanism: \(R = R_c \times (0.1 + 0.9 \cdot \mathbb{I}[R_t = 1 \wedge R_f = 1])\)
- Correctness reward \(R_c\): whether the predicted preference matches the ground truth.
- Format reward \(R_f\): whether the output adheres to the structured format (e.g., \<score> and \<preference> tags); for safety/general scenarios, a positive reward requires that no tools be used.
- Tool reward \(R_t\): whether code executes without errors and within at most 3 invocations.
- Design Motivation: Full credit is awarded only when all three criteria are satisfied; correctness alone yields only 10% of the reward, discouraging unstructured but coincidentally correct outputs.
-
Iterative Self-Bootstrapping Training Strategy (TIR-Judge-Zero):
- Function: Achieve self-improvement via pure RL without teacher distillation.
- Mechanism: Alternately execute RL → rejection sampling → SFT → RL cycles: \(\mathcal{T}_{t+1} \leftarrow \text{RS}(\pi_{\theta_t}), \pi_{\theta_{t+1}} \leftarrow \text{SFT}(\pi_{\theta_0}, \mathcal{T}_{t+1}), \pi_{\theta_{t+1}} \leftarrow \text{RL}(\pi_{\theta_{t+1}})\). For each prompt, only the shortest correct trajectory with the fewest tool invocations is retained to improve efficiency.
- Design Motivation: Demonstrate that TIR judge models can self-evolve without distillation, reducing dependence on strong teacher models.
Additional Training Details¶
- Backbone models: Qwen3-8B and Qwen3-4B; error messages are truncated to the last line to prevent excessive context length; execution outputs are masked in loss computation to prevent overfitting.
- The distilled variant uses Gemini-2.5-Flash as the teacher, collecting approximately 10K high-quality trajectories.
- Training is conducted on 8× H100 80G GPUs.
Key Experimental Results¶
Main Results (Pointwise + Pairwise)¶
| Model | PPE Avg | IFBench | CJBench | RWBench | RMBench | JGBench |
|---|---|---|---|---|---|---|
| Qwen3-8B Pointwise | 60.6 | 56.2 | 16.6 | 76.5 | 66.9 | 50.8 |
| Qwen3-8B Pairwise | 65.5 | 61.3 | 60.8 | 87.0 | 77.9 | 67.5 |
| Gemini-2.5-Flash Pairwise | 74.8 | 69.3 | 66.5 | 93.4 | 81.9 | 75.4 |
| TIR-Judge (inferred) | ~70+ | ~66+ | ~63+ | ~90+ | — | — |
Ablation Study: Zero vs. Distill¶
| Configuration | Scale | Description |
|---|---|---|
| TIR-Judge-Zero (4B) | 4B | Pure RL bootstrapping; outperforms distilled variant by 1.2% |
| TIR-Judge-Distill (4B) | 4B | RL after distillation cold-start |
| TIR-Judge-Zero (8B) | 8B | Surpasses 32B reasoning reward models |
Key Findings¶
- TIR-Judge achieves up to 6.4% improvement on Pointwise and 7.7% on Pairwise over pure-reasoning judge baselines.
- The 8B TIR-Judge surpasses 32B reasoning reward models on the PPE benchmark.
- TIR-Judge-Zero outperforms the distilled variant by 1.2% at the 4B scale, demonstrating that pure RL bootstrapping is a viable and superior strategy.
- In the Listwise setting, TIR-Judge achieves 96% of Claude-Opus-4's performance.
Highlights & Insights¶
- Extending RL with tool use from mathematical reasoning to judging tasks is a natural yet highly effective direction.
- The multiplicative structure of the three-dimensional reward (correctness × format × tool quality) is elegant, circumventing the tuning difficulties associated with simple weighted sums.
- TIR-Judge-Zero's distillation-free self-bootstrapping challenges the common assumption that a strong teacher model is required for cold-start initialization.
Limitations & Future Work¶
- Enforcing no tool use in safety/general domains may be overly simplistic; certain safety evaluation scenarios could also benefit from tool assistance.
- Capping multi-turn tool invocations at 3 may limit performance on complex evaluation tasks.
- The strongest gains are observed on reasoning-related benchmarks; the advantage on open-ended dialogue evaluation warrants further validation.
Related Work & Insights¶
- vs. JudgeLRM/J1-Judge: These methods enhance only the textual reasoning chain, whereas TIR-Judge additionally incorporates code execution to enable precise verification.
- vs. AgentRM: AgentRM employs tools at inference time but does not optimize tool use during training; TIR-Judge performs end-to-end joint training.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic application of TIR to judging tasks.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 benchmarks, 3 judging formats, Zero/Distill ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear framework with sufficient technical detail.
- Value: ⭐⭐⭐⭐⭐ An 8B model surpassing 32B counterparts demonstrates exceptional practical utility.