Skip to content

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Conference: ICLR 2026 arXiv: 2510.23038 Code: None Area: Model Compression Keywords: LLM-as-a-Judge, tool-integrated reasoning, reinforcement learning, code execution, evaluation

TL;DR

This paper proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to interleave reasoning and code execution during evaluation. With only 8B parameters, TIR-Judge surpasses 32B reasoning reward models across 7 public benchmarks; its distillation-free variant, TIR-Judge-Zero, achieves further self-bootstrapped improvement.

Background & Motivation

LLM-as-a-Judge has become increasingly critical in the LLM ecosystem—providing preference signals during training, performing best-of-N selection at inference, and replacing human evaluators during assessment. However, existing judge models face two major challenges:

Ceiling of pure-text reasoning: Current reasoning-enhanced judge models (e.g., JudgeLRM, J1-Judge) rely solely on textual chain-of-thought and struggle in scenarios requiring precise computation or symbolic reasoning (e.g., verifying code outputs, checking instruction constraints).

Limitations of tool use: The few methods that incorporate tools suffer from (i) applying tools only at inference time without training-time optimization, and (ii) being restricted to specific tasks or domains.

Core Idea: Train judge models end-to-end via reinforcement learning to learn when to invoke a code interpreter and how to iteratively refine reasoning based on execution results, achieving deep integration of reasoning and tool use.

Method

Overall Architecture

TIR-Judge constructs judging trajectories via multi-turn tool-integrated reasoning (TIR): \(s_k = \{r_1,c_1,o_1,...,r_k,c_k,o_k\}\), where \(r_i\) denotes a reasoning step, \(c_i\) the generated code, and \(o_i = \mathcal{I}(c_i)\) the execution output. RL training is performed using DAPO (an improved variant of GRPO). Three judging formats are supported: Pointwise, Pairwise, and Listwise.

Key Designs

  1. Diverse Training Data Construction:

    • Function: Balance training data between verifiable domains (math, coding) and non-verifiable domains (dialogue, safety, general code).
    • Mechanism: Real preference pairs are collected from HelpSteer3, UltraInteract, CodeRM, etc.; synthetic preference pairs are generated by sampling from multiple models (e.g., Qwen3-8B/14B) and automatically verified. The dataset comprises approximately 26K preference pairs spanning multiple domains and formats.
    • Design Motivation: Enable the model to learn when tool invocation is beneficial (verifiable scenarios) and when pure reasoning suffices (non-verifiable scenarios).
  2. Three-Dimensional Reward Design:

    • Function: Guide the model to simultaneously optimize correctness, format compliance, and tool usage quality.
    • Mechanism: \(R = R_c \times (0.1 + 0.9 \cdot \mathbb{I}[R_t = 1 \wedge R_f = 1])\)
      • Correctness reward \(R_c\): whether the predicted preference matches the ground truth.
      • Format reward \(R_f\): whether the output adheres to the structured format (e.g., \<score> and \<preference> tags); for safety/general scenarios, a positive reward requires that no tools be used.
      • Tool reward \(R_t\): whether code executes without errors and within at most 3 invocations.
    • Design Motivation: Full credit is awarded only when all three criteria are satisfied; correctness alone yields only 10% of the reward, discouraging unstructured but coincidentally correct outputs.
  3. Iterative Self-Bootstrapping Training Strategy (TIR-Judge-Zero):

    • Function: Achieve self-improvement via pure RL without teacher distillation.
    • Mechanism: Alternately execute RL → rejection sampling → SFT → RL cycles: \(\mathcal{T}_{t+1} \leftarrow \text{RS}(\pi_{\theta_t}), \pi_{\theta_{t+1}} \leftarrow \text{SFT}(\pi_{\theta_0}, \mathcal{T}_{t+1}), \pi_{\theta_{t+1}} \leftarrow \text{RL}(\pi_{\theta_{t+1}})\). For each prompt, only the shortest correct trajectory with the fewest tool invocations is retained to improve efficiency.
    • Design Motivation: Demonstrate that TIR judge models can self-evolve without distillation, reducing dependence on strong teacher models.

Additional Training Details

  • Backbone models: Qwen3-8B and Qwen3-4B; error messages are truncated to the last line to prevent excessive context length; execution outputs are masked in loss computation to prevent overfitting.
  • The distilled variant uses Gemini-2.5-Flash as the teacher, collecting approximately 10K high-quality trajectories.
  • Training is conducted on 8× H100 80G GPUs.

Key Experimental Results

Main Results (Pointwise + Pairwise)

Model PPE Avg IFBench CJBench RWBench RMBench JGBench
Qwen3-8B Pointwise 60.6 56.2 16.6 76.5 66.9 50.8
Qwen3-8B Pairwise 65.5 61.3 60.8 87.0 77.9 67.5
Gemini-2.5-Flash Pairwise 74.8 69.3 66.5 93.4 81.9 75.4
TIR-Judge (inferred) ~70+ ~66+ ~63+ ~90+

Ablation Study: Zero vs. Distill

Configuration Scale Description
TIR-Judge-Zero (4B) 4B Pure RL bootstrapping; outperforms distilled variant by 1.2%
TIR-Judge-Distill (4B) 4B RL after distillation cold-start
TIR-Judge-Zero (8B) 8B Surpasses 32B reasoning reward models

Key Findings

  • TIR-Judge achieves up to 6.4% improvement on Pointwise and 7.7% on Pairwise over pure-reasoning judge baselines.
  • The 8B TIR-Judge surpasses 32B reasoning reward models on the PPE benchmark.
  • TIR-Judge-Zero outperforms the distilled variant by 1.2% at the 4B scale, demonstrating that pure RL bootstrapping is a viable and superior strategy.
  • In the Listwise setting, TIR-Judge achieves 96% of Claude-Opus-4's performance.

Highlights & Insights

  • Extending RL with tool use from mathematical reasoning to judging tasks is a natural yet highly effective direction.
  • The multiplicative structure of the three-dimensional reward (correctness × format × tool quality) is elegant, circumventing the tuning difficulties associated with simple weighted sums.
  • TIR-Judge-Zero's distillation-free self-bootstrapping challenges the common assumption that a strong teacher model is required for cold-start initialization.

Limitations & Future Work

  • Enforcing no tool use in safety/general domains may be overly simplistic; certain safety evaluation scenarios could also benefit from tool assistance.
  • Capping multi-turn tool invocations at 3 may limit performance on complex evaluation tasks.
  • The strongest gains are observed on reasoning-related benchmarks; the advantage on open-ended dialogue evaluation warrants further validation.
  • vs. JudgeLRM/J1-Judge: These methods enhance only the textual reasoning chain, whereas TIR-Judge additionally incorporates code execution to enable precise verification.
  • vs. AgentRM: AgentRM employs tools at inference time but does not optimize tool use during training; TIR-Judge performs end-to-end joint training.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic application of TIR to judging tasks.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 benchmarks, 3 judging formats, Zero/Distill ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear framework with sufficient technical detail.
  • Value: ⭐⭐⭐⭐⭐ An 8B model surpassing 32B counterparts demonstrates exceptional practical utility.