Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning¶

Conference: ICLR 2026 arXiv: 2510.23038 Code: None Area: Model Compression Keywords: LLM-as-a-Judge, tool-integrated reasoning, reinforcement learning, code execution, evaluation

TL;DR¶

This paper proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to interleave reasoning and code execution during evaluation. With only 8B parameters, TIR-Judge surpasses 32B reasoning reward models across 7 public benchmarks; its distillation-free variant, TIR-Judge-Zero, achieves further self-bootstrapped improvement.

Background & Motivation¶

LLM-as-a-Judge has become increasingly critical in the LLM ecosystem—providing preference signals during training, performing best-of-N selection at inference, and replacing human evaluators during assessment. However, existing judge models face two major challenges:

Ceiling of pure-text reasoning: Current reasoning-enhanced judge models (e.g., JudgeLRM, J1-Judge) rely solely on textual chain-of-thought and struggle in scenarios requiring precise computation or symbolic reasoning (e.g., verifying code outputs, checking instruction constraints).

Limitations of tool use: The few methods that incorporate tools suffer from (i) applying tools only at inference time without training-time optimization, and (ii) being restricted to specific tasks or domains.

Core Idea: Train judge models end-to-end via reinforcement learning to learn when to invoke a code interpreter and how to iteratively refine reasoning based on execution results, achieving deep integration of reasoning and tool use.

Method¶

Overall Architecture¶

TIR-Judge constructs judging trajectories via multi-turn tool-integrated reasoning (TIR): \(s_k = \{r_1,c_1,o_1,...,r_k,c_k,o_k\}\), where \(r_i\) denotes a reasoning step, \(c_i\) the generated code, and \(o_i = \mathcal{I}(c_i)\) the execution output. RL training is performed using DAPO (an improved variant of GRPO). Three judging formats are supported: Pointwise, Pairwise, and Listwise.

Key Designs¶

Diverse Training Data Construction:
- Function: Balance training data between verifiable domains (math, coding) and non-verifiable domains (dialogue, safety, general code).
- Mechanism: Real preference pairs are collected from HelpSteer3, UltraInteract, CodeRM, etc.; synthetic preference pairs are generated by sampling from multiple models (e.g., Qwen3-8B/14B) and automatically verified. The dataset comprises approximately 26K preference pairs spanning multiple domains and formats.
- Design Motivation: Enable the model to learn when tool invocation is beneficial (verifiable scenarios) and when pure reasoning suffices (non-verifiable scenarios).
Three-Dimensional Reward Design:
- Function: Guide the model to simultaneously optimize correctness, format compliance, and tool usage quality.
- Mechanism: \(R = R_c \times (0.1 + 0.9 \cdot \mathbb{I}[R_t = 1 \wedge R_f = 1])\)
  - Correctness reward \(R_c\): whether the predicted preference matches the ground truth.
  - Format reward \(R_f\): whether the output adheres to the structured format (e.g., \<score> and \<preference> tags); for safety/general scenarios, a positive reward requires that no tools be used.
  - Tool reward \(R_t\): whether code executes without errors and within at most 3 invocations.
- Design Motivation: Full credit is awarded only when all three criteria are satisfied; correctness alone yields only 10% of the reward, discouraging unstructured but coincidentally correct outputs.
Iterative Self-Bootstrapping Training Strategy (TIR-Judge-Zero):
- Function: Achieve self-improvement via pure RL without teacher distillation.
- Mechanism: Alternately execute RL → rejection sampling → SFT → RL cycles: \(\mathcal{T}_{t+1} \leftarrow \text{RS}(\pi_{\theta_t}), \pi_{\theta_{t+1}} \leftarrow \text{SFT}(\pi_{\theta_0}, \mathcal{T}_{t+1}), \pi_{\theta_{t+1}} \leftarrow \text{RL}(\pi_{\theta_{t+1}})\). For each prompt, only the shortest correct trajectory with the fewest tool invocations is retained to improve efficiency.
- Design Motivation: Demonstrate that TIR judge models can self-evolve without distillation, reducing dependence on strong teacher models.

Additional Training Details¶

Backbone models: Qwen3-8B and Qwen3-4B; error messages are truncated to the last line to prevent excessive context length; execution outputs are masked in loss computation to prevent overfitting.
The distilled variant uses Gemini-2.5-Flash as the teacher, collecting approximately 10K high-quality trajectories.
Training is conducted on 8× H100 80G GPUs.

Key Experimental Results¶

Main Results (Pointwise + Pairwise)¶

Model	PPE Avg	IFBench	CJBench	RWBench	RMBench	JGBench
Qwen3-8B Pointwise	60.6	56.2	16.6	76.5	66.9	50.8
Qwen3-8B Pairwise	65.5	61.3	60.8	87.0	77.9	67.5
Gemini-2.5-Flash Pairwise	74.8	69.3	66.5	93.4	81.9	75.4
TIR-Judge (inferred)	~70+	~66+	~63+	~90+	—	—

Ablation Study: Zero vs. Distill¶

Configuration	Scale	Description
TIR-Judge-Zero (4B)	4B	Pure RL bootstrapping; outperforms distilled variant by 1.2%
TIR-Judge-Distill (4B)	4B	RL after distillation cold-start
TIR-Judge-Zero (8B)	8B	Surpasses 32B reasoning reward models

Key Findings¶

TIR-Judge achieves up to 6.4% improvement on Pointwise and 7.7% on Pairwise over pure-reasoning judge baselines.
The 8B TIR-Judge surpasses 32B reasoning reward models on the PPE benchmark.
TIR-Judge-Zero outperforms the distilled variant by 1.2% at the 4B scale, demonstrating that pure RL bootstrapping is a viable and superior strategy.
In the Listwise setting, TIR-Judge achieves 96% of Claude-Opus-4's performance.

Highlights & Insights¶

Extending RL with tool use from mathematical reasoning to judging tasks is a natural yet highly effective direction.
The multiplicative structure of the three-dimensional reward (correctness × format × tool quality) is elegant, circumventing the tuning difficulties associated with simple weighted sums.
TIR-Judge-Zero's distillation-free self-bootstrapping challenges the common assumption that a strong teacher model is required for cold-start initialization.

Limitations & Future Work¶

Enforcing no tool use in safety/general domains may be overly simplistic; certain safety evaluation scenarios could also benefit from tool assistance.
Capping multi-turn tool invocations at 3 may limit performance on complex evaluation tasks.
The strongest gains are observed on reasoning-related benchmarks; the advantage on open-ended dialogue evaluation warrants further validation.

vs. JudgeLRM/J1-Judge: These methods enhance only the textual reasoning chain, whereas TIR-Judge additionally incorporates code execution to enable precise verification.
vs. AgentRM: AgentRM employs tools at inference time but does not optimize tool use during training; TIR-Judge performs end-to-end joint training.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic application of TIR to judging tasks.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 benchmarks, 3 judging formats, Zero/Distill ablations.
Writing Quality: ⭐⭐⭐⭐ Clear framework with sufficient technical detail.
Value: ⭐⭐⭐⭐⭐ An 8B model surpassing 32B counterparts demonstrates exceptional practical utility.