AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation¶

Conference: ACL 2026 arXiv: 2604.18240 Code: https://aj-bench.github.io/ Area: Reinforcement Learning Keywords: Agent Evaluation, Agent-as-a-Judge, Environment Interaction Verification, Trajectory Evaluation, Benchmarking

TL;DR¶

This paper introduces AJ-Bench, the first benchmark systematically evaluating Agent-as-a-Judge capabilities, covering 155 tasks and 516 annotated trajectories across three domains—search, data systems, and GUI. Experiments demonstrate that Agent-as-a-Judge improves average F1 by approximately 13 percentage points over LLM-as-a-Judge.

Background & Motivation¶

Background: As RL continues to scale LLM agent training, reliably verifying agent behavior in complex environments becomes increasingly critical. Mainstream verification approaches include rule-based verifiers and LLM-as-a-Judge: the former relies on predefined rules, while the latter makes judgments based on surface-level textual signals.

Limitations of Prior Work: Rule-based verifiers struggle to generalize to complex, open-ended scenarios (e.g., scientific hypothesis verification, long-document fact-checking). LLM-as-a-Judge lacks access to environment state information and can only make surface-level judgments from trajectory text, leading to erroneous evaluations. For instance, determining whether an agent correctly queried a database requires inspecting the actual database state, not merely reading the trajectory text.

Key Challenge: Verifying agent behavior requires understanding environment state changes, yet existing evaluation frameworks confine the judge to a passive "observer" role, preventing it from interacting with the environment to gather verification evidence.

Goal: To construct the first benchmark that systematically evaluates Agent-as-a-Judge capabilities, quantifying an agent evaluator's abilities in information acquisition, state verification, and process verification.

Key Insight: Endowing the evaluator with "agentic" capabilities—allowing the judge to interact with the environment and use tools to gather evidence beyond trajectory text—enables more reliable judgments.

Core Idea: Construct verification tasks that require environment interaction, systematically compare Agent-as-a-Judge against LLM-as-a-Judge, and reveal the critical role of environment interaction in evaluation reliability.

Method¶

Overall Architecture¶

AJ-Bench is constructed in three steps: (1) designing 155 tasks across three domains—search, data systems (file system + Postgres), and GUI (PPT + Word + Excel); (2) generating 516 trajectories using multiple models and annotating them manually; (3) building interactive environment replicas for the DS and GUI domains. During evaluation, the Judge Agent receives a task description and a candidate trajectory, may interact with the environment via 60 tools to gather evidence, and outputs a binary success/failure judgment.

Key Designs¶

Three-Dimensional Evaluation Capability Design:
- Function: Comprehensively covers the core verification capabilities required by Agent-as-a-Judge.
- Mechanism: (a) Information Acquisition—verifying factual claims in trajectories via external search (search domain); (b) State Verification—using tools to check whether the current environment state matches expectations (DS domain, e.g., verifying file existence or database record correctness); (c) Process Verification—examining the correctness of key actions and execution steps (GUI domain, e.g., verifying whether a PowerPoint slide was correctly modified).
- Design Motivation: These three dimensions correspond to the core advantages of Agent-as-a-Judge over LLM-as-a-Judge.
Multi-Source Trajectory Collection and Annotation:
- Function: Provides high-quality, diverse positive and negative trajectory samples.
- Mechanism: Trajectories in the search domain are generated by Gemini, Grok, and Perplexity; DS domain trajectories are sourced from MCPMark across multiple models with standardized formatting; GUI domain trajectories are collected from OSWorld. Samples are deliberately selected such that successful trajectories have more steps and failed trajectories fewer, decoupling trajectory length from success rate. Labels are guaranteed by rule-based verification combined with manual review.
- Design Motivation: Multi-model sourcing avoids single-model stylistic bias; length decoupling prevents judges from using trajectory length as a shortcut.
Environment Reconstruction and Interactive Evaluation:
- Function: Provides the Judge Agent with a real, interactive environment.
- Mechanism: In the DS domain, the final environment state is replayed locally; in the GUI domain, environments are reconstructed on isolated AWS instances. The Judge Agent begins from the final environment state and actively gathers evidence through tool calls (file operations, database queries, GUI inspection, etc.).
- Design Motivation: Static trajectory text is insufficient for reliably determining task completion; the interactive environment is the key infrastructure distinguishing Agent-as-a-Judge from LLM-as-a-Judge.

Loss & Training¶

AJ-Bench is an evaluation benchmark rather than a training framework. F1 score is used as the primary evaluation metric. In the search domain, F1 is computed after aggregation at the individual entry level; in the DS and GUI domains, F1 is computed at the trajectory level. All results are averaged over three runs.

Key Experimental Results¶

Main Results¶

Model	Agentic	Search F1	DS F1	GUI F1	Overall F1
gemini-3-pro	✗	77.0	74.5	74.2	75.1
gpt-5	✗	73.4	60.9	52.8	61.0
deepseek-v3.2	✗	63.3	63.3	66.1	64.5
gpt-5-mini	✓	70.8	67.4	76.8	72.4
deepseek-v3.2	✓	77.3	72.7	80.5	77.3

Ablation Study¶

Configuration	Key Metric	Note
Interaction turns = 5 vs. 20	F1: ~65 vs. ~77	More interaction turns consistently improve performance
Accessibility tree only	GUI F1 varies	Sufficient for PPT tasks, insufficient for Word
Screenshot only	Word F1 best	Optimal modality differs by task
Mixed modality	Excel F1 best	Multimodal input is not always superior

Key Findings¶

Agent-as-a-Judge improves average F1 by approximately 13 percentage points over LLM-as-a-Judge using the same model, with the largest gain in the GUI domain (up to 31 percentage points).
A weaker model with tool use outperforms a stronger model without: gpt-5-mini (agentic) achieves an overall F1 of 72.4, surpassing gpt-5 (non-agentic) at 61.0.
Increased inference effort does not necessarily improve Agent-as-a-Judge performance: the thinking mode of deepseek-v3.2 underperforms its non-thinking counterpart by 0.23 F1.
Multimodal input is not always beneficial: mixed inputs may introduce noise, and the optimal modality varies across sub-tasks.

Highlights & Insights¶

The finding that "weaker model + tools > stronger model without tools" is significant, indicating that the value of Agent-as-a-Judge lies not in model capability per se, but in the information gain provided by environment interaction.
The observation that increased inference effort can degrade performance suggests that Agent-as-a-Judge requires better tool-use proficiency rather than deeper reasoning.
The domain design (search / DS / GUI) maps cleanly onto three core capabilities (information acquisition / state verification / process verification), providing a clear capability taxonomy for future research.

Limitations & Future Work¶

Most tasks are adapted from existing benchmarks rather than constructed from scratch, limiting coverage.
The search domain depends on external network conditions, and network instability may affect evaluation consistency.
Absolute performance still has substantial room for improvement (best F1 ≈ 0.77), indicating that Agent-as-a-Judge is far from saturated.
Future work may extend to additional domains (e.g., scientific verification, code review) and scale up data for training purposes.

vs. RewardBench/RM-Bench: These benchmarks evaluate LLM-as-a-Judge without involving environment interaction; AJ-Bench is the first to systematically evaluate environment-aware Agent-as-a-Judge.
vs. DevAI (Agent-as-a-Judge): DevAI covers only code verification as a single domain; AJ-Bench spans three domains and supports multimodal evaluation.
vs. AgentRewardBench: Evaluates judge decisions on agent trajectories but does not provide environment interaction capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐ First benchmark to systematically evaluate Agent-as-a-Judge.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-model comparisons and ablations, though the number of agentic models evaluated is limited.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-articulated benchmark design motivation.
Value: ⭐⭐⭐⭐⭐ Fills an important gap in agent evaluation infrastructure.