Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis¶

Conference: ICLR 2026 arXiv: 2603.15483 Code: GitHub Area: LLM Evaluation Keywords: Agent Evaluation, User Awareness, LLM-as-Judge, Error Analysis, Efficiency Metrics

TL;DR¶

This paper proposes TED (Talk, Evaluate, Diagnose), a framework that achieves user-aware dynamic agent evaluation via general, reusable expert/non-expert persona templates; enables fine-grained efficiency assessment through grading notes, LLM-as-judge scoring, and novel metrics such as MaxProgressRate@k; and provides actionable improvement feedback via automated error discovery and clustering. Experiments on τ²-bench and ToolSandbox reveal new insights into agent performance.

Background & Motivation¶

Background: LLM agents are increasingly deployed to automate diverse workflows, yet evaluation frameworks remain fragmented—each domain relies on its own methodology (database queries, regex matching, etc.) to determine task success.
Limitations of Prior Work: (1) No unified cross-domain evaluation methodology exists; (2) the effect of user personas on agent behavior is not systematically considered; (3) evaluation stops at metric reporting, lacking diagnosis and actionable improvement guidance.
Key Challenge: Agent behavior is heavily shaped by user interaction, yet user personas are left uncontrolled during evaluation.
Goal: Construct a unified, user-aware, and diagnosable agent evaluation framework.
Key Insight: A three-stage unification of Talk (user simulation) + Evaluate (assessment) + Diagnose (diagnosis).
Core Idea: Effective agent evaluation requires not only correctness, but also conversation quality, efficiency, and systematic error diagnosis.

Method¶

Overall Architecture¶

Talk → Simulate expert/non-expert user interactions with the agent via reusable persona templates. Evaluate → Convert sub-goals into grading notes, score with LLM-as-judge, and compute metrics such as MaxProgressRate@k. Diagnose → Analyze judge–agent inconsistencies, then automatically discover and cluster error patterns.

Key Designs¶

Design 1: General Reusable Persona Templates - Function: Decouple user persona from task instructions, providing general expert/non-expert templates that are independent of specific tasks and agents. - Mechanism: \(u = f(p, i)\), combining persona prompt \(p\) with task instruction \(i\). Swapping the persona on the same task isolates the effect of user behavior. A reflect-then-respond two-step process is included. - Design Motivation: Existing methods tightly couple persona with task, making it impossible to isolate the independent effect of user behavior.

Design 2: Grading Notes + Efficiency Metrics - Function: Unify all sub-goals (tool calls, response content, etc.) into natural-language checklist items; introduce metrics including MaxProgressRate@k, MaxAUC@k, and MaxPPT@k. - Mechanism: \(\text{progress}(i) = \text{fraction of grading notes achieved}\); MaxProgressRate@k is the expected maximum progress across \(k\) trials. AUC measures early-stage efficiency, and PPT measures per-turn progress rate. - Design Motivation: Success rate is too coarse-grained; partial progress and conversational turn efficiency must be captured.

Design 3: Automated Error Discovery - Function: Two-stage error analysis — low-level error identification followed by semantic clustering. - Mechanism: For sub-goals where judge and agent disagree, an LLM extracts specific low-level error descriptions; these are then semantically clustered into high-level error categories. Judge variance and agent variance reflect judge unreliability and agent instability, respectively. - Design Motivation: Close the loop from metric reporting → error discovery → improvement recommendations.

Loss & Training¶

No training is involved; TED is a purely evaluation framework. LLM-as-judge is run multiple times with majority voting. GPT-4.1 serves as both judge and user proxy.

Key Experimental Results¶

Main Results¶

τ²-bench Airline Easy (Expert | Non-expert)

Agent Model	MeanProg@k	MaxProg@k	pass@k
gpt-4.1	0.95 \| 0.82	1.00 \| 1.00	1.00 \| 1.00
gpt-4o	0.79 \| 0.86	1.00 \| 1.00	1.00 \| 1.00
gpt-4o-mini	0.70 \| 0.61	0.90 \| 0.90	0.80 \| 0.80
gpt-5	0.92 \| 0.92	1.00 \| 1.00	1.00 \| 1.00

Ablation Study¶

Finding	Description
Expert vs. Non-expert	Non-expert users systematically reduce agent MeanProg across most models
Performance gain after error fixing	8–10% improvement in MaxProgressRate
Judge variance analysis	High-variance sub-goals are predominantly associated with ambiguously described grading notes

Key Findings¶

User expertise systematically affects agent performance — non-expert users lead to more turns and lower average progress.
MaxProgressRate@k provides finer-grained evaluation than pass@k, distinguishing "near success" from "complete failure."
Common error patterns identified through automated error analysis can be directly used to improve agent prompts, yielding 8–10% gains.
GPT-5 underperforms GPT-4o on certain ToolSandbox baselines, demonstrating that model upgrades do not necessarily translate to improved agent capabilities.

Highlights & Insights¶

The Talk–Evaluate–Diagnose three-stage closed-loop design is both comprehensive and practically oriented.
The persona decoupling idea is concise yet consequential — isolating the user factor is a prerequisite for fair evaluation.
The complete loop from evaluation to diagnosis to improvement goes beyond merely "reporting scores."

Limitations & Future Work¶

Constructing grading notes still requires manual effort, limiting the degree of automation.
Only two persona types (expert/non-expert) are considered; finer-grained user modeling remains unexplored.
The reliability of the judge itself is a systemic risk that requires further validation.

AgentBoard first introduced progress rate but in an environment-interaction setting; TED extends this to multi-turn dialogue.
τ²-bench employs domain-specific personas that are not generalizable; TED achieves generalization.
Insight: Agent evaluation should be an integral part of the engineering feedback loop, rather than a standalone academic exercise.

Rating¶

Dimension	Score
Novelty	★★★★☆
Practicality	★★★★★
Experimental Thoroughness	★★★★☆
Writing Quality	★★★★★

Agent Model	MeanProg@k	MaxProg@k	pass@k
gpt-4.1	0.95 \| 0.82	1.00 \| 1.00	1.00 \| 1.00
gpt-4o	0.79 \| 0.86	1.00 \| 1.00	1.00 \| 1.00
gpt-4o-mini	0.70 \| 0.61	0.90 \| 0.90	0.80 \| 0.80
gpt-5	0.92 \| 0.92	1.00 \| 1.00	1.00 \| 1.00