Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis¶
Conference: ICLR 2026 arXiv: 2603.15483 Code: GitHub Area: LLM Evaluation Keywords: Agent Evaluation, User Awareness, LLM-as-Judge, Error Analysis, Efficiency Metrics
TL;DR¶
This paper proposes TED (Talk, Evaluate, Diagnose), a framework that achieves user-aware dynamic agent evaluation via general, reusable expert/non-expert persona templates; enables fine-grained efficiency assessment through grading notes, LLM-as-judge scoring, and novel metrics such as MaxProgressRate@k; and provides actionable improvement feedback via automated error discovery and clustering. Experiments on τ²-bench and ToolSandbox reveal new insights into agent performance.
Background & Motivation¶
- Background: LLM agents are increasingly deployed to automate diverse workflows, yet evaluation frameworks remain fragmented—each domain relies on its own methodology (database queries, regex matching, etc.) to determine task success.
- Limitations of Prior Work: (1) No unified cross-domain evaluation methodology exists; (2) the effect of user personas on agent behavior is not systematically considered; (3) evaluation stops at metric reporting, lacking diagnosis and actionable improvement guidance.
- Key Challenge: Agent behavior is heavily shaped by user interaction, yet user personas are left uncontrolled during evaluation.
- Goal: Construct a unified, user-aware, and diagnosable agent evaluation framework.
- Key Insight: A three-stage unification of Talk (user simulation) + Evaluate (assessment) + Diagnose (diagnosis).
- Core Idea: Effective agent evaluation requires not only correctness, but also conversation quality, efficiency, and systematic error diagnosis.
Method¶
Overall Architecture¶
Talk → Simulate expert/non-expert user interactions with the agent via reusable persona templates. Evaluate → Convert sub-goals into grading notes, score with LLM-as-judge, and compute metrics such as MaxProgressRate@k. Diagnose → Analyze judge–agent inconsistencies, then automatically discover and cluster error patterns.
Key Designs¶
Design 1: General Reusable Persona Templates - Function: Decouple user persona from task instructions, providing general expert/non-expert templates that are independent of specific tasks and agents. - Mechanism: \(u = f(p, i)\), combining persona prompt \(p\) with task instruction \(i\). Swapping the persona on the same task isolates the effect of user behavior. A reflect-then-respond two-step process is included. - Design Motivation: Existing methods tightly couple persona with task, making it impossible to isolate the independent effect of user behavior.
Design 2: Grading Notes + Efficiency Metrics - Function: Unify all sub-goals (tool calls, response content, etc.) into natural-language checklist items; introduce metrics including MaxProgressRate@k, MaxAUC@k, and MaxPPT@k. - Mechanism: \(\text{progress}(i) = \text{fraction of grading notes achieved}\); MaxProgressRate@k is the expected maximum progress across \(k\) trials. AUC measures early-stage efficiency, and PPT measures per-turn progress rate. - Design Motivation: Success rate is too coarse-grained; partial progress and conversational turn efficiency must be captured.
Design 3: Automated Error Discovery - Function: Two-stage error analysis — low-level error identification followed by semantic clustering. - Mechanism: For sub-goals where judge and agent disagree, an LLM extracts specific low-level error descriptions; these are then semantically clustered into high-level error categories. Judge variance and agent variance reflect judge unreliability and agent instability, respectively. - Design Motivation: Close the loop from metric reporting → error discovery → improvement recommendations.
Loss & Training¶
No training is involved; TED is a purely evaluation framework. LLM-as-judge is run multiple times with majority voting. GPT-4.1 serves as both judge and user proxy.
Key Experimental Results¶
Main Results¶
τ²-bench Airline Easy (Expert | Non-expert)
| Agent Model | MeanProg@k | MaxProg@k | pass@k |
|---|---|---|---|
| gpt-4.1 | 0.95 | 0.82 | 1.00 | 1.00 | 1.00 | 1.00 |
| gpt-4o | 0.79 | 0.86 | 1.00 | 1.00 | 1.00 | 1.00 |
| gpt-4o-mini | 0.70 | 0.61 | 0.90 | 0.90 | 0.80 | 0.80 |
| gpt-5 | 0.92 | 0.92 | 1.00 | 1.00 | 1.00 | 1.00 |
Ablation Study¶
| Finding | Description |
|---|---|
| Expert vs. Non-expert | Non-expert users systematically reduce agent MeanProg across most models |
| Performance gain after error fixing | 8–10% improvement in MaxProgressRate |
| Judge variance analysis | High-variance sub-goals are predominantly associated with ambiguously described grading notes |
Key Findings¶
- User expertise systematically affects agent performance — non-expert users lead to more turns and lower average progress.
- MaxProgressRate@k provides finer-grained evaluation than pass@k, distinguishing "near success" from "complete failure."
- Common error patterns identified through automated error analysis can be directly used to improve agent prompts, yielding 8–10% gains.
- GPT-5 underperforms GPT-4o on certain ToolSandbox baselines, demonstrating that model upgrades do not necessarily translate to improved agent capabilities.
Highlights & Insights¶
- The Talk–Evaluate–Diagnose three-stage closed-loop design is both comprehensive and practically oriented.
- The persona decoupling idea is concise yet consequential — isolating the user factor is a prerequisite for fair evaluation.
- The complete loop from evaluation to diagnosis to improvement goes beyond merely "reporting scores."
Limitations & Future Work¶
- Constructing grading notes still requires manual effort, limiting the degree of automation.
- Only two persona types (expert/non-expert) are considered; finer-grained user modeling remains unexplored.
- The reliability of the judge itself is a systemic risk that requires further validation.
Related Work & Insights¶
- AgentBoard first introduced progress rate but in an environment-interaction setting; TED extends this to multi-turn dialogue.
- τ²-bench employs domain-specific personas that are not generalizable; TED achieves generalization.
- Insight: Agent evaluation should be an integral part of the engineering feedback loop, rather than a standalone academic exercise.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ★★★★☆ |
| Practicality | ★★★★★ |
| Experimental Thoroughness | ★★★★☆ |
| Writing Quality | ★★★★★ |