Skip to content

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Conference: ACL 2026 arXiv: 2601.11044 Code: GitHub Area: LLM Agent / Benchmark Keywords: autonomous agents, long-horizon tasks, real-world benchmark, user simulation, Docker sandbox evaluation

TL;DR

This paper proposes AgencyBench — a comprehensive benchmark comprising 138 real-world tasks that evaluates six core agent capabilities. Each scenario requires an average of 90 tool calls and 1M tokens. Fully automated evaluation is achieved via a user simulation agent and Docker sandbox.

Background & Motivation

Background: LLM-based autonomous agents are increasingly deployed across software development, scientific research, and everyday use, yet evaluation benchmarks have lagged significantly behind the growth in agent capabilities.

Limitations of Prior Work: (1) Existing benchmarks focus on isolated capabilities (e.g., tool use or software engineering) and fail to capture the multi-dimensional, long-horizon nature of real-world tasks; (2) real-task evaluation relies on human-in-the-loop feedback, becoming a bottleneck for automated assessment; (3) task complexity is insufficient — most benchmarks require only tens of tool calls.

Key Challenge: The capabilities of frontier agents have far surpassed the scope of existing benchmarks, necessitating substantially more challenging evaluations.

Goal: Construct a high-complexity, multi-dimensional, fully automated benchmark for evaluating real-world agents.

Key Insight: Twenty human experts (AI researchers and developers) collect tasks from authentic work scenarios, organized into a hierarchical capability–scenario–task taxonomy.

Core Idea: Replace human feedback with a user simulation agent and employ Docker sandbox execution with visual evaluation to enable fully automated rollout collection and scoring for long-horizon, complex tasks.

Method

Overall Architecture

A hierarchical design is adopted: six core capabilities (game development, frontend, backend, code generation, research, MCP tools) → 32 real-world scenarios → 138 specific tasks. Each scenario contains 1–5 sequentially ordered tasks of increasing difficulty, where the results of earlier tasks affect subsequent ones. Evaluation is conducted across three isolated spaces — workspace, sandbox, and evalspace — to ensure environmental separation.

Key Designs

  1. User Simulation Agent:

    • Function: Provides iterative feedback in place of a human during multi-turn interactions.
    • Mechanism: Simulates authentic user behavior — when the agent submits an intermediate result, the simulation agent provides revision suggestions or confirmations based on the task description and rubric.
    • Design Motivation: Eliminates the human-in-the-loop bottleneck, enabling rollouts that would otherwise take hours to complete in a fully automated manner.
  2. Docker Sandbox Evaluation:

    • Function: Performs visual and functional evaluation of code and files produced by the agent.
    • Mechanism: Deliverables are synchronized into a Docker container, where human-computer interactions are simulated (UI rendering, mouse clicks, screen recording) to generate visual artifacts, which are then scored by evaluation scripts and an LLM judge against the rubric.
    • Design Motivation: Many real-world task outputs (e.g., games, web pages) cannot be assessed through text alone and require actual execution and visual inspection.
  3. Hierarchical Task Design:

    • Function: Simulates the progressive complexity of real-world workflows.
    • Mechanism: Each scenario's 1–5 tasks increase in difficulty, with earlier completions affecting later ones — for example, a "Gomoku game" scenario progresses from a basic board to adding an AI opponent, undo functionality, and theme switching.
    • Design Motivation: Real-world tasks are never completed in a single step; this design tests the agent's ability to maintain context and engage in long-horizon planning.

Loss & Training

Evaluation uses rubric-based scoring on a 0–10 scale, combining rule-based evaluation scripts with an LLM-based judge. A unanimous agreement strategy is applied for data quality — all four expert reviewers must approve a task before it is included.

Key Experimental Results

Main Results

Model Type Avg. Score Highest Lowest
Closed-source 48.4% GPT-5.2 (56.5%) Grok-4.1-Fast (44.3%)
Open-source 32.1% GLM-4.6 (38.6%) Qwen-3-235B (27.0%)

Key Behavioral Differences

Model Characteristic Description
GPT-5.2 Strong feedback self-correction Most effective at incorporating user feedback
Grok-4.1-Fast High token efficiency Completes tasks with fewer tokens
Claude-4.5-Opus Preference for shell tools More frequent use of command-line operations
Gemini-3-Pro Preference for file management More frequent use of file and memory management tools

Key Findings

  • Closed-source models substantially outperform open-source models (48.4% vs. 32.1%), with a larger gap than observed on short-task benchmarks.
  • A pronounced "home advantage" effect is observed — models perform best within their native frameworks (e.g., Claude + Claude-Agent-SDK).
  • Even the strongest current model reaches only 56.5%, indicating that long-horizon real-world tasks remain a formidable challenge.
  • Distinct tool-use preferences across models suggest influences from architectural differences and training data.

Highlights & Insights

  • Task complexity far exceeds existing benchmarks — an average of 90 tool calls and 1M tokens represents a qualitative leap.
  • The combination of user simulation agent and Docker sandbox addresses the core challenge of automated evaluation for long-horizon tasks.
  • The "home advantage" finding has important implications for agent framework design — general-purpose frameworks may be outperformed by specialized ones.

Limitations & Future Work

  • 138 tasks may still be insufficient to comprehensively cover real-world scenarios.
  • The quality of the user simulation agent constitutes an upper bound on evaluation reliability.
  • The complexity of Docker sandbox configuration may hinder community adoption.
  • Future work could extend coverage to additional domains such as data analysis, design, and writing.
  • vs. SWE-bench: SWE-bench focuses on a single software engineering capability, whereas AgencyBench covers six capability dimensions.
  • vs. GAIA: GAIA averages only 10K tokens; AgencyBench operates at 100× the complexity.
  • vs. ToolLLM: ToolLLM targets tool-call correctness, while AgencyBench focuses on end-to-end task completion.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Substantially surpasses existing benchmarks in scale and real-world fidelity.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model comparison, behavioral analysis, and framework-level evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with rich illustrative examples.
  • Value: ⭐⭐⭐⭐⭐ Sets a new standard for next-generation agent evaluation.