E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task¶

Conference: ACL 2026 arXiv: 2510.14509 Code: https://github.com/SCUNLP/E2EDev Area: LLM Evaluation Keywords: end-to-end software development, behavior-driven development, benchmark, multi-agent coding, requirements verification

TL;DR¶

This paper proposes E2EDev, an end-to-end software development benchmark grounded in Behavior-Driven Development (BDD) principles. It comprises 46 real-world web projects, 244 fine-grained requirements, and 703 executable BDD tests. Evaluation reveals that even the strongest LLMs (Claude series) achieve no more than 60% requirement accuracy, and that the interaction overhead of multi-agent frameworks is disproportionate to their performance gains.

Background & Motivation¶

Background: LLM-driven end-to-end software development (E2ESD) is evolving from function-level code generation toward full-project automation. Existing frameworks span multi-agent approaches (ChatDev, MetaGPT) and single-agent approaches (GPT-Engineer), yet evaluation methodology lags significantly behind framework development.

Limitations of Prior Work: (1) Existing benchmarks (SoftwareDev, SRDD) rely on coarse-grained requirement descriptions as input — vague phrasings such as "manage words" fail to specify whether users need editing, bookmarking, or deletion functionality; (2) Evaluation depends on subjective human judgment or heuristic metrics, lacking a systematic methodology grounded in software engineering standards, rendering cross-framework comparisons inconsistent and unreliable.

Key Challenge: E2ESD tasks require simultaneously performing high-level planning (deciding what to build) and fine-grained functional implementation (precisely satisfying requirement details). Ambiguous requirements and unreliable evaluation in existing benchmarks prevent a genuine understanding of where framework capabilities break down.

Goal: (1) Construct an E2ESD benchmark with fine-grained requirement specifications; (2) Design an automated BDD-based evaluation pipeline; (3) Systematically analyze the actual capabilities and failure modes of various frameworks and LLMs on E2ESD tasks.

Key Insight: Drawing on Behavior-Driven Development (BDD) principles from software engineering, this work employs Given-When-Then Gherkin scenarios to simulate real user interactions, enabling requirement verification from a user perspective.

Core Idea: Transform E2ESD evaluation from subjective human scoring to executable BDD tests grounded in fine-grained requirements, deterministically verifying the requirement conformance of generated code by simulating authentic user interactions.

Method¶

Overall Architecture¶

E2EDev consists of three components: (1) a fine-grained user requirement list for each software project; (2) multiple BDD test scenarios with corresponding Python step implementations for each requirement; and (3) a fully automated testing pipeline based on the Behave framework. The dataset is constructed from 46 real-world GitHub web projects via HITL-MAA (Human-In-The-Loop Multi-Agent Annotation framework).

Key Designs¶

HITL-MAA Annotation Framework:
- Function: Semi-automatically extracts fine-grained requirements and executable tests from source code.
- Mechanism: A three-stage pipeline — (a) a Code Analyzer Agent analyzes core functionalities and UI element interactions, a Requirement Extractor Agent generates candidate requirements, and human reviewers ensure accuracy; (b) a Test Case Generation Agent produces Gherkin-format BDD scenarios for each requirement, reviewed collaboratively by five software testing experts; (c) a Test Automation Engineer Agent generates Python step implementations, with iterative self-correction via a Dry Run Verifier and Test Runner resolving over 80% of logical errors without human intervention.
- Design Motivation: Pure manual annotation is prohibitively expensive, while pure LLM generation yields unstable quality; human-machine collaboration balances efficiency and quality.
Test ID Anchor System:
- Function: Assigns unique test IDs to UI components as structurally invariant DOM anchors.
- Mechanism: Prior to requirement and test generation, GPT-4o assigns unique test IDs to key UI components, ensuring consistent cross-project reference regardless of DOM structural variation.
- Design Motivation: Projects generated by different frameworks exhibit substantial HTML structural differences, necessitating stable anchors to execute tests consistently across projects.
Multi-Level Evaluation Metric Suite:
- Function: Assesses code validity at both the requirement level and test level, while measuring generation efficiency.
- Mechanism: Req. Acc measures the proportion of fully satisfied requirements (all test cases pass); Test Acc measures the proportion of passing tests; Balanced Score weights both to eliminate test granularity bias. Efficiency metrics include API cost, carbon emissions, and wall-clock time.
- Design Motivation: Test pass rates alone may be biased by uneven test counts per requirement; requirement-level metrics more faithfully reflect real user experience.

Loss & Training¶

E2EDev is an evaluation benchmark and does not involve model training. The evaluation pipeline automatically executes Python steps corresponding to Gherkin scenarios via the Behave framework, performing deterministic pass/fail verification on each generated project.

Key Experimental Results¶

Main Results¶

Requirement Accuracy (Req. Acc %) across Frameworks and LLM Backbones

LLM Backbone	Vanilla LLM	GPT-Engineer	Self-Collab.	MapCoder	ChatDev	MetaGPT
Claude-Haiku 4.5	48.69	53.75	49.01	49.61	44.73	5.39
GPT-4o	45.95	50.83	46.83	47.70	42.71	0.00
GPT-4o-mini	44.82	42.13	37.90	41.30	33.16	0.00
Qwen-Max	43.33	49.61	42.30	48.83	43.93	1.65
Qwen-7B	22.37	24.03	20.65	11.90	10.96	0.00

Ablation Study¶

Failure Mode Analysis (Manual Evaluation of 360 Projects)

Failure Type	Description	Primarily Affected Frameworks
Code Inconsistency	Missing/conflicting/empty functions	MetaGPT (44% attributable)
Requirement Omission	Required functionality not implemented	Vanilla LLM, ChatDev
Requirement Deviation	Implementation logic diverges from requirements	All frameworks (notable improvement in multi-agent)
Detail Mismatch	Mostly correct but edge-case errors	Most severe in Self-Collaboration

Key Findings¶

Even the strongest combination of Claude-Haiku 4.5 + GPT-Engineer achieves only 53.75% Req. Acc, indicating that E2ESD remains a formidable challenge.
MetaGPT approaches 0% success across nearly all LLM backbones, rooted in inter-agent communication breakdown — programmers ignore architects' file structures, and product managers overwrite and compress original requirements.
Multi-agent frameworks incur substantial interaction overhead (ChatDev averages 15.72 dialogue turns) with limited performance gains, sometimes underperforming Vanilla LLM.
The gap between Soft Req. Acc and Req. Acc exceeds 25%, indicating that models can implement basic functionality but fail to handle complex edge cases.
Framework performance is heavily dependent on backbone capability; weaker models can be further degraded by framework overhead.

Highlights & Insights¶

Introducing BDD methodology into LLM evaluation represents an insightful cross-domain transfer — applying the mature software engineering practice of Given-When-Then to the verification of AI-generated code.
The iterative self-correction mechanism in HITL-MAA (Dry Run + Test Runner) resolves 80% of logical errors, demonstrating the practical utility of LLMs within annotation pipelines.
Failure mode analysis exposes a fundamental issue in multi-agent architectures: information is progressively diluted as it passes between agents, preserving high-level functionality while losing fine-grained details.

Limitations & Future Work¶

Coverage is limited to web applications; while the authors argue this constitutes a "lower-bound test," challenges in desktop, mobile, and backend applications may differ substantially.
The benchmark scale of 46 projects is constrained by the high cost of repository-level benchmark construction.
CI/CD integration and deep backend verification are excluded in favor of browser-automation-based black-box testing.
Future work may extend the benchmark into a continuously updated public leaderboard supporting longitudinal evaluation.

vs. rSDE-Bench: rSDE-Bench uses function-level unit tests to verify outputs; E2EDev uses BDD tests to verify behavior from a user perspective, at a granularity more aligned with real usage scenarios.
vs. SoftwareDev/SRDD: These rely on vague descriptions and human evaluation; E2EDev provides fine-grained requirements and automated deterministic assessment.
vs. Mle-Bench/GitTaskBench: These focus on ML pipelines and repository operations, respectively; E2EDev targets the complete pipeline from requirements to executable projects.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing BDD into LLM E2ESD evaluation is a meaningful contribution, though the benchmark construction methodology itself is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 LLM backbones × 6 frameworks with additional manual failure mode analysis — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Well-structured with intuitive figures and in-depth analysis.
Value: ⭐⭐⭐⭐ Fills a gap in reliable E2ESD evaluation; failure mode analysis provides direct guidance for framework design.