UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench¶
Conference: ACL 2025
arXiv: 2506.09289
Code: github.com/CUHK-Shenzhen-SE/UTBoost
Area: Other (Software Engineering/Code Generation Evaluation)
Keywords: SWE-Bench, Test Case Augmentation, Code Generation Evaluation, Metamorphic Testing, LLM Coding Agents
TL;DR¶
This paper proposes the UTBoost framework, which enhances test case coverage of SWE-Bench through an LLM-based test case generator (UTGenerator) and an improved parser. It identifies 36 inadequately tested instances and 345 patches falsely flagged as passed, leading to ranking changes of 40.9% on SWE-Bench Lite and 24.4% on SWE-Bench Verified.
Background & Motivation¶
SWE-Bench is a standard benchmark for evaluating the capabilities of code generation agents on real-world Python projects. It is constructed based on GitHub issues and corresponding pull requests, using human-written test cases to verify whether the generated patches resolve the issues.
However, human-written test cases are often inadequate: generated patches may pass the tests but fail to truly resolve the issues. For instance, in the mwaskom__seaborn-3010 instance, the issue requires the PolyFit function to handle missing data, but the original test only considers cases where both x and y are missing, failing to cover the boundary condition where only x is missing. The patch by IBM SWE-1.0 passes the original test but errors out when only x is missing.
Furthermore, the test log parser of SWE-Bench is flawed: when using regular expressions to extract test results, it fails to handle test case logs that span multiple lines, leading to incorrect test annotations. For example, in django__django-13710, the test name test_immutable_content_type spans two lines, and the parser mistakenly extracts "Regression for #9362" as the test name.
Method¶
Overall Architecture¶
The UTBoost system consists of three key components:
- UTGenerator: An LLM-based test case generator
- Intramorphic Testing: Constructing test oracles
- Improved Parser: Fixing flaws in the original SWE-Bench parser
Workflow: - Step 1: Filter generated patches that pass the original tests (satisfying \(P(T_{orig}) = P'(T_{orig})\)) - Step 2: Generate augmented test cases \(T_{aug}\) with UTGenerator - Step 3: Check whether the metamorphic relation \(P(T_{aug}) = P'(T_{aug})\) holds on the augmented tests - Step 4: If it does not hold, mark the instance as suspicious
Key Designs¶
Intramorphic Testing: - A white-box automated testing technique that establishes a test oracle by comparing the outputs of the original system and the modified system on the same input. - Standard gold patch is applied to program \(P\), and the generated patch is applied to program \(P'\). - Oracle relation: If the two patches resolve the issue equivalently, then \(P(T) = P'(T)\) should hold. - If the augmented test breaks this relation, it indicates that the original test is inadequate.
UTGenerator: Three-Stage Localization + Test Generation
-
File-level Localization:
- Build a tree-structure representation of the codebase.
- The LLM receives the issue description, original test patches, and the tree structure.
- Outputs the Top-\(N\) files most likely requiring test addition.
-
Function/Class-level Localization:
- Compress code files, retaining only class and function headers.
- The LLM analyzes the compressed format to locate the functions or classes most likely to receive the tests.
-
Line-level Localization:
- Extract specific code snippets.
- The LLM identifies the exact line range for adding augmented test cases.
-
Test Case Generation:
- Extend the localized lines using a context window of \(x\) lines.
- The LLM generates the augmented test cases and their dependencies.
Improved Parser: - Use a queue to track adjacent log data. - Match test case names precisely using regular expressions. - Iteratively search until the correct test name is found when tests span multiple lines. - Fixes various edge cases that the original parser failed to handle.
Loss & Training¶
UTGenerator uses GPT-4o as the LLM backend. A multi-temperature sampling strategy is adopted to increase the diversity of test cases: - Temperature 0: 1 deterministic patch - Temperature 0.8: 20 patches - Temperature 0.9: 20 patches - Temperature 0.99: 20 patches
File-level localization uses Top-3, the localization stage uses a temperature of 0.8, and the context window is 10 lines of code. The average API cost per SWE-Bench instance is $1.6.
Key Experimental Results¶
Main Results¶
Inadequately Tested Instances Found: - SWE-Bench Lite: 23 inadequately tested instances (out of 300) - SWE-Bench Verified: 26 inadequately tested instances (out of 500) - 36 distinct instances in total (with overlap between the two sets)
Incorrect Patches Identified: - SWE-Bench Lite: 170 out of 599 patches (28.4%) that passed the original tests were actually incorrect - SWE-Bench Verified: 92 out of 584 passed patches (15.7%) were actually incorrect
Impact of Parser Flaws: - SWE-Bench Lite: 54.7% (164/300) of instance annotations were affected - SWE-Bench Verified: 54.2% (271/500) of instance annotations were affected - After correction, an additional 64 (Lite) and 79 (Verified) incorrect patches were discovered
Total Incorrect Patches Identified: - SWE-Bench Lite: 176 (augmented tests + improved parser) - SWE-Bench Verified: 169
Ablation Study¶
Project Distribution Analysis: - django and sympy account for the majority of errors: 84.1% of incorrect patches in SWE-Bench Lite, and 82.6% in Verified - Inadequately tested instances are distributed across 9 out of 12 projects
Leaderboard Impact: - SWE-Bench Lite: 40.9% (18/44) of ranks changed - SWE-Bench Verified: 24.4% (11/45) of ranks changed - Typical case: Amazon-Q-Developer-Agent dropped from 1st place (55%) to tied 1st with devlo (53.6%) due to having 7 incorrect patches
Key Findings¶
- Even manual review by 93 professional developers missed inadequate test issues: UTBoost identified 26 problematic instances in SWE-Bench Verified.
- Parser flaws have an extremely broad impact: over 54% of instance annotations contained errors.
- False pass rate caused by inadequate testing is as high as 28.4%: nearly one-third of the "passed" patches are actually incorrect.
- django and sympy are the weakest projects: concentrating the vast majority of errors.
Highlights & Insights¶
- First systematic resolution of inadequate testing in SWE-Bench: prior works (Aleithan et al., Chen and Jiang) only manually identified issues, whereas UTBoost provides an automated solution.
- First application of metamorphic testing to evaluate open-source software systems: elegantly leveraging the equivalence of the Gold Patch and generated patches to establish test oracles.
- Three-stage localization strategy effectively handles large-scale codebases: progressively narrowing down from file \(\to\) function/class \(\to\) line.
- Improved parser fixes a long-neglected infrastructure flaw: affecting more than half of the instances.
Limitations & Future Work¶
- Dependency on at least one resolved instance by an agent: UTBoost requires cross-validation of Gold Patch and generated patches, and cannot process instances that have not been resolved by any agent (currently covering 74.6% of Lite and 81.6% of Verified).
- Reliance on GPT-4o only: integrating other LLMs might increase test diversity.
- Simplified architecture: based on a simplified version of Agentless; employing more complex coding agent frameworks could generate more diverse tests.
- Average cost of $1.6 per instance: making large-scale usage relatively expensive.
- Covers only Python projects: has not yet been extended to other programming languages.
Related Work & Insights¶
- Comparison with EvalPlus (Liu et al., 2024): EvalPlus adds tests to HumanEval/MBPP via type-aware mutation, but cannot handle the multi-file/multi-dependency complexities of SWE-Bench.
- Insights for code generation benchmark design: test coverage and parser correctness are fundamental to benchmark credibility and cannot rely solely on manual review.
- Provides plug-and-play augmented test cases for future SWE-Bench submissions.
Rating¶
- Novelty: ★★★★☆ (novel metamorphic testing application, but the overall approach is somewhat engineering-oriented)
- Experimental Thoroughness: ★★★★★ (comprehensive coverage of both benchmarks, with results confirmed by manual review)
- Value: ★★★★★ (directly impacts the SWE-Bench leaderboard, with code and data released)
- Writing Quality: ★★★★☆ (clear structure, detailed cases, but some content is repetitive)