LLM-as-a-Judge for Scalable Test Coverage Evaluation¶

Conference: AAAI 2026 arXiv: 2512.01232 Code: None Area: LLM Evaluation Keywords: LLM-as-Judge, Test Coverage, Software Testing, Automated Evaluation, Gherkin

TL;DR¶

This paper applies the LLM-as-Judge paradigm to Gherkin acceptance test coverage evaluation, systematically quantifying accuracy–reliability–cost trade-offs across 20 model configurations × 500 evaluations. It finds that GPT-4o Mini achieves the optimal production balance with a MAAE of 6.07, an ECR@1 of 96.6%, and a cost of $1.01 per 1K evaluations—approximately 1/78th the cost of GPT-5 at high reasoning effort.

Background & Motivation¶

Background: Software test coverage evaluation has traditionally relied on code instrumentation tools (JaCoCo, coverage.py, etc.), which measure only structural coverage (line coverage, branch coverage) and cannot assess semantic completeness—i.e., whether tests adequately cover business requirements, edge cases, and error conditions.
Limitations of Prior Work: Manual expert review is not scalable at the pace of modern CI/CD pipelines. Although LLM-as-Judge has been successfully applied to text generation evaluation, systematic assessment of its use in software testing remains absent—particularly regarding operational reliability (e.g., first-attempt success rate) and cost-effectiveness beyond raw accuracy.
Key Challenge: Larger models may be more accurate but are costly and subject to API instability, whereas smaller models are cheaper but potentially less precise. The key challenge lies in identifying the optimal balance among all three dimensions.
Key Insight: The paper constructs a benchmark of 100 expert-annotated Gherkin test scripts, introduces novel metrics—ECR@1 (Evaluation Completion Rate at first attempt) and reliability-adjusted cost—and conducts a three-dimensional systematic evaluation across 20 model configurations.

Method¶

Overall Architecture¶

The LLM-as-a-Judge (LAJ) framework receives a Jira requirement together with a Gherkin test script, processes them through a rubric-driven evaluation prompt, and outputs a structured JSON object containing a coverage percentage, coverage analysis, gap identification, and improvement suggestions. Benchmark construction proceeds in three stages: requirements authoring → script generation → expert annotation.

Key Designs¶

Three-Stage Benchmark Dataset Construction
Stage 1: Experienced product managers manually author 100 Jira tickets (covering GET 50% / POST 21% / DELETE 15% / PUT 14%) targeting the Kill Bill subscription billing platform.
Stage 2: Development and QA teams collaborate to automatically generate Gherkin test scripts using GPT-4.1.
Stage 3: Three senior QA engineers (8+ years of API testing experience) independently annotate coverage scores using a four-dimensional weighted rubric.
Four-Dimensional Weighted Evaluation Rubric
Scenario Completeness (40%): Coverage of happy paths, error conditions, and edge cases.
Acceptance Criteria Alignment (30%): Whether specified requirements are explicitly validated.
HTTP Method-Specific Considerations (20%): Appropriate handling of idempotency, caching, and state changes.
Assertion Quality (10%): Depth and specificity of verification steps.
Scores are computed as a weighted sum mapped to 0–100%, and the rubric is embedded in the LAJ prompt to ensure human–model alignment.
Operational Reliability Metric (ECR@1)
ECR@1 = first-attempt success rate, where success is defined as producing a valid, parseable JSON output.
Reliability failures include API timeouts, malformed JSON, and schema violations.
Reliability directly affects production cost—low ECR@1 implies more retries.
A "reliability-adjusted cost" metric is introduced to incorporate retry overhead into the cost calculation.
Large-Scale Systematic Evaluation
20 model configurations: GPT-4 series (5) + GPT-5 series (9, spanning high/medium/low reasoning effort) + open-source models (6, with varying reasoning effort).
Each configuration × 100 scripts × 5 independent runs = 10,000 total evaluations.
A unified prompt is used across all configurations to ensure fair comparison.

Loss & Training¶

No training is required—the framework relies entirely on prompt engineering.
A two-part prompt design is adopted: a System Prompt (role definition + rubric) and a User Prompt (Jira story + Gherkin script + testing guidelines + output format specification).
Each HTTP method is accompanied by targeted testing guidelines (GET: caching / pagination / authorization; POST: validation / duplication / large payloads, etc.).

Key Experimental Results¶

Main Results: Accuracy–Reliability–Cost Comparison¶

Model	MAAE (↓)	ECR@1 (↑)	Cost / 1K Evaluations
GPT-4o Mini	6.07	96.6%	$1.01
GPT-4o	7.23	98.0%	$3.51
GPT-4.1	6.93	100.0%	$5.12
GPT-5 (High Reasoning)	7.71	85.4%	$78.96
GPT-5 Mini (Medium)	6.88	98.0%	$4.23
GPT-OSS 20B (Low)	Worse	Lower	$0.45

Ablation Study: Effect of Reasoning Effort Across Model Families¶

Model Family	Effect of Increasing Reasoning Effort
GPT-5 Series	Accuracy improves with predictable cost increases — positive trade-off
Open-Source Models	Simultaneous degradation in accuracy, reliability, and cost — negative effect

Key Findings¶

Smaller models can outperform larger ones: GPT-4o Mini (6.07 MAAE) is more accurate than GPT-5 at high reasoning effort (7.71 MAAE) at 1/78th the cost.
The effect of reasoning effort is model family-specific: GPT-5 benefits from increased reasoning effort, whereas open-source models degrade—no universal rule applies.
ECR@1 ranges from 85.4% to 100%: GPT-5 at high reasoning effort exhibits the lowest reliability (85.4%), indicating that complex reasoning increases the risk of unstable output formatting.
A 175× cost spread: $0.45 to $78.96 per 1K evaluations—model selection has substantial implications for engineering budgets.
Close Match Rate (CMR) reveals consistency: Most models match expert judgments within ±5 percentage points.

Highlights & Insights¶

Practical utility of the three-dimensional evaluation framework: This work is the first to simultaneously quantify accuracy, operational reliability, and cost, providing a complete basis for real-world deployment decisions.
General value of the ECR@1 metric: This metric is transferable to any LLM production system, as first-attempt success rate directly affects user experience and operational cost.
Counter-intuitive finding — smaller models outperform larger ones: On structured evaluation tasks, excessive reasoning capacity can introduce noise rather than improve performance.

Limitations & Future Work¶

The benchmark targets only API testing on the Kill Bill platform; domain generalizability remains unvalidated.
Only Gherkin-format acceptance tests are evaluated; unit tests and integration tests are not covered.
Inter-annotator agreement among expert annotators is not reported.
Embedding the rubric in the prompt may lead models to "align with the rubric" rather than to genuinely "understand the tests."

vs. Traditional Coverage Tools (JaCoCo, etc.): Traditional tools measure structural coverage (which code is executed), whereas LAJ evaluates semantic coverage (whether tests address requirements)—the two approaches are complementary.
vs. LLM-based Code Evaluation (TestGen-LLM, etc.): Code evaluation focuses on the correctness of generated tests, while LAJ assesses the coverage of existing tests—the two operate at different stages of the testing lifecycle.

Rating¶

Dimension	Score	Rationale
Novelty	⭐⭐⭐	The LLM-as-Judge framework itself is not new, but its application to test coverage evaluation and the introduction of ECR@1 represent original contributions.
Technical Depth	⭐⭐⭐	The methodology centers primarily on prompt design, with a relatively low technical barrier.
Experimental Thoroughness	⭐⭐⭐⭐⭐	20 models × 500 evaluations, three-dimensional systematic analysis, and a detailed cost model make this exceptionally comprehensive.
Value	⭐⭐⭐⭐	Directly applicable as a reference for engineering teams selecting LLMs for production deployment.