Revisit Self-Debugging with Self-Generated Tests for Code Generation¶

Conference: ACL 2025
Code: -
Area: Code Intelligence
Keywords: code generation, self-debugging, self-generated tests, execution feedback, LLM

TL;DR¶

This paper systematically investigates the effectiveness of self-debugging with self-generated tests using LLMs. It finds that post-execution-based self-debugging degrades performance on basic programming problems due to self-generated test bias. Conversely, in-execution self-debugging successfully avoids this bias, achieving consistent improvements on both basic and competitive programming tasks.

Background & Motivation¶

Self-debugging has become a popular method to improve LLM code generation quality in recent years: generate code \(\rightarrow\) execute tests \(\rightarrow\) obtain feedback \(\rightarrow\) repair code.
However, existing methods (e.g., Self-Debugging, Reflexion, AlphaCodium) largely depend on predefined oracle tests, whereas high-quality test cases are often unavailable in practice.
Dilemma of Self-Generated Tests: Utilizing model-generated tests for self-debugging is a natural solution, but its effectiveness remains under-explored.
- Though Reflexion utilizes self-generated tests for feedback, it evaluates pre-repair code using oracle tests.
- AlphaCodium iterates on oracle tests before iterating on self-generated tests.
Core Problem: The quality of self-generated tests is limited (test output accuracy is only ~85%). What impact does this have on self-debugging?

Method¶

Overall Architecture¶

A unified framework for self-debugging is proposed, distinguishing between two paradigms:

Post-Execution Self-Debugging
- Compares actual execution outputs with expected outputs after running the code.
- If a mismatch occurs, the failed test cases, actual outputs, and error messages are provided as feedback for the model to fix the program.
- Two feedback granularities: label (only correct/incorrect status) and detail (includes test inputs, expected outputs, and actual outputs).
In-Execution Self-Debugging
- Decomposes the program into basic blocks based on the Control Flow Graph (CFG).
- Collects intermediate variable states (execution trace) before and after the execution of each basic block.
- The model assesses program correctness and debugs purely based on the test inputs and intermediate states.
- Does not utilize post-execution information (unaware of whether the final output matches the expected output).

Key Differences¶

Post-execution relies on the output labels of self-generated tests, which can be erroneous.
In-execution focuses solely on the intermediate states of execution, thereby bypassing the label bias issue.

Bias Analysis Framework¶

Four cases are defined: - True Positive (TP): Correct program passes the test. - True Negative (TN): Incorrect program fails the test. - False Positive (FP): Incorrect program passes a buggy test. - False Negative (FN): Correct program fails due to an erroneous test.

Experiments¶

Experimental Setup¶

Models: GPT-4o, Claude-3.5-Sonnet, LLaMA-3-70B-Instruct, Qwen2.5-Coder-7B-Instruct
Benchmarks: HumanEval(+), MBPP(+) (basic), LiveCodeBench (competition-level, 450 problems)
Greedy decoding, with 10 test cases generated per problem.
Iterations: 1-2 rounds

Post-Execution Self-Debugging + Oracle Tests (Control Baseline)¶

Self-debugging brings consistent improvements when using oracle tests, for example: - GPT-4o on HumanEval: 92.1 \(\rightarrow\) 95.1 (+3.0) - Claude-3.5 on MBPP+: 77.0 \(\rightarrow\) 86.0 (+9.0)

Post-Execution Self-Debugging + Self-Generated Tests¶

Performs poorly on basic problems: - Claude-3.5 on HumanEval: 94.5 \(\rightarrow\) 87.2 (-7.3) - LLaMA-3-70B on HumanEval: 79.9 \(\rightarrow\) 73.8 (-6.1) - All models show performance drops on HumanEval.

Shows potential on competitive programming tasks: - GPT-4o on LiveCodeBench: 46.0 \(\rightarrow\) 49.3 (+3.3) (with label feedback) - However, detail feedback leads to performance drops on easy problems.

Self-Generated Test Quality Analysis¶

Model	Test Input Accuracy	Test Output Accuracy	Test Suite Effectiveness
GPT-4o	97.63%	89.77%	59.15%
Claude-3.5	97.68%	89.14%	56.71%
LLaMA-3-70B	94.53%	84.69%	49.39%
Qwen2.5-Coder-7B	97.19%	84.85%	44.50%

Generating test inputs is relatively easy, while generating correct outputs is difficult (~85%).
The effectiveness of the complete test suite is only ~50-60%.

Key Findings from Bias Analysis¶

On HumanEval/MBPP, False Negatives outnumber True Negatives, meaning correct programs are labeled as failed by erroneous tests, leading to unnecessary modifications that introduce bugs instead.
On LiveCodeBench, True Negatives have a higher proportion because competitive programming tasks have lower base pass rates, making self-generated test labels more likely to be correct.

In-Execution Self-Debugging¶

Demonstrates positive performance on basic problems: - GPT-4o on HumanEval/MBPP+: 87.8/76.5 \(\rightarrow\) 89.0/79.1 (2 iteration rounds) - Qwen2.5-Coder-7B on MBPP+: 70.6 \(\rightarrow\) 72.0 - Most models maintain or improve performance.

On competitive programming tasks: - GPT-4o on LiveCodeBench: 46.0 \(\rightarrow\) 47.6 (+1.6)

Synthesis of Comparison¶

Paradigm	Basic Problems	Competitive Problems
Post-execution + self-tests	❌ Decreases in general	⚠️ Label improves performance
In-execution + self-tests	✅ Mild gain	✅ Consistent gain

Highlights & Insights¶

First to systematically reveal the failure mode of self-debugging with self-generated tests: Post-execution self-debugging is counterproductive on basic problems, which is an important and counterintuitive finding.
The bias analysis framework (TP/TN/FP/FN) precisely explains the root cause of the inconsistency: high FN rates on basic problems versus more reasonable TN rates on competitive problems.
The In-execution paradigm serves as a practical solution: By analyzing intermediate execution states rather than relying on unreliable test labels, it effectively circumvents test bias.
Insight: The effectiveness of self-debugging depends on not only the code-repair capability but also the ability to recognize erroneous feedback.
Comprehensive experimental coverage: 4 models \(\times\) 3 benchmarks \(\times\) 2 paradigms \(\times\) 2 feedback granularities.

Limitations & Future Work¶

Restricted to Python programming tasks; multilingual scenarios are not validated.
In-execution self-debugging requires full execution traces. For complex programs (deep loops/recursions), this may yield excessively long traces, and truncation or compression strategies are not discussed.
Self-generated tests are limited to 10 instances; whether more tests can alleviate bias remains under-explored.
Hybrid approaches combining oracle tests and self-generated tests are not explored.
Evaluated strictly using greedy decoding, without considering candidate sampling strategies.

Code Generation: CodeT (Chen et al., 2023) utilizes dual execution consistency; Self-Edit (Zhang et al., 2023a) incorporates sample tests as execution feedback.
Self-debugging: Self-Debugging (Chen et al., 2024b) focuses on iterative debugging; LDB (Zhong et al., 2024) leverages runtime trace information; Reflexion (Shinn et al., 2023) uses self-generated test feedback but evaluates with oracle tests.
Code Evaluation: EvalPlus (Liu et al., 2023) scales up evaluation test suites; LiveCodeBench (Jain et al., 2024) continuously collects competitive coding questions.

Rating ⭐⭐⭐⭐¶

Clear research problem, deep analysis (bias framework), valuable findings (discovering post-execution self-debugging is harmful on basic questions), and proposes a viable alternative (in-execution self-debugging). However, the absolute gains of in-execution self-debugging are somewhat limited, requiring further verification of practical utility.