AceCoder: Acing Coder RL via Automated Test-Case Synthesis¶
Conference: ACL 2025
arXiv: 2502.01718
Code: https://tiger-ai-lab.github.io/AceCoder
Area: Other
Keywords: Code Generation, Reinforcement Learning, Reward Model, test case synthesis, R1-style training
TL;DR¶
The study constructs AceCode-87K (87K coding problems + 1.38M automatically synthesized test cases) to train a code-specific Reward Model (the 7B model outperforms the 340B Nemotron). Best-of-N sampling improves Llama-3.1-8B by 8.9 points on average. Direct R1-style RL from a base model for only 80 steps improves HumanEval+ by 22.5%.
Background & Motivation¶
Background: Code generation models primarily rely on SFT for improvements, while the potential of RL remains under-explored. The success of RL in mathematics (such as DeepSeek-R1) suggests that the coding domain can also benefit.
Limitations of Prior Work: (1) Code evaluation requires executing test cases (unlike mathematics, which can rely on direct string matching), leading to a lack of reliable reward signals; (2) Large-scale coding datasets with test cases are scarce—APPS/TACO rely on manual annotation, and previous RL works were limited to <5K samples.
Key Challenge: Existing general reward models (such as Skywork) fail to generalize to code evaluation, whereas manual annotation of test cases is too costly to scale.
Goal: Establish an automated pipeline to generate large-scale, reliable test cases from code datasets → train a code-specific RM → utilize the RM for Best-of-N sampling or RL.
Key Insight: Batch "imagining" test cases using GPT-4o-mini combined with execution filtering by a strong code model yields low-cost, high-quality, and large-scale test data.
Core Idea: Fully automated test case synthesis + execution filtering → code RM training → RL/Best-of-N, scaling RL for code generation to 87K samples.
Method¶
Overall Architecture¶
A five-phase pipeline: (1) Generating LeetCode-style problems and ~20 test cases from a seed code dataset using GPT-4o-mini; (2) Using Qwen2.5-Coder-32B to generate solutions and filtering out test cases that fail to execute; (3) Constructing preference pairs from the filtered data (utilizing Bradley-Terry loss); (4) Training AceCode-RM (7B/32B); (5) Employing the RM for Best-of-N sampling or Reinforcement++ RL.
Key Designs¶
-
Automated Test-Case Synthesis + Filtering (AceCode-87K):
- Function: Starting from 124K Python functions (Magicoder-Evol/OSS/StackPy), GPT-4o-mini rewrites them into LeetCode-style problems and generates ~20 test cases.
- Mechanism: Using Qwen2.5-Coder-32B to generate solutions for each problem and executing the test cases, removing test cases failed by the solutions and discarding problems with fewer than 5 remaining test cases.
- Final Scale: 87,149 problems and 1,380,000 cleaned test cases (an average of 15.87 per problem).
- Quality Verification: Only 3 invalid cases out of 200 manually inspected samples (1.5% error rate).
-
Preference Pair Construction and RM Training:
- Function: Sampling 16 programs per problem, sorting them by test pass rate, and constructing preference pairs.
- Mechanism: Selective pairing—pairing is performed only when \(s_i > s_j + 0.4\), \(s_i > 0.8\), and \(s_j > 0\) to ensure high-quality preference signals.
- Training: Bradley-Terry loss with Qwen2.5-Coder-7B-Instruct as the backbone, 1 epoch, running on 8×A100 GPUs for 24 hours.
- Results: 307,509 valid preference pairs generated from 46,618 problems.
-
RL Training (Reinforcement++):
- Function: Conducting RL with either AceCode-RM or binary rule-based rewards (all tests passed = 1, otherwise 0).
- Mechanism: Reinforcement++ algorithm (more efficient than PPO, requiring no value model); training exclusively on the top 25% hardest problems in AceCode-87K.
- R1-style Experiment: Direct RL starting from Qwen2.5-Coder-7B-base (without SFT) for only 80 steps.
- Key Findings: Rule-based rewards are more effective than RM rewards, as RM training suffers from reward hacking during the RL process.
Loss & Training¶
RM Training: Standard Bradley-Terry loss. RL Training: Reinforcement++ algorithm, rollout batch=256, 8 samples/question, lr=5e-7, 1 episode, 6 hours on 8×H100 GPUs.
Key Experimental Results¶
Best-of-N Sampling Results (Llama-3.1-8B-Instruct)¶
| Method | HumanEval | MBPP | BigCodeBench-C | LiveCodeBench | Average |
|---|---|---|---|---|---|
| Greedy | 68.9 | 67.2 | 38.5 | 18.0 | 40.9 |
| AceCode-RM-7B (Best-of-64) | 81.7 | 74.6 | 47.8 | 27.6 | 49.3 |
| AceCode-RM-32B (Best-of-64) | 85.4 | 72.0 | 48.5 | 31.0 | 49.8 |
R1-style RL Training (Direct RL from Base Model, 80 Steps)¶
| Configuration | HumanEval+ | MBPP+ | BigCodeBench-I |
|---|---|---|---|
| Qwen2.5-Coder-7B-Base | 61.6 | 76.9 | 40.2 |
| + AceCoder-Rule (80 Steps) | 84.1 (+22.5) | 82.3 (+5.4) | 43.2 (+3.0) |
| + AceCoder-RM (80 Steps) | 83.5 (+21.9) | 80.2 (+3.3) | 36.8 (-3.4) |
Ablation Study¶
| Configuration | HumanEval | MBPP | BigCodeBench-H | Average |
|---|---|---|---|---|
| RM w/o Filter | 73.8 | 73.3 | 17.6 | 45.2 |
| RM w/ Filter | 77.4 | 76.5 | 20.3 | 47.7 |
Key Findings¶
- AceCode-RM-7B outperforms the 340B Nemotron-Reward by 7.5 points (66.9% vs. 54.5%) on the RM Bench coding category, achieving a level where a 7B model surpasses a 50× larger general-purpose RM.
- Best-of-N sampling provides a massive boost for weaker models (Mistral +13 points) but shows limited improvements on stronger models (Qwen-Coder +4 points).
- R1-style training achieves astonishing efficiency, boosting HumanEval+ by 22.5% in just 80 steps (48 H100-GPU-hours).
- Rule-based rewards > RM rewards for RL training, because RM suffers from reward hacking during the RL process.
- Test case filtering yields an average improvement of 2.5 points, especially on hard problems (BigCodeBench-Hard +2.7).
- The choice of RM backbone is crucial: the Qwen2.5-Coder backbone outperforms the Llama-3.1 backbone by 11.6 points (HumanEval).
Highlights & Insights¶
- The fully automated test case synthesis pipeline is the core contribution: GPT-4o-mini generation + strong model execution filtering = low cost, large scale, and high quality (with only a 1.5% error rate). This pipeline can be transferred to any code RL scenario that requires test execution.
- The efficiency of R1-style training is remarkably high: practicing direct RL starting from a base model (bypassing SFT) reaches a level close to SFT + RL in only 80 steps, challenging the conventional paradigm where SFT is considered a prerequisite for RL.
- The finding that rule-based rewards outperform RM rewards is thought-provoking: it indicates that RMs are easily exploited (reward hacking) in RL loops, making simple yet precise reward signals (binary pass/fail) more effective.
Limitations & Future Work¶
- Synthetic test cases still contain noise: passing all tests does not guarantee program correctness, as edge cases might be missed.
- RL shows limited average gain (+0.7 points) on already strong models (e.g., Qwen2.5-Coder-7B-Instruct).
- Test synthesis relies solely on GPT-4o-mini; using stronger models (such as GPT-4) could further improve quality.
- Only one training episode was conducted, leaving further RL scaling strategies unexplored.
Related Work & Insights¶
- vs. Skywork-Reward: General-purpose RMs fall significantly short on code tasks, making code-specific RMs highly necessary and effective.
- vs. DeepSeek-R1: Adapts its "direct RL from base" philosophy and proves its viability in the coding domain.
- vs. APPS/TACO: While predecessor works rely on manual test-case annotations, AceCoder utilizes fully automated synthesis to scale the dataset size by over 10×.
- Insight: The "synthesis + filtering" paradigm of test cases can be extended to any task with executable validation (e.g., SQL, data analysis).
Rating¶
- Novelty: ⭐⭐⭐⭐ Valuable contribution with the automated test synthesis pipeline and R1-style code RL.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model × Multi-benchmark + Best-of-N + RL + R1 + detailed ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear description of the pipeline and comprehensive empirical data.
- Value: ⭐⭐⭐⭐⭐ First to scale code RL to 87K samples, making it highly practical.
Highlights & Insights¶
- The fully automated test synthesis pipeline resolves the reward signal bottleneck in code RL—using test pass rates as preference signals is more objective than manual annotations.
- The success of direct R1-style RL from base suggests that coding is an ideal domain for RL due to its natural execution-based verifiers.
- The AceCode-87K dataset itself is a significant contribution—comprising 87K problems and 1.38M cleaned test cases.
Limitations & Future Work¶
- Covers only Python. Residual issues remain after test case filtering. RL training is highly sensitive to hyperparameters.
Rating¶
- Novelty: ⭐⭐⭐⭐ First fully automated test synthesis → RM → RL pipeline for code.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Best-of-N + RL + R1-style + multi-benchmark.
- Writing Quality: ⭐⭐⭐⭐ Complete pipeline description and rigorous experimental design.
- Value: ⭐⭐⭐⭐⭐ Unlocks the immense potential of RL for code generation.