AceCoder: Acing Coder RL via Automated Test-Case Synthesis¶

Conference: ACL 2025
arXiv: 2502.01718
Code: https://tiger-ai-lab.github.io/AceCoder
Area: Other
Keywords: Code Generation, Reinforcement Learning, Reward Model, test case synthesis, R1-style training

TL;DR¶

The study constructs AceCode-87K (87K coding problems + 1.38M automatically synthesized test cases) to train a code-specific Reward Model (the 7B model outperforms the 340B Nemotron). Best-of-N sampling improves Llama-3.1-8B by 8.9 points on average. Direct R1-style RL from a base model for only 80 steps improves HumanEval+ by 22.5%.

Background & Motivation¶

Background: Code generation models primarily rely on SFT for improvements, while the potential of RL remains under-explored. The success of RL in mathematics (such as DeepSeek-R1) suggests that the coding domain can also benefit.

Limitations of Prior Work: (1) Code evaluation requires executing test cases (unlike mathematics, which can rely on direct string matching), leading to a lack of reliable reward signals; (2) Large-scale coding datasets with test cases are scarce—APPS/TACO rely on manual annotation, and previous RL works were limited to <5K samples.

Key Challenge: Existing general reward models (such as Skywork) fail to generalize to code evaluation, whereas manual annotation of test cases is too costly to scale.

Goal: Establish an automated pipeline to generate large-scale, reliable test cases from code datasets → train a code-specific RM → utilize the RM for Best-of-N sampling or RL.

Key Insight: Batch "imagining" test cases using GPT-4o-mini combined with execution filtering by a strong code model yields low-cost, high-quality, and large-scale test data.

Core Idea: Fully automated test case synthesis + execution filtering → code RM training → RL/Best-of-N, scaling RL for code generation to 87K samples.

Method¶

Overall Architecture¶

A five-phase pipeline: (1) Generating LeetCode-style problems and ~20 test cases from a seed code dataset using GPT-4o-mini; (2) Using Qwen2.5-Coder-32B to generate solutions and filtering out test cases that fail to execute; (3) Constructing preference pairs from the filtered data (utilizing Bradley-Terry loss); (4) Training AceCode-RM (7B/32B); (5) Employing the RM for Best-of-N sampling or Reinforcement++ RL.

Key Designs¶

Automated Test-Case Synthesis + Filtering (AceCode-87K):
- Function: Starting from 124K Python functions (Magicoder-Evol/OSS/StackPy), GPT-4o-mini rewrites them into LeetCode-style problems and generates ~20 test cases.
- Mechanism: Using Qwen2.5-Coder-32B to generate solutions for each problem and executing the test cases, removing test cases failed by the solutions and discarding problems with fewer than 5 remaining test cases.
- Final Scale: 87,149 problems and 1,380,000 cleaned test cases (an average of 15.87 per problem).
- Quality Verification: Only 3 invalid cases out of 200 manually inspected samples (1.5% error rate).
Preference Pair Construction and RM Training:
- Function: Sampling 16 programs per problem, sorting them by test pass rate, and constructing preference pairs.
- Mechanism: Selective pairing—pairing is performed only when \(s_i > s_j + 0.4\), \(s_i > 0.8\), and \(s_j > 0\) to ensure high-quality preference signals.
- Training: Bradley-Terry loss with Qwen2.5-Coder-7B-Instruct as the backbone, 1 epoch, running on 8×A100 GPUs for 24 hours.
- Results: 307,509 valid preference pairs generated from 46,618 problems.
RL Training (Reinforcement++):
- Function: Conducting RL with either AceCode-RM or binary rule-based rewards (all tests passed = 1, otherwise 0).
- Mechanism: Reinforcement++ algorithm (more efficient than PPO, requiring no value model); training exclusively on the top 25% hardest problems in AceCode-87K.
- R1-style Experiment: Direct RL starting from Qwen2.5-Coder-7B-base (without SFT) for only 80 steps.
- Key Findings: Rule-based rewards are more effective than RM rewards, as RM training suffers from reward hacking during the RL process.

Loss & Training¶

RM Training: Standard Bradley-Terry loss. RL Training: Reinforcement++ algorithm, rollout batch=256, 8 samples/question, lr=5e-7, 1 episode, 6 hours on 8×H100 GPUs.

Key Experimental Results¶

Best-of-N Sampling Results (Llama-3.1-8B-Instruct)¶

Method	HumanEval	MBPP	BigCodeBench-C	LiveCodeBench	Average
Greedy	68.9	67.2	38.5	18.0	40.9
AceCode-RM-7B (Best-of-64)	81.7	74.6	47.8	27.6	49.3
AceCode-RM-32B (Best-of-64)	85.4	72.0	48.5	31.0	49.8

R1-style RL Training (Direct RL from Base Model, 80 Steps)¶

Configuration	HumanEval+	MBPP+	BigCodeBench-I
Qwen2.5-Coder-7B-Base	61.6	76.9	40.2
+ AceCoder-Rule (80 Steps)	84.1 (+22.5)	82.3 (+5.4)	43.2 (+3.0)
+ AceCoder-RM (80 Steps)	83.5 (+21.9)	80.2 (+3.3)	36.8 (-3.4)

Ablation Study¶

Configuration	HumanEval	MBPP	BigCodeBench-H	Average
RM w/o Filter	73.8	73.3	17.6	45.2
RM w/ Filter	77.4	76.5	20.3	47.7

Key Findings¶

AceCode-RM-7B outperforms the 340B Nemotron-Reward by 7.5 points (66.9% vs. 54.5%) on the RM Bench coding category, achieving a level where a 7B model surpasses a 50× larger general-purpose RM.
Best-of-N sampling provides a massive boost for weaker models (Mistral +13 points) but shows limited improvements on stronger models (Qwen-Coder +4 points).
R1-style training achieves astonishing efficiency, boosting HumanEval+ by 22.5% in just 80 steps (48 H100-GPU-hours).
Rule-based rewards > RM rewards for RL training, because RM suffers from reward hacking during the RL process.
Test case filtering yields an average improvement of 2.5 points, especially on hard problems (BigCodeBench-Hard +2.7).
The choice of RM backbone is crucial: the Qwen2.5-Coder backbone outperforms the Llama-3.1 backbone by 11.6 points (HumanEval).

Highlights & Insights¶

The fully automated test case synthesis pipeline is the core contribution: GPT-4o-mini generation + strong model execution filtering = low cost, large scale, and high quality (with only a 1.5% error rate). This pipeline can be transferred to any code RL scenario that requires test execution.
The efficiency of R1-style training is remarkably high: practicing direct RL starting from a base model (bypassing SFT) reaches a level close to SFT + RL in only 80 steps, challenging the conventional paradigm where SFT is considered a prerequisite for RL.
The finding that rule-based rewards outperform RM rewards is thought-provoking: it indicates that RMs are easily exploited (reward hacking) in RL loops, making simple yet precise reward signals (binary pass/fail) more effective.

Limitations & Future Work¶

Synthetic test cases still contain noise: passing all tests does not guarantee program correctness, as edge cases might be missed.
RL shows limited average gain (+0.7 points) on already strong models (e.g., Qwen2.5-Coder-7B-Instruct).
Test synthesis relies solely on GPT-4o-mini; using stronger models (such as GPT-4) could further improve quality.
Only one training episode was conducted, leaving further RL scaling strategies unexplored.

vs. Skywork-Reward: General-purpose RMs fall significantly short on code tasks, making code-specific RMs highly necessary and effective.
vs. DeepSeek-R1: Adapts its "direct RL from base" philosophy and proves its viability in the coding domain.
vs. APPS/TACO: While predecessor works rely on manual test-case annotations, AceCoder utilizes fully automated synthesis to scale the dataset size by over 10×.
Insight: The "synthesis + filtering" paradigm of test cases can be extended to any task with executable validation (e.g., SQL, data analysis).

Rating¶

Novelty: ⭐⭐⭐⭐ Valuable contribution with the automated test synthesis pipeline and R1-style code RL.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model × Multi-benchmark + Best-of-N + RL + R1 + detailed ablation.
Writing Quality: ⭐⭐⭐⭐ Clear description of the pipeline and comprehensive empirical data.
Value: ⭐⭐⭐⭐⭐ First to scale code RL to 87K samples, making it highly practical.

Highlights & Insights¶

The fully automated test synthesis pipeline resolves the reward signal bottleneck in code RL—using test pass rates as preference signals is more objective than manual annotations.
The success of direct R1-style RL from base suggests that coding is an ideal domain for RL due to its natural execution-based verifiers.
The AceCode-87K dataset itself is a significant contribution—comprising 87K problems and 1.38M cleaned test cases.

Limitations & Future Work¶

Covers only Python. Residual issues remain after test case filtering. RL training is highly sensitive to hyperparameters.

Rating¶

Novelty: ⭐⭐⭐⭐ First fully automated test synthesis → RM → RL pipeline for code.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Best-of-N + RL + R1-style + multi-benchmark.
Writing Quality: ⭐⭐⭐⭐ Complete pipeline description and rigorous experimental design.
Value: ⭐⭐⭐⭐⭐ Unlocks the immense potential of RL for code generation.

AceCoder: Acing Coder RL via Automated Test-Case Synthesis¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Best-of-N Sampling Results (Llama-3.1-8B-Instruct)¶

R1-style RL Training (Direct RL from Base Model, 80 Steps)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Related Papers¶