KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding¶
Conference: ACL 2025
arXiv: 2503.02951
Code: Yes / Models
Area: Other
Keywords: Synthetic Dataset, Code Generation, Self-Verification, Reinforcement Learning, Reasoning Model
TL;DR¶
KodCode proposes a three-stage synthetic data pipeline (synthesizing programming problems \(\rightarrow\) solution + unit test self-verification \(\rightarrow\) post-training data synthesis) to construct 447K verified programming question-solution-test triplets. The fine-tuned models outperform Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B on benchmarks such as HumanEval, MBPP, BigCodeBench, and LiveCodeBench.
Background & Motivation¶
Training high-performance code LLMs requires high-quality, verifiable training data, but existing resources suffer from three major limitations:
Limited scale of human-annotated datasets: Although TACO (26K), APPS (10K), and CodeContests (13K) are of high quality, their scales are limited.
Insufficient quality of synthetic datasets: - Code Alpaca (20K): Low diversity, low difficulty, no unit tests. - Evol Instruct (111K): Low diversity, no test verification. - OSS Instruct (75K): Moderate diversity, no test verification.
Lack of a unified large-scale, multi-difficulty, verifiable dataset: No existing dataset simultaneously satisfies high diversity, mixed difficulty, and validation via unit tests.
The core goal of KodCode is to construct a large-scale (447K), diverse (12 subsets), challenging (from easy to competitive level), and verifiable (comes with unit tests) synthetic coding dataset.
Method¶
Overall Architecture¶
A three-stage pipeline: Step 1 Synthesizing diverse programming problems \(\rightarrow\) Step 2 Generating solutions and unit tests with self-verification \(\rightarrow\) Step 3 Post-training data synthesis (style conversion + CoT response generation by reasoning models).
Key Designs¶
-
Step 1: Programming Problem Synthesis (12 subsets, 5 methods):
- Magpie-Prefill: Leverages the prefilled suffix ("Write a Python function that") + Qwen2.5-Coder-7B completion to efficiently generate simple problems.
- Evaluation Task Expansion: Uses GPT-4o as a teacher LLM to generate new problems after analyzing seed problem structures (seeds sourced from LeetCode, Codeforces, APPS, TACO, and CodeContests).
- DSA Knowledge to Problems: Extracts from Python DSA code snippets to generate data structure and algorithm problems.
- Technical Documentation to Problems: Converts documentation from libraries such as Flask, Pandas, and PyTorch into programming tasks with built-in quality control.
- Additional Problems: Generated using Magpie + 7 open-source LLMs, with high-quality problems retained through filtering with an LLM classifier.
- Deduplication: Uses
all-mpnet-base-v2embeddings + FAISS nearest neighbor distance filtering. - Design Motivation: Multi-source and multi-method approach ensures the diversity and difficulty coverage of the problems.
-
Step 2: Solution and Test Generation (Self-Verification Mechanism):
- Uses GPT-4o to simultaneously generate both the solution and unit tests.
- Executes unit tests to verify the correctness of the solution.
- Uses pytest-cov to conduct branch coverage analysis, retaining only triplets with 100% branch coverage.
- Key Innovation — Allocating Extra Attempts for Hard Problems:
- Up to \(n=10\) attempts per problem.
- Generates the solution and tests from scratch in each attempt (to avoid cascading failures caused by initial faulty tests).
- Retains new versions only if their branch coverage is no lower than prior attempts.
- Problems that still fail after \(n\) attempts are discarded (as they may be inherently flawed).
- Naturally generates difficulty labels: categorized into easy/medium/hard based on pass rates.
- Yields 279K verified triplets in the end.
- Design Motivation: Prevents easy-problem bias introduced by simply discarding hard problems.
-
Step 3: Post-Training Data Synthesis:
- Style Converter: Rewrites natural language problems into Python function signature format, paired with the solution and test inputs.
- Generates an additional 168K triplets (totaling 447K), which are directly applicable for RL training.
- SFT Data Generation: Uses DeepSeek R1 to generate CoT responses with 3 attempts per problem + test-based reject sampling.
- Design Motivation: Bridges the gap between programming problem formats and training data formats.
Loss & Training¶
- SFT: Qwen2.5-Coder-32B-Instruct, learning rate 1e-5, maximum sequence length 16384.
- RL (GRPO): Qwen2.5-7B-Instruct-1M / Qwen2.5-Coder-7B-Instruct, 256 steps, 16 rollouts per problem, binary reward (1 if all tests pass, 0 otherwise).
Key Experimental Results¶
Main Results¶
| Model | HumanEval | MBPP | BCB-C Full | BCB-C Hard | BCB-I Full | BCB-I Hard | Average |
|---|---|---|---|---|---|---|---|
| Qwen2.5-Coder-32B-Inst | 90.9 | 90.2 | 57.6 | 31.1 | 49.4 | 25.7 | 59.25 |
| DeepSeek-R1-Distill-70B | 89.0 | 81.7 | 53.5 | 25.7 | 43.9 | 25.7 | 57.79 |
| Bespoke-Stratos-32B | 88.4 | 88.1 | 56.2 | 33.1 | 47.3 | 27.0 | 59.64 |
| KodCode-32B-SFT-50K | 92.7 | 89.9 | 59.8 | 37.8 | 51.1 | 32.4 | 61.22 |
| KodCode-32B-SFT-Hard-18K | 90.9 | 89.2 | 59.7 | 37.2 | 50.5 | 31.1 | 61.26 |
Ablation Study¶
| Data Selection | BCB-C Hard | BCB-I Hard | LCB Hard |
|---|---|---|---|
| KodCode-SFT-Hard-10K | 39.9 | 31.8 | 6.3 |
| KodCode-SFT-10K (Random) | 38.5 | 27.7 | 4.8 |
| KodCode-SFT-NoConvert-10K | 35.1 | 28.4 | 5.6 |
RL Experiments¶
| Model | Steps | BCB-C Full | BCB-I Full | HumanEval | Average |
|---|---|---|---|---|---|
| Qwen2.5-Coder-7B-Inst (Baseline) | - | 52.0 | 41.8 | 91.5 | 52.32 |
| + GRPO KodCode | 128 | 52.5 | 42.2 | 90.9 | 53.56 |
| + GRPO KodCode | 256 | 53.7 | 42.9 | 90.2 | 53.99 |
Key Pipeline Analysis Metrics¶
| Validation Metric | MBPP | LiveCodeBench-V5 |
|---|---|---|
| Self-verification pass rate | 88.9% (80/90) | 49.9% (190/381) |
| Pass rate on human tests | 97.5% (78/80) | 99.47% (189/190) |
| Pass@1 \(\rightarrow\) Pass@5 | Average gain of 20%+ | - |
| Potential contamination count | 94/447K (0.02%) | - |
Key Findings¶
- KodCode SFT model achieves comprehensive SOTA: Outperforms the strongest baseline on BigCodeBench Hard by 4.7% (Complete) and 5.4% (Instruct).
- Key value of hard problems: Hard-10K performs 4.1% higher than random 10K on BCB-I Hard (31.8 vs 27.7).
- Style converter effectiveness: Removing the style converter drops BCB-C Hard performance from 38.5 to 35.1 (-3.4%).
- Value of extra attempts: Pass@1 \(\rightarrow\) Pass@5 improves by 20%+, and Pass@5 \(\rightarrow\) Pass@10 further gains 4%.
- Reliable self-verification: The retained solutions achieve a pass rate of 97.5%-99.5% on human-annotated tests.
- Reinforcement learning is effective: GRPO consistently boosts performance, and more steps lead to further improvements.
Highlights & Insights¶
- Core innovation lies in pipeline design rather than model architecture: Each step of the three-stage pipeline features clear design considerations.
- Clever strategy for handling difficult problems: Utilizing extra attempts instead of discarding them both retains hard problems and naturally generates difficulty labels.
- Self-verification + reject sampling achieves low-cost, highly reliable data quality assurance.
- Dual applicability of the dataset for SFT and RL: The solution-test pairs are naturally suited for designing RL reward signals.
- t-SNE visualization clearly demonstrates the diversity advantage of KodCode: covering the entire space rather than clustering in a single corner.
Limitations & Future Work¶
- Limited performance on LiveCodeBench-Hard: Competitive-level programming problems remain a weakness, requiring more high-difficulty problems.
- Data synthesis relies on GPT-4o and DeepSeek R1, incurring high costs.
- Limited to the Python language, lacking multilingual support.
- Optimal active selection strategies for post-training data remain unexplored.
- Lack of synthetic data for repository-level code.
Related Work & Insights¶
- Shares a similar concept of teacher-model test generation with OpenCoder (Huang et al., 2024), but KodCode incorporates hard problem retention and style conversion.
- The mutation-based test expansion of EvalPlus (Liu et al., 2024) can complement the self-verification of KodCode.
- The pipeline design can be generalized to other domains that require verifiable data, such as mathematical reasoning.
- Substantiates the perspective that synthetic data can enable smaller models to outperform larger ones.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The pipeline design is novel (particularly the hard problem retention strategy), but the core components are based on existing technologies.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely comprehensive pipeline analysis (self-verification, Pass@k, contamination, diversity) + performance evaluations (SFT/RL/ablation).
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear logic, rich visualizations (t-SNE, Sankey diagram, difficulty distribution), and highly detailed pipeline descriptions.
- Value: ⭐⭐⭐⭐⭐ — Open-sourced dataset, open-sourced models, and reproducible methods; delivers significant infrastructural value to the code LLM community.