Dynamic Scaling of Unit Tests for Code Reward Modeling¶

Attribute	Content
Title	Dynamic Scaling of Unit Tests for Code Reward Modeling
Conference	ACL2025
arXiv	2501.01054
Code	code-reward-model.github.io
Area	LLM Alignment / Code Generation
Keywords	Code Generation, Unit Test, Reward Model, Dynamic Scaling, Best-of-N

TL;DR¶

This paper discovers that scaling up the number of LLM-generated unit tests consistently improves the quality of code reward signals (especially for complex problems). Based on this insight, a lightweight unit test generation model, CodeRM-8B, is trained alongside the implementation of a dynamic scaling strategy, achieving significant improvements across multiple code generation benchmarks.

Background & Motivation¶

Current LLMs struggle to generate correct solutions in a single attempt for code generation. A common practice is repeated sampling to generate multiple candidates, followed by using unit tests to verify and select the best candidate.
Unit tests generated by LLMs themselves are unreliable (as LLMs often make confident errors), leading to degraded quality in reward signals.
Inspired by the concept that "scaling test-time compute can improve performance," the authors raise the key question: Can generating more unit tests improve the quality of code reward signals?
Pioneer experiments confirm a positive correlation, where difficult problems benefit even more.

Method¶

Overall Architecture¶

A majority voting framework for unit tests based on the Best-of-N strategy: 1. The policy model generates \(N\) candidate code solutions \(\{C_1, C_2, \ldots, C_N\}\). 2. The reward model (unit test generator) generates \(M\) unit tests \(\{T_1, T_2, \ldots, T_M\}\) for each problem. 3. Each candidate solution is executed against all unit tests, producing binary execution results \(r_{i,j} \in \{0, 1\}\). 4. The optimal solution is selected via majority voting: \(C_{opt} = \arg\max_{C_i} \sum_{j=1}^{M} r_{i,j}\).

Key Findings¶

Positive Correlation: Increasing the number of unit tests consistently improves Best-of-N performance (across different policy models and reward models).
Weaker Models Benefit More: Llama3-8B achieves an 11% improvement compared to a 5% improvement for Llama3-70B (when GPT-4o is used to generate tests).
Diversity Advantage of Smaller Models: While Gemma-2-27B is significantly inferior to Llama3.1-70B when using a single test, its performance matches Llama3.1-70B when scaling to 100 tests, likely because smaller models generate more diverse tests.
Difficult Problems Benefit More: When categorized into five difficulty tiers based on pass rate, the most difficult problems gain the largest performance improvement from scaling unit tests.

CodeRM-8B: Lightweight Unit Test Generator¶

Data Synthesis Pipeline: 1. Dataset Preprocessing: Based on the CodeFeedback-Filtered-Instruction and TACO datasets, Llama3.1-70B is used to filter out problems unsuitable for unit testing (e.g., tasks containing randomness) and restructure the code solutions into a functional format. 2. Unit Test Generation: Llama3.1-70B is used with repeated sampling to generate diverse unit tests, which are verified for correctness by executing the ground truth code. 3. Execution Feedback Repair: Capitalizes on the execution feedback from the Python interpreter to fix erroneous unit tests (which is more efficient than pure resampling). 4. Quality Control: High-quality tests should accept correct solutions and reject incorrect ones as much as possible; incorrect solutions are generated using a weaker model to filter out false-positive tests.

Model Training: SFT is conducted using Llama3.1-8B as the base. The problem combined with the code solution serves as the instructions, and high-quality unit tests are used as the targets.

Dynamic Unit Test Scaling¶

Problem Difficulty Estimation: - A lightweight difficulty classifier is trained using language model probing (extracting implicit information from the intermediate representations of LLMs). - The classifier consists of a two-layer feed-forward network that outputs a scalar difficulty value. - The training loss is cross-entropy loss, aiming to predict the pass rate \(\lambda\).

Dynamic Computation Allocation: - For problem \(x\) (with pass rate \(\lambda\)), the gain of allocating \(b\) computational budgets is \(q(x,b) = 1 - (1-\lambda)^b\). - A greedy algorithm is used to prioritize resource allocation to more difficult problems.

Key Experimental Results¶

Main Results¶

Method	Llama3-8B	Llama3-70B	GPT-3.5	GPT-4o-mini
HumanEval Plus
Vanilla	53.58	73.74	67.83	82.96
CodeRM-8B	72.01 (+18.43)	78.69 (+4.95)	78.01 (+10.18)	86.38 (+3.42)
MBPP Plus
Vanilla	49.20	69.33	70.53	71.59
CodeRM-8B	66.71 (+17.51)	72.44 (+3.11)	75.96 (+5.43)	75.20 (+3.61)

Key Findings: - CodeRM-8B (8B parameters) achieves comparable performance to Llama3.1-70B, using only ~1/9 of its parameters. - The improvement is more pronounced for weaker models: Llama3-8B achieves an 18.43% improvement on HumanEval Plus. - Even for proprietary models like GPT-4o-mini, there is a 3.42% performance gain.

Unit Test Quality Analysis¶

Metric	Llama3.1-8B	Llama3.1-70B	CodeRM-8B
Single Test Accuracy	60.02	73.65	69.64
100 Tests Accuracy	74.21	78.30	80.46
100 Tests F1	74.35	78.76	81.27

Insight: Though the single-test quality of CodeRM-8B is inferior to that of the 70B model, the combined quality over multiple tests is superior. This suggests it generates more diverse tests, providing multi-aspect validation.

Ablation Study¶

Quality Control: Filtering out false-positive tests yields relative performance improvements of approximately 45% (on HE+) and 80% (on MBPP+).
Data Volume: Model performance steadily and consistently improves as the training data scales up.
Dynamic Scaling: Under a fixed computational budget, dynamic allocation provides an additional improvement of approximately 0.5% over uniform allocation (especially pronounced on MBPP Plus).

Highlights & Insights¶

Discovery of Unit Test Scaling Laws: This work is the first to systematically demonstrate that scaling up the number of unit tests correlates positively with the quality of code reward signals, with difficult problems benefiting the most.
Compensation via Smaller Model Diversity: Although smaller models produce less accurate individual tests, they offer better coverage and diversity. When multiple tests are combined, they can outperform larger models—providing vital practical guidance.
Efficient Data Synthesis Pipeline: Combines execution feedback repair and false-positive filtering to enable the automated production of high-quality synthetic data.
An 8B Model Rivalling 70B: CodeRM-8B reaches the level of Llama3.1-70B in test generation quality, significantly reducing inference costs.
Novel Dynamic Scaling Concept: Introduces difficulty-aware computational allocation to the domain of unit test generation.

Limitations & Future Work¶

Limitation of Dynamic Scaling: Directly adopting the method from Damani et al. may not fully suit reward modeling scenarios, leaving significant room for improvement.
Lack of In-depth Study on Diversity and Coverage: The underlying mechanism of why smaller models generate more diverse tests is not yet fully understood.
Limited to Python Functional Programming: Experiments are confined to programming problems that can be converted into functional formats, making their applicability to class- or system-level code unknown.

Code Solution Reranking: Execution-based (MBR-Exec, CodeT, MPSC) and non-execution-based (neural rerankers).
Automated Unit Test Generation: Traditional methods (search/constraint/probabilistic) and LLM-based approaches.
Test-Time Compute Scaling: Research on scaling up test-time inference compute to boost LLM performance.

Rating ⭐⭐⭐⭐¶

Pros: The pioneer experiments are rigorously designed, yielding convincing and highly practical findings. CodeRM-8B achieves large-model-level performance in unit test generation at minimal cost. The concept of dynamic scaling is highly valuable for future adoption.

Cons: The actual performance improvement from dynamic scaling is relatively small (~0.5%). There is a lack of deep analysis regarding the diversity and coverage of the unit tests.