SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning¶

Conference: NeurIPS 2025 arXiv: 2509.16548
Code: Available
Area: Image Restoration Keywords: Process Reward Model, Monte Carlo Estimation, Noisy Labels, Self-Denoising, Mathematical Reasoning

TL;DR¶

This paper proposes the SCAN framework, which analyzes the noise distribution in Monte Carlo annotations to design a self-denoising sampling strategy and a robust learning loss. A PRM trained on only 101K samples generated by a 1.5B model surpasses the effect of the human-annotated dataset PRM800K.

Background & Motivation¶

Process Reward Models (PRMs) guide the reasoning process of LLMs through step-level evaluation, demonstrating strong performance on complex tasks such as mathematical reasoning. However, PRMs face a data annotation dilemma:

Human annotation is prohibitively expensive: Datasets such as PRM800K are of high quality but costly to annotate and difficult to scale.

Monte Carlo (MC) estimation suffers from high noise: Using a model to perform multiple rollouts to estimate step correctness is a promising alternative, but the noise ratio is high and models tend to overfit.

Existing denoising methods rely on strong model distillation: For example, using a 72B critic model to filter data essentially distills the capability of a large model into a smaller one.

The central question of this paper is: Can high-quality PRMs be trained without relying on external strong supervision, by exploiting the self-denoising potential of MC estimation itself and designing robust learning strategies?

The authors first conduct a systematic study of the noise distribution in MC annotations. They define a self-confidence metric \(SC_\theta(q)\) to quantify the completer model's confidence on a given problem, and identify two primary noise types:

Under-Estimation (\(t_{pred} < t_{true}\)): Insufficient model capability causes the completer to fail at generating correct rollouts even from correct prefixes, leading to premature incorrect judgments. This is concentrated in low self-confidence regions.
Over-Estimation (\(t_{pred} > t_{true}\)): The model's error-correction capability allows it to generate correct rollouts even after erroneous steps, delaying the detection of error positions.

Method¶

Overall Architecture¶

SCAN comprises two core modules: (1) an efficient data synthesis framework that reduces inference cost through selective sampling; and (2) a robust learning strategy that resists noise via noise-tolerant labels and confidence-based reweighting.

Key Designs¶

Selective MC Annotation (Efficient Data Synthesis):
- Annotate only negative samples: After generating responses, those with directly correct answers (positive samples) are used for training directly without step-level MC annotation. Since positive samples in high self-confidence regions exhibit extremely low noise (Observation 4), this saves the cost of 80 rollouts/sample.
- Perform step-level annotation only on high-confidence negative samples: Negative samples satisfying \(SC_\pi(q_i) > \epsilon\) are selected for step-level MC estimation, ensuring that 100% of MC-annotated samples are incorporated into the training set.
Noise-tolerant Labeling: To address over-estimation (where \(t_{pred} > t_{true}\), and Observation 5 shows errors typically occur near the true error position), soft labels \(\hat{y}_t = \min(c_t / SC_\pi(q), 1)\) are applied to labels within \(d\) steps before the predicted error position, rather than hard labels. This allows the model to learn from noisy positions without overfitting.
Confidence-wise Reweighting: The correctness probability \(c_t\) from MC annotation is biased relative to the true correctness \(c_t^*\) due to the completer model's capability. This is corrected via self-confidence: \(\hat{c}_i^* = \min(c_i / SC_\pi(q), 1)\). The core idea is that after correction, scores annotated by both strong and weak models on the same sample should be consistent—normalizing by self-confidence eliminates the model capability bias.

Loss & Training¶

The modified BCE loss is:

\[\mathcal{L}_{\text{SCAN}}(\theta) = -\mathbb{E}_{(x_{\leq t}, y_t) \sim D_{\text{final}}} [y_t \log P_\theta(y_t|q, \mathbf{x}_{\leq t}) + (1-y_t) \log(1 - P_\theta(y_t|q, \mathbf{x}_{\leq t}))]\]

where the label \(\hat{y}_t\) adopts soft labels near error positions and is reweighted via confidence calibration.

Key Experimental Results¶

Main Results (Best-of-8, Policy: Qwen2.5-Math-7B-Instruct)¶

Model	Training Samples	Annotation	GSM8K	MATH	College Math	Olympiad	Avg
Majority Vote@8	—	—	96.9	87.3	47.4	43.0	68.7
RLHFlow-PRM-8B	253K	MC	96.8	87.3	47.9	43.9	69.0
Qwen2.5-Math-PRM-7B	1500K	MC+KD	96.8	88.1	47.7	47.6	70.1
PRM800K	264K	Human	97.0	87.6	47.7	45.0	69.3
Scan-Base	101K	MC	97.1	86.9	47.8	44.4	69.1
Scan-Pro	197K	MC	97.1	87.3	48.1	47.7	70.1

Ablation Study (ProcessBench F1)¶

Configuration	GSM8K F1	MATH F1	Olympiad F1	Avg F1	Note
Baseline (no denoising)	—	—	—	~35	Rapid overfitting
+ Selective Sampling	—	—	—	~45	Reduces positive sample noise
+ Tolerance Labeling	—	—	—	~52	Resists over-estimation noise
+ Confidence Reweight	—	—	—	59.1	Eliminates model capability bias
Qwen2.5-7B-Ins (critic)	26.8	25.7	14.2	19.9	Base model
Scan-Pro	80.9	65.3	45.9	59.1	After self-training

Key Findings¶

High-quality data can be generated with only a 1.5B model: Scan-Base uses Qwen2.5-Math-1.5B to generate 101K samples, and the resulting PRM performance approaches that of PRM800K with 264K human annotations.
Substantial self-improvement: The ProcessBench F1 of Qwen2.5-7B-Ins improves from 19.9 to 59.1 (+39.2), surpassing 70B-scale critic models.
Tolerance distance \(d=2\) is optimal: \(d=0\) (hard labels) causes severe overfitting, while \(d=n\) (full soft labels) introduces excessive noise.
Baselines without denoising overfit rapidly: This validates the severe impact of MC noise on PRM training.
Data source diversity is beneficial: Scan-Pro, which integrates data from three models, outperforms any single-source variant.

Highlights & Insights¶

Systematic noise distribution analysis is the core contribution: This work is the first to reveal the sources and distributional patterns of under-estimation and over-estimation noise in MC annotations from a self-confidence perspective.
The self-denoising strategy is highly efficient—it requires no external strong model and relies solely on the completer's own self-confidence.
Confidence-wise reweighting elegantly addresses the consistency problem in multi-model mixed annotation.
Only 101K samples + a 1.5B model suffices to match human annotation quality, validating the feasibility of the "small model + good strategy" paradigm.

Limitations & Future Work¶

The tolerance distance \(d\) requires manual selection; adaptive determination warrants further exploration.
The self-confidence metric depends on sufficient sampling (16 rollouts); estimates are unreliable with insufficient sampling.
Validation is currently limited to mathematical reasoning; the noise distribution may differ for code reasoning or general reasoning tasks.
Directly bypassing MC annotation for positive samples may occasionally miss a small number of latent errors.

Math-Shepherd pioneered MC methods for PRM data synthesis but did not investigate noise issues in depth.
PRM800K serves as the benchmark for human annotation; this paper demonstrates that synthetic data can rival it under the right strategy.
Additional techniques from the noisy label learning literature (e.g., MixUp, label smoothing) may offer further complementary benefits.
The proposed framework has reference value for other scenarios requiring process supervision, such as code generation and multi-step reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ — Approaching PRM data synthesis from a noise distribution perspective is a novel angle.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual evaluation via BoN and ProcessBench, comprehensive ablations, and multi-model extensions.
Writing Quality: ⭐⭐⭐⭐⭐ — The logical chain from preliminary analysis → motivation → method → experiments is exceptionally coherent.
Value: ⭐⭐⭐⭐⭐ — A low-cost PRM training solution with direct practical value for reasoning enhancement.