SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning¶
Conference: NeurIPS 2025
arXiv: 2509.16548
Code: Available
Area: Image Restoration
Keywords: Process Reward Model, Monte Carlo Estimation, Noisy Labels, Self-Denoising, Mathematical Reasoning
TL;DR¶
This paper proposes the SCAN framework, which analyzes the noise distribution in Monte Carlo annotations to design a self-denoising sampling strategy and a robust learning loss. A PRM trained on only 101K samples generated by a 1.5B model surpasses the effect of the human-annotated dataset PRM800K.
Background & Motivation¶
Process Reward Models (PRMs) guide the reasoning process of LLMs through step-level evaluation, demonstrating strong performance on complex tasks such as mathematical reasoning. However, PRMs face a data annotation dilemma:
Human annotation is prohibitively expensive: Datasets such as PRM800K are of high quality but costly to annotate and difficult to scale.
Monte Carlo (MC) estimation suffers from high noise: Using a model to perform multiple rollouts to estimate step correctness is a promising alternative, but the noise ratio is high and models tend to overfit.
Existing denoising methods rely on strong model distillation: For example, using a 72B critic model to filter data essentially distills the capability of a large model into a smaller one.
The central question of this paper is: Can high-quality PRMs be trained without relying on external strong supervision, by exploiting the self-denoising potential of MC estimation itself and designing robust learning strategies?
The authors first conduct a systematic study of the noise distribution in MC annotations. They define a self-confidence metric \(SC_\theta(q)\) to quantify the completer model's confidence on a given problem, and identify two primary noise types:
- Under-Estimation (\(t_{pred} < t_{true}\)): Insufficient model capability causes the completer to fail at generating correct rollouts even from correct prefixes, leading to premature incorrect judgments. This is concentrated in low self-confidence regions.
- Over-Estimation (\(t_{pred} > t_{true}\)): The model's error-correction capability allows it to generate correct rollouts even after erroneous steps, delaying the detection of error positions.
Method¶
Overall Architecture¶
SCAN comprises two core modules: (1) an efficient data synthesis framework that reduces inference cost through selective sampling; and (2) a robust learning strategy that resists noise via noise-tolerant labels and confidence-based reweighting.
Key Designs¶
-
Selective MC Annotation (Efficient Data Synthesis):
- Annotate only negative samples: After generating responses, those with directly correct answers (positive samples) are used for training directly without step-level MC annotation. Since positive samples in high self-confidence regions exhibit extremely low noise (Observation 4), this saves the cost of 80 rollouts/sample.
- Perform step-level annotation only on high-confidence negative samples: Negative samples satisfying \(SC_\pi(q_i) > \epsilon\) are selected for step-level MC estimation, ensuring that 100% of MC-annotated samples are incorporated into the training set.
-
Noise-tolerant Labeling: To address over-estimation (where \(t_{pred} > t_{true}\), and Observation 5 shows errors typically occur near the true error position), soft labels \(\hat{y}_t = \min(c_t / SC_\pi(q), 1)\) are applied to labels within \(d\) steps before the predicted error position, rather than hard labels. This allows the model to learn from noisy positions without overfitting.
-
Confidence-wise Reweighting: The correctness probability \(c_t\) from MC annotation is biased relative to the true correctness \(c_t^*\) due to the completer model's capability. This is corrected via self-confidence: \(\hat{c}_i^* = \min(c_i / SC_\pi(q), 1)\). The core idea is that after correction, scores annotated by both strong and weak models on the same sample should be consistent—normalizing by self-confidence eliminates the model capability bias.
Loss & Training¶
The modified BCE loss is:
where the label \(\hat{y}_t\) adopts soft labels near error positions and is reweighted via confidence calibration.
Key Experimental Results¶
Main Results (Best-of-8, Policy: Qwen2.5-Math-7B-Instruct)¶
| Model | Training Samples | Annotation | GSM8K | MATH | College Math | Olympiad | Avg |
|---|---|---|---|---|---|---|---|
| Majority Vote@8 | — | — | 96.9 | 87.3 | 47.4 | 43.0 | 68.7 |
| RLHFlow-PRM-8B | 253K | MC | 96.8 | 87.3 | 47.9 | 43.9 | 69.0 |
| Qwen2.5-Math-PRM-7B | 1500K | MC+KD | 96.8 | 88.1 | 47.7 | 47.6 | 70.1 |
| PRM800K | 264K | Human | 97.0 | 87.6 | 47.7 | 45.0 | 69.3 |
| Scan-Base | 101K | MC | 97.1 | 86.9 | 47.8 | 44.4 | 69.1 |
| Scan-Pro | 197K | MC | 97.1 | 87.3 | 48.1 | 47.7 | 70.1 |
Ablation Study (ProcessBench F1)¶
| Configuration | GSM8K F1 | MATH F1 | Olympiad F1 | Avg F1 | Note |
|---|---|---|---|---|---|
| Baseline (no denoising) | — | — | — | ~35 | Rapid overfitting |
| + Selective Sampling | — | — | — | ~45 | Reduces positive sample noise |
| + Tolerance Labeling | — | — | — | ~52 | Resists over-estimation noise |
| + Confidence Reweight | — | — | — | 59.1 | Eliminates model capability bias |
| Qwen2.5-7B-Ins (critic) | 26.8 | 25.7 | 14.2 | 19.9 | Base model |
| Scan-Pro | 80.9 | 65.3 | 45.9 | 59.1 | After self-training |
Key Findings¶
- High-quality data can be generated with only a 1.5B model: Scan-Base uses Qwen2.5-Math-1.5B to generate 101K samples, and the resulting PRM performance approaches that of PRM800K with 264K human annotations.
- Substantial self-improvement: The ProcessBench F1 of Qwen2.5-7B-Ins improves from 19.9 to 59.1 (+39.2), surpassing 70B-scale critic models.
- Tolerance distance \(d=2\) is optimal: \(d=0\) (hard labels) causes severe overfitting, while \(d=n\) (full soft labels) introduces excessive noise.
- Baselines without denoising overfit rapidly: This validates the severe impact of MC noise on PRM training.
- Data source diversity is beneficial: Scan-Pro, which integrates data from three models, outperforms any single-source variant.
Highlights & Insights¶
- Systematic noise distribution analysis is the core contribution: This work is the first to reveal the sources and distributional patterns of under-estimation and over-estimation noise in MC annotations from a self-confidence perspective.
- The self-denoising strategy is highly efficient—it requires no external strong model and relies solely on the completer's own self-confidence.
- Confidence-wise reweighting elegantly addresses the consistency problem in multi-model mixed annotation.
- Only 101K samples + a 1.5B model suffices to match human annotation quality, validating the feasibility of the "small model + good strategy" paradigm.
Limitations & Future Work¶
- The tolerance distance \(d\) requires manual selection; adaptive determination warrants further exploration.
- The self-confidence metric depends on sufficient sampling (16 rollouts); estimates are unreliable with insufficient sampling.
- Validation is currently limited to mathematical reasoning; the noise distribution may differ for code reasoning or general reasoning tasks.
- Directly bypassing MC annotation for positive samples may occasionally miss a small number of latent errors.
Related Work & Insights¶
- Math-Shepherd pioneered MC methods for PRM data synthesis but did not investigate noise issues in depth.
- PRM800K serves as the benchmark for human annotation; this paper demonstrates that synthetic data can rival it under the right strategy.
- Additional techniques from the noisy label learning literature (e.g., MixUp, label smoothing) may offer further complementary benefits.
- The proposed framework has reference value for other scenarios requiring process supervision, such as code generation and multi-step reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Approaching PRM data synthesis from a noise distribution perspective is a novel angle.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual evaluation via BoN and ProcessBench, comprehensive ablations, and multi-model extensions.
- Writing Quality: ⭐⭐⭐⭐⭐ — The logical chain from preliminary analysis → motivation → method → experiments is exceptionally coherent.
- Value: ⭐⭐⭐⭐⭐ — A low-cost PRM training solution with direct practical value for reasoning enhancement.