Skip to content

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

Conference: NeurIPS 2025 arXiv: 2509.16548
Code: Available
Area: Image Restoration Keywords: Process Reward Model, Monte Carlo Estimation, Noisy Labels, Self-Denoising, Mathematical Reasoning

TL;DR

This paper proposes the SCAN framework, which analyzes the noise distribution in Monte Carlo annotations to design a self-denoising sampling strategy and a robust learning loss. A PRM trained on only 101K samples generated by a 1.5B model surpasses the effect of the human-annotated dataset PRM800K.

Background & Motivation

Process Reward Models (PRMs) guide the reasoning process of LLMs through step-level evaluation, demonstrating strong performance on complex tasks such as mathematical reasoning. However, PRMs face a data annotation dilemma:

Human annotation is prohibitively expensive: Datasets such as PRM800K are of high quality but costly to annotate and difficult to scale.

Monte Carlo (MC) estimation suffers from high noise: Using a model to perform multiple rollouts to estimate step correctness is a promising alternative, but the noise ratio is high and models tend to overfit.

Existing denoising methods rely on strong model distillation: For example, using a 72B critic model to filter data essentially distills the capability of a large model into a smaller one.

The central question of this paper is: Can high-quality PRMs be trained without relying on external strong supervision, by exploiting the self-denoising potential of MC estimation itself and designing robust learning strategies?

The authors first conduct a systematic study of the noise distribution in MC annotations. They define a self-confidence metric \(SC_\theta(q)\) to quantify the completer model's confidence on a given problem, and identify two primary noise types:

  • Under-Estimation (\(t_{pred} < t_{true}\)): Insufficient model capability causes the completer to fail at generating correct rollouts even from correct prefixes, leading to premature incorrect judgments. This is concentrated in low self-confidence regions.
  • Over-Estimation (\(t_{pred} > t_{true}\)): The model's error-correction capability allows it to generate correct rollouts even after erroneous steps, delaying the detection of error positions.

Method

Overall Architecture

SCAN comprises two core modules: (1) an efficient data synthesis framework that reduces inference cost through selective sampling; and (2) a robust learning strategy that resists noise via noise-tolerant labels and confidence-based reweighting.

Key Designs

  1. Selective MC Annotation (Efficient Data Synthesis):

    • Annotate only negative samples: After generating responses, those with directly correct answers (positive samples) are used for training directly without step-level MC annotation. Since positive samples in high self-confidence regions exhibit extremely low noise (Observation 4), this saves the cost of 80 rollouts/sample.
    • Perform step-level annotation only on high-confidence negative samples: Negative samples satisfying \(SC_\pi(q_i) > \epsilon\) are selected for step-level MC estimation, ensuring that 100% of MC-annotated samples are incorporated into the training set.
  2. Noise-tolerant Labeling: To address over-estimation (where \(t_{pred} > t_{true}\), and Observation 5 shows errors typically occur near the true error position), soft labels \(\hat{y}_t = \min(c_t / SC_\pi(q), 1)\) are applied to labels within \(d\) steps before the predicted error position, rather than hard labels. This allows the model to learn from noisy positions without overfitting.

  3. Confidence-wise Reweighting: The correctness probability \(c_t\) from MC annotation is biased relative to the true correctness \(c_t^*\) due to the completer model's capability. This is corrected via self-confidence: \(\hat{c}_i^* = \min(c_i / SC_\pi(q), 1)\). The core idea is that after correction, scores annotated by both strong and weak models on the same sample should be consistent—normalizing by self-confidence eliminates the model capability bias.

Loss & Training

The modified BCE loss is:

\[\mathcal{L}_{\text{SCAN}}(\theta) = -\mathbb{E}_{(x_{\leq t}, y_t) \sim D_{\text{final}}} [y_t \log P_\theta(y_t|q, \mathbf{x}_{\leq t}) + (1-y_t) \log(1 - P_\theta(y_t|q, \mathbf{x}_{\leq t}))]\]

where the label \(\hat{y}_t\) adopts soft labels near error positions and is reweighted via confidence calibration.

Key Experimental Results

Main Results (Best-of-8, Policy: Qwen2.5-Math-7B-Instruct)

Model Training Samples Annotation GSM8K MATH College Math Olympiad Avg
Majority Vote@8 96.9 87.3 47.4 43.0 68.7
RLHFlow-PRM-8B 253K MC 96.8 87.3 47.9 43.9 69.0
Qwen2.5-Math-PRM-7B 1500K MC+KD 96.8 88.1 47.7 47.6 70.1
PRM800K 264K Human 97.0 87.6 47.7 45.0 69.3
Scan-Base 101K MC 97.1 86.9 47.8 44.4 69.1
Scan-Pro 197K MC 97.1 87.3 48.1 47.7 70.1

Ablation Study (ProcessBench F1)

Configuration GSM8K F1 MATH F1 Olympiad F1 Avg F1 Note
Baseline (no denoising) ~35 Rapid overfitting
+ Selective Sampling ~45 Reduces positive sample noise
+ Tolerance Labeling ~52 Resists over-estimation noise
+ Confidence Reweight 59.1 Eliminates model capability bias
Qwen2.5-7B-Ins (critic) 26.8 25.7 14.2 19.9 Base model
Scan-Pro 80.9 65.3 45.9 59.1 After self-training

Key Findings

  • High-quality data can be generated with only a 1.5B model: Scan-Base uses Qwen2.5-Math-1.5B to generate 101K samples, and the resulting PRM performance approaches that of PRM800K with 264K human annotations.
  • Substantial self-improvement: The ProcessBench F1 of Qwen2.5-7B-Ins improves from 19.9 to 59.1 (+39.2), surpassing 70B-scale critic models.
  • Tolerance distance \(d=2\) is optimal: \(d=0\) (hard labels) causes severe overfitting, while \(d=n\) (full soft labels) introduces excessive noise.
  • Baselines without denoising overfit rapidly: This validates the severe impact of MC noise on PRM training.
  • Data source diversity is beneficial: Scan-Pro, which integrates data from three models, outperforms any single-source variant.

Highlights & Insights

  • Systematic noise distribution analysis is the core contribution: This work is the first to reveal the sources and distributional patterns of under-estimation and over-estimation noise in MC annotations from a self-confidence perspective.
  • The self-denoising strategy is highly efficient—it requires no external strong model and relies solely on the completer's own self-confidence.
  • Confidence-wise reweighting elegantly addresses the consistency problem in multi-model mixed annotation.
  • Only 101K samples + a 1.5B model suffices to match human annotation quality, validating the feasibility of the "small model + good strategy" paradigm.

Limitations & Future Work

  • The tolerance distance \(d\) requires manual selection; adaptive determination warrants further exploration.
  • The self-confidence metric depends on sufficient sampling (16 rollouts); estimates are unreliable with insufficient sampling.
  • Validation is currently limited to mathematical reasoning; the noise distribution may differ for code reasoning or general reasoning tasks.
  • Directly bypassing MC annotation for positive samples may occasionally miss a small number of latent errors.
  • Math-Shepherd pioneered MC methods for PRM data synthesis but did not investigate noise issues in depth.
  • PRM800K serves as the benchmark for human annotation; this paper demonstrates that synthetic data can rival it under the right strategy.
  • Additional techniques from the noisy label learning literature (e.g., MixUp, label smoothing) may offer further complementary benefits.
  • The proposed framework has reference value for other scenarios requiring process supervision, such as code generation and multi-step reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐ — Approaching PRM data synthesis from a noise distribution perspective is a novel angle.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual evaluation via BoN and ProcessBench, comprehensive ablations, and multi-model extensions.
  • Writing Quality: ⭐⭐⭐⭐⭐ — The logical chain from preliminary analysis → motivation → method → experiments is exceptionally coherent.
  • Value: ⭐⭐⭐⭐⭐ — A low-cost PRM training solution with direct practical value for reasoning enhancement.