Skip to content

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

Conference: ICML 2025
arXiv: 2502.13943
Code: https://github.com/Lux0926/ASPRM
Area: LLM Reasoning / Process Reward Model
Keywords: Process Reward Model, Reasoning Step Division, Model Confidence, Token-level Value-guided Decoding, Mathematical Reasoning

TL;DR

Proposes AdaptiveStep, a method that automatically divides reasoning steps based on model prediction confidence to train a more precise Process Reward Model (ASPRM). On mathematical reasoning and code generation tasks, it surpasses existing open-source PRMs at less than 70% of the data construction cost, and further enhances reasoning performance through token-level value-guided decoding.

Background & Motivation

Process Reward Models (PRMs) provide finer-grained feedback than Outcome Reward Models (ORMs) by giving reward signals for each step in the reasoning process, thereby guiding LLMs to generate higher-quality reasoning responses. However, existing PRMs face a core problem: the division of reasoning steps is too coarse.

The current mainstream practice relies on rule-based step division, such as splitting by newline characters or a fixed number of tokens. However, this approach has two key limitations: (1) Model confidence at newline characters is often very high, meaning these positions are not real "decision points" and carry low information; (2) In fields like code generation, it is difficult to define universal splitting rules. Although manual annotation can produce high-quality step divisions, it is highly costly and heavily dependent on expert knowledge.

The authors draw inspiration from cognitive science—Kahneman pointed out that human deep thinking accounts for only about 2% of total thinking, with critical reasoning decisions concentrated at a few nodes. Inspired by this, the authors propose letting the model itself tell us where the key decision points are: when the model's prediction confidence for the next token is low, it indicates that this position is a decision point where an important choice must be made, which should serve as the boundary between steps.

Method

Overall Architecture

The overall workflow of AdaptiveStep consists of three steps: (1) sampling responses and collecting the confidence distribution of each token; (2) dividing reasoning steps based on a confidence threshold and annotating the reward of each step through rollouts; (3) training the PRM using the annotated data, and optionally applying the PRM for Token-level Value-guided Decoding (TVD) to enhance reasoning.

Key Designs

  1. Based-on-confidence Step Division (AdaptiveStep):

    • Function: Automatically segments reasoning responses into multiple highly informative reasoning steps.
    • Mechanism: For the \(i\)-th token in the generated response \(s^n\), its confidence is defined as \(c_{s_i^n} = p(s_i^n | \pi, q, s_{<i}^n)\), which is the probability of the model predicting this token. After collecting the confidence distributions of all samples, a threshold \(\tau\) (based on a certain percentage of the token count, set to 2% in the paper) is defined. Token positions with confidence below the threshold are treated as step division points. Thus, the response \(s^n\) is divided into \(K\) reasoning steps \(\{r_1, r_2, ..., r_K\}\).
    • Design Motivation: Low-confidence positions represent difficult decision points faced by the model—such as calculations in mathematical expressions, semantic vocabulary choices, or the determination of final answers. Statistical analysis shows that 3.85% of tokens in mathematical expressions contribute 21.03% of decision tokens, and only 2.7% of decision tokens appear at newline characters, confirming the inefficiency of rule-based division.
  2. Rollout-based Step Reward Estimation:

    • Function: Estimates the target reward value for each divided reasoning step.
    • Mechanism: Perform \(J\) rollouts starting from each step \(r_k\), and use Hard Estimation (HE) to judge whether any rollout path can reach the correct answer. The target reward is: $\(r_k^e = \begin{cases} 1, & \exists j \in [J], \{r_1,...,r_k,t_j\} \text{ is correct} \\ 0, & \text{otherwise} \end{cases}\)$
    • Design Motivation: By performing rollouts at decision points, the reward signal for each step is more accurate because the end of the step is precisely where the decision occurs.
  3. Token-level Value-guided Decoding (TVD):

    • Function: Leverages the PRM during the reasoning phase to guide token selection in real-time, without requiring additional sampling.
    • Mechanism: During decoding, when the model encounters a low-confidence position (\(c_p < \tau\)), it takes the top \(M\) candidate tokens with the highest probabilities, scores each candidate with the PRM, and selects the token with the highest score: $\(s_i = \arg\max_{s_i^m \in s_i^*} R^\theta(p, s_{<i}, s_i^m)\)$
    • Design Motivation: Traditional PRMs are only used for post-hoc evaluation in Best-of-N. TVD embeds the PRM into the generation process to achieve fine-grained, real-time guidance. Since it only intervenes at low-confidence positions, the computational overhead is controllable.

Loss & Training

The PRM is trained using binary cross-entropy loss: $\(\mathcal{L}_{PRM}^\theta = -\sum_{k=1}^{K} (r_k^e \log r_k^\theta + (1 - r_k^e) \log(1 - r_k^\theta))\)$

Training data construction: each data point is sampled 30 times and deduplicated, with 8 rollouts per step, finally generating approximately 388k mathematical PRM training samples and 49k code PRM samples. The threshold is set to 2%, meaning that approximately 2% of the tokens will act as step decision boundaries.

Key Experimental Results

Main Results

Dataset Metric ASPRM Prev. SOTA Gain
GSM8k (BoN, N=64) Accuracy 90.45 (ASPRM-L) 88.70 (ER-PRM) +1.75
MATH500 (TVD) Accuracy 42.00 (ASPRM-L) 38.80 (Greedy) +3.20
GSM8k (TVD) Accuracy 83.47 (ASPRM-L) 81.80 (Greedy) +1.67
LeetCodeDataset (TVD) Pass@1 28.00 26.28 (Greedy) +1.72
LiveCodeBench (TVD) Pass@1 19.92 19.21 (Greedy) +0.71

Note: Under TVD, Math-Shepherd and ER-PRM caused performance degradation on GSM8k (lower than Greedy), whereas ASPRM consistently brings improvements.

Ablation Study

Configuration Key Metric Description
Threshold 0.5% BoN GSM8k is lower Too few division points, insufficient information
Threshold 1.0% Performance increases incrementally Discriminative power is enhanced under more decision points
Threshold 2.0% Best Matches the 2% deep thinking ratio in cognitive science
L→M Transfer Bo64 drops, but TVD can improve Training data across different models shows some transferability, but it is limited
Mixed Math + Code Math Bo64 86.35↑, MATH500 TVD 29.00↑ Cross-domain data can mutually enhance each other

Key Findings

  • Information in AdaptiveStep division is much higher than in rule-based division: In mathematical tasks, only 2.7% of decision tokens are newline characters, whereas 29% are at conjunctions, and 21% are within mathematical expressions.
  • In code tasks, 80% of decision points are in code comments, of which 91% are of the "planning the next step" type, demonstrating that the model is most uncertain when "thinking".
  • Significant cost advantage in data construction: ASPRM uses only a single model, 30 samples, and 8 rollouts, costing less than 70% of Math-Shepherd and ER-PRM.
  • Cross-domain generalization: Mathematical PRMs can provide effective guidance on code tasks (LeetCodeDataset BoN 34.29↑), and vice versa.
  • Generalization over scoring positions: The performance of ASPRM barely drops under random scoring positions, whereas models trained on newline splitting vary greatly under different settings.

Highlights & Insights

  • It uses the model's own confidence as the step division signal, which is a simple and elegant idea supported by cognitive science (Kahneman's 2% deep thinking).
  • The TVD strategy upgrades PRM from "post-hoc evaluation" to "real-time guidance". It only intervenes at low-confidence positions, resulting in minimal computational overhead and significant performance gains.
  • It open-sources a function-level LeetCode dataset (including test cases and a sandbox), filling the gap in code PRM training data.
  • Mixed training with cross-domain data is a practical, low-cost trick to enhance PRMs.

Limitations & Future Work

  • The 2% threshold is not optimal for all models; stronger models may require less training data (the paper observes this but does not explore adaptive threshold selection deeply).
  • Generating training data with a single model limits transferability; ASPRM-M's performance on MATH500 is inferior to baseline models built using multiple models.
  • PRM training data for code tasks is harder to obtain (49k vs 388k); scaling up the data could further improve performance.
  • Although TVD only intervenes at low-confidence positions, it still requires additional PRM inference, which may introduce latency in extremely long generation scenarios.
  • vs Math-Shepherd: Also uses rollout annotation, but divides steps using newline characters. It requires multi-model construction, making it more expensive and less informative.
  • vs ER-PRM: Uses 16 rollouts (ASPRM only uses 8), incurring higher construction costs, but performs worse than ASPRM on GSM8k.
  • vs Token-level PRM (OmegaPRM): Scores at every token or a fixed number of tokens, which is extremely costly to annotate; ASPRM only scores at decision points, showing superior efficiency.
  • vs MCTS-based decoding: TVD is more lightweight and does not require full tree search.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of using confidence to divide steps is natural and effective, though the core technical components (rollouts, PRM training) are standard.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both mathematical and code domains, evaluates on both BoN and TVD, and includes transferability, generalization, threshold, and feature analysis.
  • Writing Quality: ⭐⭐⭐⭐ The structure is clear, figures/tables are rich and intuitive, and the analysis is in-depth.
  • Value: ⭐⭐⭐⭐ High practical value; it reduces PRM construction costs while boosting performance, offering a significant reference for PRM research.