Skip to content

SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Conference: AAAI 2026 arXiv: 2601.20312 Code: N/A Area: LLM Reasoning Keywords: process supervision, self-evolution, first error detection, small language models, reasoner-verifier gap

TL;DR

Inspired by Error-Related Negativity (ERN) in neuroscience, this paper proposes SAPO, a self-adaptive process optimization method that replaces costly step-wise Monte Carlo rollouts with first error detection and local posterior estimation. SAPO reduces computational cost by 2–3× while enabling joint optimization of the reasoner and verifier, allowing small language models (≤2B) to outperform most self-evolution methods on mathematical and code reasoning tasks.

Background & Motivation

Background: Large language models exhibit strong reasoning capabilities but incur high computational costs. Small language models (SLMs, ≤2B) are viable candidates for deployment on mobile devices. Self-evolution methods enhance SLM reasoning through a reasoner–verifier interaction framework.

Limitations of Prior Work: - Most self-evolution methods rely solely on outcome rewards, neglecting fine-grained step-level feedback, which leads to reward hacking. - Monte Carlo process supervision is more precise but computationally prohibitive (10k problems × 8 steps × 10 trajectories = 800k rollouts). - Across multiple iterations, the gap between the reasoner and verifier widens (Figure 1), degrading the quality of verifier evaluations.

Key Challenge: A fundamental trade-off exists between the granularity of process supervision and computational efficiency—step-level feedback is necessary to reduce the reasoner-verifier gap, yet full MC estimation is too costly.

Goal: To achieve joint optimization of the reasoner and verifier without significantly increasing supervision cost.

Key Insight: Inspired by Error-Related Negativity (ERN)—humans can rapidly localize the point of error after making a mistake and adjust their behavior accordingly—only the first error needs to be located to provide effective process supervision.

Core Idea: Replace step-wise MC rollouts with verifier-based first error localization followed by local posterior verification by the reasoner, introducing process supervision signals at minimal cost.

Method

Overall Architecture

SAPO adopts an iterative explore-exploit paradigm. In each iteration: the verifier pre-scores trajectories → first error detection → reasoner posterior verification → verifier update → preference data construction → reasoner alignment. The reasoner and verifier co-evolve throughout this process.

Key Designs

  1. First Error Detection

  2. Function: The verifier \(V\) scores each step in a trajectory, and the position with the largest score drop is identified as the candidate first error location.

  3. Mechanism: Computes score differences between adjacent steps \(\Delta_j = \hat{c}_{j-1} - \hat{c}_j\), and selects \(\hat{t} = \arg\max_j \Delta_j\).
  4. Design Motivation: The first error position alone is sufficient to provide effective process supervision (Uesato et al. 2022), eliminating the need to annotate every step.

  5. Self-Verification

  6. Function: The reasoner performs posterior estimation at the verifier-predicted first error location to correct pre-annotated labels.

  7. Mechanism: Rollout verification is conducted only at positions \(\hat{t}-1\) and \(\hat{t}\), covering three cases:
    • Case (a): \(c_{\hat{t}-1}=1, c_{\hat{t}}=0\) → prediction is correct.
    • Case (b): \(c_{\hat{t}-1}=1, c_{\hat{t}}=1\) → first error lies further ahead (\(\hat{t}<t\)); extend correct labels.
    • Case (c): \(c_{\hat{t}-1}=0, c_{\hat{t}}=0\) → first error occurred earlier (\(\hat{t}>t\)); shift error label backward.
  8. Design Motivation: Only 2 rollouts are required for verification, compared to full step-wise rollouts under MC estimation.

  9. Expansion

  10. Function: Leverages trajectories generated during rollout to improve the generalization of verification.

  11. Mechanism: If \(c_{\hat{t}}=1\), all steps of the correct trajectory prefixed by \(s_{(0:\hat{t})}\) are labeled correct; if \(c_{\hat{t}}=0\), all suffixes starting from \(s_{\hat{t}}\) are labeled incorrect.
  12. Design Motivation: Rollouts naturally yield diverse samples that can be reused to augment PRM training data.

Loss & Training

  • Verifier Training: A classification PRM trained with MSE loss — \(\mathcal{L}_{PRM} = \frac{1}{n}\sum_i\sum_j(f(s_{(0:j)}^i;q) - c^k)^2\)
  • Reasoner Alignment: Preference pairs are constructed using the ORPO algorithm — \(\mathcal{L}_{ORPO} = \mathbb{E}[\mathcal{L}_{SFT}(q,\tau^w) - \beta\log\sigma(\frac{\text{odds}(\tau^w|q)}{\text{odds}(\tau^l|q)})]\)
  • Preference Data Construction: A threshold \(\eta\) is set; a positive–negative sample pair is constructed when \(r(\tau_i^w) - r(\tau_i^l) \geq \eta\).
  • Iteration Strategy: 3 iterations in total; the verifier is updated before the reasoner is optimized in each round.

Key Experimental Results

Main Results — Mathematical Reasoning and Code Generation

Method Qwen-2.5-0.5B GSM8K Qwen-2.5-0.5B MATH (OOD) Llama-3.2-1B GSM8K Gemma-2-2B GSM8K
CoT 28.51 25.97 5.31 19.86
SFT 34.19 24.84 22.14 39.12
RFT 37.83 27.35 26.31 45.34
Online-RFT 40.79 28.85 29.03 48.67
SFT+GRPO 46.24 34.53 26.46 44.65
SAPO-iter3 41.62 31.72 34.19 49.73
Method Qwen-2.5-0.5B MBPP Llama-3.2-1B MBPP Gemma-2-2B MBPP
SFT+GRPO 35.20 25.55 33.40
SAPO-iter3 36.67 28.92 35.43

Ablation Study — GSM8K

Method Qwen-2.5-0.5B Llama-3.2-1B
SAPO (Full) 41.62 34.19
w/o Process Feedback (PF) 39.65 32.37
w/o Detection & Verification (DV) 40.86 31.69
w/o Reward Model (RM) 40.71 32.75
w/o Expansion (EP) 40.33 33.73

Key Findings

  • Model Dependency of GRPO: GRPO performs best on Qwen-2.5-0.5B but underperforms SAPO on Llama and Gemma, indicating that GRPO's effectiveness is highly dependent on the base model's capacity.
  • Advantage on Code Tasks: SAPO consistently outperforms GRPO on code generation, suggesting that process supervision is more effective for structured reasoning.
  • Efficiency Gains: Compared to Shepherd (step-wise rollout), SAPO reduces FLOPs and wall-clock time by 2–3×.
  • Verifier Performance Improvement: SAPRM continually improves process verification performance across iterations and outperforms Online ORM on code verification in particular.
  • Importance of Synchronized Alignment: Reasoner–verifier pairs from the same iteration (e.g., V3-R3) exhibit the lowest error rates; cross-iteration pairings yield higher error rates.

Highlights & Insights

  • Neuroscience Inspiration: The ERN analogy provides an elegant intuition—humans do not need to examine every step, but can rapidly localize the first error and adjust accordingly.
  • Adaptive Efficiency: The first error localization cleverly reduces O(n) MC estimation to O(1) local verification, striking an effective balance between accuracy and efficiency.
  • Two New Benchmarks: GSM_Process (3,786 instances) and MBPP_Process (1,499 instances) are introduced to evaluate process verifiers, addressing the absence of step-level verification benchmarks.
  • Iterative Convergence Analysis: Self-verification error rate experiments (Table 2) directly quantify changes in the reasoner–verifier gap across iterations.

Limitations & Future Work

  • SAPO still underperforms SFT+GRPO on mathematical tasks with Qwen-2.5-0.5B (41.62 vs. 46.24), indicating that GRPO remains more effective in certain settings.
  • Experiments are limited to models ≤2B; applicability to larger models has not been validated.
  • First error detection depends on verifier quality; a weak initial verifier may introduce cascading errors.
  • The expansion strategy assumes all steps before the first error are correct and all subsequent steps are incorrect, whereas in practice some later steps may be partially correct.
  • GRPO (Shao et al. 2024): Group Relative Policy Optimization is effective for large models but not necessarily optimal for SLMs → SAPO offers a more stable alternative for SLMs.
  • V-STaR (Hosseini et al. 2024): Uses verifiers to internalize alignment signals → SAPO further advances reasoner–verifier co-optimization.
  • OmegaPRM (Luo et al. 2024): Employs binary search to localize the first error → SAPO's detect-and-verify strategy provides a more comprehensive solution.

Rating

  • Novelty: ⭐⭐⭐⭐ The ERN-inspired first error detection combined with self-verification is novel, and the efficiency optimization approach is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three backbone models × two task types × multiple baselines, with comprehensive ablation and efficiency analysis.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the logic from problem to method to experiment is coherent.
  • Value: ⭐⭐⭐⭐ Offers clear practical contributions to SLM self-evolution and efficient process supervision.