SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger¶

Conference: AAAI 2026 arXiv: 2601.20312 Code: N/A Area: LLM Reasoning Keywords: process supervision, self-evolution, first error detection, small language models, reasoner-verifier gap

TL;DR¶

Inspired by Error-Related Negativity (ERN) in neuroscience, this paper proposes SAPO, a self-adaptive process optimization method that replaces costly step-wise Monte Carlo rollouts with first error detection and local posterior estimation. SAPO reduces computational cost by 2–3× while enabling joint optimization of the reasoner and verifier, allowing small language models (≤2B) to outperform most self-evolution methods on mathematical and code reasoning tasks.

Background & Motivation¶

Background: Large language models exhibit strong reasoning capabilities but incur high computational costs. Small language models (SLMs, ≤2B) are viable candidates for deployment on mobile devices. Self-evolution methods enhance SLM reasoning through a reasoner–verifier interaction framework.

Limitations of Prior Work: - Most self-evolution methods rely solely on outcome rewards, neglecting fine-grained step-level feedback, which leads to reward hacking. - Monte Carlo process supervision is more precise but computationally prohibitive (10k problems × 8 steps × 10 trajectories = 800k rollouts). - Across multiple iterations, the gap between the reasoner and verifier widens (Figure 1), degrading the quality of verifier evaluations.

Key Challenge: A fundamental trade-off exists between the granularity of process supervision and computational efficiency—step-level feedback is necessary to reduce the reasoner-verifier gap, yet full MC estimation is too costly.

Goal: To achieve joint optimization of the reasoner and verifier without significantly increasing supervision cost.

Key Insight: Inspired by Error-Related Negativity (ERN)—humans can rapidly localize the point of error after making a mistake and adjust their behavior accordingly—only the first error needs to be located to provide effective process supervision.

Core Idea: Replace step-wise MC rollouts with verifier-based first error localization followed by local posterior verification by the reasoner, introducing process supervision signals at minimal cost.

Method¶

Overall Architecture¶

SAPO adopts an iterative explore-exploit paradigm. In each iteration: the verifier pre-scores trajectories → first error detection → reasoner posterior verification → verifier update → preference data construction → reasoner alignment. The reasoner and verifier co-evolve throughout this process.

Key Designs¶

First Error Detection
Function: The verifier \(V\) scores each step in a trajectory, and the position with the largest score drop is identified as the candidate first error location.
Mechanism: Computes score differences between adjacent steps \(\Delta_j = \hat{c}_{j-1} - \hat{c}_j\), and selects \(\hat{t} = \arg\max_j \Delta_j\).
Design Motivation: The first error position alone is sufficient to provide effective process supervision (Uesato et al. 2022), eliminating the need to annotate every step.
Self-Verification
Function: The reasoner performs posterior estimation at the verifier-predicted first error location to correct pre-annotated labels.
Mechanism: Rollout verification is conducted only at positions \(\hat{t}-1\) and \(\hat{t}\), covering three cases:
- Case (a): \(c_{\hat{t}-1}=1, c_{\hat{t}}=0\) → prediction is correct.
- Case (b): \(c_{\hat{t}-1}=1, c_{\hat{t}}=1\) → first error lies further ahead (\(\hat{t}<t\)); extend correct labels.
- Case (c): \(c_{\hat{t}-1}=0, c_{\hat{t}}=0\) → first error occurred earlier (\(\hat{t}>t\)); shift error label backward.
Design Motivation: Only 2 rollouts are required for verification, compared to full step-wise rollouts under MC estimation.
Expansion
Function: Leverages trajectories generated during rollout to improve the generalization of verification.
Mechanism: If \(c_{\hat{t}}=1\), all steps of the correct trajectory prefixed by \(s_{(0:\hat{t})}\) are labeled correct; if \(c_{\hat{t}}=0\), all suffixes starting from \(s_{\hat{t}}\) are labeled incorrect.
Design Motivation: Rollouts naturally yield diverse samples that can be reused to augment PRM training data.

Loss & Training¶

Verifier Training: A classification PRM trained with MSE loss — \(\mathcal{L}_{PRM} = \frac{1}{n}\sum_i\sum_j(f(s_{(0:j)}^i;q) - c^k)^2\)
Reasoner Alignment: Preference pairs are constructed using the ORPO algorithm — \(\mathcal{L}_{ORPO} = \mathbb{E}[\mathcal{L}_{SFT}(q,\tau^w) - \beta\log\sigma(\frac{\text{odds}(\tau^w|q)}{\text{odds}(\tau^l|q)})]\)
Preference Data Construction: A threshold \(\eta\) is set; a positive–negative sample pair is constructed when \(r(\tau_i^w) - r(\tau_i^l) \geq \eta\).
Iteration Strategy: 3 iterations in total; the verifier is updated before the reasoner is optimized in each round.

Key Experimental Results¶

Main Results — Mathematical Reasoning and Code Generation¶

Method	Qwen-2.5-0.5B GSM8K	Qwen-2.5-0.5B MATH (OOD)	Llama-3.2-1B GSM8K	Gemma-2-2B GSM8K
CoT	28.51	25.97	5.31	19.86
SFT	34.19	24.84	22.14	39.12
RFT	37.83	27.35	26.31	45.34
Online-RFT	40.79	28.85	29.03	48.67
SFT+GRPO	46.24	34.53	26.46	44.65
SAPO-iter3	41.62	31.72	34.19	49.73

Method	Qwen-2.5-0.5B MBPP	Llama-3.2-1B MBPP	Gemma-2-2B MBPP
SFT+GRPO	35.20	25.55	33.40
SAPO-iter3	36.67	28.92	35.43

Ablation Study — GSM8K¶

Method	Qwen-2.5-0.5B	Llama-3.2-1B
SAPO (Full)	41.62	34.19
w/o Process Feedback (PF)	39.65	32.37
w/o Detection & Verification (DV)	40.86	31.69
w/o Reward Model (RM)	40.71	32.75
w/o Expansion (EP)	40.33	33.73

Key Findings¶

Model Dependency of GRPO: GRPO performs best on Qwen-2.5-0.5B but underperforms SAPO on Llama and Gemma, indicating that GRPO's effectiveness is highly dependent on the base model's capacity.
Advantage on Code Tasks: SAPO consistently outperforms GRPO on code generation, suggesting that process supervision is more effective for structured reasoning.
Efficiency Gains: Compared to Shepherd (step-wise rollout), SAPO reduces FLOPs and wall-clock time by 2–3×.
Verifier Performance Improvement: SAPRM continually improves process verification performance across iterations and outperforms Online ORM on code verification in particular.
Importance of Synchronized Alignment: Reasoner–verifier pairs from the same iteration (e.g., V3-R3) exhibit the lowest error rates; cross-iteration pairings yield higher error rates.

Highlights & Insights¶

Neuroscience Inspiration: The ERN analogy provides an elegant intuition—humans do not need to examine every step, but can rapidly localize the first error and adjust accordingly.
Adaptive Efficiency: The first error localization cleverly reduces O(n) MC estimation to O(1) local verification, striking an effective balance between accuracy and efficiency.
Two New Benchmarks: GSM_Process (3,786 instances) and MBPP_Process (1,499 instances) are introduced to evaluate process verifiers, addressing the absence of step-level verification benchmarks.
Iterative Convergence Analysis: Self-verification error rate experiments (Table 2) directly quantify changes in the reasoner–verifier gap across iterations.

Limitations & Future Work¶

SAPO still underperforms SFT+GRPO on mathematical tasks with Qwen-2.5-0.5B (41.62 vs. 46.24), indicating that GRPO remains more effective in certain settings.
Experiments are limited to models ≤2B; applicability to larger models has not been validated.
First error detection depends on verifier quality; a weak initial verifier may introduce cascading errors.
The expansion strategy assumes all steps before the first error are correct and all subsequent steps are incorrect, whereas in practice some later steps may be partially correct.

GRPO (Shao et al. 2024): Group Relative Policy Optimization is effective for large models but not necessarily optimal for SLMs → SAPO offers a more stable alternative for SLMs.
V-STaR (Hosseini et al. 2024): Uses verifiers to internalize alignment signals → SAPO further advances reasoner–verifier co-optimization.
OmegaPRM (Luo et al. 2024): Employs binary search to localize the first error → SAPO's detect-and-verify strategy provides a more comprehensive solution.

Rating¶

Novelty: ⭐⭐⭐⭐ The ERN-inspired first error detection combined with self-verification is novel, and the efficiency optimization approach is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Three backbone models × two task types × multiple baselines, with comprehensive ablation and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the logic from problem to method to experiment is coherent.
Value: ⭐⭐⭐⭐ Offers clear practical contributions to SLM self-evolution and efficient process supervision.