SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision¶

Conference: AAAI 2026 arXiv: 2506.15498 Code: https://github.com/UKPLab/aaai2026-spare-prm Area: LLM Reasoning Keywords: Process Reward Model, Automatic Annotation, Reference-Guided, Single-Pass Generation, Data Efficiency

TL;DR¶

This paper proposes SPARE, a unified single-pass evaluation framework that simultaneously performs step-to-reference alignment and correctness judgment (with explicit reasoning) in a single structured generation, requiring no additional training data. SPARE achieves 2.3× speedup over MCTS-based methods and attains OOD generalization with only 16% of the training samples.

Background & Motivation¶

Process Reward Models (PRMs) provide step-level supervision signals to guide multi-step reasoning in LLMs, outperforming Outcome Reward Models (ORMs) that only evaluate final answers. However, the core bottleneck of PRMs lies in acquiring step-level annotation data—determining the correctness of each reasoning step.

Limitations of existing automatic annotation approaches:

Manual annotation (PRM800K): Requires expert mathematicians to evaluate each step, making it prohibitively costly and unscalable.

MCTS-based methods: Perform multiple forward rollouts from each intermediate step and infer step quality from final answer accuracy. The computational overhead is enormous—each step requires dozens of complete rollouts.

Existing reference-guided methods: Either rely on stronger teacher models to generate synthetic reasoning traces (GenRM, ThinkPRM), or require human step labels for filtering, limiting generalizability.

Key waste: All existing methods overlook the reference solutions (ground-truth reasoning traces) already present in SFT datasets—these high-quality step-level traces are left underutilized.

Core insight of SPARE: Reference solutions available during SFT training encode rich step-level information. Rather than discarding this information and running MCTS from scratch, SPARE directly prompts an LLM to align and evaluate each step of a candidate output against the corresponding steps of the reference solution—all within a single generation.

Method¶

Overall Architecture¶

SPARE is a unified single-stage evaluation framework. Given a context \(\mathcal{C}\), a reference reasoning path \(\mathcal{R}\) (with \(m\) steps), and a model-generated output \(\mathcal{O}\) (with \(n\) steps), SPARE produces an evaluation sequence \(\mathcal{E}\) in a single LLM generation, containing alignment information and correctness labels for each step.

Input: \((S, C, R, O) \rightarrow \mathcal{E}\)

For each step \(o_i\), the evaluation tuple \(\varepsilon = (e, c^+, o^+, r^+, \epsilon, y_i)\) consists of: - \(e\): natural language explanation (why the step is correct/incorrect) - \(c^+\): relevant contextual sentences - \(o^+\): related output steps - \(r^+\): aligned reference solution steps - \(\epsilon\): list of error categories (e.g., calculation error, logical leap) - \(y_i \in \{-1, +1\}\): correctness label

Key Designs¶

Joint Alignment + Evaluation via ICL
- Step evaluation is framed analogously to Natural Language Inference (NLI) with evidence localization: not only determining step correctness, but also identifying the supporting reference evidence.
- The system prompt encodes instance-agnostic alignment and evaluation criteria.
- In-context examples demonstrate how to apply these criteria to concrete instances.
- All steps are evaluated in a single generation pass; computational cost scales only additively with token length of the response and reference.
Explicit Reasoning Annotation
- Rather than producing binary labels only, the LLM is required to explain the reasoning behind each step's judgment.
- An error taxonomy is defined: calculation errors, logical leaps, misalignment with reference, faulty premises, etc.
- This improves interpretability and debuggability—annotations are no longer a black box.
Two Downstream Applications
- PRM Training (ranking/aggregation): SPARE-annotated data is used to train a process reward model, which at inference time performs Best-of-N selection or self-consistency voting over \(N\) candidate outputs.
- Offline RL Fine-tuning: Step-level signals from SPARE are used for DPO/offline RL to improve greedy decoding quality.
- Both applications yield consistent gains across 4 datasets.
Zero Additional Data Cost
- Reference solutions are directly reused from reasoning traces already present in standard SFT datasets, requiring no additional generation.
- The entire pipeline requires only a single LLM—no stronger teacher model is needed.

Loss & Training¶

PRM training uses a standard binary classification loss. Offline RL fine-tuning uses the DPO objective. SPARE itself is an inference-time ICL framework and requires no training.

Key Experimental Results¶

Main Results (Llama-3 8B Instruct, aggregation/ranking with N=20)¶

Method	GSM8K	MATH-500	MuSiQue	SpaRP
Self-Consistency	74.9	23.4	19.7/25.2	25.4/34.4
ORM (BoN)	79.7	20.2	33.4/45.4	41.7/49.8
ORM + SC	79.8	23.8	34.8/44.5	41.7/49.8
SPARE (BoN)	80.0	20.9	34.9/45.5	43.7/50.0
SPARE + SC	80.3	24.1	32.1/40.4	39.6/46.9

Data Efficiency (ProcessBench OOD Generalization)¶

Method	Training Data Size	ProcessBench Performance
Human annotation baseline	100%	Baseline
MCTS baseline	100%	Competitive
SPARE	~16%	Competitive

Efficiency Comparison¶

Method	Total Tokens	Relative Speed
MCTS	Multiple rollouts	1×
SPARE	Single-pass generation	2.3×

Key Findings¶

SPARE and MCTS are complementary: SPARE yields higher precision but slightly lower recall; MCTS yields higher recall but slightly lower precision—ensembling the two is a natural extension.
OOD generalization with only 16% of training data: Reference-guided alignment substantially reduces data requirements, suggesting that step alignment is more important than annotation volume.
Cross-task generalization: Consistent effectiveness across four diverse reasoning types—mathematical reasoning (GSM8K/MATH), multi-hop QA (MuSiQue), and spatial reasoning (SpaRP).
Explicit reasoning improves annotation quality: Requiring the LLM to explain each step's judgment yields more reliable and interpretable annotations.

Highlights & Insights¶

Single-pass efficiency paradigm: Merging step alignment and evaluation into a single LLM call, rather than evaluating each step independently, yields substantial computational gains.
Secondary reuse of SFT data: Reference solutions are used for training during SFT and again for evaluation during annotation—an elegant zero-cost data reuse design.
Analogy to NLI: Step evaluation is framed as natural language inference with evidence localization, offering a new theoretical perspective on process supervision.
Practical value of precision–recall complementarity: Ensembling SPARE with MCTS may represent the most effective combined solution.

Limitations & Future Work¶

Dependent on reference solution quality—errors in reference solutions (e.g., known annotation noise in MATH) propagate into the annotations.
Step granularity mismatch may hinder alignment—a single candidate step may correspond to multiple reference steps, complicating alignment.
ICL is constrained by context window length—very long reasoning chains may not be fully processed in a single generation.
Evaluation is limited to mathematical, QA, and spatial reasoning tasks; other multi-step reasoning domains such as code generation remain unexplored.
Ensemble strategies combining SPARE and MCTS warrant further investigation.

vs. MCTS-based methods (Math-Shepherd, OmegaPRM): MCTS infers step quality from final outcomes via multiple rollouts per step—computationally expensive but with high recall. SPARE uses direct reference alignment for evaluation, achieving 2.3× speedup at the cost of slightly lower recall. The two approaches are complementary.
vs. GenRM (Zhang et al. 2025): GenRM uses a stronger model to generate synthetic reasoning traces as reference; it functions as an ORM rather than a true PRM. SPARE provides genuine step-level process supervision without requiring a stronger teacher model.
vs. ThinkPRM (Khalifa 2025): ThinkPRM generates verification reasoning with a stronger model and additionally requires human step labels from PRM800K for filtering. SPARE requires no human step labels whatsoever.

Rating¶

Novelty: ⭐⭐⭐⭐ — Efficient single-pass joint alignment and evaluation design; novel approach to reusing reference solutions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four diverse reasoning benchmarks, two downstream applications (PRM + RL), efficiency analysis, and complementarity analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear architectural diagrams; the NLI analogy aids comprehension.
Value: ⭐⭐⭐⭐⭐ — Substantially reduces the cost of acquiring step-level process supervision, with direct practical utility for PRM training.