Skip to content

Note 6: Self-Evaluating LLMs - Step-Level Confidence Estimation for Multi-Step Tasks

Conference: NeurIPS 2025 arXiv: 2505.17373 Code: None (research) Area: LLM Reliability, Multi-Step Reasoning, Confidence Calibration Keywords: Failure Detection, Step-Level Evaluation, Self-Teaching, Multi-Hop Reasoning, Confidence Estimation

TL;DR

This paper extends confidence estimation to multi-step tasks, demonstrating that step-level evaluation detects reasoning failures more effectively than response-level evaluation, achieving a 15% relative AUC-ROC improvement over holistic evaluation on CoQA, and providing a practical framework for trustworthy deployment of multi-step reasoning systems.

Background & Motivation

Limitations of Prior Work

Background: Single-step research is saturated — a large body of work on confidence estimation exists, but almost all focuses on single-turn outputs; failure detection in multi-step reasoning remains understudied.

Complexity of Multi-Step Reasoning: Reasoning chains can be arbitrarily long, errors may arise at any step, and early mistakes are amplified by subsequent steps, causing direct application of single-step methods to fail.

Empirical Gap: Directly applying self-certainty to CoQA yields only 0.523 AUC-ROC, whereas a simple step-level extension achieves 0.849 (+62%), a substantial difference.

Core Problem: At what granularity should confidence estimation be performed for multi-step tasks? Which is superior — per-step or holistic evaluation?

Method

Overall Architecture

A systematic comparison of two evaluation granularities:

1. Response-Level Evaluation (Holistic) $\(p_{whole} = \mathcal{S}_{whole}(R_{[1:n]}|C,Q_{[1:n]})\)$ A single score evaluates the logical consistency of the entire reasoning chain.

2. Step-Level Evaluation (Fine-Grained) $\(p_i = \mathcal{F}_{step}(R_i|C,Q_{[1:i]},R_{[1:i-1]})\)$ Each step \(i\) is scored individually; the final confidence is \(p = \min(\{p_i\}_{i=1}^n)\) (failure at any step constitutes overall failure).

Key Designs

Evaluation of Five Categories of Confidence Estimation Methods:

Method White/Black Box Core Idea Model Requirements
Self-Verbalized Black-box LLM self-reports confidence Any LLM
LLM Evaluator Black-box GPT-4 / Llama as judge Additional evaluator
Regression Model White-box Hidden activations → confidence score Hidden layer access + fine-tuning
Preference Reward Model White-box Binary classification training PRM data + fine-tuning
Self-Certainty White-box Log-prob calibration Token-level probabilities

Training Detail — Teacher Forcing: $\(\mathcal{F}(R_i|C,Q_{[1:i]},\hat{R}_{[1:i-1]}) \rightarrow \mathbb{I}\{R_i≠\hat{R}_i\}\)$ Generation is conditioned on the gold history \(\hat{R}\) rather than model outputs, reducing error propagation.

At inference time, gold references are unavailable; the method relies entirely on the model's own generated history.

Key Experimental Results

GSM8K (Mathematical Reasoning) — AUC-ROC & FPR@0.9 Recall

Main Results

Method Granularity AUC FPR@0.9rec Key Finding
Self-Certainty Holistic 0.649 0.812 Weaker
Self-Certainty Step-level 0.849 0.374 +62% relative gain
Regression Model Holistic 0.843 0.441 Strong baseline
Regression Model Step-level 0.907 0.314 +7.6% further improvement
GPT-4.1-mini Holistic 0.880 1.0 (mr:0.81) Strong closed-source
GPT-4.1-mini Step-level 0.670 1.0 (mr:0.48) Performance degrades

CoQA (Conversational QA) — Performance Comparison

Ablation Study

Method Granularity AUC FPR@0.9rec Change vs. Holistic
Self-Certainty Holistic 0.523 0.950 Baseline
Self-Certainty Step-level 0.849 0.374 +62.1%
Llama-3.2-11B Holistic 0.586 1.0 (mr:0.52) Baseline
Llama-3.2-11B Step-level 0.676 0.81 +15.3% improvement
Activation Regression Holistic 0.750 0.647 +28%
Activation Regression Step-level 0.919 0.169 +22.5% absolute

Key Findings

  1. Task Dependency: Step-level evaluation uniformly outperforms holistic evaluation on CoQA (4/5 methods recover or improve), while differences are smaller on GSM8K and GPT-4.1-mini degrades; task characteristics are critical.
  2. Spurious Reasoning in Math: In GSM8K, 60/879 (6.8%) samples exhibit incorrect reasoning yet arrive at correct answers; step-level evaluation detects such defects while holistic evaluation misses them.
  3. Activations Are Most Robust: The regression model based on hidden activations (whose logits are not contaminated by tool interactions) performs best on both tasks, with notable step-level advantages.
  4. Real-World Validation: Step-level advantages also hold on clinical data (medical record QA), with AUC=0.940 and FPR=0.152, demonstrating the generality and effectiveness of the method.

Highlights & Insights

  1. In-Depth Granularity Trade-off: The first systematic comparison of step-level vs. holistic evaluation, revealing the complex interaction between task type and method choice.
  2. Practical Framework: Provides a deployable step-level evaluation scheme that requires no model reconstruction.
  3. Failure Mode Analysis: Identifies spurious reasoning (incorrect steps → correct answer), with step-level evaluation showing a 39.3% relative advantage in fault detection rate.
  4. Medical Validation: Validation on real clinical data strengthens applicability in high-stakes domains such as healthcare.

Limitations & Future Work

  1. The cost of step-level annotation (requiring gold answers at each step) limits data scale, and cross-domain transfer requires re-annotation.
  2. Text generation differs from classification; the definition of step boundaries remains ambiguous (what constitutes a "step"?).
  3. The PRM baseline cannot be applied at the step level on GSM8K (due to multiple valid reasoning paths); this methodological limitation is not discussed in depth.
  • Confidence estimation and calibration (log-prob, activations, preference learning)
  • Trustworthiness evaluation in multi-step reasoning and RAG
  • Error detection in dialogue systems and mathematical reasoning

Rating

⭐⭐⭐⭐⭐