Models That Prove Their Own Correctness¶
Conference: NeurIPS 2025 arXiv: 2405.15722 Code: https://github.com/orrp/self-proving-models Area: Reinforcement Learning Keywords: self-proving models, interactive proofs, reinforcement learning from verifier feedback, formal verification, LLM reliability
TL;DR¶
This paper proposes the Self-Proving Models framework, in which a model proves the correctness of its outputs to a verifier algorithm via an interactive proof system. Two learning algorithms are introduced—Transcript Learning (TL) and Reinforcement Learning from Verifier Feedback (RLVF)—and experiments on the GCD computation task demonstrate that Annotated TL achieves 96% Verifiability.
Background & Motivation¶
Model accuracy is typically measured as average performance over an input distribution, which means that for any specific input, the correctness of a model's output cannot be guaranteed. A student using an LLM to solve an algebra problem cannot be certain whether the answer to a particular question is correct, even if the LLM achieves 90% accuracy on some benchmark. This reflects the fundamental tension between average-case guarantees and worst-case guarantees in machine learning.
Requiring models to produce natural-language explanations is insufficient—a model may convince users to accept a wrong answer through plausible-sounding reasoning, and even when the answer is correct, the model may fail to generate a convincing proof. What is truly needed is a formal proof that allows a verification algorithm—rather than human judgment—to confirm correctness.
The core idea draws from Interactive Proof (IP) theory: the model plays the role of a "prover" and proves the correctness of each output to a hand-crafted efficient verifier algorithm. The verifier's completeness guarantees that correct outputs can always be proven, while its soundness guarantees that incorrect outputs cannot be accepted—regardless of the prover's computational power. This elevates ML's average-case guarantees to per-input worst-case guarantees.
Method¶
Overall Architecture¶
The Self-Proving Model \(P_\theta\) operates as follows: given input \(x\), the model first generates output \(y \sim P_\theta(x)\), then engages in \(R\) rounds of interaction with the verifier \(V\). In each round, \(V\) sends a query \(q_i\) and \(P_\theta\) replies with answer \(a_i\). The verifier then decides to accept or reject. The training objective is to maximize Verifiability:
The key insight is that the verifier \(V\) is hand-crafted (not learned), providing formal soundness guarantees. Verifiability implies Correctness: if \(P_\theta\) is \(\beta\)-Verifiable and \(V\) has soundness error \(s\), then \(P_\theta\) is \((\beta - s)\)-correct.
Key Designs¶
-
Transcript Learning (TL):
- Function: Supervised learning based on accepting transcripts
- Mechanism: The model is trained not only as \(x \mapsto y^*\) (correct output) but as \(x \mapsto y^* \pi^*\) (correct output + accepting transcript), enabling the model to autoregressively learn to generate complete interaction sequences accepted by the verifier
- Gradient estimation: TL estimates the gradient of a lower-bound function \(A(\theta) \leq \text{ver}_V(\theta)\), where \(A(\theta) := \Pr[\pi = \pi^*]\) is the probability that the current model's transcript matches the honest transcript
- Update rule: \(\theta_{i+1} := \theta_i + \lambda \cdot \prod_{s} \alpha_s(\theta_i) \cdot \sum_s \vec{d}_s(\theta_i)\), where \(\alpha_s\) is the token-level selection probability and \(\vec{d}_s\) is the corresponding log-probability gradient
- Design Motivation: Analogous to process supervision in chain-of-thought, providing intermediate-step supervision signals to accelerate learning
-
Reinforcement Learning from Verifier Feedback (RLVF):
- Function: Learns without accepting transcripts, relying solely on the verifier's accept/reject feedback
- Mechanism: The model generates a transcript \(\pi \sim P_\theta\), the verifier judges acceptance, and parameters are updated only upon acceptance
- Key gradient formula: \(\nabla_\theta \text{ver}(\theta) = \mathbb{E}[\text{Acc}_V(\cdot) \cdot \sum_s \vec{d}_s(\theta)]\), where \(\text{Acc}_V\) is a binary indicator
- Characteristics: An on-policy algorithm with simpler update steps than TL (no need to track \(\alpha_s\)), but requires initial Verifiability \(> 0\) to sample accepting transcripts
- Design Motivation: Analogous to RLHF, but with an algorithmic verifier replacing human feedback; the verifier can be efficiently simulated without additional annotation cost
-
Annotated Transcript Learning (ATL):
- Function: Augments TL with "annotations"—intermediate computation steps in the proof process
- Mechanism: The proof \(\pi\) is extended to \(\tilde{\pi} = A(x, \pi)\), containing intermediate derivation steps for the first \(T\) rounds; at inference time, an extractor \(E\) strips the annotations before sending the actual proof
- Design Motivation: Annotations serve as chain-of-thought, substantially reducing the learning difficulty. In experiments, ATL raises Verifiability from 60.3% (TL) to 96%
Loss & Training¶
- TL: Maximizes the log-likelihood of accepting transcripts, effectively performing gradient ascent on \(A(\theta)\)
- RLVF: REINFORCE-style policy gradient with the verifier's binary decision as reward
- Recommended combination: First apply TL to obtain a base Self-Proving model with \(\delta > 0\), then apply RLVF to amplify Verifiability
- Convergence guarantee: Under convexity and Lipschitz conditions, TL outputs a \((1-\varepsilon)\)-Self-Proving model after \(O(C^2 B_\text{Norm}^2 B_\text{Lip}^2 / \varepsilon^2)\) samples
Key Experimental Results¶
Main Results¶
| Learning Method | Correctness | Verifiability | Notes |
|---|---|---|---|
| GPT (baseline) | 99.8% | — | Computes GCD but cannot prove it |
| GPT+TL | 98.8% | 60.3% | 100K iterations |
| GPT+TL+RLVF | 98.9% | 78.3% | 4M RLVF iterations |
| GPT+ATL | 98.6% | 96.0% | 100K iterations; annotations yield substantial gains |
Ablation Study¶
| Configuration (annotation steps T) | Verifiability | Notes |
|---|---|---|
| T=0 (no annotation, TL) | ~60% | Baseline |
| T=1 | ~82% | Annotations begin to take effect |
| T=3 | ~91% | Continued improvement |
| T=5 | ~96% | Near saturation |
| T=7 | ~96% | Diminishing returns with further annotations |
Key Findings¶
- Even when Correctness approaches 100%, Verifiability requires dedicated training methods to improve—"being able to compute" does not imply "being able to prove"
- Subsequent implementations of RLVF (e.g., RLVR with KL regularization) have achieved broad success on full-scale LLMs, validating the practical value of this paper's theoretical framework
- Annotations (chain-of-thought) have a decisive impact on Verifiability, and models generalize to cases beyond the annotated length
Highlights & Insights¶
- Bridging theory and practice: This work applies interactive proof systems—a classical concept from theoretical computer science—to the problem of ML model trustworthiness, formally defining Verifiability and providing actionable training algorithms. RLVF has since become a core technique in LLM post-training.
- From average-case to per-input guarantees: Soundness holds for all inputs (worst-case), enabling users to trust model outputs on a per-instance basis—by running the verifier rather than relying on benchmark statistics.
Limitations & Future Work¶
- Experiments are conducted only on the small-scale GCD task (6.3M parameter GPT); comprehensive validation on large-scale LLMs and complex reasoning tasks is lacking
- Convergence theory for RLVF remains incomplete, depending on open problems in policy gradient convergence
- A hand-crafted efficient verifier must be predefined—designing such a verifier is itself non-trivial for many practical tasks
- The soundness error \(s\) provides meaningful guarantees only when \(\text{ver} > s\); reliability guarantees are limited in extremely low-Verifiability settings
- TL is sensitive to the quality of the transcript generator; low-quality transcripts substantially degrade learning efficiency
- Formally defining verifiers and proofs for natural language reasoning tasks remains an open problem
Related Work & Insights¶
- vs Prover-Verifier Games (PVGs): PVGs employ a learned verifier, whereas this work uses a hand-crafted verifier—the latter provides formal soundness guarantees immune to adversarial attacks
- vs RLHF: RLVF can be viewed as an "idealized" version of RLHF—replacing noisy human preferences with a deterministic mathematical verifier, yielding perfectly unambiguous reward signals
- vs Chain-of-Thought: TL can be interpreted as verifier-induced CoT training, but unlike standard CoT, the intermediate steps here carry formal correctness semantics
- vs IP-PAC: IP-PAC lets a learner prove that "the model is approximately correct over a distribution," whereas this work enables per-input correctness proofs, offering stronger guarantees
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Pioneering integration of interactive proofs into the ML training framework; RLVF has since been widely adopted
- Experimental Thoroughness: ⭐⭐⭐ Validation is limited to a single small-scale GCD task
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematically rigorous, conceptually clear, with well-structured theoretical development
- Value: ⭐⭐⭐⭐⭐ RLVF has been widely adopted as a standard LLM post-training method (e.g., RLVR), with far-reaching impact