ProofSketch: Efficient Verified Reasoning for Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2510.24811 Code: https://github.com/tanishka66/ProofSketch Area: LLM Reasoning Keywords: verified reasoning, symbolic closure, sketch generation, token efficiency, logical reasoning

TL;DR¶

ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.

Background & Motivation¶

Background: Reasoning methods such as CoT prompting and self-consistency improve LLM accuracy but require lengthy reasoning chains, resulting in high token costs and latency.

Limitations of Prior Work: (a) LLMs frequently generate excessively long reasoning chains for simple problems (overthinking), wasting computational resources; (b) intermediate reasoning steps are unverified and may contain logical errors despite fluent outputs.

Key Challenge: There is a fundamental tension between optimizing accuracy and trustworthiness under computational budget constraints.

Goal: Design a reasoning framework that reduces token usage while guaranteeing reasoning correctness through formal verification.

Key Insight: Integrate LLM generation with symbolic reasoning — first compute the theoretical closure via forward chaining, then have the LLM generate compact sketches, and finally verify the atomic claims in each sketch against the closure.

Core Idea: Use the symbolic logic closure as a verification oracle to select, from multiple short reasoning sketches generated by the LLM, the one with the highest verification coverage.

Method¶

Overall Architecture¶

The input consists of a logical theory \(\mathcal{T}\) (facts and rules) and a True/False/Unknown classification query \(Q\). The pipeline proceeds in four steps: (1) compute the symbolic closure \(C(\mathcal{T})\) via forward chaining; (2) return the answer directly if the closure suffices; (3) otherwise generate \(K=4\) compact sketches, each comprising an answer and a set of atomic claims; (4) verify the atomic claims of each sketch against the closure and select the best sketch via lexicographic scoring.

Key Designs¶

Symbolic Closure Computation:
- Function: Parse the theory into positive facts \(F_+\), negative facts \(F_-\), and rules \(R\), then derive all provable facts via forward chaining.
- Mechanism: Standard first-order logic forward inference, providing ground truth for subsequent verification.
- Design Motivation: Problems answerable by pure logic are resolved at zero LLM cost; otherwise, the closure serves as the verification basis for sketches.
Adaptive Sketch Generation:
- Function: Generate \(K=4\) candidate compact sketches, each within an adaptive token budget.
- Mechanism: If the query entity already appears in the closure, the budget is 120 tokens; otherwise 160 tokens. Temperature \(\tau=0.3\). Each sketch outputs an answer along with atomic claims in the format "e is a" or "e is not a."
- Design Motivation: Short sketches with multiple samples are more efficient than long CoT with a single sample, and enable parallel verification.
Lexicographic Verification and Selection:
- Function: Verify the atomic claims of each sketch and select the best one via multi-objective lexicographic scoring.
- Mechanism: Priority order is (1) full certification > (2) partial verification coverage > (3) token efficiency > (4) consistency with the closure. Early stopping is applied when a fully certified sketch is found.
- Design Motivation: Ensures that the selected result carries the maximum degree of formal guarantee.

Key Experimental Results¶

Main Results¶

Evaluated on 300 ProofWriter samples across three models:

Method	R1-Llama-8B Acc	R1-Llama Tokens	Mistral-7B Acc	Mistral Tokens	R1-Qwen-7B Acc	R1-Qwen Tokens
Zero-shot	0.37	122	0.33	7	0.39	10
Short-CoT	0.52	215	0.48	53	0.44	49
Long-CoT	0.52	219	0.41	102	0.47	101
ProofSketch	0.68	138	0.52	28	0.54	30

Ablation Study¶

Configuration	Certification Rate	Notes
ProofSketch (R1-Llama)	42%	Highest accuracy but moderate certification rate
ProofSketch (Mistral)	84%	Highest certification rate, fewest tokens
ProofSketch (R1-Qwen)	42%	Moderate accuracy and certification rate
All baselines	0%	No verification capability

Key Findings¶

ProofSketch achieves Pareto optimality across all models on the accuracy–token trade-off curve.
On Mistral-7B, token usage is only 28, representing 53% of Short-CoT's cost, with a certification rate as high as 84%.
On R1-Llama, accuracy improves by 16 percentage points (0.52 → 0.68).
Latency results are mixed: substantially reduced on Mistral, but doubled on R1-Llama due to the overhead of multiple generation and verification rounds.

Highlights & Insights¶

A practical example of neuro-symbolic hybrid reasoning: The symbolic closure serves as a "free" verifier without requiring a separately trained verifier model.
Normalization of atomic claims: Constraining LLM outputs to standardized atomic claim formats ("e is a") makes symbolic verification tractable.

Limitations & Future Work¶

Small experimental scale: Only 300 ProofWriter samples with models up to 8B parameters limit the persuasiveness of the results.
Restricted to simple logical reasoning: Reliance on first-order logic forward chaining precludes applicability to complex scenarios such as mathematical proofs or commonsense reasoning.
Latency issues: Multiple generation and verification rounds introduce significant latency increases on certain models.
Unstable certification rates: The certification rate ranges widely from 42% to 84% depending on the model and problem type.
Comparisons with additional verification approaches (e.g., LLM-as-verifier, tree search) are absent.

vs. CoT/Self-Consistency: ProofSketch pursues shorter reasoning with formal verification rather than longer reasoning with statistical consistency.
vs. Sketch-of-Thought: Both generate compact sketches, but ProofSketch adds a symbolic verification layer.
vs. atomic reasoning (zhang2025): Inspired by atomic claim decomposition, but augmented with closure-based verification.

Rating¶

Novelty: ⭐⭐⭐ The combination of symbolic closure and sketch generation is an interesting idea, though technical contributions are limited.
Experimental Thoroughness: ⭐⭐ The experimental scale is too small (300 samples, single dataset), undermining the reliability of conclusions.
Writing Quality: ⭐⭐⭐ Structure is clear but lacks depth; the work is at workshop level.
Value: ⭐⭐⭐ The direction is interesting but insufficiently validated; larger-scale experiments are needed.