ProofSketch: Efficient Verified Reasoning for Large Language Models¶
Conference: NeurIPS 2025 arXiv: 2510.24811 Code: https://github.com/tanishka66/ProofSketch Area: LLM Reasoning Keywords: verified reasoning, symbolic closure, sketch generation, token efficiency, logical reasoning
TL;DR¶
ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.
Background & Motivation¶
Background: Reasoning methods such as CoT prompting and self-consistency improve LLM accuracy but require lengthy reasoning chains, resulting in high token costs and latency.
Limitations of Prior Work: (a) LLMs frequently generate excessively long reasoning chains for simple problems (overthinking), wasting computational resources; (b) intermediate reasoning steps are unverified and may contain logical errors despite fluent outputs.
Key Challenge: There is a fundamental tension between optimizing accuracy and trustworthiness under computational budget constraints.
Goal: Design a reasoning framework that reduces token usage while guaranteeing reasoning correctness through formal verification.
Key Insight: Integrate LLM generation with symbolic reasoning — first compute the theoretical closure via forward chaining, then have the LLM generate compact sketches, and finally verify the atomic claims in each sketch against the closure.
Core Idea: Use the symbolic logic closure as a verification oracle to select, from multiple short reasoning sketches generated by the LLM, the one with the highest verification coverage.
Method¶
Overall Architecture¶
The input consists of a logical theory \(\mathcal{T}\) (facts and rules) and a True/False/Unknown classification query \(Q\). The pipeline proceeds in four steps: (1) compute the symbolic closure \(C(\mathcal{T})\) via forward chaining; (2) return the answer directly if the closure suffices; (3) otherwise generate \(K=4\) compact sketches, each comprising an answer and a set of atomic claims; (4) verify the atomic claims of each sketch against the closure and select the best sketch via lexicographic scoring.
Key Designs¶
-
Symbolic Closure Computation:
- Function: Parse the theory into positive facts \(F_+\), negative facts \(F_-\), and rules \(R\), then derive all provable facts via forward chaining.
- Mechanism: Standard first-order logic forward inference, providing ground truth for subsequent verification.
- Design Motivation: Problems answerable by pure logic are resolved at zero LLM cost; otherwise, the closure serves as the verification basis for sketches.
-
Adaptive Sketch Generation:
- Function: Generate \(K=4\) candidate compact sketches, each within an adaptive token budget.
- Mechanism: If the query entity already appears in the closure, the budget is 120 tokens; otherwise 160 tokens. Temperature \(\tau=0.3\). Each sketch outputs an answer along with atomic claims in the format "e is a" or "e is not a."
- Design Motivation: Short sketches with multiple samples are more efficient than long CoT with a single sample, and enable parallel verification.
-
Lexicographic Verification and Selection:
- Function: Verify the atomic claims of each sketch and select the best one via multi-objective lexicographic scoring.
- Mechanism: Priority order is (1) full certification > (2) partial verification coverage > (3) token efficiency > (4) consistency with the closure. Early stopping is applied when a fully certified sketch is found.
- Design Motivation: Ensures that the selected result carries the maximum degree of formal guarantee.
Key Experimental Results¶
Main Results¶
Evaluated on 300 ProofWriter samples across three models:
| Method | R1-Llama-8B Acc | R1-Llama Tokens | Mistral-7B Acc | Mistral Tokens | R1-Qwen-7B Acc | R1-Qwen Tokens |
|---|---|---|---|---|---|---|
| Zero-shot | 0.37 | 122 | 0.33 | 7 | 0.39 | 10 |
| Short-CoT | 0.52 | 215 | 0.48 | 53 | 0.44 | 49 |
| Long-CoT | 0.52 | 219 | 0.41 | 102 | 0.47 | 101 |
| ProofSketch | 0.68 | 138 | 0.52 | 28 | 0.54 | 30 |
Ablation Study¶
| Configuration | Certification Rate | Notes |
|---|---|---|
| ProofSketch (R1-Llama) | 42% | Highest accuracy but moderate certification rate |
| ProofSketch (Mistral) | 84% | Highest certification rate, fewest tokens |
| ProofSketch (R1-Qwen) | 42% | Moderate accuracy and certification rate |
| All baselines | 0% | No verification capability |
Key Findings¶
- ProofSketch achieves Pareto optimality across all models on the accuracy–token trade-off curve.
- On Mistral-7B, token usage is only 28, representing 53% of Short-CoT's cost, with a certification rate as high as 84%.
- On R1-Llama, accuracy improves by 16 percentage points (0.52 → 0.68).
- Latency results are mixed: substantially reduced on Mistral, but doubled on R1-Llama due to the overhead of multiple generation and verification rounds.
Highlights & Insights¶
- A practical example of neuro-symbolic hybrid reasoning: The symbolic closure serves as a "free" verifier without requiring a separately trained verifier model.
- Normalization of atomic claims: Constraining LLM outputs to standardized atomic claim formats ("e is a") makes symbolic verification tractable.
Limitations & Future Work¶
- Small experimental scale: Only 300 ProofWriter samples with models up to 8B parameters limit the persuasiveness of the results.
- Restricted to simple logical reasoning: Reliance on first-order logic forward chaining precludes applicability to complex scenarios such as mathematical proofs or commonsense reasoning.
- Latency issues: Multiple generation and verification rounds introduce significant latency increases on certain models.
- Unstable certification rates: The certification rate ranges widely from 42% to 84% depending on the model and problem type.
- Comparisons with additional verification approaches (e.g., LLM-as-verifier, tree search) are absent.
Related Work & Insights¶
- vs. CoT/Self-Consistency: ProofSketch pursues shorter reasoning with formal verification rather than longer reasoning with statistical consistency.
- vs. Sketch-of-Thought: Both generate compact sketches, but ProofSketch adds a symbolic verification layer.
- vs. atomic reasoning (zhang2025): Inspired by atomic claim decomposition, but augmented with closure-based verification.
Rating¶
- Novelty: ⭐⭐⭐ The combination of symbolic closure and sketch generation is an interesting idea, though technical contributions are limited.
- Experimental Thoroughness: ⭐⭐ The experimental scale is too small (300 samples, single dataset), undermining the reliability of conclusions.
- Writing Quality: ⭐⭐⭐ Structure is clear but lacks depth; the work is at workshop level.
- Value: ⭐⭐⭐ The direction is interesting but insufficiently validated; larger-scale experiments are needed.