Skip to content

ProofSketch: Efficient Verified Reasoning for Large Language Models

Conference: NeurIPS 2025 arXiv: 2510.24811 Code: https://github.com/tanishka66/ProofSketch Area: LLM Reasoning Keywords: verified reasoning, symbolic closure, sketch generation, token efficiency, logical reasoning

TL;DR

ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.

Background & Motivation

Background: Reasoning methods such as CoT prompting and self-consistency improve LLM accuracy but require lengthy reasoning chains, resulting in high token costs and latency.

Limitations of Prior Work: (a) LLMs frequently generate excessively long reasoning chains for simple problems (overthinking), wasting computational resources; (b) intermediate reasoning steps are unverified and may contain logical errors despite fluent outputs.

Key Challenge: There is a fundamental tension between optimizing accuracy and trustworthiness under computational budget constraints.

Goal: Design a reasoning framework that reduces token usage while guaranteeing reasoning correctness through formal verification.

Key Insight: Integrate LLM generation with symbolic reasoning — first compute the theoretical closure via forward chaining, then have the LLM generate compact sketches, and finally verify the atomic claims in each sketch against the closure.

Core Idea: Use the symbolic logic closure as a verification oracle to select, from multiple short reasoning sketches generated by the LLM, the one with the highest verification coverage.

Method

Overall Architecture

The input consists of a logical theory \(\mathcal{T}\) (facts and rules) and a True/False/Unknown classification query \(Q\). The pipeline proceeds in four steps: (1) compute the symbolic closure \(C(\mathcal{T})\) via forward chaining; (2) return the answer directly if the closure suffices; (3) otherwise generate \(K=4\) compact sketches, each comprising an answer and a set of atomic claims; (4) verify the atomic claims of each sketch against the closure and select the best sketch via lexicographic scoring.

Key Designs

  1. Symbolic Closure Computation:

    • Function: Parse the theory into positive facts \(F_+\), negative facts \(F_-\), and rules \(R\), then derive all provable facts via forward chaining.
    • Mechanism: Standard first-order logic forward inference, providing ground truth for subsequent verification.
    • Design Motivation: Problems answerable by pure logic are resolved at zero LLM cost; otherwise, the closure serves as the verification basis for sketches.
  2. Adaptive Sketch Generation:

    • Function: Generate \(K=4\) candidate compact sketches, each within an adaptive token budget.
    • Mechanism: If the query entity already appears in the closure, the budget is 120 tokens; otherwise 160 tokens. Temperature \(\tau=0.3\). Each sketch outputs an answer along with atomic claims in the format "e is a" or "e is not a."
    • Design Motivation: Short sketches with multiple samples are more efficient than long CoT with a single sample, and enable parallel verification.
  3. Lexicographic Verification and Selection:

    • Function: Verify the atomic claims of each sketch and select the best one via multi-objective lexicographic scoring.
    • Mechanism: Priority order is (1) full certification > (2) partial verification coverage > (3) token efficiency > (4) consistency with the closure. Early stopping is applied when a fully certified sketch is found.
    • Design Motivation: Ensures that the selected result carries the maximum degree of formal guarantee.

Key Experimental Results

Main Results

Evaluated on 300 ProofWriter samples across three models:

Method R1-Llama-8B Acc R1-Llama Tokens Mistral-7B Acc Mistral Tokens R1-Qwen-7B Acc R1-Qwen Tokens
Zero-shot 0.37 122 0.33 7 0.39 10
Short-CoT 0.52 215 0.48 53 0.44 49
Long-CoT 0.52 219 0.41 102 0.47 101
ProofSketch 0.68 138 0.52 28 0.54 30

Ablation Study

Configuration Certification Rate Notes
ProofSketch (R1-Llama) 42% Highest accuracy but moderate certification rate
ProofSketch (Mistral) 84% Highest certification rate, fewest tokens
ProofSketch (R1-Qwen) 42% Moderate accuracy and certification rate
All baselines 0% No verification capability

Key Findings

  • ProofSketch achieves Pareto optimality across all models on the accuracy–token trade-off curve.
  • On Mistral-7B, token usage is only 28, representing 53% of Short-CoT's cost, with a certification rate as high as 84%.
  • On R1-Llama, accuracy improves by 16 percentage points (0.52 → 0.68).
  • Latency results are mixed: substantially reduced on Mistral, but doubled on R1-Llama due to the overhead of multiple generation and verification rounds.

Highlights & Insights

  • A practical example of neuro-symbolic hybrid reasoning: The symbolic closure serves as a "free" verifier without requiring a separately trained verifier model.
  • Normalization of atomic claims: Constraining LLM outputs to standardized atomic claim formats ("e is a") makes symbolic verification tractable.

Limitations & Future Work

  • Small experimental scale: Only 300 ProofWriter samples with models up to 8B parameters limit the persuasiveness of the results.
  • Restricted to simple logical reasoning: Reliance on first-order logic forward chaining precludes applicability to complex scenarios such as mathematical proofs or commonsense reasoning.
  • Latency issues: Multiple generation and verification rounds introduce significant latency increases on certain models.
  • Unstable certification rates: The certification rate ranges widely from 42% to 84% depending on the model and problem type.
  • Comparisons with additional verification approaches (e.g., LLM-as-verifier, tree search) are absent.
  • vs. CoT/Self-Consistency: ProofSketch pursues shorter reasoning with formal verification rather than longer reasoning with statistical consistency.
  • vs. Sketch-of-Thought: Both generate compact sketches, but ProofSketch adds a symbolic verification layer.
  • vs. atomic reasoning (zhang2025): Inspired by atomic claim decomposition, but augmented with closure-based verification.

Rating

  • Novelty: ⭐⭐⭐ The combination of symbolic closure and sketch generation is an interesting idea, though technical contributions are limited.
  • Experimental Thoroughness: ⭐⭐ The experimental scale is too small (300 samples, single dataset), undermining the reliability of conclusions.
  • Writing Quality: ⭐⭐⭐ Structure is clear but lacks depth; the work is at workshop level.
  • Value: ⭐⭐⭐ The direction is interesting but insufficiently validated; larger-scale experiments are needed.