Skip to content

ProofCompass: Enhancing Specialized Provers with LLM Guidance

Conference: ICML 2025 (AI for MATH Workshop)
arXiv: 2507.14335
Code: —
Area: LLM Reasoning / Formal Theorem Proving / Mathematical Reasoning
Keywords: Formal Theorem Proving, Hybrid Methodology, DeepSeek-Prover, LLM Guidance, Lemma Decomposition, miniF2F

TL;DR

ProofCompass proposes a training-free hybrid approach that uses general LLMs to provide natural language proof strategies and intermediate lemma selections for specialized theorem provers (such as DeepSeek-Prover-v1.5-RL). It outperforms the baseline on miniF2F (54.9% → 55.3%) with 25 times fewer attempts.

Background & Motivation

The field of formal theorem proving consists of two mainstream approaches, each with its own limitations:

Method Type Representative Advantages Disadvantages
Large General LLMs GPT-4, Claude Strong high-level reasoning capabilities Imprecise formal syntax, high number of attempts
Small Specialized Provers DSP-v1.5-RL Precise formal syntax Lack of high-level strategic planning

Training a large specialized model that combines the advantages of both requires massive computational resources. Core Idea: Can the two be systematically combined through engineering—allowing the general LLM to provide "macro-strategies" while the specialized prover executes "micro-tactics"?

Method

Overall Architecture

The core of ProofCompass is a two-tier loop system:

  • Outer Loop (LLM Strategist): A general LLM analyzes the problem, providing natural language proof strategies and lemma decompositions.
  • Inner Loop (Specialized Executor): DSP-v1.5-RL executes Lean code generation and proof search guided by the strategy.

1. Strategy Generation Phase

The LLM receives a natural language mathematics problem description and generates proof strategies:

\[\text{Strategy} = \text{LLM}(\text{Problem Description}, \text{History of Failed Attempts})\]

The strategy includes: - Proof Method Recommendations: e.g., "try induction", "use proof by contradiction" - Key Intermediate Steps: Decomposing complex problems into several sub-problems - Failure Analysis: Based on previous failed attempts, pointing out which directions are unfeasible

2. Lemma Decomposition

The LLM decomposes complete theorems into multiple intermediate lemmas:

\[\text{Theorem} \xrightarrow{\text{LLM}} \{\text{Lemma}_1, \text{Lemma}_2, \ldots, \text{Lemma}_m\}\]

Each lemma is simpler than the original theorem, making it easier for the specialized prover to find a proof. The LLM simultaneously generates natural language descriptions of the lemmas and their initial Lean formalizations.

3. Failure Feedback Loop

When the specialized prover fails to prove a certain lemma, the failure information (including the attempted tactics and error messages) is sent back to the LLM:

\[\text{New Strategy} = \text{LLM}(\text{Problem}, \text{Previous Strategy}, \text{Failed Tactics}, \text{Error Messages})\]

The LLM adjusts the strategy based on this: it may modify the lemma decomposition, suggest different proof methods, or provide more specific tactical hints.

4. Resource Allocation Strategy

ProofCompass intelligently allocates the limited number of attempts:

\[\text{Budget} = 128 \text{ attempts total}\]
  • Allocating fewer attempts to sub-problems considered "easy" by the LLM
  • Allocating more attempts to "hard" sub-problems
  • Dynamically adjusting the allocation based on the LLM's confidence evaluation

Key Experimental Results

Main Results: miniF2F Benchmark

Method Pass Rate Attempts Efficiency (Pass Rate/Attempt)
DSP-v1.5-RL 54.9% 3,200 0.017%/attempt
ProofCompass 55.3% 128 0.432%/attempt
GPT-4 (direct) ~35% 128 0.273%/attempt
LLaMA-3-70B (direct) ~28% 128 0.219%/attempt

ProofCompass outperforms the DSP-v1.5-RL baseline while using 25 times fewer attempts.

Ablation Study

Component Pass Rate after Removal Impact
Full ProofCompass 55.3% Baseline
W/O Strategy Generation 50.1% -5.2%
W/O Lemma Decomposition 51.7% -3.6%
W/O Failure Feedback 52.8% -2.5%
W/O Resource Allocation 53.4% -1.9%

Strategy generation and lemma decomposition are the two most critical components.

Resource Efficiency Analysis

Attempts DSP-v1.5-RL ProofCompass
32 38.2% 49.1%
64 45.6% 53.2%
128 49.8% 55.3%
256 52.1% 56.1%
3,200 54.9%

ProofCompass consistently outperforms DSP-v1.5-RL across all computational budgets.

Highlights & Insights

  • Extreme Efficiency Improvements: 25x less computation for equivalent or better performance, offering exceptionally high practical value.
  • Training-Free: Fully leverages the combination of existing model capabilities without requiring additional model training.
  • Ingenious Role Division: The LLM is responsible for "thinking" (strategy and decomposition), while the prover is responsible for "acting" (formal proof).
  • Reveals that the bottleneck of current specialized provers is not their lack of formalizing ability, but rather their lack of high-level strategic planning.

Limitations & Future Work

  • Only tested on the miniF2F benchmark, without scaling to larger formal benchmarks (e.g., FMA/FMC, ProofNet).
  • The absolute pass rate improvement is relatively small (0.4%), requiring more experiments to verify statistical significance.
  • LLM API calls introduce additional costs (although GPU computational costs are dramatically reduced).
  • The effectiveness of the method may depend heavily on the depth of the LLM's mathematical understanding, and weaker LLMs could degrade performance.
  • The quality of lemma decomposition is difficult to guarantee; incorrect decomposition can waste the computational budget.
  • DeepSeek-Prover-v1.5 (Xin et al., 2024): RL-trained specialized Lean prover.
  • Subgoal-based Provers (Zhao et al., 2024): Decomposes proof targets into subgoals.
  • LLM for Math (Trinh et al., 2024): LLM-assisted mathematical reasoning such as AlphaGeometry.
  • This work's innovation lies in its "capability combination" paradigm—instead of training new models, it maximizes the complementary advantages of existing models through systematic engineering.

Rating

⭐⭐⭐⭐ — The engineering approach is simple, highly efficient, and the 25x reduction in computational cost holds strong practical value, opening new directions for the hybrid paradigm of formal proof.