ProofCompass: Enhancing Specialized Provers with LLM Guidance¶
Conference: ICML 2025 (AI for MATH Workshop)
arXiv: 2507.14335
Code: —
Area: LLM Reasoning / Formal Theorem Proving / Mathematical Reasoning
Keywords: Formal Theorem Proving, Hybrid Methodology, DeepSeek-Prover, LLM Guidance, Lemma Decomposition, miniF2F
TL;DR¶
ProofCompass proposes a training-free hybrid approach that uses general LLMs to provide natural language proof strategies and intermediate lemma selections for specialized theorem provers (such as DeepSeek-Prover-v1.5-RL). It outperforms the baseline on miniF2F (54.9% → 55.3%) with 25 times fewer attempts.
Background & Motivation¶
The field of formal theorem proving consists of two mainstream approaches, each with its own limitations:
| Method Type | Representative | Advantages | Disadvantages |
|---|---|---|---|
| Large General LLMs | GPT-4, Claude | Strong high-level reasoning capabilities | Imprecise formal syntax, high number of attempts |
| Small Specialized Provers | DSP-v1.5-RL | Precise formal syntax | Lack of high-level strategic planning |
Training a large specialized model that combines the advantages of both requires massive computational resources. Core Idea: Can the two be systematically combined through engineering—allowing the general LLM to provide "macro-strategies" while the specialized prover executes "micro-tactics"?
Method¶
Overall Architecture¶
The core of ProofCompass is a two-tier loop system:
- Outer Loop (LLM Strategist): A general LLM analyzes the problem, providing natural language proof strategies and lemma decompositions.
- Inner Loop (Specialized Executor): DSP-v1.5-RL executes Lean code generation and proof search guided by the strategy.
1. Strategy Generation Phase¶
The LLM receives a natural language mathematics problem description and generates proof strategies:
The strategy includes: - Proof Method Recommendations: e.g., "try induction", "use proof by contradiction" - Key Intermediate Steps: Decomposing complex problems into several sub-problems - Failure Analysis: Based on previous failed attempts, pointing out which directions are unfeasible
2. Lemma Decomposition¶
The LLM decomposes complete theorems into multiple intermediate lemmas:
Each lemma is simpler than the original theorem, making it easier for the specialized prover to find a proof. The LLM simultaneously generates natural language descriptions of the lemmas and their initial Lean formalizations.
3. Failure Feedback Loop¶
When the specialized prover fails to prove a certain lemma, the failure information (including the attempted tactics and error messages) is sent back to the LLM:
The LLM adjusts the strategy based on this: it may modify the lemma decomposition, suggest different proof methods, or provide more specific tactical hints.
4. Resource Allocation Strategy¶
ProofCompass intelligently allocates the limited number of attempts:
- Allocating fewer attempts to sub-problems considered "easy" by the LLM
- Allocating more attempts to "hard" sub-problems
- Dynamically adjusting the allocation based on the LLM's confidence evaluation
Key Experimental Results¶
Main Results: miniF2F Benchmark¶
| Method | Pass Rate | Attempts | Efficiency (Pass Rate/Attempt) |
|---|---|---|---|
| DSP-v1.5-RL | 54.9% | 3,200 | 0.017%/attempt |
| ProofCompass | 55.3% | 128 | 0.432%/attempt |
| GPT-4 (direct) | ~35% | 128 | 0.273%/attempt |
| LLaMA-3-70B (direct) | ~28% | 128 | 0.219%/attempt |
ProofCompass outperforms the DSP-v1.5-RL baseline while using 25 times fewer attempts.
Ablation Study¶
| Component | Pass Rate after Removal | Impact |
|---|---|---|
| Full ProofCompass | 55.3% | Baseline |
| W/O Strategy Generation | 50.1% | -5.2% |
| W/O Lemma Decomposition | 51.7% | -3.6% |
| W/O Failure Feedback | 52.8% | -2.5% |
| W/O Resource Allocation | 53.4% | -1.9% |
Strategy generation and lemma decomposition are the two most critical components.
Resource Efficiency Analysis¶
| Attempts | DSP-v1.5-RL | ProofCompass |
|---|---|---|
| 32 | 38.2% | 49.1% |
| 64 | 45.6% | 53.2% |
| 128 | 49.8% | 55.3% |
| 256 | 52.1% | 56.1% |
| 3,200 | 54.9% | — |
ProofCompass consistently outperforms DSP-v1.5-RL across all computational budgets.
Highlights & Insights¶
- Extreme Efficiency Improvements: 25x less computation for equivalent or better performance, offering exceptionally high practical value.
- Training-Free: Fully leverages the combination of existing model capabilities without requiring additional model training.
- Ingenious Role Division: The LLM is responsible for "thinking" (strategy and decomposition), while the prover is responsible for "acting" (formal proof).
- Reveals that the bottleneck of current specialized provers is not their lack of formalizing ability, but rather their lack of high-level strategic planning.
Limitations & Future Work¶
- Only tested on the miniF2F benchmark, without scaling to larger formal benchmarks (e.g., FMA/FMC, ProofNet).
- The absolute pass rate improvement is relatively small (0.4%), requiring more experiments to verify statistical significance.
- LLM API calls introduce additional costs (although GPU computational costs are dramatically reduced).
- The effectiveness of the method may depend heavily on the depth of the LLM's mathematical understanding, and weaker LLMs could degrade performance.
- The quality of lemma decomposition is difficult to guarantee; incorrect decomposition can waste the computational budget.
Related Work & Insights¶
- DeepSeek-Prover-v1.5 (Xin et al., 2024): RL-trained specialized Lean prover.
- Subgoal-based Provers (Zhao et al., 2024): Decomposes proof targets into subgoals.
- LLM for Math (Trinh et al., 2024): LLM-assisted mathematical reasoning such as AlphaGeometry.
- This work's innovation lies in its "capability combination" paradigm—instead of training new models, it maximizes the complementary advantages of existing models through systematic engineering.
Rating¶
⭐⭐⭐⭐ — The engineering approach is simple, highly efficient, and the 25x reduction in computational cost holds strong practical value, opening new directions for the hybrid paradigm of formal proof.