ICML 2025 (AI for MATH Workshop) Reasoning Formal Theorem Proving Hybrid Methodology DeepSeek-Prover LLM Guidance Lemma Decomposition miniF2F

ProofCompass: Enhancing Specialized Provers with LLM Guidance¶

Conference: ICML 2025 (AI for MATH Workshop)
arXiv: 2507.14335
Code: —
Area: LLM Reasoning / Formal Theorem Proving / Mathematical Reasoning
Keywords: Formal Theorem Proving, Hybrid Methodology, DeepSeek-Prover, LLM Guidance, Lemma Decomposition, miniF2F

TL;DR¶

ProofCompass proposes a training-free hybrid approach that uses general LLMs to provide natural language proof strategies and intermediate lemma selections for specialized theorem provers (such as DeepSeek-Prover-v1.5-RL). It outperforms the baseline on miniF2F (54.9% → 55.3%) with 25 times fewer attempts.

Background & Motivation¶

The field of formal theorem proving consists of two mainstream approaches, each with its own limitations:

Method Type	Representative	Advantages	Disadvantages
Large General LLMs	GPT-4, Claude	Strong high-level reasoning capabilities	Imprecise formal syntax, high number of attempts
Small Specialized Provers	DSP-v1.5-RL	Precise formal syntax	Lack of high-level strategic planning

Training a large specialized model that combines the advantages of both requires massive computational resources. Core Idea: Can the two be systematically combined through engineering—allowing the general LLM to provide "macro-strategies" while the specialized prover executes "micro-tactics"?

Method¶

Overall Architecture¶

The core of ProofCompass is a two-tier loop system:

Outer Loop (LLM Strategist): A general LLM analyzes the problem, providing natural language proof strategies and lemma decompositions.
Inner Loop (Specialized Executor): DSP-v1.5-RL executes Lean code generation and proof search guided by the strategy.

1. Strategy Generation Phase¶

The LLM receives a natural language mathematics problem description and generates proof strategies:

\[\text{Strategy} = \text{LLM}(\text{Problem Description}, \text{History of Failed Attempts})\]

The strategy includes: - Proof Method Recommendations: e.g., "try induction", "use proof by contradiction" - Key Intermediate Steps: Decomposing complex problems into several sub-problems - Failure Analysis: Based on previous failed attempts, pointing out which directions are unfeasible

2. Lemma Decomposition¶

The LLM decomposes complete theorems into multiple intermediate lemmas:

\[\text{Theorem} \xrightarrow{\text{LLM}} \{\text{Lemma}_1, \text{Lemma}_2, \ldots, \text{Lemma}_m\}\]

Each lemma is simpler than the original theorem, making it easier for the specialized prover to find a proof. The LLM simultaneously generates natural language descriptions of the lemmas and their initial Lean formalizations.

3. Failure Feedback Loop¶

When the specialized prover fails to prove a certain lemma, the failure information (including the attempted tactics and error messages) is sent back to the LLM:

\[\text{New Strategy} = \text{LLM}(\text{Problem}, \text{Previous Strategy}, \text{Failed Tactics}, \text{Error Messages})\]

The LLM adjusts the strategy based on this: it may modify the lemma decomposition, suggest different proof methods, or provide more specific tactical hints.

4. Resource Allocation Strategy¶

ProofCompass intelligently allocates the limited number of attempts:

\[\text{Budget} = 128 \text{ attempts total}\]

Allocating fewer attempts to sub-problems considered "easy" by the LLM
Allocating more attempts to "hard" sub-problems
Dynamically adjusting the allocation based on the LLM's confidence evaluation

Key Experimental Results¶

Main Results: miniF2F Benchmark¶

Method	Pass Rate	Attempts	Efficiency (Pass Rate/Attempt)
DSP-v1.5-RL	54.9%	3,200	0.017%/attempt
ProofCompass	55.3%	128	0.432%/attempt
GPT-4 (direct)	~35%	128	0.273%/attempt
LLaMA-3-70B (direct)	~28%	128	0.219%/attempt

ProofCompass outperforms the DSP-v1.5-RL baseline while using 25 times fewer attempts.

Ablation Study¶

Component	Pass Rate after Removal	Impact
Full ProofCompass	55.3%	Baseline
W/O Strategy Generation	50.1%	-5.2%
W/O Lemma Decomposition	51.7%	-3.6%
W/O Failure Feedback	52.8%	-2.5%
W/O Resource Allocation	53.4%	-1.9%

Strategy generation and lemma decomposition are the two most critical components.

Resource Efficiency Analysis¶

Attempts	DSP-v1.5-RL	ProofCompass
32	38.2%	49.1%
64	45.6%	53.2%
128	49.8%	55.3%
256	52.1%	56.1%
3,200	54.9%	—

ProofCompass consistently outperforms DSP-v1.5-RL across all computational budgets.

Highlights & Insights¶

Extreme Efficiency Improvements: 25x less computation for equivalent or better performance, offering exceptionally high practical value.
Training-Free: Fully leverages the combination of existing model capabilities without requiring additional model training.
Ingenious Role Division: The LLM is responsible for "thinking" (strategy and decomposition), while the prover is responsible for "acting" (formal proof).
Reveals that the bottleneck of current specialized provers is not their lack of formalizing ability, but rather their lack of high-level strategic planning.

Limitations & Future Work¶

Only tested on the miniF2F benchmark, without scaling to larger formal benchmarks (e.g., FMA/FMC, ProofNet).
The absolute pass rate improvement is relatively small (0.4%), requiring more experiments to verify statistical significance.
LLM API calls introduce additional costs (although GPU computational costs are dramatically reduced).
The effectiveness of the method may depend heavily on the depth of the LLM's mathematical understanding, and weaker LLMs could degrade performance.
The quality of lemma decomposition is difficult to guarantee; incorrect decomposition can waste the computational budget.

DeepSeek-Prover-v1.5 (Xin et al., 2024): RL-trained specialized Lean prover.
Subgoal-based Provers (Zhao et al., 2024): Decomposes proof targets into subgoals.
LLM for Math (Trinh et al., 2024): LLM-assisted mathematical reasoning such as AlphaGeometry.
This work's innovation lies in its "capability combination" paradigm—instead of training new models, it maximizes the complementary advantages of existing models through systematic engineering.

Rating¶

⭐⭐⭐⭐ — The engineering approach is simple, highly efficient, and the 25x reduction in computational cost holds strong practical value, opening new directions for the hybrid paradigm of formal proof.