Theorem Prover as a Judge for Synthetic Data Generation¶
Conference: ACL 2025
arXiv: 2502.13137
Code: GitHub
Area: Others
Keywords: Theorem prover, synthetic data, auto-formalization, reinforcement learning, mathematical reasoning
TL;DR¶
This paper proposes the TP-as-a-Judge framework, which leverages the Lean theorem prover to verify intermediate reasoning steps generated by LLMs. Combined with iterative auto-formalization and Reinforcement Learning from Theorem Prover Feedback (RLTPF), this work achieves significant improvements on multiple mathematical reasoning benchmarks using only 3,508 samples.
Background & Motivation¶
Synthetic data holds great potential for enhancing the mathematical reasoning capabilities of LLMs, but ensuring the correctness of intermediate reasoning steps (rather than just the final answer) remains a key challenge. Existing methods suffer from the following limitations:
- Bias of LLM-as-a-Judge: Relying on LLMs themselves to evaluate reasoning quality tends to introduce bias and fails to rigorously verify logical correctness.
- High Human Annotation Cost: Annotating step-by-step reasoning for training data requires substantial manual effort, while automated annotation methods are limited by noisy reward signals.
- High Error Rate of Auto-Formalization: Although theorem provers can effectively verify reasoning, translating natural language proofs into formal representations (such as Lean) is highly error-prone, with an initial execution rate of only around 60%.
- High Computational Overhead of Methods like MCTS: Although Monte Carlo Tree Search can improve reasoning quality, its extensive rollout process limits scalability.
The core motivation of this paper is to replace human annotators with a theorem prover to provide rigorous logical verification for synthetic data, while resolving execution failures through iterative formalization.
Method¶
Overall Architecture¶
TP-as-a-Judge consists of three key phases: (1) LLM data generation, generating mathematical problems and answers via a Reverse Question-Answering method; (2) theorem prover verification, verifying the formal representations of problems and answers using the Lean prover; (3) RLTPF, performing SFT and DPO training utilizing the verification results.
Key Designs¶
-
Problem Formalization (Four-Stage Verification): The original problem \(s_0\) undergoes four processing stages—(\(s_1\)) translation to a symbolic representation using CoMAT, (\(s_2\)) auto-formalization validated by a theorem prover, (\(s_3\)) Auto-Informalisation translating the formal result back to natural language, and (\(s_4\)) Alignment Check verifying consistency between the original problem and the translated natural language via gpt-4o. The Design Motivation is to address the challenge that auto-formalizing problems lacks correctness evaluation metrics.
-
Answer Formalization and Iterative Auto-Formalization: Each reasoning step of the answer is formalized and verified by the theorem prover, yielding three outcomes: verified, false, or error. For error outcomes, an iterative auto-formalization method is proposed, which feeds error messages back to the model for correction, with up to 5 iterations. This method improves the Lean prover's execution rate from 60% to 87%. The Core Idea is to simulate the process where human experts make multiple attempts to formalize complex problems.
-
RLTPF (Reinforcement Learning from Theorem Prover Feedback): The theorem prover is used to replace human annotators, allocating data based on the verification results of two LLMs. Data verified by both is used for SFT; data correct on one side and incorrect on the other is used for DPO; and data where both fail is discarded. The Design Motivation is to allow the model to learn simultaneously from correct reasoning patterns and cases requiring improvement.
Loss & Training¶
- The SFT phase uses standard negative log-likelihood loss to train on verified data.
- The DPO phase utilizes Direct Preference Optimization, with verified answers as positive examples and failed ones as negative examples.
- Training uses LoRA fine-tuning to improve computational efficiency.
- Sequential training is implemented by conducting SFT first, followed by DPO.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | Ours (RLTPF) | CoT Baseline | Gain |
|---|---|---|---|---|
| MultiArith | Acc (Llama-3.1-8B) | 97.78% | 97.78% | +0.00% |
| SVAMP | Acc (Llama-2-7B) | 44.00% | 38.00% | +6.00% |
| GSM8K | Acc (Llama-3.1-8B) | 83.75% | 82.38% | +1.37% |
| GSM-Symbolic | Acc (Llama-3.2-3B) | 59.60% | 55.80% | +3.80% |
| AQUA | Acc (Llama-3.1-8B) | 57.09% | 53.54% | +3.55% |
| MultiArith | Acc (Mistral-7B) | 67.78% | 62.22% | +5.56% |
| AIME 2024 | Acc (Llama-3.2-3B) | 13.33% | 10.00% | +3.33% |
Ablation Study¶
| Configuration | GSM8K Acc | Description |
|---|---|---|
| SFT (All Instances) | 82.23% | Using all instances |
| SFT (Only Rejected) | 80.87% | Using only rejected instances |
| SFT (Only Verified) | 81.55% | Using only verified instances |
| SFT (All) + RLTPF | 79.60% | All instances SFT + DPO conflict |
| SFT (Verified) + RLTPF | 83.75% | Best configuration |
Key Findings¶
- TP-as-a-Judge achieves an F1 of 0.87 and a Recall of 0.91, whereas o1-mini acting as an LLM-as-a-judge scores an F1 of only 0.72, with twice the false positive rate of the former.
- Approximately 40% of the samples in iterative auto-formalization require iterative correction, with most succeeding within the 3rd iteration.
- The number of iterations is positively correlated with the token sequence length; more complex reasoning requires more iterations.
- Using only 3,508 samples, the performance on GSM8K and AIME is competitive with models trained on tens of thousands to millions of samples (e.g., OpenMath-2).
- The accuracy of gpt-4o on self-generated synthetic problems is only 51.85%, demonstrating the challenging nature of these self-generated problems.
Highlights & Insights¶
- Extremely High Data Efficiency: 3,508 samples achieved competitive results compared to large-scale fine-tuning, proving that the quality of intermediate reasoning steps is far more important than the volume of data.
- Advantages of Formal Verification: Theorem provers provide rigorous logical-level verification, which is more reliable than LLM self-evaluation and halves the false positives.
- Naturalness of Iterative Correction: Simulating the process of human experts attempting formalization multiple times, over 90% of the corrections are completed within 3 iterations.
- Data Allocation Strategy of RLTPF: The combined utilization of SFT + DPO is ingenious, making best use of both verified and failed data.
Limitations & Future Work¶
- Currently only applicable to mathematical reasoning; expanding to other domains (such as code verification, logical reasoning) remains an open problem.
- Mathematical coverage is limited to algebra, counting & probability, and problem-solving, lacking geometry and calculus.
- Dataset complexity is limited by the difficulty of the problems that the LLM can reliably solve.
- Iterative formalization carries high computational overhead, and multi-round corrections increase the cost.
- Experiments were only conducted on models with \(\le 8\text{B}\) parameters, leaving the scalability to larger models unknown.
Related Work & Insights¶
- Compared to NuminaMath (860k samples) and OpenMath-2 (14M samples), this work achieves competitive results with only 3,508 samples, demonstrating that data quality \(\gg\) data volume.
- Compared to rStar-Math (MCTS), TP-as-a-Judge avoids the massive computational overhead of extensive rollouts.
- The idea of iterative auto-formalization can inspire iterative error correction based on compiler feedback in code generation.
Rating¶
| Dimension | Score (1-5) | Description |
|---|---|---|
| Novelty | 4 | Novel integration of theorem provers, synthetic data, and RLHF |
| Practicality | 3 | Limited to the mathematics domain and the Lean prover ecosystem |
| Experimental Thoroughness | 4 | Multiple models, multiple benchmarks, detailed ablation, and analysis |
| Writing Quality | 4 | Clear structure, detailed method description |
| Overall Score | 4 | Meaningful work, demonstrating the value of formal verification in quality control for synthetic data |