Theorem Prover as a Judge for Synthetic Data Generation¶

Conference: ACL 2025
arXiv: 2502.13137
Code: GitHub
Area: Others
Keywords: Theorem prover, synthetic data, auto-formalization, reinforcement learning, mathematical reasoning

TL;DR¶

This paper proposes the TP-as-a-Judge framework, which leverages the Lean theorem prover to verify intermediate reasoning steps generated by LLMs. Combined with iterative auto-formalization and Reinforcement Learning from Theorem Prover Feedback (RLTPF), this work achieves significant improvements on multiple mathematical reasoning benchmarks using only 3,508 samples.

Background & Motivation¶

Synthetic data holds great potential for enhancing the mathematical reasoning capabilities of LLMs, but ensuring the correctness of intermediate reasoning steps (rather than just the final answer) remains a key challenge. Existing methods suffer from the following limitations:

Bias of LLM-as-a-Judge: Relying on LLMs themselves to evaluate reasoning quality tends to introduce bias and fails to rigorously verify logical correctness.
High Human Annotation Cost: Annotating step-by-step reasoning for training data requires substantial manual effort, while automated annotation methods are limited by noisy reward signals.
High Error Rate of Auto-Formalization: Although theorem provers can effectively verify reasoning, translating natural language proofs into formal representations (such as Lean) is highly error-prone, with an initial execution rate of only around 60%.
High Computational Overhead of Methods like MCTS: Although Monte Carlo Tree Search can improve reasoning quality, its extensive rollout process limits scalability.

The core motivation of this paper is to replace human annotators with a theorem prover to provide rigorous logical verification for synthetic data, while resolving execution failures through iterative formalization.

Method¶

Overall Architecture¶

TP-as-a-Judge consists of three key phases: (1) LLM data generation, generating mathematical problems and answers via a Reverse Question-Answering method; (2) theorem prover verification, verifying the formal representations of problems and answers using the Lean prover; (3) RLTPF, performing SFT and DPO training utilizing the verification results.

Key Designs¶

Problem Formalization (Four-Stage Verification): The original problem \(s_0\) undergoes four processing stages—(\(s_1\)) translation to a symbolic representation using CoMAT, (\(s_2\)) auto-formalization validated by a theorem prover, (\(s_3\)) Auto-Informalisation translating the formal result back to natural language, and (\(s_4\)) Alignment Check verifying consistency between the original problem and the translated natural language via gpt-4o. The Design Motivation is to address the challenge that auto-formalizing problems lacks correctness evaluation metrics.
Answer Formalization and Iterative Auto-Formalization: Each reasoning step of the answer is formalized and verified by the theorem prover, yielding three outcomes: verified, false, or error. For error outcomes, an iterative auto-formalization method is proposed, which feeds error messages back to the model for correction, with up to 5 iterations. This method improves the Lean prover's execution rate from 60% to 87%. The Core Idea is to simulate the process where human experts make multiple attempts to formalize complex problems.
RLTPF (Reinforcement Learning from Theorem Prover Feedback): The theorem prover is used to replace human annotators, allocating data based on the verification results of two LLMs. Data verified by both is used for SFT; data correct on one side and incorrect on the other is used for DPO; and data where both fail is discarded. The Design Motivation is to allow the model to learn simultaneously from correct reasoning patterns and cases requiring improvement.

Loss & Training¶

The SFT phase uses standard negative log-likelihood loss to train on verified data.
The DPO phase utilizes Direct Preference Optimization, with verified answers as positive examples and failed ones as negative examples.
Training uses LoRA fine-tuning to improve computational efficiency.
Sequential training is implemented by conducting SFT first, followed by DPO.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (RLTPF)	CoT Baseline	Gain
MultiArith	Acc (Llama-3.1-8B)	97.78%	97.78%	+0.00%
SVAMP	Acc (Llama-2-7B)	44.00%	38.00%	+6.00%
GSM8K	Acc (Llama-3.1-8B)	83.75%	82.38%	+1.37%
GSM-Symbolic	Acc (Llama-3.2-3B)	59.60%	55.80%	+3.80%
AQUA	Acc (Llama-3.1-8B)	57.09%	53.54%	+3.55%
MultiArith	Acc (Mistral-7B)	67.78%	62.22%	+5.56%
AIME 2024	Acc (Llama-3.2-3B)	13.33%	10.00%	+3.33%

Ablation Study¶

Configuration	GSM8K Acc	Description
SFT (All Instances)	82.23%	Using all instances
SFT (Only Rejected)	80.87%	Using only rejected instances
SFT (Only Verified)	81.55%	Using only verified instances
SFT (All) + RLTPF	79.60%	All instances SFT + DPO conflict
SFT (Verified) + RLTPF	83.75%	Best configuration

Key Findings¶

TP-as-a-Judge achieves an F1 of 0.87 and a Recall of 0.91, whereas o1-mini acting as an LLM-as-a-judge scores an F1 of only 0.72, with twice the false positive rate of the former.
Approximately 40% of the samples in iterative auto-formalization require iterative correction, with most succeeding within the 3rd iteration.
The number of iterations is positively correlated with the token sequence length; more complex reasoning requires more iterations.
Using only 3,508 samples, the performance on GSM8K and AIME is competitive with models trained on tens of thousands to millions of samples (e.g., OpenMath-2).
The accuracy of gpt-4o on self-generated synthetic problems is only 51.85%, demonstrating the challenging nature of these self-generated problems.

Highlights & Insights¶

Extremely High Data Efficiency: 3,508 samples achieved competitive results compared to large-scale fine-tuning, proving that the quality of intermediate reasoning steps is far more important than the volume of data.
Advantages of Formal Verification: Theorem provers provide rigorous logical-level verification, which is more reliable than LLM self-evaluation and halves the false positives.
Naturalness of Iterative Correction: Simulating the process of human experts attempting formalization multiple times, over 90% of the corrections are completed within 3 iterations.
Data Allocation Strategy of RLTPF: The combined utilization of SFT + DPO is ingenious, making best use of both verified and failed data.

Limitations & Future Work¶

Currently only applicable to mathematical reasoning; expanding to other domains (such as code verification, logical reasoning) remains an open problem.
Mathematical coverage is limited to algebra, counting & probability, and problem-solving, lacking geometry and calculus.
Dataset complexity is limited by the difficulty of the problems that the LLM can reliably solve.
Iterative formalization carries high computational overhead, and multi-round corrections increase the cost.
Experiments were only conducted on models with \(\le 8\text{B}\) parameters, leaving the scalability to larger models unknown.

Compared to NuminaMath (860k samples) and OpenMath-2 (14M samples), this work achieves competitive results with only 3,508 samples, demonstrating that data quality \(\gg\) data volume.
Compared to rStar-Math (MCTS), TP-as-a-Judge avoids the massive computational overhead of extensive rollouts.
The idea of iterative auto-formalization can inspire iterative error correction based on compiler feedback in code generation.

Rating¶

Dimension	Score (1-5)	Description
Novelty	4	Novel integration of theorem provers, synthetic data, and RLHF
Practicality	3	Limited to the mathematics domain and the Lean prover ecosystem
Experimental Thoroughness	4	Multiple models, multiple benchmarks, detailed ablation, and analysis
Writing Quality	4	Clear structure, detailed method description
Overall Score	4	Meaningful work, demonstrating the value of formal verification in quality control for synthetic data