Skip to content

Theorem Prover as a Judge for Synthetic Data Generation

Conference: ACL 2025
arXiv: 2502.13137
Code: GitHub
Area: Others
Keywords: Theorem prover, synthetic data, auto-formalization, reinforcement learning, mathematical reasoning

TL;DR

This paper proposes the TP-as-a-Judge framework, which leverages the Lean theorem prover to verify intermediate reasoning steps generated by LLMs. Combined with iterative auto-formalization and Reinforcement Learning from Theorem Prover Feedback (RLTPF), this work achieves significant improvements on multiple mathematical reasoning benchmarks using only 3,508 samples.

Background & Motivation

Synthetic data holds great potential for enhancing the mathematical reasoning capabilities of LLMs, but ensuring the correctness of intermediate reasoning steps (rather than just the final answer) remains a key challenge. Existing methods suffer from the following limitations:

  1. Bias of LLM-as-a-Judge: Relying on LLMs themselves to evaluate reasoning quality tends to introduce bias and fails to rigorously verify logical correctness.
  2. High Human Annotation Cost: Annotating step-by-step reasoning for training data requires substantial manual effort, while automated annotation methods are limited by noisy reward signals.
  3. High Error Rate of Auto-Formalization: Although theorem provers can effectively verify reasoning, translating natural language proofs into formal representations (such as Lean) is highly error-prone, with an initial execution rate of only around 60%.
  4. High Computational Overhead of Methods like MCTS: Although Monte Carlo Tree Search can improve reasoning quality, its extensive rollout process limits scalability.

The core motivation of this paper is to replace human annotators with a theorem prover to provide rigorous logical verification for synthetic data, while resolving execution failures through iterative formalization.

Method

Overall Architecture

TP-as-a-Judge consists of three key phases: (1) LLM data generation, generating mathematical problems and answers via a Reverse Question-Answering method; (2) theorem prover verification, verifying the formal representations of problems and answers using the Lean prover; (3) RLTPF, performing SFT and DPO training utilizing the verification results.

Key Designs

  1. Problem Formalization (Four-Stage Verification): The original problem \(s_0\) undergoes four processing stages—(\(s_1\)) translation to a symbolic representation using CoMAT, (\(s_2\)) auto-formalization validated by a theorem prover, (\(s_3\)) Auto-Informalisation translating the formal result back to natural language, and (\(s_4\)) Alignment Check verifying consistency between the original problem and the translated natural language via gpt-4o. The Design Motivation is to address the challenge that auto-formalizing problems lacks correctness evaluation metrics.

  2. Answer Formalization and Iterative Auto-Formalization: Each reasoning step of the answer is formalized and verified by the theorem prover, yielding three outcomes: verified, false, or error. For error outcomes, an iterative auto-formalization method is proposed, which feeds error messages back to the model for correction, with up to 5 iterations. This method improves the Lean prover's execution rate from 60% to 87%. The Core Idea is to simulate the process where human experts make multiple attempts to formalize complex problems.

  3. RLTPF (Reinforcement Learning from Theorem Prover Feedback): The theorem prover is used to replace human annotators, allocating data based on the verification results of two LLMs. Data verified by both is used for SFT; data correct on one side and incorrect on the other is used for DPO; and data where both fail is discarded. The Design Motivation is to allow the model to learn simultaneously from correct reasoning patterns and cases requiring improvement.

Loss & Training

  • The SFT phase uses standard negative log-likelihood loss to train on verified data.
  • The DPO phase utilizes Direct Preference Optimization, with verified answers as positive examples and failed ones as negative examples.
  • Training uses LoRA fine-tuning to improve computational efficiency.
  • Sequential training is implemented by conducting SFT first, followed by DPO.

Key Experimental Results

Main Results

Dataset Metric Ours (RLTPF) CoT Baseline Gain
MultiArith Acc (Llama-3.1-8B) 97.78% 97.78% +0.00%
SVAMP Acc (Llama-2-7B) 44.00% 38.00% +6.00%
GSM8K Acc (Llama-3.1-8B) 83.75% 82.38% +1.37%
GSM-Symbolic Acc (Llama-3.2-3B) 59.60% 55.80% +3.80%
AQUA Acc (Llama-3.1-8B) 57.09% 53.54% +3.55%
MultiArith Acc (Mistral-7B) 67.78% 62.22% +5.56%
AIME 2024 Acc (Llama-3.2-3B) 13.33% 10.00% +3.33%

Ablation Study

Configuration GSM8K Acc Description
SFT (All Instances) 82.23% Using all instances
SFT (Only Rejected) 80.87% Using only rejected instances
SFT (Only Verified) 81.55% Using only verified instances
SFT (All) + RLTPF 79.60% All instances SFT + DPO conflict
SFT (Verified) + RLTPF 83.75% Best configuration

Key Findings

  • TP-as-a-Judge achieves an F1 of 0.87 and a Recall of 0.91, whereas o1-mini acting as an LLM-as-a-judge scores an F1 of only 0.72, with twice the false positive rate of the former.
  • Approximately 40% of the samples in iterative auto-formalization require iterative correction, with most succeeding within the 3rd iteration.
  • The number of iterations is positively correlated with the token sequence length; more complex reasoning requires more iterations.
  • Using only 3,508 samples, the performance on GSM8K and AIME is competitive with models trained on tens of thousands to millions of samples (e.g., OpenMath-2).
  • The accuracy of gpt-4o on self-generated synthetic problems is only 51.85%, demonstrating the challenging nature of these self-generated problems.

Highlights & Insights

  • Extremely High Data Efficiency: 3,508 samples achieved competitive results compared to large-scale fine-tuning, proving that the quality of intermediate reasoning steps is far more important than the volume of data.
  • Advantages of Formal Verification: Theorem provers provide rigorous logical-level verification, which is more reliable than LLM self-evaluation and halves the false positives.
  • Naturalness of Iterative Correction: Simulating the process of human experts attempting formalization multiple times, over 90% of the corrections are completed within 3 iterations.
  • Data Allocation Strategy of RLTPF: The combined utilization of SFT + DPO is ingenious, making best use of both verified and failed data.

Limitations & Future Work

  • Currently only applicable to mathematical reasoning; expanding to other domains (such as code verification, logical reasoning) remains an open problem.
  • Mathematical coverage is limited to algebra, counting & probability, and problem-solving, lacking geometry and calculus.
  • Dataset complexity is limited by the difficulty of the problems that the LLM can reliably solve.
  • Iterative formalization carries high computational overhead, and multi-round corrections increase the cost.
  • Experiments were only conducted on models with \(\le 8\text{B}\) parameters, leaving the scalability to larger models unknown.
  • Compared to NuminaMath (860k samples) and OpenMath-2 (14M samples), this work achieves competitive results with only 3,508 samples, demonstrating that data quality \(\gg\) data volume.
  • Compared to rStar-Math (MCTS), TP-as-a-Judge avoids the massive computational overhead of extensive rollouts.
  • The idea of iterative auto-formalization can inspire iterative error correction based on compiler feedback in code generation.

Rating

Dimension Score (1-5) Description
Novelty 4 Novel integration of theorem provers, synthetic data, and RLHF
Practicality 3 Limited to the mathematics domain and the Lean prover ecosystem
Experimental Thoroughness 4 Multiple models, multiple benchmarks, detailed ablation, and analysis
Writing Quality 4 Clear structure, detailed method description
Overall Score 4 Meaningful work, demonstrating the value of formal verification in quality control for synthetic data