StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs Through Knowledge-Reasoning Fusion¶
- Conference: AAAI 2026
- arXiv: 2508.04440
- Code: stepfun-ai/StepFun-Formalizer
- Area: Autoformalization / Mathematical Reasoning
- Keywords: Autoformalization, Lean 4, Knowledge Distillation, Reasoning Templates, RLVR, SFT, Formal Verification
TL;DR¶
This paper proposes the ThinkingF pipeline, which enhances LLMs' formal language domain knowledge via large-scale knowledge distillation and their informal-to-formal reasoning ability via template-guided reasoning trajectory synthesis. These two capabilities are then integrated through a two-stage SFT followed by RLVR. The resulting 7B/32B models achieve state-of-the-art performance on FormalMATH-Lite and ProverBench.
Background & Motivation¶
Autoformalization aims to translate natural language mathematical statements into verifiable formal language expressions such as Lean 4. Existing approaches suffer from two typical deficiencies:
- General-purpose LLMs lack formal language knowledge: For example, Claude4-thinking is unfamiliar with Lean 4-specific definitions (e.g., the formal expression of the Euler function), leading to incorrect implementations.
- Specialized models lack reasoning ability: For example, Kimina-Autoformalizer fails to correctly interpret context and align informal-to-formal representations when handling complex natural language problems.
The authors used GPT-4o to categorize errors across approximately 10K outputs, finding that Kimina-Autoformalizer exhibits a disproportionately high error rate in "problem misunderstanding" and "informal-to-formal alignment errors," confirming that insufficient reasoning capability is the core bottleneck.
Method¶
Overall Architecture: The ThinkingF Pipeline¶
ThinkingF consists of four stages: knowledge distillation and filtering → reasoning data synthesis → two-stage SFT → RLVR.
Key Design 1: Knowledge Distillation with Three-Level Quality Filtering¶
Goal: Construct large-scale, high-quality informal–formal data pairs to supplement the model's formal language knowledge.
- Informal problem preparation: Approximately 256K informal mathematics problems are filtered from the NuminaMath-1.5 dataset.
- Formal statement generation: Kimina-Autoformalizer generates 16 candidate formal statements \(\{y_{ij}\}_{j=1}^{16}\) for each problem.
- Three-level filtering:
- Syntax checking: Lean4 REPL is used to filter syntactically invalid statements, retaining correct ones \(\{y_{ij}^*\}_{j=1}^{m_i}\).
- Majority voting: Statements are grouped into equivalence classes via BEq verification, and the best formalization is selected from the largest equivalence class:
- LLM validity assessment: DeepSeek-V3 is used to filter statements that are trivially simple or internally contradictory.
Approximately 183K high-quality knowledge data pairs are retained after this process.
Key Design 2: Template-Guided Reasoning Trajectory Synthesis¶
Goal: Inject informal-to-formal reasoning capability into the model, compensating for the poor effectiveness of directly distilling reasoning processes.
The reasoning template consists of two core components:
- Informal problem comprehension: Restate the original problem → analyze high-level logical structure → decompose mathematical concepts and objects.
- Informal-to-formal analysis: Anticipate potential pitfalls in formalization → apply a divide-and-conquer paradigm to map each natural language mathematical object to its corresponding Lean 4 formal expression.
Based on human-annotated informal–formal pairs \(\{(\hat{x}_i, \hat{y}_i)\}\) (sourced from MiniF2F, ProofNet, PutnamBench, etc., totaling approximately 5.8K entries), Claude 3.7 Sonnet is used to generate reasoning trajectories \(\hat{c}_i\) following the template.
Key finding: Directly distilling reasoning trajectories from general-purpose reasoning models (e.g., Claude4-thinking) yields significantly inferior results, as such models tend to "solve" rather than "translate" during the formalization process.
Key Design 3: Two-Stage SFT¶
- Stage 1: Fine-tune on knowledge data \(\{(x_i, y_i)\}\), with empty
<think></think>tags inserted before the output to maintain format consistency. - Stage 2: Fine-tune on reasoning data \(\{(\hat{x}_i, \hat{c}_i, \hat{y}_i)\}\), with reasoning trajectories wrapped in
<think>...</think>.
The base models are DeepSeek-R1-Distill-Qwen (7B / 32B), selected for their strong performance on informal mathematics and coding tasks.
Key Design 4: RLVR Reinforcement Learning¶
BEq equivalence verification is used as a verifiable reward signal, with the reward function defined as:
The GRPO algorithm is adopted, incorporating dynamic sampling from DAPO and token-level loss improvements. After training the 7B model for 450 steps and the 32B model for 350 steps, the reward increases from 0.232 to 0.347, and the mean BEq@1 improves from 0.258 to 0.303.
Key Experimental Results¶
Main Results Comparison (BEq@1 / BEq@16, %)¶
| Model | FormalMATH-Lite | ProverBench | CombiBench |
|---|---|---|---|
| OpenAI o3-pro | 22.6 / 35.5 | 24.7 / 36.2 | 9.0 / 16.0 |
| Claude4-thinking | 20.8 / 32.2 | 24.4 / 35.6 | 9.7 / 18.0 |
| DeepSeek-R1-671B | 18.4 / 31.3 | 23.5 / 34.5 | 8.1 / 20.0 |
| Kimina-Autoformalizer-7B | 35.1 / 60.2 | 13.3 / 25.3 | 2.6 / 6.0 |
| StepFun-Formalizer-7B | 38.3 / 61.2 | 25.1 / 37.9 | 5.2 / 11.0 |
| StepFun-Formalizer-32B | 40.5 / 59.3 | 26.7 / 38.5 | 6.9 / 14.0 |
Ablation Study (OOD Benchmarks, BEq@1 / BEq@16, %)¶
| Configuration | ProverBench | CombiBench |
|---|---|---|
| ThinkingF (full) | 25.1 / 37.9 | 5.2 / 11.0 |
| w/o knowledge data | 24.5 / 37.4 | 3.9 / 10.0 |
| w/o reasoning data | 21.8 / 25.3 | 5.3 / 6.0 |
Key Findings¶
- Reasoning data is the primary contributor: Removing reasoning data causes a substantial drop in BEq@16 (37.9→25.3), indicating that reasoning capability is critical to the performance ceiling.
- Knowledge data plays a complementary role: While its individual contribution is modest, its combination with reasoning data yields significant gains.
- Template-guided synthesis outperforms direct distillation: Directly distilling reasoning trajectories from Claude4-thinking reduces BEq@1 on ProverBench from 25.1 to 21.8.
- RL remains effective: Continued RL training on the same 5.8K data already used in SFT still yields consistent improvements.
- End-to-end theorem proving: When evaluated with Kimina-Prover on 10K problems, StepFun-Formalizer-7B produces 4,940 provable statements, surpassing Kimina-Autoformalizer-7B's 4,549.
Highlights & Insights¶
- The first autoformalization training pipeline that jointly addresses domain knowledge and informal-to-formal reasoning, bridging the capability gaps of both general-purpose and specialized models.
- The template-guided reasoning synthesis is an elegant design that avoids the "solving rather than translating" pitfall of general-purpose reasoning models.
- A 7B model outperforms a 671B general-purpose model (DeepSeek-R1) on ProverBench, demonstrating remarkable parameter efficiency.
- Using BEq equivalence verification as the RL reward signal enables a reliable closed-loop verifiable feedback mechanism.
Limitations & Future Work¶
- Limited data scale: Only 5.8K reasoning data pairs and 183K knowledge pairs are used; the authors note that the 32B model requires more data to further surpass the 7B model.
- Restricted to Lean 4: Generalizability to other formal languages such as Coq and Isabelle has not been validated.
- CombiBench performance remains low: Combinatorial mathematics problems involving complex real-world scenarios and long contexts remain challenging (7B achieves only 5.2%).
- Dependence on external models: Knowledge distillation relies on Kimina-Autoformalizer and reasoning synthesis relies on Claude 3.7 Sonnet, limiting the reproducibility of the pipeline.
- Evaluation methodology limitations: BEq uses the
exact?tactic for equivalence verification, which is the most stringent approach but may miss semantically equivalent cases.
Related Work & Insights¶
- Autoformalization: LeanFormalizer (SFT/PPO), Kimina-Autoformalizer (expert iteration), Goedel-Formalizer (human annotation scaling), Herald-Translator, Mathesis (first to incorporate RL but without a reasoning process).
- Reasoning-augmented LLMs: DeepSeek-R1, domain specializations under the OpenAI o1 paradigm — Fin-R1 (finance), Table-R1 (tables), CodeV-R1 (code verification), R1-Code-Interpreter.
- Theorem proving: DeepSeek-Prover-V2, Kimina-Prover, Goedel-Prover.
Rating¶
⭐⭐⭐⭐ — The method is systematically designed and effective, with a clear dual-data pipeline for knowledge and reasoning; experiments are thorough. The 7B model outperforming a 671B general-purpose model is impressive. Limitations include the relatively small data scale and exclusive support for Lean 4.