Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations¶

Conference: ACL 2025
arXiv: 2505.24264
Code: None
Area: LLM Safety
Keywords: Natural Language Inference, Theorem Proving, Autoformalisation, Explanation Generation, Faithfulness

TL;DR¶

This paper investigates the interaction architecture between LLMs and theorem provers (TPs), and proposes four strategies to mitigate issues such as semantic information loss, syntactic errors, insufficient proof construction, and difficulties in feedback interpretation during autoformalisation. It achieves significant improvements in formalisation accuracy by +18.46%/+34.2%/+39.77% and explanation quality by +29.5%/+51.5%/+41.25% on the e-SNLI, QASC, and WorldTree datasets, respectively.

Background & Motivation¶

Background: Natural Language Inference (NLI) requires models to determine the logical relation (entailment/contradiction/neutral) between a premise and a hypothesis, and generate explanations to clarify the reasoning process. Recently, combining LLMs with theorem provers (TPs) has been utilized to verify and improve the logical validity of NLI explanations.

Limitations of Prior Work: LLM-TP (Theorem Prover) hybrid architectures face multiple challenges: (1) semantic information loss occurs when converting natural language into machine-verifiable formal representations (autoformalisation), as LLMs struggle to precisely capture key logical structures; (2) formal representations generated by LLMs often contain syntactic errors; (3) LLMs lack sufficient ability to construct rigorous proofs within formal verification frameworks; (4) LLMs struggle to effectively interpret feedback returned by TPs for iterative refinement.

Key Challenge: Formal verification requires precise logical representations, yet natural language is inherently ambiguous and polysemic. Although LLMs excel at understanding natural language, they are prone to losing semantic details or introducing logical errors when generating rigorous formal representations, resulting in a dual deficiency in both faithfulness and robustness.

Goal: To address the four limitations of the LLM-TP architecture by proposing targeted strategies respectively, improving the quality of autoformalisation and the effectiveness of explanation refinement for NLI.

Key Insight: The authors start from the interaction pipeline of the hybrid architecture, analyze the bottlenecks in the information flows of both directions—LLM \(\rightarrow\) TP (formalisation) and TP \(\rightarrow\) LLM (feedback)—and design intervention strategies one by one.

Core Idea: Systematically improve the faithfulness and robustness of LLM-TP collaboration through semantic-preserving prompting, automated feedback-driven syntax error correction, logic-guided proof sketch generation, and enhanced feedback interpretation mechanisms.

Method¶

Overall Architecture¶

The pipeline is completed through alternating collaboration between the LLM and the TP: given an NLI triplet (premise, hypothesis, label), the LLM first generates formal logical representations (first-order logic / propositional logic) for the explanatory sentences in the premise and hypothesis. The TP then attempts to construct a proof based on these formal representations to verify the logical validity of the explanations. If verification fails, the TP's feedback is sent back to the LLM for iterative correction. The final output is a verified, faithful NLI explanation.

Key Designs¶

Semantic-Preserving Autoformalisation:
- Function: Mitigate semantic loss during the translation from natural language to formal representations.
- Mechanism: Introduce additional semantic constraint prompts when the LLM performs autoformalisation. Specifically, this includes: (a) requiring the LLM to explicitly list the key semantic components of the original natural language and then inspect one by one whether the formal representation covers these components; (b) annotating the corresponding natural language meaning alongside the formal representation to match them in a "bilingual comparison" style for LLM self-checking. Through this approach, the accuracy of autoformalisation improves by +18.46%, +34.2%, and +39.77% on the three datasets, respectively.
- Design Motivation: Information loss occurs because the LLM tends to simplify or omit details during translation; an explicit semantic checklist forces the LLM to focus on completeness.
Efficient Syntax Error Correction:
- Function: Rapidly identify and correct syntax errors in the formal representations generated by the LLM.
- Mechanism: When parsing the formal representation, the TP reports the specific location and type of syntax errors. When feeding these errors back to the LLM, structured error classification information is attached (such as "unmatched parentheses", "undefined predicates", "parameter type errors", etc.) to help the LLM precisely locate and repair the errors. Simultaneously, a repair template library for common syntax errors is maintained to automatically fix high-frequency errors using pattern matching, thereby reducing LLM call frequency.
- Design Motivation: Syntax errors are the most superficial yet highly iterative-consuming types of issues; automated repair can substantially reduce meaningless rounds of iteration.
Logic-Guided Proof Sketch Generation:
- Function: Use the structural information of logical expressions to guide the LLM to generate more structured proof sketches.
- Mechanism: Before requiring the LLM to generate proofs, the LLM is first prompted to analyze the logical structure of existing formal representations (such as implication chains, conditional branches, etc.). Based on this structural information, it generates a proof "skeleton" (proof sketch), which is then filled with specific proof steps by the LLM. This step-by-step generation from structure to detail is more reliable than directly generating complete proofs.
- Design Motivation: LLMs easily get lost in details when directly generating complete formal proofs; creating a skeleton before filling in the content is a strategy commonly used by human mathematicians.

Loss & Training¶

Ours is a pure inference-time method without model fine-tuning. The improvements are entirely reflected in the prompt design and the LLM-TP interaction protocol. The key training strategy improvement lies in the iterative refinement loop—enhancing the LLM's capability to interpret TP feedback to make the correction at each iteration more effective, thereby significantly reducing the number of iteration rounds needed to achieve successful verification.

Key Experimental Results¶

Main Results¶

Performance comparison on three NLI explanation datasets:

Dataset	Metric	Ours	Prev. SOTA	Gain
e-SNLI	Formalisation Accuracy	Significant Improvement	Baseline	+18.46%
QASC	Formalisation Accuracy	Significant Improvement	Baseline	+34.2%
WorldTree	Formalisation Accuracy	Significant Improvement	Baseline	+39.77%
e-SNLI	Explanation Refinement Quality	Significant Improvement	Baseline	+29.5%
QASC	Explanation Refinement Quality	Significant Improvement	Baseline	+51.5%
WorldTree	Explanation Refinement Quality	Significant Improvement	Baseline	+41.25%

Ablation Study¶

Configuration	Formalisation Accuracy Change	Explanation
Full Method	Optimal	All four strategies enabled
w/o Semantic-Preserving Strategy	Significant drop	Formalisation quality degrades significantly
w/o Syntactic Fix	Decrease	Iteration counts increase, efficiency decreases
w/o Proof Sketch Guidance	Decrease	Proof construction success rate decreases
w/o Enhanced Feedback Interpretation	Decrease	Quality of iterative correction deteriorates

Key Findings¶

The contributions of the four strategies vary: the semantic-preserving strategy brings the largest improvement in formalisation accuracy, while the proof sketch guidance contributes most to the verification success rate.
The gains vary across different LLMs (GPT-4, GPT-3.5, etc.), indicating that the effectiveness of the strategies correlates with the foundational logical reasoning capabilities of the LLM.
Specific architectural interventions can "dramatically" reduce the iteration rounds required for successful verification, significantly boosting efficiency.
The improvement is most pronounced on the WorldTree dataset (formalisation +39.77%, refinement +41.25%), likely because this dataset contains longer and more complex explanation chains, where the advantages of the strategies become more prominent.

Highlights & Insights¶

Systematic failure analysis is highly valuable—decomposing the issues of the LLM-TP architecture into four independent yet related dimensions (semantic loss, syntactic errors, proof construction, and feedback interpretation), with targeted solutions for each dimension.
The structure-to-detail proof sketch strategy reflects an understanding of the human reasoning process—good proofs are not written in a single step but are gradually refined with an overall outline first. This approach can be generalized to other formal reasoning scenarios of LLMs.
The numerical results are exceptionally strong—with substantial improvements across all three datasets, particularly in the quality of explanation refinement (+29.5% to +51.5%), demonstrating that the strategies successfully address the key bottlenecks.

Limitations & Future Work¶

The method relies on specific theorem provers (possibly Isabelle or Coq). Since feedback formats differ across TPs, adaptation costs exist.
Although formalisation accuracy is significantly improved, the absolute values might still be insufficient, indicating that the gap between natural language and formal logic remains substantial.
The paper has not open-sourced its code, leaving reproducibility to be verified.
Future work could explore using reinforcement learning (RL) to train the formalisation capability of LLMs, employing the prompting strategies of this paper as reward signals.

vs NaturalProver: Early NLI proving work directly prompted LLMs to generate proofs, whereas ours introduces the TP verification stage and iterative corrections.
vs Autoformalisation Series (e.g., ProofNet): These works focus on the formalisation of mathematical theorems, whereas this paper targets NLI explanations; although the domains differ, the underlying techniques are shared.
vs LLM+Tool Frameworks (e.g., ToolFormer, ReAct): The LLM-TP interaction in this paper can be viewed as a specialized LLM+Tool paradigm, but it is deeply optimized specifically for formal verification.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic improvements targeted at LLM-TP interaction; the four strategies are cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets \(\times\) multiple LLMs \(\times\) detailed ablation, delivering significant and credible improvements.
Writing Quality: ⭐⭐⭐⭐ In-depth analysis of problems and clear descriptions of strategies.
Value: ⭐⭐⭐⭐ Direct guiding value for the practical applications of formal reasoning in LLMs.