Entailment-Preserving First-order Logic Representations in Natural Language Entailment¶

Conference: ACL 2025
arXiv: 2502.16757
Code: None
Area: LLM Efficiency
Keywords: First-order logic, natural language entailment, semantic parsing, Learning-to-rank, theorem proving

TL;DR¶

This paper formally defines the Entailment-Preserving First-Order Logic Representation (EPF) task and a set of reference-free evaluation metrics (EPR family). It proposes an iterative learning-to-rank training method that optimizes the NL→FOL translation of T5 models via the BRIO loss, enabling the generated FOL representations to be verified for entailment relations by automated theorem provers. It improves EPR by 1.8–2.7% and EPR@16 by 17.4–20.6% across three datasets.

Background & Motivation¶

Background: First-Order Logic (FOL) is a classic representation of natural language semantics, allowing automated theorem provers to determine entailment relations in theory. NL→FOL translation has recently attracted attention due to the code generation capabilities of LLMs.
Limitations of Prior Work:
- Classic methods (such as translating via CCG/AMR to FOL) almost completely fail on RTE tasks (Bos 2014 reported only 1.9% recall).
- LLMs (such as GPT-4o) perform exceptionally well on synthetic logical entailment tasks but poorly on natural language entailment.
- The core problem is arbitrariness: LLMs generate inconsistent predicate names and arities for synonymous concepts, leading to theorem proving failures.
Key Challenge: Natural language entailment is much broader than strict logical entailment ("a human reading P would infer that h is likely true"), and traditional FOL cannot capture this non-strict entailment.
Key Insight: Do not rely on reference FOL representations; instead, directly use the execution results of theorem provers as optimization signals.
Core Idea: Use learning-to-rank to teach the model to favor generating FOL representations that preserve entailment in combinations, thereby indirectly reducing arbitrariness.

Method¶

Overall Architecture¶

Base Model Training: Fine-tune T5-base with standard cross-entropy on MALLS (an NL-FOL parallel corpus) to obtain \(S_0\).
Sampling: For each sentence in the training set, sample K FOL representations using beam search with \(S_t\).
Evaluation: Use the Vampire theorem prover to check all possible premise-hypothesis FOL combinations and calculate the score for each output.
Training: Train the model using the BRIO loss for ranking outputs to obtain \(S_{t+1}\).
Iterate 5 times to obtain \(S_5\).

Key Designs¶

EPR Metric Family (Entailment-Preserving Rate):
- EPR: For each premise-hypothesis pair, select the highest-probability FOL translation combination and check if entailment is preserved.
- EPR@K: Allow K translations per sentence; success is achieved if any combination preserves entailment (a relaxed version).
- EPR@K-Oracle: Choose one translation per sentence to globally optimize EPR (NP-complete, solved using Answer Set Programming / ASP).
- Inequality relation: EPR = EPR@1 ≤ EPR@K-Oracle ≤ EPR@K
- Completely reference-free: No gold FOL representations are needed; only NL entailment labels are required.
- Spurious entailment prevention: Verification steps ensure the hypothesis does not introduce new predicates/constants, and the proof must utilize all premises.
Scoring Function:
- Since the global EPR metric cannot be directly used for sentence-level training, a sentence-level score is defined.
- The score of an FOL output \(S(p)_j\) is equal to the number of entailment-preserving combinations that include it.
- Syntax-invalid outputs receive a score of -1.
- If a sentence appears in multiple premise-hypothesis pairs, its scores are accumulated.
- Design Motivation: Maximizing the score naturally improves the global EPR.
BRIO Learning-to-Rank Loss:
- For each input, its K outputs are sorted in descending order of their scores.
- Goal: Ensure that the average token log-probability of higher-scoring outputs is greater than that of lower-scoring outputs.
- \(\mathcal{L}_{BRIO} = \sum_i \sum_j max(\hat{p}(y_j|x) - \hat{p}(y_i|x) + \Delta(j-i), 0)\)
- Incorporate cross-entropy regularization: \(\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{BRIO}\)
- \(\Delta=0.01\), \(\lambda=10\)
Iterative Training:
- Re-evaluate and train in each iteration using the current model's sampled results.
- Analogous to online iterative strategies in RLHF (e.g., Iterative DPO).
- 5 iterations, 20 epochs each.

Loss & Training¶

Initial Training: Fine-tune T5-base with standard cross-entropy on MALLS (34k NL-FOL pairs).
Iterative Training: \(\mathcal{L} = \mathcal{L}_{CE} + 10 \cdot \mathcal{L}_{BRIO}\)
Sample \(K=16\) outputs using beam search.
Perform fully automated evaluation using the Vampire theorem prover.

Key Experimental Results¶

Main Results¶

Dataset	Metric	T5-Iter5	GPT-4o	T5-Iter0	CCG2Lambda
EntailmentBank	EPR	7.4	2.9	5.6	0.0
eQASC	EPR	4.9	1.1	2.6	0.0
e-SNLI	EPR	4.3	1.5	0.1	0.0
EntailmentBank	EPR@16	32.8	13.2	15.4	-
eQASC	EPR@16	33.1	11.4	12.5	-
e-SNLI	EPR@16	36.1	8.3	3.4	-

Ablation Study¶

Configuration	Key Metric	Description
Number of Iterations (0→5)	Continuous EPR increase	EPR, EPR@16, and EPR@16-Oracle monotonically increase across all datasets.
Unique Predicate Names/Sent.	Significant decrease after Iter1	Synonymous concepts are mapped to fewer predicate names, reducing arbitrariness.
Arity Entropy	Continuous decrease	Predicate arities tend to be consistent (e.g., CauseCycles converges from a mix of 2 and 3 arguments to 2 arguments consistently).
Cross-dataset Transfer	EPR@16 increase	Iter5 trained on dataset A also outperforms Iter0 on dataset B, showing transferability.

Key Findings¶

Classic NL→FOL methods (e.g., CCG2Lambda, AMR2FOL) have EPRs close to 0 on multi-premise RTE, resulting in complete failure.
Although GPT-4o performs strongly on synthetic logical reasoning, its EPR for FOL generation on natural language entailment is only 1-3%.
Core Mechanism of Iterative Training: The BRIO loss only provides gradient signals when different outputs have different scores, yet the model generalizes well to unseen combinations.
Arbitrariness analysis reveals a direct causal relationship with EPR improvement: rising predicate name consistency \(\uparrow \rightarrow\) success rate \(\uparrow\).
EPR@16-Oracle is close to EPR@16 (with a gap of only 1.7 percentage points), suggesting the existence of a near-optimal single-selection strategy.
Models trained on e-SNLI show the best cross-domain generalization, likely due to the large scale of the data and its broad semantic coverage.

Highlights & Insights¶

Ingenious design of the EPR metric: completely reference-free, allows arbitrary FOL structures, and indirectly evaluates semantic quality through execution results (theorem proving).
Modeling NL→FOL translation as an execution-guided sequence generation problem, which is closely aligned with execution-guided methods in semantic parsing.
The iterative learning-to-rank framework is highly generalizable and can be extended to other execution-guided generation tasks.
Profound analysis of arbitrariness—which represents the fundamental barrier for LLMs in FOL generation—providing both quantitative metrics and solutions in this paper.

Limitations & Future Work¶

Even for the best T5-Iter5, the EPR is only 7.4% (on EntailmentBank), which is still far from the theoretical upper bound (EPR@16-Oracle \(\approx 31\%\)).
The study uses T5-base (220M) and does not explore the potential of larger models.
Datasets lack linguistically controlled minimal pairs, which might miss fine-grained semantic differences.
EPR@K-Oracle evaluation is NP-complete, requiring approximations in practical computation.
Future Work: (1) Utilize larger pre-trained models (e.g., T5-XXL or LLMs) as backbones; (2) Integrate classical linguistic knowledge to constrain the structures of generated FOL.

The negative conclusion of Bos (2014) regarding the failure of FOL on single-premise RTE is extended to multi-premise scenarios.
BRIO was originally designed for ranking training in summarization tasks; this paper innovatively applies it to execution-guided semantic parsing.
The approach is conceptually similar to online iterative training strategies in RLHF (such as Iterative DPO and Self-Play).
Pipelines like Logic-LM (LLM + FOL + Prover) are effective on synthetic data but fail to generalize; this paper provides quantitative evidence for this phenomenon.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Formally defining the EPF task and EPR metrics is a community-level contribution, and the iterative learning-to-rank method is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple baselines, reasoning type analyses, and cross-domain analyses, though the absolute EPR values remain relatively low.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous problem definition, clear metric definitions, and step-by-step progressive analysis.
Value: ⭐⭐⭐⭐ Provides a new research framework and benchmark for the long-standing challenge of "using FOL for natural language inference".