Entailment-Preserving First-order Logic Representations in Natural Language Entailment¶
Conference: ACL 2025
arXiv: 2502.16757
Code: None
Area: LLM Efficiency
Keywords: First-order logic, natural language entailment, semantic parsing, Learning-to-rank, theorem proving
TL;DR¶
This paper formally defines the Entailment-Preserving First-Order Logic Representation (EPF) task and a set of reference-free evaluation metrics (EPR family). It proposes an iterative learning-to-rank training method that optimizes the NL→FOL translation of T5 models via the BRIO loss, enabling the generated FOL representations to be verified for entailment relations by automated theorem provers. It improves EPR by 1.8–2.7% and EPR@16 by 17.4–20.6% across three datasets.
Background & Motivation¶
- Background: First-Order Logic (FOL) is a classic representation of natural language semantics, allowing automated theorem provers to determine entailment relations in theory. NL→FOL translation has recently attracted attention due to the code generation capabilities of LLMs.
- Limitations of Prior Work:
- Classic methods (such as translating via CCG/AMR to FOL) almost completely fail on RTE tasks (Bos 2014 reported only 1.9% recall).
- LLMs (such as GPT-4o) perform exceptionally well on synthetic logical entailment tasks but poorly on natural language entailment.
- The core problem is arbitrariness: LLMs generate inconsistent predicate names and arities for synonymous concepts, leading to theorem proving failures.
- Key Challenge: Natural language entailment is much broader than strict logical entailment ("a human reading P would infer that h is likely true"), and traditional FOL cannot capture this non-strict entailment.
- Key Insight: Do not rely on reference FOL representations; instead, directly use the execution results of theorem provers as optimization signals.
- Core Idea: Use learning-to-rank to teach the model to favor generating FOL representations that preserve entailment in combinations, thereby indirectly reducing arbitrariness.
Method¶
Overall Architecture¶
- Base Model Training: Fine-tune T5-base with standard cross-entropy on MALLS (an NL-FOL parallel corpus) to obtain \(S_0\).
- Sampling: For each sentence in the training set, sample K FOL representations using beam search with \(S_t\).
- Evaluation: Use the Vampire theorem prover to check all possible premise-hypothesis FOL combinations and calculate the score for each output.
- Training: Train the model using the BRIO loss for ranking outputs to obtain \(S_{t+1}\).
- Iterate 5 times to obtain \(S_5\).
Key Designs¶
-
EPR Metric Family (Entailment-Preserving Rate):
- EPR: For each premise-hypothesis pair, select the highest-probability FOL translation combination and check if entailment is preserved.
- EPR@K: Allow K translations per sentence; success is achieved if any combination preserves entailment (a relaxed version).
- EPR@K-Oracle: Choose one translation per sentence to globally optimize EPR (NP-complete, solved using Answer Set Programming / ASP).
- Inequality relation: EPR = EPR@1 ≤ EPR@K-Oracle ≤ EPR@K
- Completely reference-free: No gold FOL representations are needed; only NL entailment labels are required.
- Spurious entailment prevention: Verification steps ensure the hypothesis does not introduce new predicates/constants, and the proof must utilize all premises.
-
Scoring Function:
- Since the global EPR metric cannot be directly used for sentence-level training, a sentence-level score is defined.
- The score of an FOL output \(S(p)_j\) is equal to the number of entailment-preserving combinations that include it.
- Syntax-invalid outputs receive a score of -1.
- If a sentence appears in multiple premise-hypothesis pairs, its scores are accumulated.
- Design Motivation: Maximizing the score naturally improves the global EPR.
-
BRIO Learning-to-Rank Loss:
- For each input, its K outputs are sorted in descending order of their scores.
- Goal: Ensure that the average token log-probability of higher-scoring outputs is greater than that of lower-scoring outputs.
- \(\mathcal{L}_{BRIO} = \sum_i \sum_j max(\hat{p}(y_j|x) - \hat{p}(y_i|x) + \Delta(j-i), 0)\)
- Incorporate cross-entropy regularization: \(\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{BRIO}\)
- \(\Delta=0.01\), \(\lambda=10\)
-
Iterative Training:
- Re-evaluate and train in each iteration using the current model's sampled results.
- Analogous to online iterative strategies in RLHF (e.g., Iterative DPO).
- 5 iterations, 20 epochs each.
Loss & Training¶
- Initial Training: Fine-tune T5-base with standard cross-entropy on MALLS (34k NL-FOL pairs).
- Iterative Training: \(\mathcal{L} = \mathcal{L}_{CE} + 10 \cdot \mathcal{L}_{BRIO}\)
- Sample \(K=16\) outputs using beam search.
- Perform fully automated evaluation using the Vampire theorem prover.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | T5-Iter5 | GPT-4o | T5-Iter0 | CCG2Lambda |
|---|---|---|---|---|---|
| EntailmentBank | EPR | 7.4 | 2.9 | 5.6 | 0.0 |
| eQASC | EPR | 4.9 | 1.1 | 2.6 | 0.0 |
| e-SNLI | EPR | 4.3 | 1.5 | 0.1 | 0.0 |
| EntailmentBank | EPR@16 | 32.8 | 13.2 | 15.4 | - |
| eQASC | EPR@16 | 33.1 | 11.4 | 12.5 | - |
| e-SNLI | EPR@16 | 36.1 | 8.3 | 3.4 | - |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Number of Iterations (0→5) | Continuous EPR increase | EPR, EPR@16, and EPR@16-Oracle monotonically increase across all datasets. |
| Unique Predicate Names/Sent. | Significant decrease after Iter1 | Synonymous concepts are mapped to fewer predicate names, reducing arbitrariness. |
| Arity Entropy | Continuous decrease | Predicate arities tend to be consistent (e.g., CauseCycles converges from a mix of 2 and 3 arguments to 2 arguments consistently). |
| Cross-dataset Transfer | EPR@16 increase | Iter5 trained on dataset A also outperforms Iter0 on dataset B, showing transferability. |
Key Findings¶
- Classic NL→FOL methods (e.g., CCG2Lambda, AMR2FOL) have EPRs close to 0 on multi-premise RTE, resulting in complete failure.
- Although GPT-4o performs strongly on synthetic logical reasoning, its EPR for FOL generation on natural language entailment is only 1-3%.
- Core Mechanism of Iterative Training: The BRIO loss only provides gradient signals when different outputs have different scores, yet the model generalizes well to unseen combinations.
- Arbitrariness analysis reveals a direct causal relationship with EPR improvement: rising predicate name consistency \(\uparrow \rightarrow\) success rate \(\uparrow\).
- EPR@16-Oracle is close to EPR@16 (with a gap of only 1.7 percentage points), suggesting the existence of a near-optimal single-selection strategy.
- Models trained on e-SNLI show the best cross-domain generalization, likely due to the large scale of the data and its broad semantic coverage.
Highlights & Insights¶
- Ingenious design of the EPR metric: completely reference-free, allows arbitrary FOL structures, and indirectly evaluates semantic quality through execution results (theorem proving).
- Modeling NL→FOL translation as an execution-guided sequence generation problem, which is closely aligned with execution-guided methods in semantic parsing.
- The iterative learning-to-rank framework is highly generalizable and can be extended to other execution-guided generation tasks.
- Profound analysis of arbitrariness—which represents the fundamental barrier for LLMs in FOL generation—providing both quantitative metrics and solutions in this paper.
Limitations & Future Work¶
- Even for the best T5-Iter5, the EPR is only 7.4% (on EntailmentBank), which is still far from the theoretical upper bound (EPR@16-Oracle \(\approx 31\%\)).
- The study uses T5-base (220M) and does not explore the potential of larger models.
- Datasets lack linguistically controlled minimal pairs, which might miss fine-grained semantic differences.
- EPR@K-Oracle evaluation is NP-complete, requiring approximations in practical computation.
- Future Work: (1) Utilize larger pre-trained models (e.g., T5-XXL or LLMs) as backbones; (2) Integrate classical linguistic knowledge to constrain the structures of generated FOL.
Related Work & Insights¶
- The negative conclusion of Bos (2014) regarding the failure of FOL on single-premise RTE is extended to multi-premise scenarios.
- BRIO was originally designed for ranking training in summarization tasks; this paper innovatively applies it to execution-guided semantic parsing.
- The approach is conceptually similar to online iterative training strategies in RLHF (such as Iterative DPO and Self-Play).
- Pipelines like Logic-LM (LLM + FOL + Prover) are effective on synthetic data but fail to generalize; this paper provides quantitative evidence for this phenomenon.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Formally defining the EPF task and EPR metrics is a community-level contribution, and the iterative learning-to-rank method is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple baselines, reasoning type analyses, and cross-domain analyses, though the absolute EPR values remain relatively low.
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous problem definition, clear metric definitions, and step-by-step progressive analysis.
- Value: ⭐⭐⭐⭐ Provides a new research framework and benchmark for the long-standing challenge of "using FOL for natural language inference".