Enhancing Retrieval Systems with Inference-Time Logical Reasoning¶
Conference: ACL 2025
arXiv: 2503.17860
Code: Undisclosed
Authors: Felix Faltings, Wei Wei, Yujia Bao
Affiliations: MIT, Accenture Center for Advanced AI
Area: LLM Inference / Information Retrieval
Keywords: Logical Reasoning, Inference-Time Reasoning, Fuzzy Logic, Compositional Queries, Dense Retrieval, Negation Handling
TL;DR¶
Proposes an Inference-Time Logical Reasoning (ITLR) framework, which leverages LLMs to decompose natural language queries into logical expressions (AND/OR/NOT), and then composes the cosine similarity scores of individual sub-terms based on fuzzy logic. It consistently outperforms traditional dense retrieval and the BRIGHT baseline on synthetic data and three real-world datasets (NFCorpus/SciFact/ArguAna), showing significant improvements particularly on complex queries containing negations.
Background & Motivation¶
Background: Modern retrieval systems map user queries into a vector space and retrieve documents using cosine similarity. Although efficient and scalable, they perform poorly when handling logical constructs (negations, conjunctions, disjunctions).
Core Problem: - Traditional dense retrieval cannot effectively represent negation semantics—for example, when querying "Vitamin D Benefits, but not including bone health," the system still tends to retrieve documents related to bone health. - As the number of logical terms in a query increases, the performance of traditional methods degrades significantly. - Existing inference-enhanced retrieval works (such as BRIGHT) are mainly used for knowledge question answering, and inference-time reasoning for general retrieval systems has not been fully explored.
Design Motivation: To explicitly introduce inference-time logical reasoning into the retrieval workflow, decomposing queries into logical expressions and composing sub-term scores, thus allowing retrieval systems to handle complex logical queries while maintaining computational efficiency.
Method¶
Overall Architecture¶
The ITLR method consists of three key steps:
- Logical Query Transformation: Utilizes LLMs (Llama3-70b) to parse natural language queries into logical forms, such as
"Vitamin D Benefits" AND NOT "Bone Health". - Term Embedding and Similarity Computation: Separately encodes each sub-term in the logical expression into vectors, and computes their respective cosine similarity with the documents.
- Score Composition Based on Logical Relations: Composes sub-term scores based on logical relations to obtain the final document ranking score.
Query Syntax¶
Defines a concise query syntax supporting three operators: AND, OR, NOT, as well as bracket precedence:
- Precedence rules: NOT > AND > OR
- Strictly defined using a Context-Free Grammar (CFG):
- Each query is parsed into a syntax tree, then directly converted into a score operation tree.
Score Operations¶
Defines various candidate operations for AND, OR, NOT respectively:
| Operator | Candidate Implementations |
|---|---|
| OP_AND(x,y) | \(x \times y\), \(x+y\), \(\min(x,y)\) |
| OP_OR(x,y) | \(x+y\), \(\max(x,y)\) |
| OP_NOT(x) | \(1-x\), \(1/x\) |
Default Choice (Experimental verification shows this is optimal):
- \(\text{OP\_AND}(x,y) = x \times y\)
- \(\text{OP\_OR}(x,y) = x + y\)
- \(\text{OP\_NOT}(x) = 1 - x\)
Key Designs: - The commonly used \(\text{AND}=\min\), \(\text{OR}=\max\) compositions in traditional fuzzy logic perform poorly in this scenario (nDCG@10 of only 0.86 vs 0.97 for the default). - Multiplication operations can better penalize documents with low scores on any sub-terms, which aligns with the AND semantics.
Computational Efficiency¶
- Embedding computations for each sub-term can be executed in parallel.
- Introduces only minimal extra overhead compared to the baseline (vector encoding scaled by the number of sub-terms).
- Operator compositions are simple numerical operations, which do not affect retrieval latency.
Experiments¶
Experimental Setup¶
- Embedding Models: NV-Embed-V1, nv-embedqa-mistral-7b-v2, text-embedding-v3-large, text-embedding-v3-small
- LLMs: Llama3-70b (used for query transformation and data generation)
- Evaluation Metrics: nDCG@10
- Compared Methods: Baseline (Direct Query), BRIGHT (Inference-then-Query), ITLR (Ours)
Synthetic Data Experiments¶
Three-term Query Experiments: Tested on 32 possible three-term logical query templates, with 100 queries per template:
| Number of Negations | Base | ITLR |
|---|---|---|
| 0 | 0.95 | 0.99 |
| 1 | 0.77 | 0.97 |
| 2 | 0.65 | 0.96 |
| 3 | 0.52 | 1.00 |
Key Findings: The more negations there are, the larger the relative improvement of ITLR. With 3 negations, the baseline achieves only 0.52 (close to random at 0.7), while ITLR reaches 1.00.
Scaling to More Terms: As the number of query terms increases (from 2 upwards), the performance gap between ITLR and the baseline continually widens, with a more significant gap observed on AND queries.
Ablation on Operator Combinations (Table 2, NV-Embed-V1)¶
| NOT | AND | OR | nDCG@10 |
|---|---|---|---|
| \(1-x\) | \(x \times y\) | \(x+y\) | 0.97 |
| \(1-x\) | \(x+y\) | \(x+y\) | 0.97 |
| \(1/x\) | \(x \times y\) | \(x+y\) | 0.90 |
| \(1-x\) | \(\min\) | \(\max\) | 0.86 |
| \(1/x\) | \(\min\) | \(\max\) | 0.86 |
Multiplication AND + Addition OR + Inversion NOT is the optimal combination.
Main Results (Table 3)¶
Generated 960 compositional reasoning queries using Llama3-70b on NFCorpus, SciFact, and ArguAna:
| Method | NFCorpus | SciFact | ArguAna |
|---|---|---|---|
| NV-Embed-V1 | |||
| Baseline | 0.56 | 0.51 | 0.51 |
| BRIGHT | 0.67 | 0.59 | 0.58 |
| ITLR | 0.74 | 0.64 | 0.64 |
| text-embedding-v3-large | |||
| Baseline | 0.63 | 0.59 | 0.63 |
| BRIGHT | 0.70 | 0.63 | 0.66 |
| ITLR | 0.73 | 0.64 | 0.67 |
| nv-embedqa-mistral-7b-v2 | |||
| Baseline | 0.54 | 0.50 | 0.40 |
| BRIGHT | 0.48 | 0.39 | 0.29 |
| ITLR | 0.67 | 0.61 | 0.59 |
ITLR achieves the best performance in 11 out of 12 (Model × Dataset) combinations.
Breakdown by Number of Negations (NFCorpus, NV-Embed-V1, Table 4)¶
| Number of Negations | Base | Reasoning | Logical |
|---|---|---|---|
| 0 | 0.81 | 0.81 | 0.76 |
| 1 | 0.60 | 0.68 | 0.71 |
| 2 | 0.51 | 0.64 | 0.76 |
| 3 | 0.36 | 0.56 | 0.73 |
Findings: Without negations, ITLR is slightly lower than the baseline (0.76 vs 0.81), but its advantage significantly expands when negations are present. ITLR maintains stable performance on queries with 2+ negations (0.76/0.73), whereas the baseline drops drastically.
Highlights & Insights¶
- Elegant Problem Decomposition: Reformulates the retrieval problem of complex queries into a logical expression solving problem, bridging discrete logic and continuous similarity scores via fuzzy logic.
- Plug-and-Play: The method does not modify the underlying embedding model and can be used on top of any dense retrieval model.
- Computationally Efficient: Sub-term embedding calculations can be parallelized, and score composition involves only simple arithmetic operations, introducing negligible actual latency.
- Systematic Findings on Negation Handling: Negation is the Achilles' heel of traditional retrieval—the base model's performance drops to 0.52 with 3 negations, while ITLR perfectly maintains performance; this finding has direct practical implications for RAG system design.
- Counter-intuitive Findings on Operator Selection: The min/max combination of traditional fuzzy logic underperforms in this scenario, while the multiplication/addition combination is superior.
Limitations & Future Work¶
- Slight Performance Drop on Queries Without Negations: On simple queries without negations, ITLR is about 5 percentage points lower than the baseline (0.76 vs 0.81), indicating that logical decomposition introduces a slight loss of information.
- Dependency on LLM Query Transformation Quality: The query transformation is implemented using only simple prompts without fine-tuning a dedicated model, which may result in unstable transformation quality for complex natural language queries.
- Score Calibration Problem: Different sub-terms may have different score distributions, leading the AND operation to favor documents that score high on certain sub-terms; the paper admits that no simple calibration method has been found yet.
- Evaluation Limitations: Since both queries and labels in real-world data experiments are generated by Llama3-70b, there is a risk of evaluation circular dependency.
- Evaluation Restricted to Three-Term Queries: Only three-term logical query templates were tested on real-world data, and the effectiveness on more complex nested logical structures remains unknown.
Related Work & Insights¶
- Dense Retrieval: Sentence-BERT (Reimers, 2019), NV-Embed (Lee et al., 2024), text-embedding-v3 (Wang et al., 2023)
- Reasoning-Enhanced Retrieval: BRIGHT (Su et al., 2024) optimizes queries using LLM reasoning trajectories, StructGPT (Jiang et al., 2023) structured reasoning, ChatKBQA (Luo et al., 2023) integrates generation and retrieval.
- Logical Retrieval: Term logic-based retrieval model by Meghini et al. (1993), RELIEF (Ounis & Paşca, 1998).
- LLM Inference-Time Reasoning: Chain-of-Thought (Wei et al., 2022), Tree of Thoughts (Yao et al., 2024).
Rating¶
⭐⭐⭐⭐ — The method is simple and elegant, showing remarkable effectiveness in handling negations and high practical value (with direct guidance for RAG/search systems). However, the slight performance degradation on queries without negations and the reliance on LLM-generated data for evaluation are minor drawbacks.