ACL 2025 Reasoning Logical Reasoning Inference-Time Reasoning Fuzzy Logic Compositional Queries Dense Retrieval Negation Handling

Enhancing Retrieval Systems with Inference-Time Logical Reasoning¶

Conference: ACL 2025
arXiv: 2503.17860
Code: Undisclosed
Authors: Felix Faltings, Wei Wei, Yujia Bao
Affiliations: MIT, Accenture Center for Advanced AI
Area: LLM Inference / Information Retrieval
Keywords: Logical Reasoning, Inference-Time Reasoning, Fuzzy Logic, Compositional Queries, Dense Retrieval, Negation Handling

TL;DR¶

Proposes an Inference-Time Logical Reasoning (ITLR) framework, which leverages LLMs to decompose natural language queries into logical expressions (AND/OR/NOT), and then composes the cosine similarity scores of individual sub-terms based on fuzzy logic. It consistently outperforms traditional dense retrieval and the BRIGHT baseline on synthetic data and three real-world datasets (NFCorpus/SciFact/ArguAna), showing significant improvements particularly on complex queries containing negations.

Background & Motivation¶

Background: Modern retrieval systems map user queries into a vector space and retrieve documents using cosine similarity. Although efficient and scalable, they perform poorly when handling logical constructs (negations, conjunctions, disjunctions).

Core Problem: - Traditional dense retrieval cannot effectively represent negation semantics—for example, when querying "Vitamin D Benefits, but not including bone health," the system still tends to retrieve documents related to bone health. - As the number of logical terms in a query increases, the performance of traditional methods degrades significantly. - Existing inference-enhanced retrieval works (such as BRIGHT) are mainly used for knowledge question answering, and inference-time reasoning for general retrieval systems has not been fully explored.

Design Motivation: To explicitly introduce inference-time logical reasoning into the retrieval workflow, decomposing queries into logical expressions and composing sub-term scores, thus allowing retrieval systems to handle complex logical queries while maintaining computational efficiency.

Method¶

Overall Architecture¶

The ITLR method consists of three key steps:

Logical Query Transformation: Utilizes LLMs (Llama3-70b) to parse natural language queries into logical forms, such as "Vitamin D Benefits" AND NOT "Bone Health".
Term Embedding and Similarity Computation: Separately encodes each sub-term in the logical expression into vectors, and computes their respective cosine similarity with the documents.
Score Composition Based on Logical Relations: Composes sub-term scores based on logical relations to obtain the final document ranking score.

Query Syntax¶

Defines a concise query syntax supporting three operators: AND, OR, NOT, as well as bracket precedence:

Precedence rules: NOT > AND > OR

Strictly defined using a Context-Free Grammar (CFG):

T → U OR U | U; U → V AND V | V; V → NOT W | W; W → string | (T)

Each query is parsed into a syntax tree, then directly converted into a score operation tree.

Score Operations¶

Defines various candidate operations for AND, OR, NOT respectively:

Operator	Candidate Implementations
OP_AND(x,y)	\(x \times y\), \(x+y\), \(\min(x,y)\)
OP_OR(x,y)	\(x+y\), \(\max(x,y)\)
OP_NOT(x)	\(1-x\), \(1/x\)

Default Choice (Experimental verification shows this is optimal): - \(\text{OP\_AND}(x,y) = x \times y\) - \(\text{OP\_OR}(x,y) = x + y\)
- \(\text{OP\_NOT}(x) = 1 - x\)

Key Designs: - The commonly used \(\text{AND}=\min\), \(\text{OR}=\max\) compositions in traditional fuzzy logic perform poorly in this scenario (nDCG@10 of only 0.86 vs 0.97 for the default). - Multiplication operations can better penalize documents with low scores on any sub-terms, which aligns with the AND semantics.

Computational Efficiency¶

Embedding computations for each sub-term can be executed in parallel.
Introduces only minimal extra overhead compared to the baseline (vector encoding scaled by the number of sub-terms).
Operator compositions are simple numerical operations, which do not affect retrieval latency.

Experiments¶

Experimental Setup¶

Embedding Models: NV-Embed-V1, nv-embedqa-mistral-7b-v2, text-embedding-v3-large, text-embedding-v3-small
LLMs: Llama3-70b (used for query transformation and data generation)
Evaluation Metrics: nDCG@10
Compared Methods: Baseline (Direct Query), BRIGHT (Inference-then-Query), ITLR (Ours)

Synthetic Data Experiments¶

Three-term Query Experiments: Tested on 32 possible three-term logical query templates, with 100 queries per template:

Number of Negations	Base	ITLR
0	0.95	0.99
1	0.77	0.97
2	0.65	0.96
3	0.52	1.00

Key Findings: The more negations there are, the larger the relative improvement of ITLR. With 3 negations, the baseline achieves only 0.52 (close to random at 0.7), while ITLR reaches 1.00.

Scaling to More Terms: As the number of query terms increases (from 2 upwards), the performance gap between ITLR and the baseline continually widens, with a more significant gap observed on AND queries.

Ablation on Operator Combinations (Table 2, NV-Embed-V1)¶

NOT	AND	OR	nDCG@10
\(1-x\)	\(x \times y\)	\(x+y\)	0.97
\(1-x\)	\(x+y\)	\(x+y\)	0.97
\(1/x\)	\(x \times y\)	\(x+y\)	0.90
\(1-x\)	\(\min\)	\(\max\)	0.86
\(1/x\)	\(\min\)	\(\max\)	0.86

Multiplication AND + Addition OR + Inversion NOT is the optimal combination.

Main Results (Table 3)¶

Generated 960 compositional reasoning queries using Llama3-70b on NFCorpus, SciFact, and ArguAna:

Method	NFCorpus	SciFact	ArguAna
NV-Embed-V1
Baseline	0.56	0.51	0.51
BRIGHT	0.67	0.59	0.58
ITLR	0.74	0.64	0.64
text-embedding-v3-large
Baseline	0.63	0.59	0.63
BRIGHT	0.70	0.63	0.66
ITLR	0.73	0.64	0.67
nv-embedqa-mistral-7b-v2
Baseline	0.54	0.50	0.40
BRIGHT	0.48	0.39	0.29
ITLR	0.67	0.61	0.59

ITLR achieves the best performance in 11 out of 12 (Model × Dataset) combinations.

Breakdown by Number of Negations (NFCorpus, NV-Embed-V1, Table 4)¶

Number of Negations	Base	Reasoning	Logical
0	0.81	0.81	0.76
1	0.60	0.68	0.71
2	0.51	0.64	0.76
3	0.36	0.56	0.73

Findings: Without negations, ITLR is slightly lower than the baseline (0.76 vs 0.81), but its advantage significantly expands when negations are present. ITLR maintains stable performance on queries with 2+ negations (0.76/0.73), whereas the baseline drops drastically.

Highlights & Insights¶

Elegant Problem Decomposition: Reformulates the retrieval problem of complex queries into a logical expression solving problem, bridging discrete logic and continuous similarity scores via fuzzy logic.
Plug-and-Play: The method does not modify the underlying embedding model and can be used on top of any dense retrieval model.
Computationally Efficient: Sub-term embedding calculations can be parallelized, and score composition involves only simple arithmetic operations, introducing negligible actual latency.
Systematic Findings on Negation Handling: Negation is the Achilles' heel of traditional retrieval—the base model's performance drops to 0.52 with 3 negations, while ITLR perfectly maintains performance; this finding has direct practical implications for RAG system design.
Counter-intuitive Findings on Operator Selection: The min/max combination of traditional fuzzy logic underperforms in this scenario, while the multiplication/addition combination is superior.

Limitations & Future Work¶

Slight Performance Drop on Queries Without Negations: On simple queries without negations, ITLR is about 5 percentage points lower than the baseline (0.76 vs 0.81), indicating that logical decomposition introduces a slight loss of information.
Dependency on LLM Query Transformation Quality: The query transformation is implemented using only simple prompts without fine-tuning a dedicated model, which may result in unstable transformation quality for complex natural language queries.
Score Calibration Problem: Different sub-terms may have different score distributions, leading the AND operation to favor documents that score high on certain sub-terms; the paper admits that no simple calibration method has been found yet.
Evaluation Limitations: Since both queries and labels in real-world data experiments are generated by Llama3-70b, there is a risk of evaluation circular dependency.
Evaluation Restricted to Three-Term Queries: Only three-term logical query templates were tested on real-world data, and the effectiveness on more complex nested logical structures remains unknown.

Dense Retrieval: Sentence-BERT (Reimers, 2019), NV-Embed (Lee et al., 2024), text-embedding-v3 (Wang et al., 2023)
Reasoning-Enhanced Retrieval: BRIGHT (Su et al., 2024) optimizes queries using LLM reasoning trajectories, StructGPT (Jiang et al., 2023) structured reasoning, ChatKBQA (Luo et al., 2023) integrates generation and retrieval.
Logical Retrieval: Term logic-based retrieval model by Meghini et al. (1993), RELIEF (Ounis & Paşca, 1998).
LLM Inference-Time Reasoning: Chain-of-Thought (Wei et al., 2022), Tree of Thoughts (Yao et al., 2024).

Rating¶

⭐⭐⭐⭐ — The method is simple and elegant, showing remarkable effectiveness in handling negations and high practical value (with direct guidance for RAG/search systems). However, the slight performance degradation on queries without negations and the reliance on LLM-generated data for evaluation are minor drawbacks.