Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion¶
Conference: ACL2025
arXiv: 2504.14175
Code: To be confirmed
Area: Information Retrieval
Keywords: Query Expansion, Knowledge Leakage, Fact Verification, HyDE, Query2doc, Zero-Shot Retrieval
TL;DR¶
The authors question whether the performance gains of LLM-based query expansion (HyDE/Query2doc) truly stem from "hypothetical document generation." They find that performance improvements consistently occur only when the LLM-generated documents contain sentences semantically consistent with the gold evidence, revealing potential knowledge leakage issues in benchmarks.
Background & Motivation¶
Background: Zero-shot retrieval is a core component of knowledge-intensive applications. In recent years, LLM-based query expansion (QE) methods such as HyDE and Query2doc have achieved significant performance improvements on multiple benchmarks and have been widely adopted.
Limitations of Prior Work: The core assumption of these methods is that "although LLM-generated hypothetical documents may be inaccurate, they can bridge the semantic distance between the query and the target document"—yet this assumption has never been rigorously validated.
Key Challenge: LLMs are pre-trained on massive corpora, which heavily likely include the knowledge sources of the benchmarks (e.g., Wikipedia). Thus, are LLMs generating "hypothetical documents" or merely "reciting memorized knowledge"? If it is the latter, the retrieval task degenerates into a trivial task that is close to exact matching.
Goal: To investigate how much of the performance gains of LLM-based QE methods can be attributed to knowledge leakage rather than genuine hypothetical reasoning capabilities.
Key Insight: Fact verification is selected as the evaluation testbed—this task provides explicit gold evidence for comparison and is a classification task, making it easy to clearly evaluate the impact of QE on the downstream task.
Core Idea: Use NLI to detect whether the LLM-generated document "entails" the gold evidence sentences, partition the samples into matched and unmatched groups for performance comparison, and find that the effectiveness of QE only holds for the matched group.
Method¶
Overall Architecture¶
This paper presents an empirical analysis study, with the core workflow as follows: 1. Run two mainstream QE methods (Query2doc + HyDE) on three fact-verification benchmarks. 2. Detect whether the LLM-generated documents contain gold evidence using an NLI-based matching algorithm. 3. Split the data based on matched/unmatched conditions and evaluate retrieval and verification performance separately. 4. Compare the consistency trends of seven LLMs across three datasets.
Key Design 1: NLI-based Matching Algorithm¶
- Function: Determine whether the expanded document \(d\) generated by the LLM for a claim contains sentences semantically equivalent to the gold evidence.
- Design Motivation: If the LLM "recites" the gold evidence in the expanded document, the improvement in retrieval performance might simply be due to the target answer already being embedded within the query vector.
- Mechanism:
- Sentence Segmentation: Segment the generated document \(d\) into sentences using spaCy, and remove sentences with high repetition relative to the claim (ROUGE-2 > 0.95).
- NLI Labeling: Perform NLI inference for all pairs \((e_i, s_j) \in E \times S\) using GPT-4o-mini (entailment / contradiction / neutral).
- Label Aggregation: If any pair \((e_i, s_j)\) is labeled as entailment, the claim is marked as matched (M); otherwise, it is marked as unmatched (¬M).
Key Design 2: Experimental Settings of Two QE Methods¶
- Query2doc: Generates a pseudo-document \(d\), concatenates \(d\) with multiple copies of the query to form an expanded query \(q^+\), and retrieves using BM25. \(n=5\).
- HyDE: Generates a hypothetical document \(d\), encodes \(q\) and \(d\) separately using Contriever, and averages their vector representations for retrieval. \(N=1\).
Key Design 3: Evaluation Strategy¶
- Retrieval Evaluation: FEVER/SciFact use Recall@5 and NDCG@5; AVeriTeC uses METEOR and BERTScore (as the gold evidence is human-rewritten).
- Verification Evaluation: Use GPT-4o-mini to perform verdict prediction on the top-5 retrieved evidence and evaluate Macro F1.
- Statistical Significance: Based on 8 replicated LLM generation experiments, reporting the mean \(\pm\) standard error.
Loss & Training¶
This work does not involve model training and is a purely analytical study. All LLMs are utilized in a zero-shot prompting manner.
Key Experimental Results¶
Main Results: Overall Performance of QE Methods (Query2doc + GPT-4o-mini, k=5)¶
| Metric | FEVER | SciFact | AVeriTeC |
|---|---|---|---|
| BM25 baseline Recall@5 | 31.0 | 51.2 | 17.8 (METEOR) |
| Query2doc Recall@5 | 36.4 | 55.1 | 19.1 (METEOR) |
| Query2doc F1 | 55.6 | 52.5 | 32.6 |
QE significantly outperforms the baseline across all three datasets (\(p < 0.001\)), with consistent trends observed across all seven LLMs.
Core Analysis: Performance Comparison of Matched vs. Unmatched (GPT-4o-mini)¶
| Condition | FEVER Recall@5 | SciFact Recall@5 | AVeriTeC METEOR |
|---|---|---|---|
| Query2doc ALL | 36.4 | 55.1 | 19.1 |
| Matched (M) | 40.5 | 63.3 | 21.6 |
| Unmatched (¬M) | 23.8 | 45.9 | 17.4 |
| BM25 baseline | 31.0 | 51.2 | 17.8 |
| Condition | FEVER Recall@5 | SciFact Recall@5 | AVeriTeC METEOR |
|---|---|---|---|
| HyDE ALL | 37.3 | 61.2 | 18.7 |
| Matched (M) | 40.0 | 68.4 | 19.8 |
| Unmatched (¬M) | 23.4 | 50.8 | 16.4 |
| Contriever baseline | 26.8 | 55.1 | 17.6 |
Knowledge Leakage Ratio (Proportion of Matches, Table 3 Summary)¶
| LLM | FEVER (Q2d/HyDE) | SciFact | AVeriTeC |
|---|---|---|---|
| GPT-4o-mini | 75.8% / 83.5% | 52.8% / 59.1% | 40.4% / 68.0% |
| Llama-3.1-70b | 78.3% / 71.7% | 57.5% / 55.0% | 48.1% / 47.0% |
Key Findings¶
- Widespread Knowledge Leakage: In most cases, more than 40% of the claims have expanded documents containing sentences semantically consistent with the gold evidence, reaching up to 83.5% on FEVER.
- Performance Gains Stem from Matched Samples: The performance of the matched group is significantly higher than that of the overall and unmatched groups (\(p < 0.001\)), while the unmatched group matches or even performs worse than the baseline without QE in most cases.
- Consistent Trends Across Models and Datasets: The conclusions are highly consistent across seven LLMs \(\times\) three datasets \(\times\) two QE methods.
- Warnings for Practical Applications: For claims involving emerging or niche knowledge, QE methods may be not only ineffective but even detrimental.
Highlights & Insights¶
- Proposes a highly valuable "counter-intuitive" question: It challenges the widely accepted core assumption of HyDE/Query2doc, which is highly commendable for its academic courage.
- Simple and Effective Methodology: The NLI-based matching algorithm is simple and intuitive but accurately quantifies the degree of knowledge leakage.
- Rigorous Experimental Design: The study provides comprehensive coverage with 7 LLMs \(\times\) 3 benchmarks \(\times\) 2 QEs \(\times\) 8 replications \(\times\) matched/unmatched stratified analysis.
- Significance as a Warning to the Community: It reminds researchers to consider the impact of data contamination/knowledge leakage when evaluating LLM-based retrieval methods, promoting more equitable benchmark design.
Limitations & Future Work¶
- Causal Relationship Not Established: Only correlation is observed (the link between LLM behavior and leakage); the causal chain of "training data \(\rightarrow\) generation" has not been proved.
- NLI Judgment Quality: Relying on GPT-4o-mini for NLI labeling may introduce bias; although manual verification was conducted, its scale was limited.
- Limited Task Scope: Validated only on fact verification tasks; whether this generalizes to other retrieval-intensive tasks such as QA or conversational retrieval remains unknown.
- Lack of Solutions: This work is primarily analytical and does not propose concrete methods to mitigate knowledge leakage.
- Unexplored Potential of QE on "Truly New Knowledge": Whether QE can regain effectiveness if integrated with external knowledge sources is only briefly mentioned in the Discussion.
Related Work & Insights¶
vs. HyDE (Gao et al., 2023)¶
HyDE assumes that generated hypothetical documents can facilitate retrieval even if they contain factual errors. Through NLI analysis, this study demonstrates that the performance gains of HyDE largely depend on the LLM's recitation of memorized gold evidence rather than hypothetical reasoning, undermining the theoretical foundation of HyDE.
vs. Data Contamination Studies (Deng et al., 2023; Xu et al., 2024)¶
Existing research on data contamination detects whether LLMs have "seen" test data using methods like perplexity and token prediction. This work is the first to introduce the knowledge leakage issue to the query expansion field, utilizing NLI matching as a detection mechanism—a natural extension of data contamination research in the IR direction.
vs. Query Expansion + External Knowledge (Lei et al., 2024)¶
Some recent works have begun introducing external knowledge sources to enhance QE. The findings of this paper provide strong motivation for such approaches—if LLM internal knowledge is a product of leakage, introducing external information is the proper direction.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Systematically challenges widely accepted assumptions in the QE domain, with highly insightful research questions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage with 7 LLMs \(\times\) 3 datasets \(\times\) 2 methods, backed by rigorous statistical testing.
- Writing Quality: ⭐⭐⭐⭐ — Clear formulation of questions, rigorous experimental logic, and deep discussion.
- Value: ⭐⭐⭐⭐ — Holds significant warning value for the IR community, driving more rigorous benchmark evaluation standards.