Adaptive Retrieval without Self-Knowledge? Bringing Uncertainty Back Home¶
Conference: ACL 2025
arXiv: 2501.12835
Code: https://github.com/s-nlp/AdaRAGUE
Area: Other
Keywords: Adaptive Retrieval, RAG, Uncertainty Estimation, Self-knowledge, Question Answering System
TL;DR¶
This work conducts a comprehensive evaluation of 35 adaptive retrieval methods (including 8 state-of-the-art methods and 27 uncertainty estimation methods), revealing that classic uncertainty estimation techniques often outperform complex, specialized pipelines in terms of efficiency and self-knowledge capability, while maintaining comparable QA performance.
Background & Motivation¶
Retrieval-Augmented Generation (RAG) can improve the accuracy of LLM responses and alleviate hallucinations, but it faces two key challenges:
Retrieval is not always necessary: When the LLM already possesses the relevant knowledge, retrieval may introduce irrelevant information, leading to error propagation and external hallucinations.
High computational cost: Each retrieval call requires invoking the retriever and performing additional LM inference.
To address this, adaptive retrieval methods have emerged, which leverage the LLM's self-knowledge (i.e., the model's ability to recognize its own knowledge boundaries) to decide when retrieval is needed. However, existing research has three blind spots:
- Neglecting efficiency evaluation: Focusing only on the number of retrievals without tracking the number of LM calls (which can be more expensive).
- Lack of comparison with uncertainty estimation methods: Mature methods like Mean Entropy have never been systematically evaluated in adaptive retrieval scenarios.
- Lack of self-knowledge capability evaluation: Fails to measure the capability of the method itself in predicting "whether retrieval is needed".
This paper bridges these gaps through a comprehensive and unified evaluation framework.
Method¶
Overall Architecture¶
A unified evaluation of 35 methods across 6 datasets is conducted, using 10 metrics covering three dimensions: QA performance, self-knowledge, and efficiency. Uncertainty estimation methods are integrated into the adaptive retrieval framework of AdaptiveRAG to ensure comparability.
Key Designs¶
-
End-to-end adaptive retrieval methods (8 methods):
- IRCoT: Dynamically adds retrieved passages during Chain-of-Thought reasoning until the answer is generated.
- AdaptiveRAG: Uses a T5-large classifier to predict three outcomes (no retrieval/single retrieval/multiple retrieval).
- FLARE: Triggers retrieval and regenerates responses when token probabilities fall below a threshold.
- DRAGIN: Similar to FLARE but filters stop words and reconstructs queries using attention weights.
- Rowen: Determines whether to retrieve based on consistency (cross-lingual/cross-model).
- SeaKR: Monitors internal states using an uncertainty module to trigger retrieval and rerank candidate passages.
-
Uncertainty estimation methods (27 methods, highlighting 5):
- Lexical Similarity: Based on the average similarity among sampled responses.
- Max/Mean Entropy: Computes the entropy for each token and aggregates them using maximum/mean values.
- SAR: Entropy aggregation weighted by token relevance.
- EigValLaplacian: Constructs a weighted graph of sampled responses and computes the sum of Laplacian eigenvalues.
- Uncertainty scores are converted into retrieval decisions using a classifier.
-
Evaluation framework:
- QA metrics: In-Accuracy (InAcc), Exact Match (EM), F1.
- Efficiency metrics: Retrieval Calls (RC), LM Calls (LMC).
- Self-knowledge metrics: ROC-AUC, Spearman correlation coefficient, accuracy, overconfidence rate, underconfidence rate.
- Cross-dataset ranking aggregation: Uses reciprocal rank fusion for fair comparison.
Loss & Training¶
- All methods uniformly use the LLaMA 3.1-8b-instruct model.
- BM25 + Elasticsearch is uniformly used as the retriever.
- Classifiers for uncertainty methods are trained on the training set, and the best classifier is selected to report test set results.
- Baseline methods retain their respective original settings.
Key Experimental Results¶
Main Results¶
QA performance and efficiency (selected key methods):
| Method | NQ InAcc | NQ LMC | NQ RC | HotPot InAcc | HotPot LMC | HotPot RC |
|---|---|---|---|---|---|---|
| Never RAG | 0.446 | 1.0 | 0.00 | 0.286 | 1.0 | 0.00 |
| Always RAG | 0.496 | 1.0 | 1.00 | 0.410 | 1.0 | 1.00 |
| IRCoT | 0.478 | 2.7 | 2.70 | 0.438 | 4.4 | 4.38 |
| DRAGIN | 0.480 | 4.5 | 2.24 | 0.430 | 5.1 | 2.56 |
| RowenHybrid | 0.494 | 55.0 | 7.27 | 0.354 | 59.8 | 7.63 |
| Mean Entropy | 0.498 | 1.9 | 0.88 | 0.410 | 2.0 | 0.99 |
| Best UE | 0.512 | 1.8 | 0.81 | 0.414 | 2.0 | 0.99 |
| Ideal Oracle | 0.608 | 1.6 | 0.55 | 0.460 | 1.7 | 0.71 |
Ablation Study¶
Self-knowledge evaluation (ROC-AUC):
| Method | NQ | SQUAD | TQA | 2Wiki | HotPot | Musique |
|---|---|---|---|---|---|---|
| AdaptiveRAG | 0.54 | 0.58 | 0.49 | 0.71 | 0.62 | 0.64 |
| FLARE | 0.59 | 0.58 | 0.57 | 0.62 | 0.54 | 0.51 |
| SeaKR | 0.64 | 0.77 | 0.78 | 0.37 | 0.55 | 0.56 |
| Max Entropy | 0.73 | 0.72 | 0.72 | 0.73 | 0.66 | 0.68 |
Key Findings¶
- Uncertainty estimation methods exhibit an overwhelming advantage in efficiency: Each question requires only ~2 LM calls and <1 retrieval call, whereas RowenHybrid requires 55-80 LM calls.
- Uncertainty methods achieve comparable QA performance to complex pipelines: Best UE even outperforms most end-to-end methods on single-hop datasets.
- Uncertainty methods are consistently superior in self-knowledge capability: The ROC-AUC of Max Entropy significantly outperforms all end-to-end methods on the majority of datasets.
- No single method dominates across all dimensions: Efficiency \(\neq\) Performance \(\neq\) Self-knowledge; selection should depend on the application scenario.
- A clear gap remains between current methods and optimal performance: The Ideal Oracle demonstrates substantial room for improvement.
Highlights & Insights¶
- Significant framework contribution: This study systematically introduces 27 uncertainty estimation methods to the adaptive retrieval scenario for the first time, filling an important comparative gap.
- Challenging the "complex is better" assumption: The simple Mean Entropy method outperforms carefully designed multi-step pipelines in many scenarios.
- LM calls as an overlooked critical cost: The Rowen series requires 30-80 LM calls per question, which is cost-prohibitive when using commercial APIs.
- Multi-dimensional evaluation perspective: It unifies triple dimensions of QA performance, efficiency, and self-knowledge capability into a single evaluation for the first time, providing a comprehensive reference for researchers.
- OOD Analysis: It further evaluates the performance of uncertainty methods under out-of-distribution scenarios, adding practical value.
Limitations & Future Work¶
- Only a single model (LLaMA 3.1-8b-instruct) is evaluated; different models may exhibit different optimal methods.
- Only BM25 is utilized as the retriever; stronger retrievers (such as dense retrievers) might alter the relative ranking of methods.
- Uncertainty methods require training a classifier to determine thresholds, introducing additional data and hyperparameter tuning demands.
- The definition of self-knowledge capacity relies on binary In-Accuracy labels, which may be too coarse.
- The quality of retrieved documents is not factored into final answer evaluation; poor retrieval might cause "Always RAG" to be underestimated.
Related Work & Insights¶
- Unifies two previously isolated research domains: Adaptive RAG (Su et al., 2024; Jeong et al., 2024) and Uncertainty Estimation (Fadeeva et al., 2023; Duan et al., 2023).
- Provides a practical evaluation framework for LLM self-knowledge capacity research (Yin et al., 2023, 2024).
- Practical takeaway: In resource-constrained scenarios, instead of building complex adaptive retrieval pipelines, it is often better to directly employ simple methods like Mean Entropy.
Rating¶
- Novelty: ⭐⭐⭐ The method itself has limited innovation; the core contribution lies in the comprehensive comparative framework and counterintuitive findings.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 35 methods, 6 datasets, 10 metrics, OOD analysis, and classifier complexity analysis.
- Writing Quality: ⭐⭐⭐⭐ Well-organized paper with highly intuitive multi-dimensional visualization in Figure 1.
- Value: ⭐⭐⭐⭐⭐ Highly valuable systematic evaluation for the adaptive retrieval field; the findings have direct practical implications.