Do Retrieval Augmented Language Models Know When They Don't Know?¶
Conference: AAAI 2026 arXiv: 2509.01476 Code: GitHub Area: Information Retrieval Keywords: RAG, Calibration, Over-Refusal, Uncertainty Estimation, Retrieval Augmentation
TL;DR¶
This paper systematically analyzes the refusal calibration problem in RAG models, finding that RALMs exhibit an over-refusal rate exceeding 55% when all retrieved documents are irrelevant (even when the model's internal knowledge suffices to answer), and proposes a mechanism combining uncertainty estimation with refusal-aware fine-tuning to balance refusal behavior and answer quality.
Background & Motivation¶
- Background: RAG systems augment LLMs by retrieving external documents, while refusal post-training teaches models to proactively decline answering under uncertainty. Both techniques are widely adopted but have rarely been studied jointly.
- Limitations of Prior Work: When all retrieved documents are irrelevant, RALMs with refusal training tend to reject all queries—even when the model's parametric knowledge (internal knowledge) is sufficient for a correct answer. This "over-refusal" phenomenon has been largely overlooked.
- Key Challenge: Improving refusal rates does not equate to improving calibration quality. Existing refusal training methods (e.g., R-tuning) may in fact exacerbate over-refusal. The fundamental issue is a calibration imbalance between the model's internal knowledge and externally retrieved knowledge.
- Key Insight: The paper categorizes RALM knowledge sources into four quadrants (internal knowledge present/absent × external knowledge present/absent) and systematically investigates refusal behavior and uncertainty calibration across different knowledge states.
Method¶
Overall Architecture¶
The study is organized around three progressive research questions: (RQ1) How well are RALMs calibrated under different knowledge states? (RQ2) What is the relationship between refusal ability and calibration quality? (RQ3) Can uncertainty estimation be leveraged to mitigate over-refusal?
Key Designs¶
-
Four-Quadrant Knowledge State Analysis (RQ1)
- Each question is classified into one of four knowledge states: highlyknown, maybeknown, weaklyknown, or unknown
- Classification is based on consistency between temperature-sampled and greedy-decoded outputs
- Refusal behavior under different retrieval configurations (0 positive/10 negative, 1 positive/9 negative, 5 positive/5 negative, etc.) is analyzed for each knowledge state
- Finding: Even highlyknown questions are subject to over-refusal under fully negative retrieval conditions
-
Comparative Analysis of Refusal Training Methods (RQ2)
- R-tuning: Detects questions the model cannot answer and trains it to output "I don't know"—found to exacerbate over-refusal
- In-Context Fine-Tuning (ICFT): Inserts both positive and negative contexts into fine-tuning data simultaneously, with training targets determined by the knowledge quadrant—effectively mitigates over-refusal
- Key finding: ICFT improves refusal behavior but does not necessarily improve calibration or accuracy, due to changes in robustness and context faithfulness
-
Uncertainty-Based Refusal Mechanism (RQ3)
- Uses uncertainty estimates and their changes to infer the RALM's knowledge state
- Based on the inferred state, decides to: answer using retrieved context / answer without retrieved context / refuse to answer
- Three categories of uncertainty estimation methods: Verbalization-based (model self-reported confidence, 4 prompt variants), Consistency-based (Agreement/Entropy/FSD metrics), and Similarity Matrix-based (eigenvalue/degree measures)
-
Experimental Rigor
- High-quality negative retrieval samples are constructed via Milvus hybrid search with re-ranking
- Strict answer evaluation: LLM-as-judge + exact match + refusal keyword filtering
- Qwen evaluated on Chinese datasets (CRUD, RGB_zh); LLaMA evaluated on English datasets (NQ, RGB_en)
- Generation temperature of 0.5 with 16 sampling runs
Loss & Training¶
- R-tuning: Two-stage pipeline—first detecting questions outside the model's knowledge boundary, then fine-tuning with "should refuse" labels
- ICFT: Both positive and negative contexts are inserted into each training sample simultaneously; training targets are set as correct answers or refusal expressions based on the RALM's knowledge quadrant
Key Experimental Results¶
Main Results (RQ1 Calibration Analysis, Lower Brier Score Is Better)¶
| Uncertainty Estimation Method | RGB_en (0p10n) | RGB_en (5p5n) | RGB_zh (0p10n) |
|---|---|---|---|
| Verb-1s-1 | 0.139 | 0.023 | 0.441 |
| Entropy | 0.305 | 0.009 | 0.256 |
| Agree | 0.192 | 0.010 | 0.261 |
| Eigv (Similarity Matrix) | 0.232 | 0.271 | 0.282 |
Ablation Study (RQ2 Refusal Training Effectiveness)¶
| Method | Over-Refusal Rate | Overall Accuracy | Calibration Quality |
|---|---|---|---|
| Original RALM | >55% | Baseline | Baseline |
| R-tuning | Worsened | Degraded | Not improved |
| ICFT | Significantly reduced | Maintained/slightly improved | Not necessarily improved |
| ICFT + Uncertainty Refusal | Best | Significantly improved | Improved |
Key Findings¶
- Over-refusal rate exceeds 55% under fully irrelevant retrieval: Models possess sufficient internal knowledge to answer correctly, yet refuse due to poor retrieval results
- R-tuning is counterproductive: Refusal training causes models to refuse in more cases, exacerbating over-refusal
- ICFT is effective but limited: It mitigates over-refusal but does not automatically improve calibration quality, suggesting that refusal behavior and calibration are two independent dimensions
- Consistency-based methods achieve the best calibration when positive documents are present (Brier = 0.009), but degrade significantly under fully negative retrieval settings
- Changes in uncertainty (the difference between with/without retrieval) serve as effective indicators of knowledge state
Highlights & Insights¶
- Systematic characterization of "over-refusal": This paper is among the first to situate RAG refusal ability within an uncertainty calibration framework, revealing a widely overlooked but practically significant problem
- Key insight that refusal ≠ calibration: Improving a model's refusal behavior does not equate to improving its calibration quality; the two dimensions require independent optimization
- Practical utility of the four-quadrant knowledge state framework: Provides a clear analytical structure for dynamic strategy selection in RAG systems
Limitations & Future Work¶
- Experiments are primarily based on simple factual QA (single-hop); over-refusal patterns on complex reasoning tasks may differ
- The choice of uncertainty estimation method substantially impacts results, and no mechanism for automatically selecting the optimal method is provided
- Dynamic retrieval still relies on static thresholds; adaptive threshold mechanisms warrant further exploration
- Experiments are limited to models of approximately 7B parameters; over-refusal behavior in larger models remains to be verified
Related Work & Insights¶
- vs. Dynamic RAG methods (Self-RAG, FLARE, etc.): These methods determine when to retrieve based on uncertainty, but assume the model is well-calibrated; this paper demonstrates that assumption does not hold
- vs. LLM knowledge boundary research (Li et al. 2025): Prior work focuses primarily on internal knowledge; this paper extends the analysis to the joint interplay of internal and external knowledge
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic characterization of over-refusal and the four-quadrant analysis framework constitute original contributions
- Experimental Thoroughness: ⭐⭐⭐⭐ Three progressive RQs, multiple UE methods, bilingual evaluation (Chinese and English), and diverse retrieval configurations
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; research questions are developed with strong logical progression
- Value: ⭐⭐⭐⭐ Offers direct guidance for improving RAG system reliability and designing intelligent refusal strategies