Reliable Decision Making via Calibration Oriented Retrieval Augmented Generation¶
Conference: NeurIPS 2025 arXiv: 2411.08891 Code: To be confirmed Area: Information Retrieval Keywords: RAG, calibration, decision-making, retrieval augmented generation, confidence estimation
TL;DR¶
This paper proposes CalibRAG, a framework that trains a temperature-conditioned forecasting function to ensure confidence calibration in RAG-assisted decision-making, achieving improvements in both calibration quality and accuracy.
Background & Motivation¶
- LLMs are increasingly used to assist human decision-making, yet they frequently produce incorrect information with high confidence (hallucination), leading users to make suboptimal decisions.
- Studies show that users tend to over-rely on LLM outputs, and the degree of reliance is proportional to the model's expressed confidence.
- RAG mitigates hallucination by incorporating external documents, but RAG retrievers may return irrelevant documents, and LLMs tend to be overconfident about retrieved content.
- Existing RAG methods focus solely on retrieval relevance, without considering whether the resulting user decisions are well-calibrated.
- Traditional temperature scaling is not applicable to calibration in long-form text generation.
- Prior decision calibration methods require fine-tuning 3 LLMs with PPO training, which is costly and unstable.
Method¶
Overall Architecture¶
The core idea of CalibRAG is to train a forecasting function \(f(t, q, d)\) that predicts "the probability that a user makes a correct decision given temperature \(t\), query \(q\), and retrieved document \(d\)." At inference time, this function re-ranks retrieved documents and selects those most likely to lead to a correct decision.
The four-stage inference pipeline is as follows: 1. Stage 1 – Initial Retrieval: Given query \(q^*\), retrieve Top-K candidate documents. 2. Stage 2 – Scoring and Selection: Score and re-rank each document \(d_i^*\) using \(f(t, q^*, d_i^*)\). 3. Stage 3 – Query Reformulation (Optional): If the highest confidence score falls below threshold \(\epsilon = 0.5\), reformulate the query and retrieve again. 4. Stage 4 – Final Decision: Generate guidance and a confidence score for the user to make a decision.
Key Designs¶
Forecasting Function Modeling:
A frozen LLM \(\mathcal{M}\) serves as the feature extractor \(f_{\text{feat}}\). The temperature parameter is encoded using Fourier positional encoding:
where \(\omega_n = 2^n \cdot \frac{2\pi}{t_{\max} - t_{\min}}\). The final model is:
Only the LoRA adapters and a lightweight prediction head are trained; the LLM backbone remains frozen.
Synthetic Supervision Data Generation:
- \((x, y)\) pairs are extracted from TriviaQA, SQuAD2.0, and WikiQA.
- Top-20 documents (rather than only Top-1) are retrieved per query, for two reasons: (1) lower-ranked documents may also contribute to correct decisions; (2) this avoids biasing the training data toward negative samples.
- A proxy user model \(U\) is used to sample \(R=10\) responses at varying temperatures \(t\).
- Soft labels \(b \in [0,1]\) represent the proportion of correct responses.
Loss & Training¶
A log-likelihood loss (strictly proper scoring rule) is used:
This loss is a strictly proper scoring rule (logarithmic score), whose unique maximizer is the true probability \(p\), thereby guaranteeing calibration convergence.
A multi-class variant, CalibRAG-multi, is also explored, which discretizes the correctness distribution into histogram bins (0–10).
Key Experimental Results¶
Main Results: General Domain (NQ, WebQA)¶
Llama-3.1-8B is used as both the RAG model and the decision model, with BM25 and Contriever as retrievers.
| Method | Metric | NQ (BM25) | WebQA (BM25) |
|---|---|---|---|
| CT-probe | ECE↓ | ~0.35 | ~0.38 |
| Number-LoRA | ECE↓ | ~0.30 | ~0.33 |
| CalibRAG | ECE↓ | ~0.15 | ~0.18 |
| CalibRAG-multi | ECE↓ | ~0.14 | ~0.17 |
CalibRAG outperforms all baselines across all metrics (1-AUROC, 1-ACC, ECE, BS).
Medical Domain (MedCPT Retriever)¶
| Metric | BioASQ-Y/N | MMLU-Med | PubMedQA |
|---|---|---|---|
| CalibRAG ECE↓ | Best | Best | Best |
| CalibRAG ACC↑ | Best | Best | Best |
CalibRAG is trained on general-domain data yet achieves state-of-the-art performance on medical benchmarks with unseen retrievers and out-of-distribution datasets.
Comparison with Re-ranking / Robust RAG Methods¶
| Dataset | Method | AUROC↑ | ACC↑ | ECE↓ | BS↓ |
|---|---|---|---|---|---|
| HotpotQA | Cross-encoder | 60.74 | 34.98 | 0.477 | 0.477 |
| HotpotQA | LLM-rerank | 60.57 | 38.52 | 0.248 | 0.297 |
| HotpotQA | CalibRAG | 72.47 | 42.37 | 0.106 | 0.206 |
| NQ | SelfRAG | 48.4 | 36.2 | 0.522 | 0.545 |
| NQ | CalibRAG | 63.5 | 37.4 | 0.258 | 0.287 |
Ablation Study¶
- Temperature conditioning: Removing temperature conditioning leads to a significant increase in ECE, particularly under high-temperature sampling, validating the necessity of temperature modeling.
- Number of retrieved documents: Performance is optimal at \(K=20\); increasing to 40 yields diminishing returns.
- Query reformulation: Stage 3 consistently improves performance across all settings, albeit at additional computational cost.
Key Findings¶
- The Top-1 retrieved document is not always optimal—lower-ranked documents can sometimes yield better decisions.
- While adding retrieved documents improves accuracy, it also increases ECE (overconfidence), necessitating additional calibration.
- Although CalibRAG is primarily designed for calibration, it also improves accuracy by selecting documents more likely to lead to correct decisions.
Highlights & Insights¶
- Novel problem formulation: The objective of RAG is extended from "retrieving relevant documents" to "ensuring well-calibrated decisions," representing a distinctly new perspective.
- Elegant temperature-conditioned design: Fourier encoding is used to model variability in user behavior, enabling a single model to accommodate users with different risk preferences.
- Strong cross-domain generalization: A model trained on general-domain data can be directly applied to unseen medical retrievers and datasets.
- Lightweight solution: Only LoRA adapters and a classification head require training, avoiding unstable procedures such as PPO.
- Rigorous theoretical guarantee: The use of a strictly proper scoring rule as the loss function provides convergence guarantees for calibration.
Limitations & Future Work¶
- Synthetic data generation and forecasting function training introduce additional overhead.
- The evaluation model \(\mathcal{G}\) relies on GPT-4o-mini, which may introduce evaluation bias.
- The proxy user model \(U\) may not fully capture real human decision-making behavior.
- A gap remains between the temperature parameter \(t\) and actual user behavior.
- The effectiveness of the forecasting function across more LLM backbones has not been explored.
- The triggering mechanism for query reformulation in Stage 3 could be made more fine-grained.
Related Work & Insights¶
- Extending confidence calibration from classification tasks to long-form RAG generation is broadly instructive.
- Document re-ranking is not equivalent to calibration—re-ranking optimizes ranking metrics, whereas CalibRAG optimizes decision correctness.
- CalibRAG can be used in a complementary fashion alongside other robust RAG methods such as SelfRAG.
- The framework has direct practical value for the reliable deployment of LLMs in high-stakes decision-making scenarios (e.g., healthcare, law).
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing calibration into RAG-based decision-making is a novel perspective, though the technical approach is relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple datasets, retrievers, and domains with comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated and mathematical formalization is rigorous, though notation is occasionally dense.
- Value: ⭐⭐⭐⭐ Offers practical value for improving the reliability of RAG systems, particularly in high-stakes decision-making contexts.