Reliable Decision Making via Calibration Oriented Retrieval Augmented Generation¶

Conference: NeurIPS 2025 arXiv: 2411.08891 Code: To be confirmed Area: Information Retrieval Keywords: RAG, calibration, decision-making, retrieval augmented generation, confidence estimation

TL;DR¶

This paper proposes CalibRAG, a framework that trains a temperature-conditioned forecasting function to ensure confidence calibration in RAG-assisted decision-making, achieving improvements in both calibration quality and accuracy.

Background & Motivation¶

LLMs are increasingly used to assist human decision-making, yet they frequently produce incorrect information with high confidence (hallucination), leading users to make suboptimal decisions.
Studies show that users tend to over-rely on LLM outputs, and the degree of reliance is proportional to the model's expressed confidence.
RAG mitigates hallucination by incorporating external documents, but RAG retrievers may return irrelevant documents, and LLMs tend to be overconfident about retrieved content.
Existing RAG methods focus solely on retrieval relevance, without considering whether the resulting user decisions are well-calibrated.
Traditional temperature scaling is not applicable to calibration in long-form text generation.
Prior decision calibration methods require fine-tuning 3 LLMs with PPO training, which is costly and unstable.

Method¶

Overall Architecture¶

The core idea of CalibRAG is to train a forecasting function \(f(t, q, d)\) that predicts "the probability that a user makes a correct decision given temperature \(t\), query \(q\), and retrieved document \(d\)." At inference time, this function re-ranks retrieved documents and selects those most likely to lead to a correct decision.

The four-stage inference pipeline is as follows: 1. Stage 1 – Initial Retrieval: Given query \(q^*\), retrieve Top-K candidate documents. 2. Stage 2 – Scoring and Selection: Score and re-rank each document \(d_i^*\) using \(f(t, q^*, d_i^*)\). 3. Stage 3 – Query Reformulation (Optional): If the highest confidence score falls below threshold \(\epsilon = 0.5\), reformulate the query and retrieve again. 4. Stage 4 – Final Decision: Generate guidance and a confidence score for the user to make a decision.

Key Designs¶

Forecasting Function Modeling:

A frozen LLM \(\mathcal{M}\) serves as the feature extractor \(f_{\text{feat}}\). The temperature parameter is encoded using Fourier positional encoding:

\[\text{PE}(t) = [\sin(\omega_1 t), \cos(\omega_1 t), \ldots, \sin(\omega_N t), \cos(\omega_N t)]\]

where \(\omega_n = 2^n \cdot \frac{2\pi}{t_{\max} - t_{\min}}\). The final model is:

\[f(t, q, d) = \sigma\left(W_{\text{head}}^\top \left(f_{\text{feat}}(\text{concat}[q, d]; W_{\text{LoRA}}) + W_p \cdot \text{PE}(t)\right) + b_{\text{head}}\right)\]

Only the LoRA adapters and a lightweight prediction head are trained; the LLM backbone remains frozen.

Synthetic Supervision Data Generation:

\((x, y)\) pairs are extracted from TriviaQA, SQuAD2.0, and WikiQA.
Top-20 documents (rather than only Top-1) are retrieved per query, for two reasons: (1) lower-ranked documents may also contribute to correct decisions; (2) this avoids biasing the training data toward negative samples.
A proxy user model \(U\) is used to sample \(R=10\) responses at varying temperatures \(t\).
Soft labels \(b \in [0,1]\) represent the proportion of correct responses.

Loss & Training¶

A log-likelihood loss (strictly proper scoring rule) is used:

\[\mathcal{L} = -\frac{1}{|\mathcal{S}|} \sum_{(t,q,d,b) \in \mathcal{S}} \left[b \log f(t,q,d) + (1-b)\log(1-f(t,q,d))\right]\]

This loss is a strictly proper scoring rule (logarithmic score), whose unique maximizer is the true probability \(p\), thereby guaranteeing calibration convergence.

A multi-class variant, CalibRAG-multi, is also explored, which discretizes the correctness distribution into histogram bins (0–10).

Key Experimental Results¶

Main Results: General Domain (NQ, WebQA)¶

Llama-3.1-8B is used as both the RAG model and the decision model, with BM25 and Contriever as retrievers.

Method	Metric	NQ (BM25)	WebQA (BM25)
CT-probe	ECE↓	~0.35	~0.38
Number-LoRA	ECE↓	~0.30	~0.33
CalibRAG	ECE↓	~0.15	~0.18
CalibRAG-multi	ECE↓	~0.14	~0.17

CalibRAG outperforms all baselines across all metrics (1-AUROC, 1-ACC, ECE, BS).

Medical Domain (MedCPT Retriever)¶

Metric	BioASQ-Y/N	MMLU-Med	PubMedQA
CalibRAG ECE↓	Best	Best	Best
CalibRAG ACC↑	Best	Best	Best

CalibRAG is trained on general-domain data yet achieves state-of-the-art performance on medical benchmarks with unseen retrievers and out-of-distribution datasets.

Comparison with Re-ranking / Robust RAG Methods¶

Dataset	Method	AUROC↑	ACC↑	ECE↓	BS↓
HotpotQA	Cross-encoder	60.74	34.98	0.477	0.477
HotpotQA	LLM-rerank	60.57	38.52	0.248	0.297
HotpotQA	CalibRAG	72.47	42.37	0.106	0.206
NQ	SelfRAG	48.4	36.2	0.522	0.545
NQ	CalibRAG	63.5	37.4	0.258	0.287

Ablation Study¶

Temperature conditioning: Removing temperature conditioning leads to a significant increase in ECE, particularly under high-temperature sampling, validating the necessity of temperature modeling.
Number of retrieved documents: Performance is optimal at \(K=20\); increasing to 40 yields diminishing returns.
Query reformulation: Stage 3 consistently improves performance across all settings, albeit at additional computational cost.

Key Findings¶

The Top-1 retrieved document is not always optimal—lower-ranked documents can sometimes yield better decisions.
While adding retrieved documents improves accuracy, it also increases ECE (overconfidence), necessitating additional calibration.
Although CalibRAG is primarily designed for calibration, it also improves accuracy by selecting documents more likely to lead to correct decisions.

Highlights & Insights¶

Novel problem formulation: The objective of RAG is extended from "retrieving relevant documents" to "ensuring well-calibrated decisions," representing a distinctly new perspective.
Elegant temperature-conditioned design: Fourier encoding is used to model variability in user behavior, enabling a single model to accommodate users with different risk preferences.
Strong cross-domain generalization: A model trained on general-domain data can be directly applied to unseen medical retrievers and datasets.
Lightweight solution: Only LoRA adapters and a classification head require training, avoiding unstable procedures such as PPO.
Rigorous theoretical guarantee: The use of a strictly proper scoring rule as the loss function provides convergence guarantees for calibration.

Limitations & Future Work¶

Synthetic data generation and forecasting function training introduce additional overhead.
The evaluation model \(\mathcal{G}\) relies on GPT-4o-mini, which may introduce evaluation bias.
The proxy user model \(U\) may not fully capture real human decision-making behavior.
A gap remains between the temperature parameter \(t\) and actual user behavior.
The effectiveness of the forecasting function across more LLM backbones has not been explored.
The triggering mechanism for query reformulation in Stage 3 could be made more fine-grained.

Extending confidence calibration from classification tasks to long-form RAG generation is broadly instructive.
Document re-ranking is not equivalent to calibration—re-ranking optimizes ranking metrics, whereas CalibRAG optimizes decision correctness.
CalibRAG can be used in a complementary fashion alongside other robust RAG methods such as SelfRAG.
The framework has direct practical value for the reliable deployment of LLMs in high-stakes decision-making scenarios (e.g., healthcare, law).

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing calibration into RAG-based decision-making is a novel perspective, though the technical approach is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple datasets, retrievers, and domains with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated and mathematical formalization is rigorous, though notation is occasionally dense.
Value: ⭐⭐⭐⭐ Offers practical value for improving the reliability of RAG systems, particularly in high-stakes decision-making contexts.