REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models¶

Conference: ACL 2025
arXiv: 2502.13622
Code: https://github.com/oneonlee/REFIND
Area: Hallucination Detection
Keywords: Hallucination detection, retrieval augmentation, context sensitivity, multilingual, token-level analysis

TL;DR¶

The REFIND framework is proposed, which efficiently detects hallucinated spans in LLM outputs by calculating the Context Sensitivity Ratio (CSR)—the ratio of generation probability for each token with and without retrieved documents. It significantly outperforms baselines across 9 languages in SemEval-2025 Task 3.

Background & Motivation¶

Hallucinatory content generated by LLMs (i.e., outputs inconsistent with facts) severely limits their reliability in knowledge-intensive tasks. Existing hallucination detection methods possess notable limitations:

Token-level Classifiers (e.g., XLM-RoBERTa-based): Rely solely on the model's internal knowledge for binary classification without utilizing external evidence, performing poorly on low-resource languages.

FAVA (Retrieval-Augmented Editing Method): Although external knowledge is introduced, it adopts a multi-step pipeline (retrieval → comparison → editing). The alignment between steps is prone to errors, and the workflow is complex.

The Core Problem lies in: How to leverage retrieved external documents more directly and efficiently to locate hallucinated spans in LLM outputs?

The key insight of REFIND is: if a token is hallucinated (fabricated), the model's generation probability for this token should change significantly when correct external evidence is provided. Conversely, if a token is factual, external evidence will not dramatically alter its generation probability.

Method¶

Overall Architecture¶

The three-step pipeline of REFIND: (1) Given a query q, a retriever R is used to retrieve a relevant document set D; (2) A frozen LLM is employed to calculate the generation probability of each token with and without the retrieved context, respectively; (3) The CSR is computed, and tokens exceeding a threshold δ are flagged as hallucinations.

Key Designs¶

Context Sensitivity Ratio (CSR): The core metric, defined as:

\[CSR(t_i) = \frac{\log p_\theta(t_i | D, q, t_{<i})}{\log p_\theta(t_i | q, t_{<i}) + \varepsilon}\]
- Numerator: Log-probability conditioned on the query q, historical tokens \(t_{<i}\), and retrieved documents D.
- Denominator: Log-probability conditioned only on the query and historical tokens (\(+ \varepsilon\) to avoid division by zero).
- A high CSR indicates that the retrieved context has a strong influence on the generation of the token, implying that the token is likely a hallucination.
- Design Motivation: Rather than training another model to "judge" whether a text segment is a hallucination, it is better to directly observe the probability shift of the original LLM when provided with correct evidence—serving as a more intrinsic and direct signal.
Hybrid Retrieval Strategy: A sparse and dense hybrid retrieval strategy is adopted. First, BM25 is used to retrieve the Top-10 documents from a preprocessed multilingual Wikipedia corpus, which are then reranked using multilingual-e5-large to select the final 5 documents. To maintain multilingual consistency, a multilingual embedding model is uniformly used.
Threshold Decision: A token is determined as a hallucination if CSR ≥ δ. Here, δ is a tunable hyperparameter used to balance precision and recall. Experiments demonstrate that performance remains stable within the range of δ = 0.1 ~ 0.4 for most languages.

Technical Details¶

The context-free probability \(p(t_i|q, t_{<i})\) directly utilizes the token probabilities provided by the Mu-SHROOM dataset.
The context-conditioned probability \(p(t_i|D, q, t_{<i})\) is computed via PyTorch 2.
No training is required, making it a zero-shot detection method.

Key Experimental Results¶

Main Results (IoU metric, higher is better)¶

Method	AR	CS	DE	EN	ES	EU	FI	FR	IT	Average
XLM-R	0.042	0.096	0.032	0.031	0.072	0.021	0.004	0.002	0.010	0.035
FAVA	0.217	0.235	0.386	0.281	0.235	0.387	0.230	0.212	0.326	0.279
REFIND	0.374	0.276	0.352	0.353	0.215	0.407	0.506	0.473	0.313	0.363

Ablation Study (Threshold Sensitivity Analysis)¶

Threshold δ	Overall Performance	Description
0.1	High	Stable for high-resource languages, slightly larger fluctuations for low-resource languages
0.2	High	Overall most stable interval
0.3	Medium-High	English and German maintain ~0.35
0.4	Medium	Performance declines for some low-resource languages

Key Findings¶

REFIND achieves an average IoU of 0.363, outperforming FAVA (0.279) by 30% and XLM-R (0.035) by 10 times.
Improvements in low-resource languages are particularly significant: Finnish increases from 0.230 to 0.506, French from 0.212 to 0.473, and Arabic from 0.217 to 0.374.
XLM-R almost completely fails on Finnish and French (IoU < 0.01), indicating that relying solely on internal model knowledge is insufficient for low-resource languages.
REFIND is insensitive to threshold selection across most languages, demonstrating the robustness of the method.
Retrieval quality directly impacts the reliability of the CSR computation.

Highlights & Insights¶

The core idea of CSR is simple yet effective: Instead of training additional models, it directly utilizes variations in the LLM's own probability distribution to detect hallucinations. The computational cost is significantly lower than fine-tuning-based methods.
Outstanding multilingual zero-shot capability: Thanks to the multilingual retriever and the language-agnostic nature of CSR, it performs well across 9 languages (including low-resource ones).
More direct than FAVA: By avoiding alignment errors inherent in multi-step pipelines, CSR provides clear hallucination signals at the token level.
Clear case study: For example, in the response to "When did Chance the Rapper debut?", "2011" was correctly identified as a hallucination because the retrieved documents provided a different date.

Limitations & Future Work¶

Dependence on retrieval quality: If the retriever fails to return high-quality documents, the CSR calculation will be inaccurate, potentially leading to false positives or negatives.
Computational overhead: Calculating token probabilities with and without context separately represents a potential bottleneck in low-latency scenarios.
Focus only on factual hallucinations: Performance on non-factual scenarios (such as opinion-based or creative queries) remains untested.
Fixed hyperparameter for threshold: Although experiments show good robustness, an adaptive threshold mechanism (e.g., based on language or domain) could further enhance performance.
Anomalous performance on Spanish: REFIND's performance on Spanish (0.215) is lower than FAVA (0.235), the reasons for which have not been analyzed in depth.

Similar to SelfCheckGPT, REFIND leverages the internal information of the LLM to detect hallucinations, but overcomes the limitations of purely internal methods by incorporating external retrieval evidence.
Semantic entropy methods (Farquhar et al., 2024) determine whether an entire response is hallucinated, whereas REFIND localizes specific spans, offering a finer granularity.
The concept of CSR can be extended to other scenarios requiring validation of information source reliability, such as automated fact-checking.

Rating¶

Novelty: ⭐⭐⭐⭐ — The design of the CSR metric is novel and intuitive. Directly quantifying context sensitivity provides an elegant solution.
Experimental Thoroughness: ⭐⭐⭐ — While evaluation across 9 languages offers good coverage, the ablation study is limited (restricted to threshold analysis) and lacks ablations over variables such as retrievers and LLM selection.
Writing Quality: ⭐⭐⭐⭐ — The description of the methodology is clear and concise, and the illustrations are intuitive.
Value: ⭐⭐⭐⭐ — It provides a lightweight, training-free paradigm for hallucination detection, demonstrating strong practicality.