Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering¶

Conference: ACL 2025 (Findings)
arXiv: 2407.11930
Code: Yes (provided in paper)
Area: Hallucination Detection
Keywords: Long-form Question Answering, Hallucination Detection, Fine-grained Annotation, Feedback Model, Answer Refinement

TL;DR¶

This paper constructs HaluQuestQA (698 QA pairs, 4.7k error annotations, 5 error types), the first LFQA hallucination dataset with span-level error annotations. It trains an automatic feedback model to detect incomplete information error spans and generate explanations, and finally proposes the Error-informed Refinement method to refine answers using feedback signals, reducing hallucinations by approximately 3%, with 84% of users preferring the refined answers in human evaluations.

Background & Motivation¶

Background: Long-form Question Answering (LFQA) aims to provide comprehensive and in-depth answers to complex questions. LLMs excel in this aspect but are prone to hallucinations and factual inconsistencies. Simple BLEU/ROUGE metrics cannot effectively evaluate the quality of long-form answers.

Limitations of Prior Work: (1) Existing hallucination detection works mainly focus on factual errors, neglecting the multi-dimensional evaluation needs such as answer completeness and relevance; (2) Existing evaluations either rely on coarse-grained overall preference judgments or require reference texts; (3) Span-level error annotations for long-form answers are almost non-existent in the LFQA field. Current feedback models either fail to provide fine-grained error feedback or rely on gold-standard texts.

Key Challenge: Long-form answers generated by LLMs may appear fluent and convincing on the surface, but contain hidden flaws in information completeness and reference quality. These defects require expert-level annotations to detect, which are highly expensive.

Goal: (1) Establish a fine-grained LFQA error classification system and annotated dataset; (2) Train a reference-free automatic feedback model; (3) Automatically refine answers using feedback information to reduce hallucinations.

Key Insight: Starting from how human experts review long-form answers, the authors define five key error types and observe that the primary issue with answers is not factual error, but rather a lack of comprehensiveness and poor reference quality.

Core Idea: Formulating a closed-loop pipeline of "error detection \(\rightarrow\) fine-grained feedback generation \(\rightarrow\) feedback-driven answer refinement" to systematically mitigate hallucinations in LFQA.

Method¶

Overall Architecture¶

The pipeline consists of three stages: (1) Constructing the HaluQuestQA dataset, where domain experts perform span-level error annotations on human- and model-generated answers; (2) Training a feedback model on this dataset to predict the error status and explanation for each sentence; (3) Using the output of the feedback model to drive a refinement model that improves answer quality.

Key Designs¶

HaluQuestQA Dataset Construction:
- Function: Provides the first span-level, multi-dimensional error annotation dataset in the LFQA domain.
- Mechanism: Crawls new questions (Nov 2022 - Mar 2023) from the Reddit ELI5 forum to avoid data leakage, covering 7 domains: physics, chemistry, biology, technology, economics, history, and law. Domain experts (aged 22-32, US/UK undergraduate/postgraduate) were recruited through the Prolific platform, with 3 experts per domain annotating 35-50 QA pairs. Human answers come from Reddit, while model answers are zero-shot generated by GPT-4. Five error types are defined: question misconception, factuality, completeness, relevance, and references.
- Design Motivation: Existing datasets lack fine-grained, span-level annotations tailored for LFQA, which are essential for evaluation and improvement.
Error Feedback Model (EFM):
- Function: Given a QA pair, predicts the [Complete]/[Incomplete] label and the error explanation for each sentence.
- Mechanism: Fine-tuned on LLaMA2-13B using completeness annotation data (509 samples) from HQ2A. During inference, nucleus sampling (\(p=0.9\)) is used to generate 20 candidate outputs, and the most reliable feedback is selected through a two-stage consistency filtering. Stage 1 is label consistency \(\mathcal{S_{TC}} = \frac{1}{n}\sum_{s=1}^n \mathbf{1}_{t_i = t_s}\) (proportion of identical label sequences); Stage 2 is rationale consistency \(\mathcal{S_{RC}} = \frac{1}{m_i}\sum_{k=1}^{m_i}\sum_{s=1}^n \mathbf{1}_{w_{ik} \in j_s}\) (frequency of occurrence of each token in other explanations).
- Design Motivation: The feedback model itself can hallucinate (approximately 20% of outputs contain fabricated web links). Sampling + consistency filtering reduces the hallucination rate to 5-10%.
Error-Informed Refinement (EIR):
- Function: Refines original answers using fine-grained feedback.
- Mechanism: Uses LLaMA2-13B-chat as the refinement model. The inputs are the original QA pair and the fine-grained error feedback (error location, cause, and confidence score) provided by the feedback model. The model zero-shot generates the improved answer. Comparison baselines include: Improve (direct improvement without feedback), Generic (generic feedback), and EIR (fine-grained feedback) to verify the importance of feedback granularity.
- Design Motivation: Fine-grained feedback provides more precise guidance than coarse-grained feedback to help the model correct specific deficiencies in the answer.

Loss & Training¶

The feedback model uses standard sequence-to-sequence fine-tuning loss, with a batch size of 4, a learning rate of 2e-5, a sequence length of 1024, for 5 training epochs. In addition, Direct Preference Optimization (DPO) is used to optimize LLaMA2-13B-chat preferences, employing LoRA (\(r=256\), \(\alpha=128\)).

Key Experimental Results¶

Main Results¶

Dataset	Method	TigerScore(↓)	Hallucination Sample Ratio(↓)	F1(↑)
HQ2A	Baseline	19.61	0.63	-
HQ2A	Improve	1.31	0.05	0.97
HQ2A	Generic	1.31	0.05	0.97
HQ2A	EIR	0.65	0.03	0.97
HQ2A	Human Feedback	2.61	0.09	0.94

Human Evaluation¶

Dataset	Type	Comprehensiveness	Preference Rate
HQ2A	Baseline	0%	7.84%
HQ2A	Refined	100%	92.16%
ASQA	Baseline	82%	40%
ASQA	Refined	100%	60%
ELI5	Baseline	38%	0%
ELI5	Refined	100%	100%

Key Findings¶

EIR not only outperforms coarse-grained feedback but also surpasses expert human feedback (TigerScore 0.65 vs 2.61 on HQ2A), indicating that model-generated fine-grained feedback is more stable than humans in guiding refinement.
The feedback model aligns highly with human annotations when the consistency score satisfies \(>0.80\), while uncertainty increases significantly when it is \(<0.80\).
The major issues with the answers are completeness and reference quality, rather than factual errors—GPT-4 answers score well in factuality and relevance but still exhibit obvious deficiencies in comprehensiveness.
The DPO-optimized refinement model achieves better hallucination scores on ASQA and ELI5, but does not reduce the number of hallucinating samples. This indicates that DPO helps correct severe errors but is less effective in improving coverage.
Expert annotations point out that ~40% of technical and economic questions inherently contain misconceptions, suggesting that LFQA evaluation needs to assess question quality first.

Highlights & Insights¶

The feedback model surpassing human feedback is a key discovery: Automated, stable, fine-grained feedback is more suitable for guiding answer refinement than one-time critiques from human experts. This indicates that in the "evaluation-improvement" cycle, consistency is more critical than professional depth.
Sampling + consistency filtering is a practical technique for handling the feedback model's own hallucinations, reducing the ratio of fabricated references from 20% to 5-10%, which is generalizable to any scenario requiring reliable generation.
"Completeness" and "references" in the five-dimensional error taxonomy offer new insights; prior work has overly focused on factuality, neglecting other dimensions of answer utility.

Limitations & Future Work¶

The study focuses only on LFQA tasks and has yet to verify effects on other long-context generation tasks like summarization and translation.
The feedback model is built on LLaMA2-13B; newer and larger models may achieve better error detection accuracy.
Experiments were conducted only with LLaMA2-13B-chat as the refinement model; models with different instruction-following capabilities may yield different results.
The data source is limited to English content on Reddit; generalization to other languages and domains remains to be verified.
It is worth considering iterative refinement—detecting errors again in refined answers to improve them further.

vs TigerScore: TigerScore is a reference-free evaluation metric but does not provide structured feedback suitable for refinement; ours feedback model serves as both an evaluator and a guide for improvement.
vs Fine-grained RLHF (Wu et al., 2023): They train multiple reward models corresponding to different error types, which is complex and computationally intensive; ours employs a single feedback model to cover multi-dimensional evaluation.
vs Self-Refine (Madaan et al., 2023): Self-Refine relies on a model's self-feedback, which is prone to "self-hallucination"; ours uses an independently trained feedback model to provide external supervision.

Rating¶

Novelty: ⭐⭐⭐⭐ First LFQA span-level error annotation dataset and systematic integration of the "detection-feedback-refinement" pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive: three datasets, multiple baselines, human evaluations, and DPO ablation studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured with clear illustrations and intuitive examples of error types.
Value: ⭐⭐⭐⭐ The dataset and methodology provide significant reference value to the LFQA community, and the feedback-refinement paradigm can be widely applied.