Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization¶

Conference: ACL 2025
arXiv: 2501.13573
Code: github
Area: LLM/NLP
Keywords: Retrieval-Augmented Generation, Contextual Faithfulness, Attention Heads, Contrastive Decoding, Long-Form Question Answering

TL;DR¶

This paper finds that "retrieval heads" in LLMs are highly correlated with contextual faithfulness in long-form question answering. Based on this, the Rhio framework is proposed: generating unfaithful samples by masking retrieval heads, introducing control tokens for faithfulness-aware tuning, and utilizing contrastive decoding to enhance faithfulness, achieving performance that surpasses GPT-4o on both 7B and 13B models.

Background & Motivation¶

Retrieval-Augmented Generation (RAG) has become a crucial technique for improving LLM performance in information retrieval tasks. However, in long-form question answering (LFQA) scenarios, model-generated responses often lack faithfulness to the provided context—resulting in "faithfulness-related hallucinations." These hallucinations manifest in two ways: the generated answers contain information not present in the context (fabrication out of thin air), or they inaccurately synthesize context information (such as incorrectly linking contents across different documents).

Existing mitigation methods mainly improve faithfulness indirectly through "compensatory" means, such as enhancing context quality (explicit denoising), self-reflection, or context-aware decoding, without fundamentally teaching the model to distinguish between faithful and unfaithful outputs.

The key insight of this paper comes from the study of "retrieval heads." Retrieval heads are specialized attention heads in Transformers responsible for retrieving relevant information from the context and performing "copy-paste" operations. Through preliminary experiments, the authors found that an increase in the number of masked retrieval heads directly correlates with a decrease in faithfulness, and the distribution of error types produced by masking retrieval heads is highly similar to the model's own natural error patterns. This discovery inspired the design of the entire methodology.

Method¶

Overall Architecture¶

Rhio (Retrieval Heads-Induced Optimization) consists of three core components: 1. Unfaithful data augmentation (via masking retrieval heads) 2. Faithfulness-Aware Tuning (FAT, using control tokens to distinguish faithful/unfaithful) 3. Self-Induced Decoding (SID, utilizing contrastive decoding to enhance inference-time faithfulness)

Key Designs¶

Unfaithful Data Augmentation: Leveraging findings from preliminary experiments, unfaithful samples are generated by masking the top \(N=100\) retrieval heads in the LLM. Specifically, the attention weight matrices corresponding to the retrieval heads are set to zero. This method offers two major advantages: (a) the generated unfaithful sample patterns are highly similar to actual model errors, including incomplete hallucinations (most common) and complete fabrication hallucinations; (b) the method is simple and efficient, requiring no complex entity replacement or relation perturbation. Compared to traditional methods like entity replacement, masking retrieval heads produces more realistic and diverse error patterns.
Faithfulness-Aware Tuning (FAT): Two special control tokens, [POS] and [NEG], are introduced. [POS] guides the model to generate faithful answers, while [NEG] guides it to generate unfaithful answers. The training objective consists of two parts: (a) learning to generate faithful outputs \(y^+\) conditioned on [POS]; (b) learning to generate unfaithful outputs \(y^-\) conditioned on [NEG]. This bidirectional learning enables the model to understand not only "what to do" but also "what not to do," thereby enhancing its capacity for faithfulness awareness.
Self-Induced Decoding (SID): During the inference phase, contrastive generation is induced using the trained control tokens. The logits of the faithful prediction induced by [POS] are amplified (by \(1+\alpha\) times), while subtracting the logits of the unfaithful prediction induced by [NEG] (by \(\alpha\) times). This contrastive decoding further strengthens faithfulness. The optimal performance is achieved when \(\alpha\) is set to 0.2. Compared to Context-Aware Decoding (CAD), SID exploits a more diverse range of internal model error types.
GroundBench Benchmark: A comprehensive LFQA faithfulness evaluation benchmark is designed, containing 5 datasets (ELI5-WebGPT, ExpertQA, HAGRID, CLAPNQ, QuoteSum) to cover different types of queries and retrieval sources. The key design is to ensure that the provided documents contain sufficient information to answer the questions, thereby providing a controlled evaluation setup.

Loss & Training¶

The loss function is standard cross-entropy, but simultaneously optimizes two objectives:

\[\mathcal{L}(\theta) = -\mathbb{E}_{(\mathbf{x},\mathbf{c},y^+)}[\log p_\theta(y^+ | [\text{POS}] \oplus \mathbf{x}, \mathbf{c})] - \mathbb{E}_{(\mathbf{x},\mathbf{c},y^-)}[\log p_\theta(y^- | [\text{NEG}] \oplus \mathbf{x}, \mathbf{c})]\]

Training uses the long-context split of the FRONT dataset, with Llama-2-7B and 13B as backbones. Decoding utilizes a sampling strategy (temperature=1, top-p=0.95).

Key Experimental Results¶

Main Results¶

Model	Method	Average Faithfulness (%)
GPT-4o	Prompting	82.33
GPT-4o-mini	Prompting	80.97
Llama-3.1-70B	Prompting	75.87
Llama-2-7B	SFT	72.98
Llama-2-7B	Self-RAG	68.60
Llama-2-7B	RECOMP	56.52
Llama-2-7B	Rhio	82.35
Llama-2-13B	SFT	74.40
Llama-2-13B	Rhio	83.77

Rhio-7B improves faithfulness by 12.84% compared to SFT-7B, and Rhio-13B outperforms GPT-4o by 1.74%.

Ablation Study¶

Configuration	Average Faithfulness (7B)	Average Faithfulness (13B)	Description
Rhio (Full)	82.35	83.77	Full method
w/o SID	80.03	80.42	Without contrastive decoding, -2.90%/-4.17%
w/o FAT	72.98	74.40	Without faithfulness-aware tuning, -12.84%/-12.59%

Key Findings¶

FAT is the most critical component, contributing the vast majority of performance improvement; SID further improves performance by 2-4% on top of FAT.
Negative sample augmentation via masking retrieval heads outperforms alternative strategies such as entity replacement, relation perturbation, and direct prompting.
Self-induced negative samples (generated by masking the retrieval heads of the same model) yield better results than using negative samples generated by other models.
\(\alpha=0.2\) is the optimal hyperparameter for SID; values too large lead to performance degradation.
SID slightly outperforms Context-Aware Decoding (CAD) by leveraging a more diverse range of error types.
Human evaluation confirms that the faithfulness of Rhio-13B (87.5% Full Support) surpasses that of GPT-4o (86.5%).

Highlights & Insights¶

Correlation between Retrieval Heads and Faithfulness: Establishing a causal link between the model's internal attention mechanism (retrieval heads) and external task attributes (faithfulness) is a profound discovery that provides a new perspective for understanding how RAG operates.
Training using Model's Own Weaknesses: The error patterns produced by masking retrieval heads are exactly the errors the model itself is prone to make. Training the model to overcome its weaknesses using its own "weaknesses" is a highly inspiring, bootstrap-style methodology.
Dual Use of Control Tokens: [POS] and [NEG] are used for discriminative learning during training, and for contrastive decoding during inference. A single design serves two purposes, which is extremely elegant.
Small Models Outperforming Large Models: The 7B model not only significantly outperforms baselines of the same scale but even exceeds GPT-4o, indicating that targeted training strategies can compensate for limitations in model capacity.

Limitations & Future Work¶

The detection algorithm for retrieval heads is adopted from existing work; the set of retrieval heads itself may not be perfect.
Validated only on the Llama-2 series, lacking evaluation on newer models (e.g., Llama-3, Qwen-2.5).
The number of masked heads is fixed at \(N=100\); the optimal setting may vary for different models.
SID requires two forward passes (one for [POS] and one for [NEG]), which increases inference costs.
Although GroundBench is comprehensive, the scale of each sub-dataset is relatively limited.
The method focuses on LFQA scenarios; its generalizability to short-form QA or other RAG tasks remains to be validated.

Complementary to Self-RAG (which uses reflection tokens for self-evaluation): Self-RAG evaluates retrieval quality through reflection tokens, while Rhio distinguishes generation quality through control tokens.
Inspired by Context-Aware Decoding (CAD), but SID obtains a more effective contrastive signal through training.
The concept of retrieval heads originated from model interpretability research; this paper successfully applies it to practical model improvement.
Provides a new direction for RLHF/alignment techniques: utilizing internal model mechanisms (such as specific attention heads) to construct training signals.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The discovery and utilization of the correlation between retrieval heads and faithfulness are novel, with an overall ingenious method design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Proposes a new benchmark, GroundBench, with exhaustive ablation studies and human evaluations.
Writing Quality: ⭐⭐⭐⭐ Clearly structured, progressing step-by-step from preliminary experiments to the full method.
Value: ⭐⭐⭐⭐⭐ A small model surpassing the faithfulness of GPT-4o; the method is highly practical, and the code is open-source.