Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering¶

Conference: ACL 2025
arXiv: 2502.07340
Code: GitHub
Area: Hallucination Detection
Keywords: Hallucination Mitigation, Data Filtering, Instruction Tuning, Knowledge Alignment, Internal Consistency

TL;DR¶

This paper proposes the NOVA framework, which filters knowledge-aligned, high-quality instruction data by measuring the LLM's familiarity with instructions via Internal Consistency Probing (ICP) and familiarity with target responses via Semantic Equivalence Identification (SEI). Fine-tuning LLaMA-3-8B with only 5% of selected data achieves an 8.6-point improvement on BioGEN and a 7.2-point improvement on FollowRAG, while preserving instruction-following capability.

Background & Motivation¶

Background: Instruction tuning is a crucial step for LLM alignment. However, studies show that fine-tuning LLMs on data containing unfamiliar knowledge encourages overconfidence and hallucination.

Limitations of Prior Work: (a) RL-based methods (e.g., FLAME-DPO) reduce hallucination through preference learning after instruction tuning, but deteriorate instruction-following capabilities and incur extra data and API costs; (b) Existing data filtering methods (e.g., IFD, CaR, Nuggets) focus solely on quality, selecting high-quality data that often contains expert-level knowledge unfamiliar to the LLM, which paradoxically exacerbates hallucination.

Key Challenge: High-quality instruction data often contains deeper expert knowledge (correctness \(\uparrow\)), which may not have been learned by the LLM during pre-training (familiarity \(\downarrow\)), thereby intensifying hallucination.

Goal: To simultaneously achieve "instruction following" and "hallucination reduction" during the instruction tuning stage by filtering data that is both high-quality and knowledge-aligned.

Method¶

Overall Architecture¶

NOVA = ICP (measuring instruction familiarity) + SEI (measuring response familiarity) + Quality RM (ensuring data quality). The final rank is computed as \((familiarity\_rank + quality\_rank) / 2\), and the top-\(k\)\% data is selected for fine-tuning.

Key Designs¶

Internal Consistency Probing (ICP):
- Function: Measures the LLM's understanding of a given instruction \(q\).
- Mechanism: Generates \(K\) responses for instruction \(q\), and extracts the internal states of the last token from each response as sentence embeddings \(E=[e_1,...,e_K]\). Assuming \(E \sim \mathcal{N}(\mu, \Sigma)\), it calculates the differential entropy: \(F_{ins}(q) = \frac{1}{2}\sum_{i=1}^d \lambda_i + G\), where \(\lambda_i\) represents the eigenvalues of the covariance matrix \(\Sigma\). Low entropy \(\rightarrow\) consistent responses \(\rightarrow\) high LLM familiarity with the instruction.
- Design Motivation: Compared to surface-level metrics like perplexity or Rouge-L, the differential entropy of internal states captures more fine-grained semantic consistency information.
Semantic Equivalence Identification (SEI):
- Function: Measures the LLM's familiarity with the knowledge in the target response \(r\).
- Mechanism: (1) Employs an NLI model to perform bidirectional entailment detection on the \(K\) generated responses, clustering semantically equivalent responses into \([c_1,...,c_M]\); (2) Applies a voting strategy to determine which cluster the target response \(r\) belongs to; (3) Calculates \(F_{res}(r) = k_{target}/\sum k_m\)—a higher proportion of the target cluster in total responses indicates greater familiarity of the LLM with the target response content.
- Design Motivation: Since target responses originate from human annotations or GPT-4, the LLM's internal states cannot effectively represent these external inputs. Thus, NLI-based semantic clustering and voting are utilized as an alternative.
Expert-Aligned Quality Reward Model:
- Function: Trains a reward model using 3,751 expert-annotated preference data points to evaluate data quality.
- Mechanism: The final score is computed as \(R_{final}^{(i)} = \frac{1}{2}(R_{familiarity}^{(i)} + R_{quality}^{(i)})\), balancing familiarity and quality.
- Design Motivation: When only considering familiarity (w/o Quality RM), the selected data significantly reduces hallucinations but drastically degrades instruction-following capabilities (MT-Bench drops from 64.6 to 48.6).

Loss & Training¶

Experiments are conducted based on LLaMA-3-8B and LLaMA-3-70B in Alpaca (52K) and Alpaca-GPT4. Top 5%/10%/15% data is selected for SFT. The NLI model uses DeBERTa-v3.

Key Experimental Results¶

Main Results¶

LLaMA-3-8B, Alpaca-GPT4, 5% data selection:

Method	BioGEN(FactScore)↑	LongFact-Obj↑	FollowRAG-Avg↑	MT-Bench↑
Vanilla-100%	41.9	84.7	38.1	64.3
IFD-5%	46.7	84.4	42.3	65.0
Nuggets-5%	47.2	87.0	41.5	66.2
FLAME-DPO	46.3	87.3	41.5	56.2
NOVA-5%	50.5	90.1	45.3	64.6

Improvements of NOVA relative to Vanilla-100%: BioGEN +8.6, LongFact +5.1, FollowRAG +7.2, MT-Bench +0.3.

Ablation Study¶

Contributions of each component (LLaMA-3-8B, Alpaca-GPT4, 5%):

Configuration	BioGEN↑	MT-Bench↑
Full NOVA	50.5	64.6
-w/o ICP	47.6	64.1
-w/o SEI	48.3	63.8
-w/o Quality RM	55.6	48.6
-w/o ICP & SEI	43.7	65.2

Comparison of alternative ICP methods:

ICP Alternatives	BioGEN↑	MT-Bench↑
Internal States (NOVA)	50.5	64.6
Perplexities	48.4	62.2
Rouge-L	47.9	61.5
External Embedding Models	49.8	63.9

Key Findings¶

Surpassing 100% full-data training with only 5% of the data: This applies to both the hallucination and instruction-following dimensions.
RL-based methods (FLAME-DPO, SELF-EVAL) severely compromise instruction-following while reducing hallucinations: MT-Bench scores drop by 8.1 and 11.2, respectively.
Pure quality-based data filtering can exacerbate hallucinations: For instance, IFD increases the number of generated facts on LongFact (39.2 vs 32.0).
Quality RM is key to maintaining instruction-following capabilities: Without it, BioGEN achieves higher scores (55.6 vs 50.5) but MT-Bench collapses (48.6 vs 64.6).
Internal states are more effective than external embeddings: Internal states capture fine-grained information that might otherwise be lost during the decoding phase.
Scalability to 70B: NOVA-5%-70B achieves 60.9 (+7.2 Gain) on BioGEN.

Highlights & Insights¶

Resolves a fundamental trade-off: Co-optimizes two potentially conflicting objectives through data filtering without introducing extra RL stages.
Novelty of ICP: Employs the differential entropy of LLM internal states to measure consistency, capturing minor semantic differences more effectively than surface-level metrics.
NLI+Voting design of SEI: Cleverly addresses the challenge where target responses from external models cannot be effectively represented by internal states of the local LLM.
Quality RM as a balancer: Shows that while pure familiarity-based filtering significantly mitigates hallucinations, it hurts instruction capabilities; the role of the reward model as a balancer is crucial.

Limitations & Future Work¶

Requires generating \(K\) responses for each instruction, which increases offline data filtering latency (though it has zero impact on inference time).
Applicable only to single-turn instruction data; multi-turn dialog scenarios remain unexplored.
The Quality RM requires training on 3,751 expert preference data points, introducing additional data requirements.
NLI models may yield inaccurate matches of semantic equivalence on long texts or domain-specific professional fields.

Gekhman et al. (2024): Provides theoretical foundations showing that fine-tuning LLMs on new knowledge encourages hallucinations.
FLAME (NeurIPS 2024): Offers RL-based hallucination mitigation, but compromises instruction-following capabilities.
Insight: 'Data selection' can be more efficient than 'extra training stages (RL)' in solving alignment issues, as it prevents the problems at their root.

Rating¶

Novelty: ⭐⭐⭐⭐ The designs of ICP and SEI are novel, taking a unique perspective to resolve hallucinations through knowledge alignment.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes 3 hallucination benchmarks, 2 instruction benchmarks, comprehensive ablation studies, alternative comparisons, and human evaluations.
Writing Quality: ⭐⭐⭐⭐ Clear presentation and well-justified motivations, though heavily marked with mathematical symbols.
Value: ⭐⭐⭐⭐⭐ Highly instructive for LLM alignment research, providing a simple yet effective methodology.