On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation¶
Conference: ACL 2025
arXiv: 2406.12221
Code: Yes (GitHub)
Area: Hallucination Detection
Keywords: Hallucination Mitigation, Reinforcement Learning, Self-Alignment, Fine-grained Feedback, On-Policy Learning
TL;DR¶
Proposes RLFH (Reinforcement Learning for Hallucination), an on-policy self-alignment method where the LLM itself acts as the judge to decompose responses into atomic facts, evaluate their truthfulness and informativeness, generate token-level dense reward signals, and optimize via online PPO to effectively mitigate hallucination.
Background & Motivation¶
The hallucination problem in Large Language Models (LLMs) is one of the most critical challenges. Hallucination refers to a model generating content that deviates from its knowledge boundary—such as incorrect factual information, reckless answers to questions beyond its knowledge, or evasion of questions that could actually be answered.
Existing hallucination mitigation methods face three major dilemmas:
Off-policy sampling: Existing learning methods train on data generated by other models or older versions of the same model, leading to distribution shift. The training data does not reflect the behavior of the current model, which compromises optimization performance.
Coarse-grained feedback: Existing methods often assign a single score (good/bad) to the entire response. However, a response may contain both correct and incorrect facts, making coarse-grained feedback unable to pin down the issues precisely.
Inaccurate knowledge boundary detection: Existing methods detect the model's knowledge boundary through explicit prompting or internal state probing, but the results are often inconsistent.
Another class of approaches (editing-based methods) generates content first and then corrects it using external knowledge. However, this merely patches the output without improving the model's intrinsic knowledge utilization capability, and the coverage of external knowledge sources is limited.
The core idea of RLFH is to let the model explore its own knowledge boundary and self-correct its generation behavior through fine-grained, on-policy feedback.
Method¶
Overall Architecture¶
RLFH consists of a three-step loop:
- Response Generation: The current policy model \(\pi\) generates a response to the input prompt.
- Self-Evaluation: The policy model serves as its own judge to conduct a fine-grained evaluation of the response.
- Online Reinforcement Learning: The evaluation results are converted into token-level dense rewards to update the policy using PPO.
Key Designs¶
1. Hierarchical Atomic Fact Extraction¶
Function: Decomposes model responses into verifiable, minimal factual units.
Mechanism: Two-level decomposition—first splitting the response into sentences \(\{s_i\}_{i=1}^M\), and then extracting atomic facts \(\{e_{ij}\}_{j=1}^{N_i}\) from each sentence.
Design Motivation:
- Extracting statements after sentence-level splitting yields a finer granularity.
- The sentence-statement hierarchical structure facilitates mapping the evaluation results back to the original token positions later.
2. Truthfulness Verification¶
The policy model itself retrieves relevant context from reference documents to verify each atomic fact. It is classified into five levels: - Correct (supported by evidence) - Hedged Correct (correct with uncertainty expressions) - Vague (truthfulness cannot be determined) - Hedged Wrong (wrong with uncertainty expressions) - Wrong (wrong, contradicts evidence)
The "Vague" category is introduced to handle statements that cannot be verified due to insufficient reference documents.
3. Informativeness Assessment¶
Assesses the informativeness of each statement on a 1-5 scale. Unlike truthfulness verification, informativeness assessment needs to consider the context of the original question \(x\) and the complete response \(y\)—as informativeness requires global judgment.
This design prevents the model from taking shortcuts: if only truthfulness rewards are provided, the model might learn to refuse most questions or provide extremely minimal answers to avoid errors. Informativeness rewards force the model to find a balance between accuracy and informativeness.
Loss & Training¶
Token-level Dense Reward¶
Truthfulness Reward:
- \(f\) maps the truthfulness labels to scalars (correct \(\rightarrow\) positive, wrong \(\rightarrow\) negative).
- \(|g(k_{\text{info}})|\) scaling—more important statements receive a larger magnitude of reward/penalty (the hallucination snowball effect: key mistakes trigger a chain of hallucinations).
Informativeness Reward:
Using a logarithmic function allows the growth to saturate quickly while the penalty scales up rapidly, preventing the model from excessively pursuing informativeness.
Mapping to Token Positions: Using the Longest Common Subsequence (LCS) algorithm to map statement-level evaluations back to the token positions of the original response, achieving token-level dense rewards.
PPO Optimization: Using the standard Proximal Policy Optimization algorithm to conduct online reinforcement learning with token-level dense rewards.
Key Experimental Results¶
Main Results (FactScore Evaluation)¶
| Model | Average Score | HotpotQA | SQuADv2 | Biography |
|---|---|---|---|---|
| Llama3.1-8B (Baseline) | 0.639 | 0.653 | 0.777 | 0.487 |
| DOLA | 0.546 | 0.524 | 0.713 | 0.399 |
| ITI | 0.646 | 0.649 | 0.776 | 0.512 |
| FACT_DPO | 0.645 | 0.652 | 0.778 | 0.506 |
| FACT_SFT | 0.653 | 0.635 | 0.783 | 0.541 |
| RLFH (Llama3.1-8B) | 0.686 | 0.714 | 0.786 | 0.558 |
| Qwen2.5-7B (Baseline) | 0.638 | 0.634 | 0.813 | 0.467 |
| RLFH (Qwen2.5-7B) | 0.668 | 0.651 | 0.830 | 0.523 |
Ablation Study: Impact of Reward Granularity¶
| Model (Qwen2.5-7B) | Average Score | HotpotQA | SQuADv2 | Biography |
|---|---|---|---|---|
| Baseline | 0.638 | 0.634 | 0.813 | 0.467 |
| Response-level | 0.651 | 0.639 | 0.819 | 0.493 |
| Sentence-level | 0.655 | 0.637 | 0.821 | 0.506 |
| Statement-level | 0.668 | 0.651 | 0.830 | 0.523 |
Ablation Study: Impact of Judge Model¶
| Judge Model → Qwen2.5-7B | Average Score |
|---|---|
| DeepSeekV2-Lite | 0.643 |
| Llama3.1-8B | 0.666 |
| Qwen2.5-7B (Fixed) | 0.668 |
| On-Policy (Self) | 0.668 |
Key Findings¶
- RLFH achieves the highest FactScore across all datasets: On Llama3.1-8B, the average score improves from 0.639 to 0.686 (+7.4%), and on Qwen2.5-7B, it increases from 0.638 to 0.668 (+4.7%).
- Cross-dataset generalization: Trained only on HotpotQA, yet achieves significant improvements on two out-of-distribution datasets, SQuADv2 and Biography.
- Finer granularity is better: Statement-level rewards consistently outperform sentence-level and response-level rewards, validating the value of fine-grained feedback.
- On-policy self-evaluation advantage: Letting the model act as its own judge yields performance comparable to or better than using an external model of equivalent scale, with the on-policy setting achieving the best performance on Llama3.1-8B.
- Accuracy-informativeness trade-off: After training, the model's response rate decreases slightly (becoming more conservative), but the information provided is more accurate—with a substantial increase in the proportion of highly accurate responses.
- Significant reduction in errors and unverifiable content: Distribution analysis shows that RLFH effectively suppresses erroneous and vague statements.
Highlights & Insights¶
- "Policy as Judge" paradigm: Letting the optimized model serve as its own judge not only eliminates the dependency on external reward models but also ensures consistency between evaluation and the current policy distribution—an elegant design.
- Hierarchical fact decomposition + LCS mapping: Accurately mapping natural language, statement-level evaluations back to token positions realizes a seamless transition from natural language feedback to numerical rewards.
- Accounting for the hallucination snowball effect: Weighting the truthfulness reward by informativeness in the reward design ensures that errors in key statements are penalized more heavily—which is closer to reality than simple uniform rewards.
- Informativeness preventing degeneration: Prevents the model from learning degenerate strategies like "avoiding errors by remaining silent".
Limitations & Future Work¶
- Main target is factual knowledge; verification on a broader range of hallucinations (e.g., reasoning hallucinations) has not yet been conducted.
- Existing evaluation benchmarks are limited in scope and may not fully capture the complexity of hallucinations.
- Automatic truthfulness verification itself may contain errors, which can affect the quality of the training signal.
- Model self-evaluation may suffer from "self-reinforcement bias"—the model might simultaneously generate an error and verify it as correct.
- Currently only validated on 7-8B models; the performance on larger-scale models remains to be observed.
Related Work & Insights¶
- FactScore (Min et al., 2023) provides a pipeline for statement-level factuality evaluation.
- RLHF (Ouyang et al., 2022) is the foundational framework for LLM alignment.
- DoLa (Chuang et al., 2023) provides a training-free method to improve factuality through contrasting decoding layers.
- ITI (Li et al., 2023) improves model truthfulness through inference-time intervention.
- This work combines fine-grained evaluation and online reinforcement learning, presenting a more systematic hallucination mitigation solution.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of "self-judgment + fine-grained dense rewards + online RL" is novel, though individual components have precedents.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Very comprehensive, with 3 datasets, multiple baseline models, granularity ablations, judge model ablations, and distribution analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear description of the methodology with rich diagrams, although there are many mathematical symbols.
- Value: ⭐⭐⭐⭐⭐ — Hallucination mitigation is currently one of the hottest research directions, and the proposed method offers significant practicality and improvement.