Skip to content

Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation

Conference: ACL 2025
arXiv: 2503.03106
Code: None
Area: Hallucination Detection
Keywords: Hallucination Mitigation, Decoding Strategy, Monitor Function, Tree Search, Partial Response Evaluation

TL;DR

A Monitoring Decoding (MD) framework is proposed to dynamically monitor the factuality of partial responses during generation. It identifies hallucination-prone tokens via a monitor function and selectively revises these key tokens using a tree-search strategy, significantly improving factual accuracy while maintaining efficiency.

Background & Motivation

Large Language Models (LLMs) perform excellently in tasks like question answering, summarization, and reasoning, but remain highly prone to hallucinations—generating content that is plausible but factually incorrect. Existing hallucination mitigation methods face the following challenges:

Incompatibility of Full-length Sampling: Best-of-N (BoN) strategies and self-consistency methods require generating multiple complete responses, introducing substantial latency overhead.

Overconfidence Issue: Models may display extremely high confidence in hallucinated tokens, leading multiple samplings to repeatedly generate the same erroneous outputs. High self-consistency does not equate to factual correctness.

Key Findings: The authors observe that usually only a few key tokens trigger hallucinations. Replacing these specific tokens (e.g., replacing "24" with "It") can transform an erroneous response into a correct one. This implies that resampling the entire response is unnecessary.

Core Problem: Is it necessary to resample multiple highly similar full-length responses to improve factuality? The answer is negative; targeted token-level intervention is sufficient.

Method

Overall Architecture

The MD framework consists of two core components: - In-process Detection: Continuously monitors the factuality of every \(m\) newly generated tokens during the generation process. - Tree-based Revision: Performs tree-search style resampling and pruning on tokens flagged as suspicious.

Pipeline: Input prompt \(\rightarrow\) Model generates \(m\) tokens at a time \(\rightarrow\) Monitor function evaluation \(\rightarrow\) If passed, keep and continue generating \(\rightarrow\) If hallucination risk is detected, trigger tree-search revision \(\rightarrow\) Select the best path to continue.

Key Designs

  1. Monitor Function:

    • Core Idea: Utilize a larger reference model \(f^*\) to evaluate the credibility of the tokens generated by the target model \(f_\theta\).
    • Computation: Weighted ratio \(r_\beta = \sum_{s=1}^{m} w_s^t \cdot \frac{p^*(y_s^t | \mathbf{y}^{<t}, y_{<s}^t)}{p_\theta(y_s^t | \mathbf{y}^{<t}, y_{<s}^t)}\)
    • Design Motivation: Hallucinated but overconfident tokens have high probabilities in the target model but low probabilities in the reference model, resulting in a low ratio.
    • Weight Design: \(w_s^t = 1/|(\mathbf{y}^{<t}, y_{<s}^t)|\), where earlier tokens are assigned larger weights because they lay the foundation for subsequent generation.
  2. Generation with Rejection:

    • Acceptance Probability: \(p(\text{accept } \mathbf{y}^t) = \min\{1, r_\beta(\mathbf{y}^t)\}\)
    • Adaptive Threshold: \(\gamma^t = \gamma_0 \sum_{s=1}^{m} w_s^t\), where \(\gamma_0 \in [0,1]\)
    • If the acceptance probability exceeds the threshold, it is retained; otherwise, revision is triggered.
  3. Tree-based Revision:

    • Function: Performs tree-search style regeneration token-by-token for the \(m\) rejected tokens.
    • Core Idea: Samples Top-\(N\) candidate tokens at each step, prunes and retains Top-\(K\) paths using the monitor function, and expands layer-by-layer up to \(m\) steps.
    • Design Motivation: Balances the exploration space and computational efficiency—avoiding the redundancy of full-length sampling and the singularity of greedy decoding.

Loss & Training

MD is a training-free inference-time framework and does not involve additional training: - The target model is used directly (e.g., Llama-2-7B-chat). - The reference model is a larger model of the same architecture (e.g., Llama-2-70B-Chat). - Hyperparameters: Sample size \(N=2\), search depth \(K=3\).

Key Experimental Results

Main Results

Model Method TruthfulQA (T×I%) TriviaQA (EM) NQ-Open (EM) GSM8K (Acc)
Llama-2 Greedy 37.9 64.8 36.6 24.2
Llama-2 USC 39.4 66.8 38.6 23.4
Llama-2 MD 44.1 (+6.2) 72.1 (+7.6) 40.5 (+3.7) 27.5 (+3.3)
Llama-3 Greedy 42.4 72.4 39.6 81.4
Llama-3 MD 46.1 (+3.7) 80.8 (+8.4) 47.4 (+6.8) 85.2 (+3.8)
Gemma-2 Greedy 43.6 54.0 23.0 60.9
Gemma-2 MD 50.2 (+6.6) 64.6 (+10.6) 31.0 (+8.0) 79.9 (+19.0)

Efficiency Comparison

Method Latency (ms/token) Throughput (token/s)
Greedy 19.94 (×1.00) 50.68 (×1.00)
USC 245.76 (×12.32) 4.06 (×0.08)
FSC 316.72 (×15.88) 3.15 (×0.06)
MD 113.78 (×5.70) 18.99 (×0.37)

Ablation Study

Sample Size N TriviaQA EM
1 (=Greedy) 64.8
2 ~70
4 ~71
6+ Stabilized (~72)
  • The threshold \(\gamma_0\) yields stable improvements from 0 to positive values, showing the robustness of the method to this parameter.

Key Findings

  1. MD achieves the most significant improvement on Gemma-2—TriviaQA increases by 10.6% and GSM8K by 19.0%, indicating greater effectiveness on smaller models.
  2. Performance of baseline methods is unstable: DoLa even degrades performance on reasoning tasks (GSM8K -7.7%), and ID drops GSM8K by -13.3% on Llama-2.
  3. The latency of MD is only about half of USC, with a throughput 4.7 times that of USC, demonstrating a clear efficiency advantage.
  4. Case studies show that MD can accurately locate critical hallucinated tokens, correcting the overall response by modifying only a few tokens.

Highlights & Insights

  1. Granularity Insight: Not all tokens require revision—most "easy" tokens remain consistent across samplings, and only a few "hard" tokens trigger hallucinations. This finding refines hallucination mitigation from the response level to the token level.
  2. Nature of Overconfidence: The model's overconfidence in hallucinated tokens causes self-consistency strategies to fail. MD bypasses this issue by introducing an external reference model.
  3. Balance of Efficiency and Effectiveness: Selective token resampling combined with tree-search pruning achieves better performance with lower overhead compared to full-length sampling.

Limitations & Future Work

  1. Reliance on Reference Models: It requires a larger reference model with the same architecture (e.g., 70B), increasing resource requirements in practical deployment.
  2. Knowledge Coverage: If the factual information is absent in the training data, the monitor function cannot detect it. This limitation could be alleviated by introducing external knowledge bases.
  3. Fixed Window \(m\): Monitoring is performed every \(m\) tokens, and the choice of \(m\) may affect performance. The paper does not fully explore the optimal setting for \(m\).
  4. Scalability: It remains unclear how the parameters \(N\) and \(K\) of the tree search should be adjusted according to task complexity.
  • Complementary to DoLa (contrastive decoding across layers): DoLa utilizes internal layers' information of the model, whereas MD utilizes external reference model information.
  • The posterior detection concept of SelfCheckGPT contrasts with the in-process detection of MD, heuristically illustrating the importance of "at which stage to intervene."
  • Shares the "small model generation, large model verification" paradigm with Speculative Decoding, but with a completely different motivation.

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of token-level monitoring and tree-search revision is a natural yet effective innovation. The paradigm shift from "full-length sampling" to "selective token revision" is meaningful.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive, covering 3 models, 4 datasets, efficiency analyses, ablation studies, and case studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clearly articulated motivation (especially the comparison in Figure 1), with well-structured descriptions of the method.
  • Value: ⭐⭐⭐⭐ — Highly practical as an inference-time hallucination mitigation solution, though requiring an extra large model poses a deployment barrier.