Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding¶

Conference: ICLR 2026 arXiv: 2601.21969 Code: https://github.com/rhq945/Token-Guard Area: Information Retrieval Keywords: LLM hallucination control, token-level decoding, self-checking, segment-level scoring, iterative refinement

TL;DR¶

This paper proposes Token-Guard, a token-level hallucination control method based on self-checking decoding, which detects and suppresses hallucinations during decoding via token-level/segment-level scoring in the hidden space and an iterative refinement mechanism, achieving an average F1 improvement of 16.3%.

Background & Motivation¶

LLM hallucination problem: Large models frequently generate content inconsistent with the input, which is especially severe in knowledge-intensive scenarios.
Limitations of Prior Work:
- RAG and RLHF require expensive external retrieval or large-scale fine-tuning.
- Existing decoding methods (CoT, ToT, etc.) lack explicit token-level hallucination checking mechanisms.
- Hallucination risk is not explicitly quantified, and token selection lacks directional guidance.
- Most methods support only single-pass generation and lack dynamic refinement capability.
Core Problem: How to achieve fine-grained hallucination control during decoding with low overhead?

Method¶

Overall Architecture¶

Token-Guard comprises three levels of hallucination control: 1. Token-level self-checking → 2. Segment-level representation and scoring → 3. Global iterative refinement

Key Design 1: Token-Level Hallucination Self-Checking¶

A composite hallucination score is computed for each candidate token:

\[F_{\text{halu}}^{\text{token}}(a_t^{(i)} \mid s_t) = \lambda \cdot \frac{h_t^{(i)} \cdot \bar{h}_{<t}}{|h_t^{(i)}| |\bar{h}_{<t}|} + (1-\lambda) \cdot P(a_t^{(i)} \mid a_{<t}, x)\]

First term: cosine similarity between the candidate token's hidden state and the mean hidden state of accepted tokens (semantic consistency).
Second term: model-assigned conditional probability (token probability).
\(\lambda = 0.6\) balances the two terms.
Threshold \(\tau_{\text{token}} = 0.4\); tokens below the threshold are discarded.

Hidden states are taken from the model's second-to-last layer \(\text{LLM}_{\text{hidden}}^{(L-1)}\); the mean hidden state of the input context serves as the anchor for the first token.

Key Design 2: Segment-Level Candidate Representation and Scoring¶

Candidate segments \(C_k\) are formed from consecutive tokens, and segment representations are computed via weighted averaging:

\[H_k = \sum_{i=1}^{n} w_i h_t^{(i)}, \quad w_i = \frac{\exp(F_{\text{halu}}^{\text{token}}(a_{t_i} \mid s_{t_i}))}{\sum_j \exp(F_{\text{halu}}^{\text{token}}(a_{t_j} \mid s_{t_j}))}\]

The segment-level score integrates three dimensions:

\[F_{\text{halu}}^{\text{seg}}(C_k) = \alpha F_{\text{halu}}^{\text{token}}(C_k) + \beta \text{Consistency}(C_k) + \gamma \text{Alignment}(C_k)\]

Token aggregation: weighted token reliability (\(\alpha = 0.5\))
Local consistency: smoothness of adjacent token hidden states (\(\beta = 0.3\))
Global alignment: semantic alignment with the input context (\(\gamma = 0.2\))

Segment-level thresholds: \(\tau_{\text{seg}}^{\text{low}} = 0.55\) (discard), \(\tau_{\text{seg}}^{\text{high}} = 0.75\) (accept); segments in between undergo local refinement.

Local refinement: The lowest-scoring token and its neighboring window \(W_k^{(l)}\) within a segment are identified, and the LLM regenerates replacement tokens conditioned on the surrounding context:

\[W_k^{(l)'} = \text{LLM\_refine}(W_k^{(l)} \mid a_{<i-1}, a_{>i+1}, H_k)\]

Global iteration: Reliable segments are assembled into a reasoning chain \(R\), and a global score is computed:

\[F_{\text{global}}(R) = \frac{F_{\text{fact}}(R) \cdot F_{\text{logic}}(R)}{F_{\text{fact}}(R) + F_{\text{logic}}(R) - F_{\text{fact}}(R) \cdot F_{\text{logic}}(R)}\]

If \(F_{\text{global}} < 0.7\), global regeneration is triggered; if both \(F_{\text{fact}}\) and \(F_{\text{logic}}\) fall below 0.5, the system outputs "unable to answer."

Memory Efficiency¶

Token level: only the running mean \(\bar{h}_{<t}\) is maintained; complexity \(\mathcal{O}(L_{\max} \cdot K_{\text{active}} \cdot d)\).
Segment level: temporary hidden states are released after segment formation; only compact segment vectors are retained.
Global level: only segment vectors \(\{H_k\}\) are operated on; complexity \(\mathcal{O}(K \cdot d)\).

Key Experimental Results¶

Main Results (Meta-Llama-3.1-8B-Instruct)¶

Method	FinanceBench F1	DROP_hist F1	DROP_nfl F1	HaluEval F1	Avg F1
BaseModel	16.00	44.21	39.10	42.16	28.29
Guided Decoding	16.44	55.95	36.71	57.41	34.73
Chain-of-Thoughts	11.01	49.26	49.21	55.32	34.63
Tree-of-Thought	14.44	47.73	37.69	56.02	33.33
Token-Guard	30.80	68.52	58.10	78.54	51.03

Qwen3-8B Results¶

Method	Avg EM	Avg F1
BaseModel	0.22	44.25
CoT	0.23	45.10
Token-Guard	0.35	53.98

Ablation Study¶

Variant	DROP_hist F1	RAGTruth F1	Avg BLEU
Full Token-Guard	68.52	43.94	51.74
w/o Token-Level	47.51	27.10	34.97
w/o Segment-Level	60.10	39.20	46.32
w/o Global Iteration	63.05	41.05	36.26
w/o Prompt	55.23	32.50	39.70

Key Findings¶

Token-level scoring contributes most to performance (removing it causes the largest F1 drop).
Global iteration primarily improves BLEU (linguistic fluency), with additional contributions to EM/F1.
The advantage is greatest on tasks requiring multi-step reasoning (DROP_nfl).
Improvements are limited on knowledge-intensive tasks (PubMedQA), as the method cannot compensate for missing domain knowledge.
The approach is effective across both backbone models (Llama3.1-8B and Qwen3-8B).

Highlights & Insights¶

Multi-level hallucination control: A three-tier token→segment→global hierarchy that balances precision and efficiency.
No external resources required: No retrieval system or additional training is needed; the method operates purely at the decoding stage.
Modular design: Can be integrated as a plug-in into any LLM decoding pipeline.
Memory-friendly: Careful state management keeps memory usage independent of generation length.

Limitations & Future Work¶

Multi-level scoring introduces additional computational overhead (each token requires multiple hidden state computations and cosine similarity calculations).
The method involves numerous hyperparameters (\(\lambda\), \(\tau_{\text{token}}\), \(\alpha/\beta/\gamma\), \(\tau_{\text{seg}}\), \(\tau_{\text{global}}\), etc.), making tuning complex.
The hidden-state-similarity-based hallucination detection assumes "consistency with context = factual correctness," which may fail when the model itself contains erroneous knowledge.
Validation is limited to 8B-scale models; applicability to larger or smaller models remains unknown.
Global iteration relies on TF-IDF and KMeans clustering, introducing additional dependencies on classical NLP methods.

RAG methods: External retrieval augmentation; computationally intensive and domain-dependent.
RLHF/alignment methods: Require large-scale fine-tuning with high resource consumption.
Decoding methods: DoLa (inter-layer contrastive decoding), KCTS (knowledge-constrained tree search), Phi-Decoding (look-ahead sampling).
Token-Guard: The first unified hallucination control framework integrating token-level self-checking, segment-level scoring, and global iteration.

Rating¶

Dimension	Score
Novelty	★★★★☆
Theoretical Depth	★★★☆☆
Experimental Thoroughness	★★★★☆
Value	★★★★☆
Writing Quality	★★★☆☆