Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling¶

Conference: NeurIPS 2025 arXiv: 2504.13169 Code: GitHub Area: Multimodal Large Language Models / Hallucination Mitigation Keywords: VLM, Visual Hallucination, Self-Correction, Retrospective Resampling, Confidence Token

TL;DR¶

This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.

Background & Motivation¶

Core Problem: VLMs frequently produce hallucinations in visual understanding (e.g., describing non-existent objects or actions), posing significant risks in safety-critical scenarios such as autonomous driving and assistive technologies.
Limitations of Prior Work: Generation adjustment methods (VCD, OPERA, DoLA, etc.) modify decoding behavior but cannot correct erroneously generated tokens once produced; post-hoc verification methods (Woodpecker, LURE) rely on external models, involve complex pipelines, and tend to reject outputs rather than correct them.
Key Challenge: No existing method can simultaneously generate, verify, and correct within a single model.
Key Insight: Introduce explicit confidence tokens to enable the VLM to self-annotate phrase-level hallucinations, combined with retrospective resampling for runtime self-correction.

Method¶

Confidence Token Design¶

Three special tokens are added to the VLM vocabulary: - : marks the beginning of a key phrase - </CN>: marks the end of a confident, grounded phrase - </UN>: marks the end of an unconfident, hallucinated phrase

1.3M Semi-Synthetic Hallucination-Aware Training Data¶

Expanded from LLaVA-v1.5-665k, comprising 6.8M QA pairs (3.8M correct answers + 2.9M hallucinated answers): - Positive phrases are enclosed with  and </CN> - Negative phrases are enclosed with  and </UN>, with truncation immediately after </UN> - Binary Yes/No and counting questions use rule-based negative generation; open-ended answers use GPT-4o-mini - 20% of data incorporates query rewriting prompts to support retrospective correction

Hallucination-Aware Training Loss¶

A modified cross-entropy loss that applies target masking (weight set to 0) on tokens between  and </UN>, preventing reinforcement of language priors on hallucinated content:

\[L(S) = -\sum_{y_i \in Y} \mathbb{1}_{Hall(i)} \cdot \log P(y_i | X, y_1, ..., y_{i-1}; \theta)\]

where \(\mathbb{1}_{Hall(i)}=0\) only when the token lies between  and </UN>.

Retrospective Resampling¶

During inference, the generation probability \(P(\text{</UN>})\) is continuously monitored. When it exceeds a threshold \(\tau\), a hierarchical rollback strategy is triggered:

Local Rollback: Roll back to the most recent </CN> (confidence checkpoint) and attempt local correction.
Sentence-Level Rollback: If local correction fails \(K\) times (\(K=10\)), roll back to the previous sentence boundary.
Query Rewriting with Hint: Append a hint to the input — "Hint: potential incorrect phrases → \<placeholder>"
Termination: If correction still fails after \(N\) attempts (\(N=50\)), return the current output flagged as potentially hallucinated.

Temperature is gradually increased during rejection sampling (step \(\Delta T=0.1\), upper bound \(T_0+0.5\)) to encourage exploration of alternative expressions.

Key Experimental Results¶

CHAIR-MSCOCO Image Captioning (Lower is Better)¶

Method	CHAIRi↓	CHAIRs↓
LLaVA-v1.5 7B	15.4	50.0
HA-DPO	11.0	38.2
HALVA	11.7	41.4
REVERSE (τ=0.003)	10.3	37.0
REVERSE (τ=0.0003)	6.1	13.6

HaloQuest Open-Ended QA (Accuracy↑)¶

Method	Avg Acc↑	FP	VC	IC
LLaVA-v1.5	22.6	17.1	39.5	10.7
HALVA	23.9	21.1	37.4	10.7
REVERSE (τ=0.003)	30.7	31.8	31.5	26.9
REVERSE (τ=0.0003)	32.3	29.4	18.7	58.8

MMHal-Bench (Score↑ / Hall Rate↓)¶

Method	Score↑	Hall. Rate↓
LLaVA-v1.5	2.11	0.54
HALVA	2.25	0.54
REVERSE (τ=0.003)	2.56	0.47
REVERSE (τ=0.0003)	3.28	0.30

Ablation Study (AMBER-G)¶

Component	CHAIR↓	Cover↑	Hall↓	Cog↓
LLaVA-v1.5 Baseline	7.8	51.0	36.4	4.2
+ Hallucination-Aware Training	7.2	53.2	36.3	3.4
+ Rejection Sampling	6.0	51.0	30.5	3.0
+ Query Rewriting	6.0	52.2	30.4	3.0

Efficiency¶

37% of samples require no rollback; among the remainder, more than half require only one correction round.
At \(N=50\), total token generation is approximately 3.05× that of the baseline.
Verification overhead is negligible (inline token-level confidence estimation), far lower than the external model overhead of Woodpecker.

Highlights & Insights¶

First Unified Generate + Verify + Correct Framework: A single VLM serves as both generator and verifier without external models, performing retrospective correction rather than simple rejection.
Tunable Threshold τ for Expressiveness–Trustworthiness Trade-off: τ can be continuously adjusted from 0.01 to 0.0001; at τ=0.0001, hallucination control even surpasses GPT-4V, making REVERSE the first method to offer such a user-controllable parameter.
Hallucination-Aware Training Alone Yields Gains: Even without inference-time rollback, the contrastive learning effect from the training stage already outperforms existing VLMs (analogous to a DPO effect).
Robust to Temperature Variation: While other methods suffer from simultaneous degradation in hallucination rate and coverage at high temperatures, REVERSE reduces hallucinations while improving coverage even under high temperature.

Limitations & Future Work¶

Increased Inference Overhead: Worst-case token generation reaches 3×; KV-cache reuse could mitigate this but is not yet implemented.
Ineffective for Discriminative VQA: Rollback provides no additional reasoning benefit in binary Yes/No tasks.
Training Data Dependence on GPT-4o-mini: May introduce biases and limited coverage.
Threshold τ Requires Per-Model Tuning: LLaVA uses 0.003, Qwen uses 0.01; confidence scores are not calibrated across models.
Accuracy Drop on VC (Visually Challenging) Subset: The more conservative generation strategy causes the model to refuse some answerable questions.

Generation Adjustment vs. Post-Hoc Verification: This paper is the first to demonstrate that the two paradigms can be unified; confidence tokens serve simultaneously as classifiers and rollback triggers.
Implicit Connection to DPO: The contrastive training on positive and negative phrase pairs in hallucination-aware training may produce a DPO-like effect, warranting further investigation.
Broader Inspiration: The retrospective resampling idea is generalizable to LLM factuality checking (e.g., having LLMs self-annotate key claims requiring citation during generation and verifying them on the fly).

Rating¶

⭐⭐⭐⭐ — The unified framework is elegantly designed, experiments are comprehensive (3 VLM backbones × multiple benchmarks), and the tunable threshold offers practical value. Limitations include inference overhead and reliance on an external model for training data quality.