Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling¶
Conference: NeurIPS 2025 arXiv: 2504.13169 Code: GitHub Area: Multimodal Large Language Models / Hallucination Mitigation Keywords: VLM, Visual Hallucination, Self-Correction, Retrospective Resampling, Confidence Token
TL;DR¶
This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.
Background & Motivation¶
- Core Problem: VLMs frequently produce hallucinations in visual understanding (e.g., describing non-existent objects or actions), posing significant risks in safety-critical scenarios such as autonomous driving and assistive technologies.
- Limitations of Prior Work: Generation adjustment methods (VCD, OPERA, DoLA, etc.) modify decoding behavior but cannot correct erroneously generated tokens once produced; post-hoc verification methods (Woodpecker, LURE) rely on external models, involve complex pipelines, and tend to reject outputs rather than correct them.
- Key Challenge: No existing method can simultaneously generate, verify, and correct within a single model.
- Key Insight: Introduce explicit confidence tokens to enable the VLM to self-annotate phrase-level hallucinations, combined with retrospective resampling for runtime self-correction.
Method¶
Confidence Token Design¶
Three special tokens are added to the VLM vocabulary:
- <SPAN>: marks the beginning of a key phrase
- </CN>: marks the end of a confident, grounded phrase
- </UN>: marks the end of an unconfident, hallucinated phrase
1.3M Semi-Synthetic Hallucination-Aware Training Data¶
Expanded from LLaVA-v1.5-665k, comprising 6.8M QA pairs (3.8M correct answers + 2.9M hallucinated answers):
- Positive phrases are enclosed with <SPAN> and </CN>
- Negative phrases are enclosed with <SPAN> and </UN>, with truncation immediately after </UN>
- Binary Yes/No and counting questions use rule-based negative generation; open-ended answers use GPT-4o-mini
- 20% of data incorporates query rewriting prompts to support retrospective correction
Hallucination-Aware Training Loss¶
A modified cross-entropy loss that applies target masking (weight set to 0) on tokens between <SPAN> and </UN>, preventing reinforcement of language priors on hallucinated content:
where \(\mathbb{1}_{Hall(i)}=0\) only when the token lies between <SPAN> and </UN>.
Retrospective Resampling¶
During inference, the generation probability \(P(\text{</UN>})\) is continuously monitored. When it exceeds a threshold \(\tau\), a hierarchical rollback strategy is triggered:
- Local Rollback: Roll back to the most recent
</CN>(confidence checkpoint) and attempt local correction. - Sentence-Level Rollback: If local correction fails \(K\) times (\(K=10\)), roll back to the previous sentence boundary.
- Query Rewriting with Hint: Append a hint to the input — "Hint: potential incorrect phrases → \<placeholder>"
- Termination: If correction still fails after \(N\) attempts (\(N=50\)), return the current output flagged as potentially hallucinated.
Temperature is gradually increased during rejection sampling (step \(\Delta T=0.1\), upper bound \(T_0+0.5\)) to encourage exploration of alternative expressions.
Key Experimental Results¶
CHAIR-MSCOCO Image Captioning (Lower is Better)¶
| Method | CHAIRi↓ | CHAIRs↓ |
|---|---|---|
| LLaVA-v1.5 7B | 15.4 | 50.0 |
| HA-DPO | 11.0 | 38.2 |
| HALVA | 11.7 | 41.4 |
| REVERSE (τ=0.003) | 10.3 | 37.0 |
| REVERSE (τ=0.0003) | 6.1 | 13.6 |
HaloQuest Open-Ended QA (Accuracy↑)¶
| Method | Avg Acc↑ | FP | VC | IC |
|---|---|---|---|---|
| LLaVA-v1.5 | 22.6 | 17.1 | 39.5 | 10.7 |
| HALVA | 23.9 | 21.1 | 37.4 | 10.7 |
| REVERSE (τ=0.003) | 30.7 | 31.8 | 31.5 | 26.9 |
| REVERSE (τ=0.0003) | 32.3 | 29.4 | 18.7 | 58.8 |
MMHal-Bench (Score↑ / Hall Rate↓)¶
| Method | Score↑ | Hall. Rate↓ |
|---|---|---|
| LLaVA-v1.5 | 2.11 | 0.54 |
| HALVA | 2.25 | 0.54 |
| REVERSE (τ=0.003) | 2.56 | 0.47 |
| REVERSE (τ=0.0003) | 3.28 | 0.30 |
Ablation Study (AMBER-G)¶
| Component | CHAIR↓ | Cover↑ | Hall↓ | Cog↓ |
|---|---|---|---|---|
| LLaVA-v1.5 Baseline | 7.8 | 51.0 | 36.4 | 4.2 |
| + Hallucination-Aware Training | 7.2 | 53.2 | 36.3 | 3.4 |
| + Rejection Sampling | 6.0 | 51.0 | 30.5 | 3.0 |
| + Query Rewriting | 6.0 | 52.2 | 30.4 | 3.0 |
Efficiency¶
- 37% of samples require no rollback; among the remainder, more than half require only one correction round.
- At \(N=50\), total token generation is approximately 3.05× that of the baseline.
- Verification overhead is negligible (inline token-level confidence estimation), far lower than the external model overhead of Woodpecker.
Highlights & Insights¶
- First Unified Generate + Verify + Correct Framework: A single VLM serves as both generator and verifier without external models, performing retrospective correction rather than simple rejection.
- Tunable Threshold τ for Expressiveness–Trustworthiness Trade-off: τ can be continuously adjusted from 0.01 to 0.0001; at τ=0.0001, hallucination control even surpasses GPT-4V, making REVERSE the first method to offer such a user-controllable parameter.
- Hallucination-Aware Training Alone Yields Gains: Even without inference-time rollback, the contrastive learning effect from the training stage already outperforms existing VLMs (analogous to a DPO effect).
- Robust to Temperature Variation: While other methods suffer from simultaneous degradation in hallucination rate and coverage at high temperatures, REVERSE reduces hallucinations while improving coverage even under high temperature.
Limitations & Future Work¶
- Increased Inference Overhead: Worst-case token generation reaches 3×; KV-cache reuse could mitigate this but is not yet implemented.
- Ineffective for Discriminative VQA: Rollback provides no additional reasoning benefit in binary Yes/No tasks.
- Training Data Dependence on GPT-4o-mini: May introduce biases and limited coverage.
- Threshold τ Requires Per-Model Tuning: LLaVA uses 0.003, Qwen uses 0.01; confidence scores are not calibrated across models.
- Accuracy Drop on VC (Visually Challenging) Subset: The more conservative generation strategy causes the model to refuse some answerable questions.
Related Work & Insights¶
- Generation Adjustment vs. Post-Hoc Verification: This paper is the first to demonstrate that the two paradigms can be unified; confidence tokens serve simultaneously as classifiers and rollback triggers.
- Implicit Connection to DPO: The contrastive training on positive and negative phrase pairs in hallucination-aware training may produce a DPO-like effect, warranting further investigation.
- Broader Inspiration: The retrospective resampling idea is generalizable to LLM factuality checking (e.g., having LLMs self-annotate key claims requiring citation during generation and verifying them on the fly).
Rating¶
⭐⭐⭐⭐ — The unified framework is elegantly designed, experiments are comprehensive (3 VLM backbones × multiple benchmarks), and the tunable threshold offers practical value. Limitations include inference overhead and reliance on an external model for training data quality.