Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception¶
Conference: CVPR 2025
arXiv: 2504.20468
Code: None
Area: Hallucination Detection
Keywords: LVLM Hallucination, Counterfactual Presupposition, Preference Optimization, CP-Bench, Self-Correction
TL;DR¶
This paper proposes Antidote—a unified, synthetic data-driven post-training framework that enables model self-correction by injecting factual priors into prompts, decoupling hallucination mitigation as a preference optimization problem. It improves CP-Bench by over 50% on the LLaVA series, increases POPE by 1.8-3.3%, and reduces CHAIR/SHR by 30-50% without suffering from catastrophic forgetting.
Background & Motivation¶
Background: LVLM hallucination mitigation primarily focuses on "object perception"—including object existence (evaluated by POPE) and image description accuracy (evaluated by CHAIR/SHR). Existing methods include instruction tuning (LRV), contrastive decoding (VCD), and post-training (HA-DPO), among others.
Limitations of Prior Work: Existing methods focus on hallucinations during the response-generation stage but overlook that the questions themselves might contain false premises. When a user asks "What brand is that car in the picture?" when there is no car in the image, even the latest InternVL-2 and Qwen2-VL blindly accept the false presupposition and fabricate an answer. Such "counterfactual presupposition questions" (CPQ) represent a deeper level of hallucination, against which existing mitigation methods are almost completely ineffective.
Key Challenge: LVLMs overlearn instruction-following patterns during instruction tuning, causing models to "cooperate" and answer even when the question premise contradicts the image content. Meanwhile, statistical biases of object co-occurrence cause models to generate common but non-existent co-occurring objects when they see similar scenes.
Goal: To design a unified framework to simultaneously mitigate two types of hallucinations: (1) counterfactual presupposition hallucinations—detecting and rejecting false premises in questions; (2) object perception hallucinations—reducing descriptions of non-existent objects.
Key Insight: The authors argue that co-occurring objects can be decoupled and controlled via synthetic data—generating object-missing images that "should statistically co-occur but are actually missing," and then constructing CPQs targeting these missing objects. The key insight is that the factual priors (which objects exist/do not exist) from the data synthesis process can be integrated into prompts at zero cost to enable model self-correction, converting the problem into preference optimization.
Core Idea: Synthesize images with decoupled co-occurring objects along with corresponding CPQ/existence/description queries; use factual priors to prompt the model to self-correct to obtain "positive samples," use the original hallucinated responses as "negative samples," and train the model via DPO to learn to distinguish between them.
Method¶
Overall Architecture¶
The input consists of a baseline LVLM and preference training pairs generated by a synthetic data pipeline. The pipeline consists of three steps: (1) building a caption pool from CC3M and filtering/denoising it using DeepSeek-V2; (2) scene understanding—identifying objects in the scene and candidate objects for co-occurrence hallucination; (3) generating decoupled images using Stable Diffusion 3 and validating facts using Grounding-DINO. During the self-correction phase, factual priors are injected into prompts to generate positive samples, which are paired with the original hallucinated responses to construct preference pairs for DPO post-training.
Key Designs¶
-
Synthetic Data Pipeline:
- Function: Automatically generate images containing decoupled co-occurring objects and corresponding queries.
- Mechanism: (a) Collect captions from CC3M, rewrite and filter them using DeepSeek-V2, and deduplicate using MinHash+LSH; (b) Extract scene objects \(\mathcal{O}_{pre}\) and hallucination candidate objects \(\mathcal{O}_{hallu}\) from captions using DeepSeek-V2; (c) Use SD3 to generate images with captions as positive prompts and \(\mathcal{O}_{hallu}\) as negative prompts, then verify the facts using Grounding-DINO—checking that \(\mathcal{O}_{pre}\) indeed exists and \(\mathcal{O}_{hallu}\) indeed does not exist.
- Design Motivation: Decouple co-occurring objects by controlling the generation process to obtain images where "the scene is normal but specific objects are missing," providing precise factual annotations for constructing targeted hallucination training data without manual intervention.
-
Self-Correction via Preference Alignment:
- Function: Construct high-quality preference training pairs using the model's own capability without external expert models.
- Mechanism: Construct preference pairs for three query types. For CPQs: when querying the attributes of a non-existent object, the original response (hallucination) is negative, and the self-corrected response with factual priors (e.g., "This object is not in the image") is positive. Meanwhile, True Presupposition Questions (TPQs) are constructed to prevent the model from becoming overly cautious. Object existence queries and image description queries are processed similarly. BGE-m3 is used to calculate the cosine similarity between positive and negative responses to filter out non-hallucinatory samples. A key-value memory bank is also maintained to avoid query duplicates.
- Design Motivation: Compared to relying on GPT-4V to generate preference samples, self-correction fully utilizes the factual information in the synthetic data at zero extra cost. Compared to SFT which only increases the probability of positive samples, the contrastive learning of DPO can better distinguish between hallucinated vs. truthful responses.
-
CP-Bench Benchmark:
- Function: The first benchmark specifically evaluating the ability of LVLMs to handle counterfactual presuppositions.
- Mechanism: It includes a dev set (automatically synthesized) and a test set (manually annotated), each with 1,000 samples (500 CPQ + 500 TPQ). The CPQs in the test set cover four types of everyday scenes: objects, knowledge, scenes, and activities. Hallucination candidate objects are chosen from highly co-occurring objects with semantic relevance (e.g., "railway" in a train scene) to increase the challenge. GPT-4o is used to convert open-ended responses into binary classifications for evaluation.
- Design Motivation: Existing projects/benchmarks (POPE, CHAIR) only evaluate hallucinations during response generation, lacking evaluation of the capability to judge question presuppositions.
Loss & Training¶
Using the DPO loss: \(\mathcal{L}_{dpo} = -\mathbb{E}_\mathcal{D}[\log\sigma(\beta\log\frac{\pi_\theta(y_{pos}|[x_T,x_I])}{\pi_{ref}(y_{pos}|[x_T,x_I])} - \beta\log\frac{\pi_\theta(y_{neg}|[x_T,x_I])}{\pi_{ref}(y_{neg}|[x_T,x_I])})]\). Training data: 5,000 CPQs + 5,000 TPQs + 2,000 object existence + 8,000 image descriptions, totaling 20K samples. LoRA fine-tuning (\(r=64, \alpha=128\)), \(\beta=0.1\).
Key Experimental Results¶
Main Results (CP-Bench Test F1-Score)¶
| Model | Original | + Antidote | Gain |
|---|---|---|---|
| LLaVA-1.5-7B | 5.7 | 78.4 | +72.7 |
| LLaVA-1.5-13B | 17.3 | 83.5 | +66.2 |
| LLaVA-Next-Mistral-7B | 26.7 | 76.8 | +50.1 |
Object Perception Hallucination (CHAIR_s↓ / SHR↓)¶
| Model | CHAIR_s | +Antidote | SHR | +Antidote |
|---|---|---|---|---|
| LLaVA-1.5-7B | 19.4 | 9.4 (-51%) | 36.7 | 18.1 (-51%) |
| LLaVA-1.5-13B | 30.0 | 12.6 (-58%) | 37.2 | 21.3 (-43%) |
Ablation Study¶
| Configuration | CP-Bench | POPE | SHR↓ | MMBench |
|---|---|---|---|---|
| Baseline | 12.4 | 86.07 | 36.7 | 64.3 |
| SFT instead of DPO | 67.8 | 85.14 | 25.9 | 59.6 |
| Antidote (DPO) | 82.9 | 87.89 | 18.1 | 65.4 |
Key Findings¶
- Antidote allows open-source models to approach closed-source model performance on CP-Bench (78.4 vs. 86.0 of Claude-3.5).
- DPO significantly outperforms SFT—SFT degrades on POPE and MMBench, suffering from catastrophic forgetting; DPO not only avoids degradation but also slightly improves general capability.
- Model size is not the decisive factor for counterfactual presupposition capability—MiniCPM-V2.5 (8B) has a recalling rate 8.5% higher than Qwen2-VL (72B) on CP-Bench.
- Attention visualization shows that after Antidote training, the model attends more accurately to the corresponding areas in the image when answering object-related words.
Highlights & Insights¶
- Definition and Evaluation of New Hallucination Type: Systematically defines and evaluates CPQ hallucinations for the first time, revealing that even the strongest open-source LVLMs fail severely on this task (InternVL-2-Pro has an F1 of only 64.2%).
- Zero-Cost Design of Self-Correction Preference Optimization: Cleverly utilizes the factual information already available in the synthetic data to drive self-correction without relying on external experts like GPT-4V, presenting a paradigm that can be extended to other post-training scenarios.
- Controlled Co-occurrence Decoupling in Synthetic Data: Controls the presence/absence of objects in generated images using negative prompts to provide precise annotations for training data—this idea of "prior-controlled synthetic training data" can be extended to other tasks requiring precise annotations.
Limitations & Future Work¶
- The diversity of synthetic data is limited by the CC3M caption pool and SD3 generation capabilities.
- GPT-4o evaluation on CP-Bench introduces additional bias.
- The choice of LoRA rank has a significant impact on effectiveness—catastrophic forgetting starts at \(r=128\), and the model collapses at \(r=256\), requiring careful hyperparameter tuning.
- The effectiveness of Antidote has not been validated on stronger baselines like InternVL-2/Qwen2-VL.
Related Work & Insights¶
- vs HA-DPO: HA-DPO relies on GPT-4V to generate preference samples and is ineffective for CPQs (F1 of only 4.7%). Antidote avoids external dependencies through synthetic data self-correction.
- vs Decoupling/Decoding Strategies like VCD/OPERA: Decoding strategies only improve behavior at inference time without changing the model internally. Antidote fundamentally enhances the model's capability for presupposition judgment through post-training.
- vs SeVa: SeVa uses self-supervised negative data, showing mediocre performance on CPQs (F1 = 24.1). The positive-negative contrast in Antidote is more effective.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematically solves the CPQ hallucination for the first time; the combination of synthetic data + self-correction + DPO is effective and elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on CP-Bench + POPE + CHAIR + SHR + 4 general benchmarks, with comprehensive ablation studies and many comparison methods.
- Writing Quality: ⭐⭐⭐⭐ Clearly defined problems, well-motivated methods, and in-depth experimental analysis.
- Value: ⭐⭐⭐⭐⭐ CPQ is a core safety concern for LVLMs in practical applications; Antidote provides a practical solution.