Skip to content

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Conference: ACL 2026 arXiv: 2604.12033 Code: Available (to be released upon publication) Area: Multimodal VLM Keywords: Vision-language models, hallucination detection, deflection evaluation, knowledge-based VQA, retrieval-augmented generation

TL;DR

This paper proposes VLM-DeflectionBench, a multimodal benchmark comprising 2,775 samples that systematically evaluates the deflection vs. hallucination behavior of large vision-language models (LVLMs) under insufficient or misleading evidence, through four evaluation scenarios (Parametric / Oracle / Realistic / Adversarial). Experiments covering 20 state-of-the-art LVLMs reveal that virtually no model can reliably deflect under noisy evidence.

Background & Motivation

Background: Large vision-language models increasingly rely on retrieval augmentation to answer knowledge-intensive multimodal questions. Existing KB-VQA benchmarks (e.g., OK-VQA, InfoSeek, E-VQA) primarily evaluate accuracy when correct evidence is retrieved.

Limitations of Prior Work: (1) Evidence conflicts are ignored: Existing benchmarks do not account for contradictions between visual and textual evidence, nor do they address what a model should do when retrieved knowledge is incomplete. (2) Rapid obsolescence: As LVLM training sets expand, many questions that previously required retrieval can now be answered directly via parametric knowledge, rendering benchmarks less discriminative. (3) Failure modes are conflated: These benchmarks only measure correctness without distinguishing between erroneous answers (hallucination) and refusals to answer (deflection)—yet deflection is a more desirable failure mode when evidence is insufficient.

Key Challenge: Reliable RAG systems should deflect rather than fabricate when evidence is insufficient, yet no benchmark systematically evaluates this behavior.

Goal: Construct a dynamically updatable benchmark specifically designed to evaluate LVLM hallucination vs. deflection behavior under varying knowledge conditions.

Key Insight: Four complementary scenarios are designed to disentangle parametric memory from retrieval robustness—ranging from no evidence to perfect evidence to mixed evidence to pure distractors.

Core Idea: A dynamic filtering pipeline maintains benchmark difficulty by excluding parametrically answerable samples, while a four-scenario evaluation protocol separately assesses what models know and how they behave when they do not know.

Method

Overall Architecture

A three-stage construction pipeline: Stage I applies multiple gating models to filter out parametrically answerable samples → Stage II mines textual and visual distractors for the retained samples → Stage III performs quality control (ensuring answerability and distractor validity). The final benchmark contains 2,775 samples, each accompanied by gold-standard evidence and distractors.

Key Designs

  1. Dynamic Parametric Filtering (Stage I):

    • Function: Ensures that questions in the benchmark genuinely require external retrieval.
    • Mechanism: Four powerful gating models (Gemma3-27B, Qwen-2.5-VL-32B, InternVL3-38B, VL-Rethinker-72B) attempt to answer each question without external knowledge. Only samples that none of the models answer correctly are retained. GPT-4o serves as the judge.
    • Design Motivation: As model capabilities improve, questions previously requiring retrieval may become parametrically answerable. Dynamic filtering allows the benchmark to be updated over time by substituting stronger gating models, preserving evaluation validity.
  2. Four-Scenario Evaluation Protocol:

    • Function: Disentangles parametric knowledge from retrieval robustness and explicitly evaluates deflection behavior.
    • Mechanism: Parametric scenario (no external knowledge): validates filtering effectiveness; near-zero accuracy is expected. Oracle scenario (gold-standard evidence only): measures maximum capability given perfect evidence. Realistic scenario (gold-standard evidence mixed with distractors): simulates real-world retrieval results. Adversarial scenario (distractors only): models are expected to deflect rather than hallucinate. Each scenario reports three metrics: accuracy, deflection rate, and hallucination rate.
    • Design Motivation: A single scenario cannot reveal a complete picture of model behavior. The four scenarios span the spectrum from ideal to worst-case knowledge conditions.
  3. Multimodal Distractor Mining (Stage II):

    • Function: Provides high-quality textual and visual distractors for each sample.
    • Mechanism: Textual distractors are retrieved from a Wikipedia index using EVA-CLIP to obtain the top-10 relevant pages, which are then chunked and reranked with Contriever. Visual distractors are retrieved from an image index as the top-10 similar non-gold images. Each sample is guaranteed at least five distractors.
    • Design Motivation: Noise is inevitable in real-world retrieval settings; high-quality distractors test a model's ability to distinguish relevant from irrelevant evidence.

Key Experimental Results

Main Results (Four-scenario evaluation of 20 LVLMs; representative models shown)

Model Oracle Acc↑ Oracle Hall↓ Realistic Acc Realistic Hall↓ Adversarial Defl↑ Adversarial Hall↓
Ovis2-34B 66.5 27.8 49.1 43.3 38.7 58.1
GPT-5 73.1 12.6 59.5 25.5 61.2 34.7
Claude-Opus-4 49.1 9.2 32.1 8.5 88.3 11.1
Gemini-2.5-Pro 59.8 13.9 51.0 20.5 76.1 22.2
Qwen-2.5-VL-32B 61.0 33.9 45.2 49.5 13.7 83.9
Mistral-Small-3.1 42.6 10.3 23.5 14.9 83.8 15.6

Key Findings

  • No model performs consistently across all scenarios: Claude over-deflects (Oracle accuracy only 49.1%), Qwen is overconfident (adversarial hallucination rate 83.9%), and Mistral also over-deflects.
  • Hallucination remains severe even with gold-standard evidence: LLaVA-OneVision still exhibits a 41.6% hallucination rate in the Oracle scenario, indicating that grounding rather than retrieval is the primary bottleneck.
  • Accuracy drops by 10–20 percentage points in the Realistic scenario: Distractors frequently mislead models.
  • GPT-5 exhibits elevated accuracy in the Parametric scenario (23.7%): This may reflect training data contamination.
  • Open-source models rarely deflect in the Adversarial scenario: Most deflection rates fall below 35%, with models tending to fabricate answers.
  • A fundamental trade-off exists between deflection and accuracy: Models with high deflection rates (e.g., Claude) sacrifice Oracle accuracy.

Highlights & Insights

  • The dynamic filtering philosophy is critically important—benchmarks should evolve alongside models; otherwise they quickly become obsolete. VLM-DeflectionBench's pipeline can maintain timeliness by replacing gating models as stronger ones become available.
  • The four-scenario evaluation protocol reveals behavioral patterns invisible to a single accuracy metric. For instance, Claude may score low on traditional benchmarks due to its high deflection rate, yet it may be the most suitable choice in safety-critical deployment contexts.
  • The distinction between hallucination and deflection is essential for RAG system deployment—in high-stakes domains such as medicine and law, generating unsupported answers is far more dangerous than deflecting.

Limitations & Future Work

  • The benchmark relies solely on GPT-4o as the judge, which may introduce evaluation bias.
  • The scale of 2,775 samples is relatively modest, with limited coverage of certain modality combinations.
  • The "strict RAG" assumption—treating all incorrect answers as hallucinations—is an oversimplification; in practice, errors may stem from misreading evidence rather than confabulation.
  • Training models to improve deflection capability is not explored; the work is purely evaluative.
  • Deflection behavior in multi-turn interactions is not considered.
  • Distractor difficulty is not stratified; distractors of varying difficulty may elicit qualitatively different behaviors.
  • vs. MRAG-Bench: MRAG-Bench incorporates visual evidence but does not evaluate deflection or hallucination. VLM-DeflectionBench is the first to systematically assess both behaviors in the KB-VQA setting.
  • vs. HaloQuest/AMBER: These benchmarks focus exclusively on visual hallucination and do not involve retrieval-augmented scenarios. VLM-DeflectionBench evaluates within a RAG framework.
  • vs. SimpleQA/GaRaGe: These are text-only hallucination evaluations that cannot capture visual-textual evidence conflicts.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First benchmark to systematically evaluate deflection behavior in multimodal RAG; the four-scenario design is highly original.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 20 models (open- and closed-source) with human validation at κ=0.91.
  • Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clear, experimental design is rigorous, and findings are insightful.
  • Value: ⭐⭐⭐⭐⭐ Introduces a new paradigm for evaluating the reliability of RAG systems with direct implications for deployment decisions.