Skip to content

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Conference: CVPR 2026 arXiv: 2603.17662 Code: https://explainableml.github.io/finer-project/ Area: Multimodal VLM Keywords: MLLM hallucination, fine-grained negative queries, DPO, scene graph, hallucination benchmark

TL;DR

This paper identifies that MLLMs suffer a dramatic increase in hallucination rates under fine-grained negative queries (queries involving multiple objects/attributes/relations with only one subtle error), proposes the FINER benchmark and FINER-Tuning (based on DPO), achieving up to 24.2% improvement on InternVL3.5-14B.

Background & Motivation

Background: Hallucination in MLLMs has been extensively studied; existing benchmarks (POPE, DASH, AMBER) primarily focus on coarse-grained queries, such as whether a single object exists.

Limitations of Prior Work: Queries in real-world scenarios are often fine-grained—involving multiple objects, attributes, and relations. The more fine-grained the query, the more easily the model is misled by "mostly correct" content into answering "yes."

Key Challenge: There is a strong positive correlation between query granularity and hallucination rate. InternVL3.5-14B achieves approximately 80% accuracy at granularity level 1, which drops sharply to approximately 20% at levels 5–7.

Goal: (a) Systematically investigate hallucination behavior under fine-grained negative queries; (b) propose a training method that effectively mitigates fine-grained hallucinations.

Key Insight: Simulating the human sentence construction process (object → attribute → relation), the paper constructs progressively fine-grained negative queries to systematically expose hallucinations.

Core Idea: A scene-graph-driven approach is used to construct a fine-grained negative query benchmark, combined with DPO training to teach the model to detect subtle errors in queries.

Method

Overall Architecture

Benchmark Construction: Starting from the scene graph of an image (objects + attributes + relations), negative queries are generated by substituting one element with a negative version, forming paired positive/negative multiple-choice questions. Training Method: FINER-style preference data is generated from the Pixmo dataset, and DPO is used to train the model.

Key Designs

  1. FINER Benchmark (FINER-CompreCap + FINER-DOCCI):

    • Function: Constructs a fine-grained benchmark covering four settings—multi-object (Multi-obj), multi-attribute (Multi-attr), multi-relation (Multi-rel), and Wh-questions.
    • Mechanism: Starting from a positive scene graph, an LLM generates four semantically plausible but image-absent negative substitutions for each element (e.g., "door frame" → "pillar"), which are then combined via templates into positive/negative multiple-choice questions.
    • Design Motivation: Multiple-choice questions replace simple Yes/No to avoid model response bias; paired positive/negative queries require both to be answered correctly (paired accuracy).
  2. Negative Sample Quality Validation:

    • Function: Ensures that generated negative elements are genuinely absent from the image.
    • Mechanism: Qwen2.5-VL-72B is used as a discriminator; positive elements are mixed into negative candidates, and if the discriminator fails to identify the positive element, certain negative elements are considered ambiguous and regenerated.
    • Design Motivation: The quality of negative samples directly determines the reliability of the benchmark.
  3. FINER-Tuning (DPO Training):

    • Function: Constructs preference data from fine-grained positive/negative query pairs for DPO training.
    • Mechanism: Object/attribute/relation phrases are extracted from Pixmo long descriptions; Phi-4-14B generates negative versions; correct answers (accepted) and incorrect answers (rejected) are constructed; the model is trained with DPO loss: \(\mathcal{L}_{DPO}(\theta) = -\mathbb{E}[\log\sigma(\beta(\Delta_\theta - \Delta_{ref}))]\)
    • Design Motivation: Unlike methods that only reduce hallucinations in model-generated responses, FINER-Tuning teaches the model to detect subtle errors within the query itself.

Loss & Training

  • Pixmo-caption is used as the data source to avoid training set leakage with the benchmark.
  • Phi-4-14B (different from the LLM used for benchmark construction) generates training data.
  • DPO hyperparameter \(\beta = 0.1\).

Key Experimental Results

Main Results (FINER-CompreCap, Paired Accuracy)

Model Multi-obj Multi-attr Multi-rel Wh
Random Guess 4.0 4.0 4.0 4.0
LLaVA-1.6-7B 25.3 13.0 7.6 15.3
+FINER-Tuning 48.4 (+23.1) 38.4 (+25.4) 24.2 (+16.6) 22.1 (+6.8)
InternVL-3.5-8B 75.0 72.5 49.8 23.5
+FINER-Tuning 77.1 (+2.1) 78.9 (+6.4) 64.1 (+14.3) 34.2 (+10.7)
InternVL-3.5-14B 74.5 68.1 47.0 21.8
+FINER-Tuning 80.0 (+5.5) 78.9 (+10.8) 71.2 (+24.2) 30.1 (+8.3)

Granularity–Accuracy Relationship

Query Granularity InternVL3.5-14B Baseline +FINER-Tuning
Level 1 ~80% ~85%
Level 3 ~50% ~65%
Level 5 ~25% ~50%
Level 7 ~20% ~45%

Key Findings

  • Hallucination is strongly correlated with query granularity: higher granularity leads to lower accuracy, confirming that fine-grained queries represent a systematic weakness of MLLMs.
  • Multi-rel is the most challenging setting; even strong model baselines score below 50%.
  • FINER-Tuning yields larger gains for weaker models (LLaVA-1.6-7B) than for stronger ones.
  • FINER-Tuning not only improves performance on the FINER benchmark, but also consistently improves results across 8 existing hallucination benchmarks without degrading general capabilities (6 benchmarks).

Highlights & Insights

  • The discovery of the granularity–hallucination correlation is highly insightful, revealing the mechanism by which MLLMs are misled by "mostly correct" information.
  • The paired positive/negative query evaluation design ensures that models cannot exploit a preference for "No" to game the benchmark.
  • FINER-Tuning teaches models to detect "errors in the query" rather than "hallucinations in the response," representing a novel perspective.
  • The data construction pipeline is transferable to other VQA robustness evaluation settings.

Limitations & Future Work

  • Negative element generation relies on LLMs, which may introduce systematic biases.
  • The templates for converting scene graphs to queries are relatively fixed and do not cover all natural language expressions.
  • The benchmark focuses exclusively on negative queries; fine-grained understanding of affirmative queries also warrants investigation.
  • Scene graphs for DOCCI are extracted from long descriptions and may contain extraction noise.
  • vs. POPE: POPE tests only single-object existence; FINER extends evaluation to multi-element fine-grained negation.
  • vs. AMBER: AMBER covers single object/attribute/relation; FINER pushes granularity to multi-element combinations.
  • vs. RLAIF-V/OPA-DPO: These methods apply DPO to reduce hallucinations in model-generated outputs; FINER-Tuning specifically targets subtle errors embedded within queries.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The systematic study of the granularity–hallucination relationship opens a new research direction.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Four models + 2 benchmarks + 8 existing hallucination benchmarks + 6 general capability benchmarks.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the data construction pipeline is described in detail.
  • Value: ⭐⭐⭐⭐⭐ Both the benchmark and the method provide significant contributions to understanding and mitigating MLLM hallucinations.