FINER: MLLMs Hallucinate under Fine-grained Negative Queries¶

Conference: CVPR 2026 arXiv: 2603.17662 Code: https://explainableml.github.io/finer-project/ Area: Multimodal VLM Keywords: MLLM hallucination, fine-grained negative queries, DPO, scene graph, hallucination benchmark

TL;DR¶

This paper identifies that MLLMs suffer a dramatic increase in hallucination rates under fine-grained negative queries (queries involving multiple objects/attributes/relations with only one subtle error), proposes the FINER benchmark and FINER-Tuning (based on DPO), achieving up to 24.2% improvement on InternVL3.5-14B.

Background & Motivation¶

Background: Hallucination in MLLMs has been extensively studied; existing benchmarks (POPE, DASH, AMBER) primarily focus on coarse-grained queries, such as whether a single object exists.

Limitations of Prior Work: Queries in real-world scenarios are often fine-grained—involving multiple objects, attributes, and relations. The more fine-grained the query, the more easily the model is misled by "mostly correct" content into answering "yes."

Key Challenge: There is a strong positive correlation between query granularity and hallucination rate. InternVL3.5-14B achieves approximately 80% accuracy at granularity level 1, which drops sharply to approximately 20% at levels 5–7.

Goal: (a) Systematically investigate hallucination behavior under fine-grained negative queries; (b) propose a training method that effectively mitigates fine-grained hallucinations.

Key Insight: Simulating the human sentence construction process (object → attribute → relation), the paper constructs progressively fine-grained negative queries to systematically expose hallucinations.

Core Idea: A scene-graph-driven approach is used to construct a fine-grained negative query benchmark, combined with DPO training to teach the model to detect subtle errors in queries.

Method¶

Overall Architecture¶

Benchmark Construction: Starting from the scene graph of an image (objects + attributes + relations), negative queries are generated by substituting one element with a negative version, forming paired positive/negative multiple-choice questions. Training Method: FINER-style preference data is generated from the Pixmo dataset, and DPO is used to train the model.

Key Designs¶

FINER Benchmark (FINER-CompreCap + FINER-DOCCI):
- Function: Constructs a fine-grained benchmark covering four settings—multi-object (Multi-obj), multi-attribute (Multi-attr), multi-relation (Multi-rel), and Wh-questions.
- Mechanism: Starting from a positive scene graph, an LLM generates four semantically plausible but image-absent negative substitutions for each element (e.g., "door frame" → "pillar"), which are then combined via templates into positive/negative multiple-choice questions.
- Design Motivation: Multiple-choice questions replace simple Yes/No to avoid model response bias; paired positive/negative queries require both to be answered correctly (paired accuracy).
Negative Sample Quality Validation:
- Function: Ensures that generated negative elements are genuinely absent from the image.
- Mechanism: Qwen2.5-VL-72B is used as a discriminator; positive elements are mixed into negative candidates, and if the discriminator fails to identify the positive element, certain negative elements are considered ambiguous and regenerated.
- Design Motivation: The quality of negative samples directly determines the reliability of the benchmark.
FINER-Tuning (DPO Training):
- Function: Constructs preference data from fine-grained positive/negative query pairs for DPO training.
- Mechanism: Object/attribute/relation phrases are extracted from Pixmo long descriptions; Phi-4-14B generates negative versions; correct answers (accepted) and incorrect answers (rejected) are constructed; the model is trained with DPO loss: \(\mathcal{L}_{DPO}(\theta) = -\mathbb{E}[\log\sigma(\beta(\Delta_\theta - \Delta_{ref}))]\)
- Design Motivation: Unlike methods that only reduce hallucinations in model-generated responses, FINER-Tuning teaches the model to detect subtle errors within the query itself.

Loss & Training¶

Pixmo-caption is used as the data source to avoid training set leakage with the benchmark.
Phi-4-14B (different from the LLM used for benchmark construction) generates training data.
DPO hyperparameter \(\beta = 0.1\).

Key Experimental Results¶

Main Results (FINER-CompreCap, Paired Accuracy)¶

Model	Multi-obj	Multi-attr	Multi-rel	Wh
Random Guess	4.0	4.0	4.0	4.0
LLaVA-1.6-7B	25.3	13.0	7.6	15.3
+FINER-Tuning	48.4 (+23.1)	38.4 (+25.4)	24.2 (+16.6)	22.1 (+6.8)
InternVL-3.5-8B	75.0	72.5	49.8	23.5
+FINER-Tuning	77.1 (+2.1)	78.9 (+6.4)	64.1 (+14.3)	34.2 (+10.7)
InternVL-3.5-14B	74.5	68.1	47.0	21.8
+FINER-Tuning	80.0 (+5.5)	78.9 (+10.8)	71.2 (+24.2)	30.1 (+8.3)

Granularity–Accuracy Relationship¶

Query Granularity	InternVL3.5-14B Baseline	+FINER-Tuning
Level 1	~80%	~85%
Level 3	~50%	~65%
Level 5	~25%	~50%
Level 7	~20%	~45%

Key Findings¶

Hallucination is strongly correlated with query granularity: higher granularity leads to lower accuracy, confirming that fine-grained queries represent a systematic weakness of MLLMs.
Multi-rel is the most challenging setting; even strong model baselines score below 50%.
FINER-Tuning yields larger gains for weaker models (LLaVA-1.6-7B) than for stronger ones.
FINER-Tuning not only improves performance on the FINER benchmark, but also consistently improves results across 8 existing hallucination benchmarks without degrading general capabilities (6 benchmarks).

Highlights & Insights¶

The discovery of the granularity–hallucination correlation is highly insightful, revealing the mechanism by which MLLMs are misled by "mostly correct" information.
The paired positive/negative query evaluation design ensures that models cannot exploit a preference for "No" to game the benchmark.
FINER-Tuning teaches models to detect "errors in the query" rather than "hallucinations in the response," representing a novel perspective.
The data construction pipeline is transferable to other VQA robustness evaluation settings.

Limitations & Future Work¶

Negative element generation relies on LLMs, which may introduce systematic biases.
The templates for converting scene graphs to queries are relatively fixed and do not cover all natural language expressions.
The benchmark focuses exclusively on negative queries; fine-grained understanding of affirmative queries also warrants investigation.
Scene graphs for DOCCI are extracted from long descriptions and may contain extraction noise.

vs. POPE: POPE tests only single-object existence; FINER extends evaluation to multi-element fine-grained negation.
vs. AMBER: AMBER covers single object/attribute/relation; FINER pushes granularity to multi-element combinations.
vs. RLAIF-V/OPA-DPO: These methods apply DPO to reduce hallucinations in model-generated outputs; FINER-Tuning specifically targets subtle errors embedded within queries.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The systematic study of the granularity–hallucination relationship opens a new research direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Four models + 2 benchmarks + 8 existing hallucination benchmarks + 6 general capability benchmarks.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the data construction pipeline is described in detail.
Value: ⭐⭐⭐⭐⭐ Both the benchmark and the method provide significant contributions to understanding and mitigating MLLM hallucinations.