FINER: MLLMs Hallucinate under Fine-grained Negative Queries¶
Conference: CVPR 2026 arXiv: 2603.17662 Code: https://explainableml.github.io/finer-project/ Area: Multimodal VLM Keywords: MLLM hallucination, fine-grained negative queries, DPO, scene graph, hallucination benchmark
TL;DR¶
This paper identifies that MLLMs suffer a dramatic increase in hallucination rates under fine-grained negative queries (queries involving multiple objects/attributes/relations with only one subtle error), proposes the FINER benchmark and FINER-Tuning (based on DPO), achieving up to 24.2% improvement on InternVL3.5-14B.
Background & Motivation¶
Background: Hallucination in MLLMs has been extensively studied; existing benchmarks (POPE, DASH, AMBER) primarily focus on coarse-grained queries, such as whether a single object exists.
Limitations of Prior Work: Queries in real-world scenarios are often fine-grained—involving multiple objects, attributes, and relations. The more fine-grained the query, the more easily the model is misled by "mostly correct" content into answering "yes."
Key Challenge: There is a strong positive correlation between query granularity and hallucination rate. InternVL3.5-14B achieves approximately 80% accuracy at granularity level 1, which drops sharply to approximately 20% at levels 5–7.
Goal: (a) Systematically investigate hallucination behavior under fine-grained negative queries; (b) propose a training method that effectively mitigates fine-grained hallucinations.
Key Insight: Simulating the human sentence construction process (object → attribute → relation), the paper constructs progressively fine-grained negative queries to systematically expose hallucinations.
Core Idea: A scene-graph-driven approach is used to construct a fine-grained negative query benchmark, combined with DPO training to teach the model to detect subtle errors in queries.
Method¶
Overall Architecture¶
Benchmark Construction: Starting from the scene graph of an image (objects + attributes + relations), negative queries are generated by substituting one element with a negative version, forming paired positive/negative multiple-choice questions. Training Method: FINER-style preference data is generated from the Pixmo dataset, and DPO is used to train the model.
Key Designs¶
-
FINER Benchmark (FINER-CompreCap + FINER-DOCCI):
- Function: Constructs a fine-grained benchmark covering four settings—multi-object (Multi-obj), multi-attribute (Multi-attr), multi-relation (Multi-rel), and Wh-questions.
- Mechanism: Starting from a positive scene graph, an LLM generates four semantically plausible but image-absent negative substitutions for each element (e.g., "door frame" → "pillar"), which are then combined via templates into positive/negative multiple-choice questions.
- Design Motivation: Multiple-choice questions replace simple Yes/No to avoid model response bias; paired positive/negative queries require both to be answered correctly (paired accuracy).
-
Negative Sample Quality Validation:
- Function: Ensures that generated negative elements are genuinely absent from the image.
- Mechanism: Qwen2.5-VL-72B is used as a discriminator; positive elements are mixed into negative candidates, and if the discriminator fails to identify the positive element, certain negative elements are considered ambiguous and regenerated.
- Design Motivation: The quality of negative samples directly determines the reliability of the benchmark.
-
FINER-Tuning (DPO Training):
- Function: Constructs preference data from fine-grained positive/negative query pairs for DPO training.
- Mechanism: Object/attribute/relation phrases are extracted from Pixmo long descriptions; Phi-4-14B generates negative versions; correct answers (accepted) and incorrect answers (rejected) are constructed; the model is trained with DPO loss: \(\mathcal{L}_{DPO}(\theta) = -\mathbb{E}[\log\sigma(\beta(\Delta_\theta - \Delta_{ref}))]\)
- Design Motivation: Unlike methods that only reduce hallucinations in model-generated responses, FINER-Tuning teaches the model to detect subtle errors within the query itself.
Loss & Training¶
- Pixmo-caption is used as the data source to avoid training set leakage with the benchmark.
- Phi-4-14B (different from the LLM used for benchmark construction) generates training data.
- DPO hyperparameter \(\beta = 0.1\).
Key Experimental Results¶
Main Results (FINER-CompreCap, Paired Accuracy)¶
| Model | Multi-obj | Multi-attr | Multi-rel | Wh |
|---|---|---|---|---|
| Random Guess | 4.0 | 4.0 | 4.0 | 4.0 |
| LLaVA-1.6-7B | 25.3 | 13.0 | 7.6 | 15.3 |
| +FINER-Tuning | 48.4 (+23.1) | 38.4 (+25.4) | 24.2 (+16.6) | 22.1 (+6.8) |
| InternVL-3.5-8B | 75.0 | 72.5 | 49.8 | 23.5 |
| +FINER-Tuning | 77.1 (+2.1) | 78.9 (+6.4) | 64.1 (+14.3) | 34.2 (+10.7) |
| InternVL-3.5-14B | 74.5 | 68.1 | 47.0 | 21.8 |
| +FINER-Tuning | 80.0 (+5.5) | 78.9 (+10.8) | 71.2 (+24.2) | 30.1 (+8.3) |
Granularity–Accuracy Relationship¶
| Query Granularity | InternVL3.5-14B Baseline | +FINER-Tuning |
|---|---|---|
| Level 1 | ~80% | ~85% |
| Level 3 | ~50% | ~65% |
| Level 5 | ~25% | ~50% |
| Level 7 | ~20% | ~45% |
Key Findings¶
- Hallucination is strongly correlated with query granularity: higher granularity leads to lower accuracy, confirming that fine-grained queries represent a systematic weakness of MLLMs.
- Multi-rel is the most challenging setting; even strong model baselines score below 50%.
- FINER-Tuning yields larger gains for weaker models (LLaVA-1.6-7B) than for stronger ones.
- FINER-Tuning not only improves performance on the FINER benchmark, but also consistently improves results across 8 existing hallucination benchmarks without degrading general capabilities (6 benchmarks).
Highlights & Insights¶
- The discovery of the granularity–hallucination correlation is highly insightful, revealing the mechanism by which MLLMs are misled by "mostly correct" information.
- The paired positive/negative query evaluation design ensures that models cannot exploit a preference for "No" to game the benchmark.
- FINER-Tuning teaches models to detect "errors in the query" rather than "hallucinations in the response," representing a novel perspective.
- The data construction pipeline is transferable to other VQA robustness evaluation settings.
Limitations & Future Work¶
- Negative element generation relies on LLMs, which may introduce systematic biases.
- The templates for converting scene graphs to queries are relatively fixed and do not cover all natural language expressions.
- The benchmark focuses exclusively on negative queries; fine-grained understanding of affirmative queries also warrants investigation.
- Scene graphs for DOCCI are extracted from long descriptions and may contain extraction noise.
Related Work & Insights¶
- vs. POPE: POPE tests only single-object existence; FINER extends evaluation to multi-element fine-grained negation.
- vs. AMBER: AMBER covers single object/attribute/relation; FINER pushes granularity to multi-element combinations.
- vs. RLAIF-V/OPA-DPO: These methods apply DPO to reduce hallucinations in model-generated outputs; FINER-Tuning specifically targets subtle errors embedded within queries.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The systematic study of the granularity–hallucination relationship opens a new research direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four models + 2 benchmarks + 8 existing hallucination benchmarks + 6 general capability benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the data construction pipeline is described in detail.
- Value: ⭐⭐⭐⭐⭐ Both the benchmark and the method provide significant contributions to understanding and mitigating MLLM hallucinations.