GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity¶

Conference: NeurIPS 2025
arXiv: 2508.19972
Code: https://github.com/deeplearning-wisc/glsim
Area: Multimodal VLM / Hallucination Detection
Keywords: object hallucination, hallucination detection, global-local similarity, visual logit lens, training-free

TL;DR¶

GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.

Background & Motivation¶

Background: Large vision-language models (LVLMs) are prone to object hallucinations—generating descriptions of objects that do not exist in the image—which severely undermines reliable deployment in high-stakes domains such as medical imaging and autonomous driving.
Limitations of Prior Work: Existing hallucination detection methods either rely on external annotated data (e.g., CHAIR), require external LLM judges (e.g., FaithScore), or exploit only a single-perspective signal. Token-probability-based methods (NLL) fail because LLMs favor linguistic fluency; attention-based methods (SVAR) are susceptible to attention sinks; and Internal Confidence, which directly uses the maximum probability from Visual Logit Lens, can be overconfident.
Key Challenge: A single global or local signal each has blind spots. Global methods may falsely accept contextually plausible but visually absent objects (e.g., "dining table" in a birthday scene); local methods may be confused by visually similar objects (e.g., a motorcycle seat mistaken for a "handbag").
Key Insight: This work is the first to unify global and local embedding similarity signals within a single framework, leveraging their complementary strengths.

Method¶

Overall Architecture¶

GLSim is a training-free, object-level hallucination detection framework. For each object \(o\) mentioned in the LVLM-generated text, two scores are computed: (1) a global similarity score—cosine similarity between the object embedding and the scene embedding; and (2) a local similarity score—average cosine similarity between the object embedding and the Top-K image patch embeddings localized via Visual Logit Lens. The final GLSim score is a weighted combination of the two.

Key Designs¶

Unsupervised Object Localization via Visual Logit Lens
Function: Localizes the image regions most relevant to a given object without relying on external annotations or detectors.
Mechanism: The hidden representation \(h_l(v_i)\) of each visual token \(v_i\) at decoder layer \(l\) is projected into the vocabulary space via the unembedding matrix \(W_U\), yielding the probability \(\text{softmax}(\text{VLL}_l(v_i))[o]\) that each visual patch predicts object word \(o\). The Top-K patches with the highest probabilities are selected as the localization region \(I(o)\).
Design Motivation: Visual Logit Lens localizes objects more accurately than attention weights (experiments show a 12.5% AUROC improvement) and requires no external detectors.
Local Similarity Score
Function: Verifies whether the object has genuine visual evidence in a specific region of the image.
Mechanism: Computes the average cosine similarity between the object token embedding \(h_{l'}(o)\) and the hidden representations \(h_l(v_i)\) of the Top-K localized patches: \(s_\text{local} = \frac{1}{K}\sum_{v_i \in I(o)} \text{sim}(h_l(v_i),\, h_{l'}(o))\). Regions corresponding to real objects yield high similarity, while hallucinated objects map to irrelevant regions with low similarity.
Design Motivation: Using embedding similarity is more stable than using raw Logit Lens probability values, which can be overconfident (as observed with Internal Confidence). The embedding space provides a finer-grained signal.
Global Similarity Score
Function: Assesses whether the object is semantically consistent with the overall scene.
Mechanism: Computes the cosine similarity between the object token embedding and the hidden representation of the last token of the instruction prompt: \(s_\text{global} = \text{sim}(h_l(v, t),\, h_{l'}(o))\). The last instruction token encodes the model's integrated understanding of both the image and textual context.
Design Motivation: The last instruction token captures scene semantics more effectively than the "last image token" or the "average of all image tokens" (ablation shows an 8% AUROC improvement), providing a high-level judgment of whether an object is plausible in the scene.

Loss & Training¶

GLSim is entirely training-free and directly exploits the internal representations of the LVLM. The final score is \(\text{GLSim} = w \cdot s_\text{global} + (1-w) \cdot s_\text{local}\), where \(w = 0.6\) consistently achieves the best performance across settings. Layer indices \(l\) and \(l'\) are selected via ablation (LLaVA: \(l=32\), \(l'=31\); Shikra: \(l=30\), \(l'=27\)).

Key Experimental Results¶

Main Results¶

Dataset / Model	Metric	GLSim	SVAR	Internal Conf.	Contextual Lens	NLL
MSCOCO / LLaVA-7B	AUROC	83.7	74.7	72.9	75.4	63.7
MSCOCO / LLaVA-13B	AUROC	84.8	75.2	71.0	78.7	63.1
MSCOCO / MiniGPT-4	AUROC	87.0	83.6	75.7	84.9	59.4
MSCOCO / Shikra	AUROC	83.0	70.7	69.1	69.5	60.4
Objects365 / LLaVA-7B	AUROC	72.6	64.9	68.7	63.2	62.9
Objects365 / MiniGPT-4	AUROC	74.8	71.0	68.5	70.2	56.7

Ablation Study¶

Configuration	LLaVA AUROC	Shikra AUROC	Notes
Global only (\(s_\text{global}\))	79.3	78.9	Already outperforms all baselines
Local Top-K only (\(s_\text{local}\))	78.8	76.8	Complementary to global
GLSim (global + local Top-K)	83.7	83.0	Fusion gain: +4.9 / +6.2
Localization: Attention	66.3 (local)	65.0	Attention weights unreliable
Localization: Cosine Sim	76.2 (local)	70.1	Sub-optimal
Localization: Logit Lens	78.8 (local)	76.8	Best localization
\(w = 0.4\)	82.5	—	Biased toward local
\(w = 0.6\)	83.7	—	Optimal balance
\(w = 0.8\)	82.0	—	Biased toward global
\(K = 8/16/32/64\)	82→83→83.7→82	—	\(K=32\) optimal (~6% of image tokens)

Key Findings¶

GLSim consistently outperforms all baselines across all LVLM and dataset combinations, with particularly large gains on Shikra (+12.7% AUROC vs. SVAR).
Global and local signals are genuinely complementary: each individually already surpasses all prior methods, and their fusion yields further improvement.
Internal Confidence can be overconfident for hallucinated objects, as Visual Logit Lens probabilities may assign high scores to incorrect regions.
Visual Logit Lens outperforms attention weights by 12.5% and cosine similarity by 2.6% as a localization method.
Optimal layer selection falls in the late intermediate layers rather than the final layer, supporting the observation that the best task-specific representations do not necessarily reside in the last layer.

Highlights & Insights¶

The global-local complementarity idea is intuitive yet highly effective, and this work is the first to demonstrate its value in object hallucination detection.
The training-free, plug-and-play design requires no additional training or external models, relying entirely on the LVLM's internal representations.
The paper provides a comprehensive benchmark, systematically comparing five object-level hallucination detection methods for the first time and filling a notable gap in the field.
The qualitative analysis is highly intuitive: Figure 2 clearly illustrates complementary failure cases where the global signal fails but the local succeeds, and vice versa.

Limitations & Future Work¶

The method only addresses object existence hallucinations and does not handle attribute-level (color, size) or relation-level (spatial position) hallucinations.
Although \(K\) and \(w\) are empirically robust, their optimal values may vary with input resolution.
Selecting appropriate layer indices \(l\) and \(l'\) requires per-model ablation.
How to leverage GLSim scores to correct or mitigate hallucinations after detection remains an open direction.

CHAIR: A hallucination evaluation metric based on ground-truth matching; GLSim requires no annotations.
Internal Confidence: Uses Visual Logit Lens probabilities; GLSim improves upon this by adopting embedding similarity with Top-K aggregation.
SVAR: An attention-weight-based method susceptible to attention sinks and text-token bias.
The global-local complementarity paradigm is transferable to other detection tasks; GLSim can also be combined in series with training-time hallucination mitigation methods (e.g., Causal-LLaVA).

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic integration of global and local signals for hallucination detection, with a novel application of Visual Logit Lens.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 LVLMs × 2 datasets × 5 baselines × extensive ablations over \(K\), \(w\), layer selection, localization method, and global design; highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Motivation and methodology are clearly presented; qualitative analysis is intuitive and effective.
Value: ⭐⭐⭐⭐ — The training-free, plug-and-play nature makes it highly practical; code is open-sourced.