Skip to content

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Conference: NeurIPS 2025
arXiv: 2508.19972
Code: https://github.com/deeplearning-wisc/glsim
Area: Multimodal VLM / Hallucination Detection
Keywords: object hallucination, hallucination detection, global-local similarity, visual logit lens, training-free

TL;DR

GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.

Background & Motivation

  • Background: Large vision-language models (LVLMs) are prone to object hallucinations—generating descriptions of objects that do not exist in the image—which severely undermines reliable deployment in high-stakes domains such as medical imaging and autonomous driving.
  • Limitations of Prior Work: Existing hallucination detection methods either rely on external annotated data (e.g., CHAIR), require external LLM judges (e.g., FaithScore), or exploit only a single-perspective signal. Token-probability-based methods (NLL) fail because LLMs favor linguistic fluency; attention-based methods (SVAR) are susceptible to attention sinks; and Internal Confidence, which directly uses the maximum probability from Visual Logit Lens, can be overconfident.
  • Key Challenge: A single global or local signal each has blind spots. Global methods may falsely accept contextually plausible but visually absent objects (e.g., "dining table" in a birthday scene); local methods may be confused by visually similar objects (e.g., a motorcycle seat mistaken for a "handbag").
  • Key Insight: This work is the first to unify global and local embedding similarity signals within a single framework, leveraging their complementary strengths.

Method

Overall Architecture

GLSim is a training-free, object-level hallucination detection framework. For each object \(o\) mentioned in the LVLM-generated text, two scores are computed: (1) a global similarity score—cosine similarity between the object embedding and the scene embedding; and (2) a local similarity score—average cosine similarity between the object embedding and the Top-K image patch embeddings localized via Visual Logit Lens. The final GLSim score is a weighted combination of the two.

Key Designs

  1. Unsupervised Object Localization via Visual Logit Lens

  2. Function: Localizes the image regions most relevant to a given object without relying on external annotations or detectors.

  3. Mechanism: The hidden representation \(h_l(v_i)\) of each visual token \(v_i\) at decoder layer \(l\) is projected into the vocabulary space via the unembedding matrix \(W_U\), yielding the probability \(\text{softmax}(\text{VLL}_l(v_i))[o]\) that each visual patch predicts object word \(o\). The Top-K patches with the highest probabilities are selected as the localization region \(I(o)\).
  4. Design Motivation: Visual Logit Lens localizes objects more accurately than attention weights (experiments show a 12.5% AUROC improvement) and requires no external detectors.

  5. Local Similarity Score

  6. Function: Verifies whether the object has genuine visual evidence in a specific region of the image.

  7. Mechanism: Computes the average cosine similarity between the object token embedding \(h_{l'}(o)\) and the hidden representations \(h_l(v_i)\) of the Top-K localized patches: \(s_\text{local} = \frac{1}{K}\sum_{v_i \in I(o)} \text{sim}(h_l(v_i),\, h_{l'}(o))\). Regions corresponding to real objects yield high similarity, while hallucinated objects map to irrelevant regions with low similarity.
  8. Design Motivation: Using embedding similarity is more stable than using raw Logit Lens probability values, which can be overconfident (as observed with Internal Confidence). The embedding space provides a finer-grained signal.

  9. Global Similarity Score

  10. Function: Assesses whether the object is semantically consistent with the overall scene.

  11. Mechanism: Computes the cosine similarity between the object token embedding and the hidden representation of the last token of the instruction prompt: \(s_\text{global} = \text{sim}(h_l(v, t),\, h_{l'}(o))\). The last instruction token encodes the model's integrated understanding of both the image and textual context.
  12. Design Motivation: The last instruction token captures scene semantics more effectively than the "last image token" or the "average of all image tokens" (ablation shows an 8% AUROC improvement), providing a high-level judgment of whether an object is plausible in the scene.

Loss & Training

GLSim is entirely training-free and directly exploits the internal representations of the LVLM. The final score is \(\text{GLSim} = w \cdot s_\text{global} + (1-w) \cdot s_\text{local}\), where \(w = 0.6\) consistently achieves the best performance across settings. Layer indices \(l\) and \(l'\) are selected via ablation (LLaVA: \(l=32\), \(l'=31\); Shikra: \(l=30\), \(l'=27\)).

Key Experimental Results

Main Results

Dataset / Model Metric GLSim SVAR Internal Conf. Contextual Lens NLL
MSCOCO / LLaVA-7B AUROC 83.7 74.7 72.9 75.4 63.7
MSCOCO / LLaVA-13B AUROC 84.8 75.2 71.0 78.7 63.1
MSCOCO / MiniGPT-4 AUROC 87.0 83.6 75.7 84.9 59.4
MSCOCO / Shikra AUROC 83.0 70.7 69.1 69.5 60.4
Objects365 / LLaVA-7B AUROC 72.6 64.9 68.7 63.2 62.9
Objects365 / MiniGPT-4 AUROC 74.8 71.0 68.5 70.2 56.7

Ablation Study

Configuration LLaVA AUROC Shikra AUROC Notes
Global only (\(s_\text{global}\)) 79.3 78.9 Already outperforms all baselines
Local Top-K only (\(s_\text{local}\)) 78.8 76.8 Complementary to global
GLSim (global + local Top-K) 83.7 83.0 Fusion gain: +4.9 / +6.2
Localization: Attention 66.3 (local) 65.0 Attention weights unreliable
Localization: Cosine Sim 76.2 (local) 70.1 Sub-optimal
Localization: Logit Lens 78.8 (local) 76.8 Best localization
\(w = 0.4\) 82.5 Biased toward local
\(w = 0.6\) 83.7 Optimal balance
\(w = 0.8\) 82.0 Biased toward global
\(K = 8/16/32/64\) 82→83→83.7→82 \(K=32\) optimal (~6% of image tokens)

Key Findings

  • GLSim consistently outperforms all baselines across all LVLM and dataset combinations, with particularly large gains on Shikra (+12.7% AUROC vs. SVAR).
  • Global and local signals are genuinely complementary: each individually already surpasses all prior methods, and their fusion yields further improvement.
  • Internal Confidence can be overconfident for hallucinated objects, as Visual Logit Lens probabilities may assign high scores to incorrect regions.
  • Visual Logit Lens outperforms attention weights by 12.5% and cosine similarity by 2.6% as a localization method.
  • Optimal layer selection falls in the late intermediate layers rather than the final layer, supporting the observation that the best task-specific representations do not necessarily reside in the last layer.

Highlights & Insights

  • The global-local complementarity idea is intuitive yet highly effective, and this work is the first to demonstrate its value in object hallucination detection.
  • The training-free, plug-and-play design requires no additional training or external models, relying entirely on the LVLM's internal representations.
  • The paper provides a comprehensive benchmark, systematically comparing five object-level hallucination detection methods for the first time and filling a notable gap in the field.
  • The qualitative analysis is highly intuitive: Figure 2 clearly illustrates complementary failure cases where the global signal fails but the local succeeds, and vice versa.

Limitations & Future Work

  • The method only addresses object existence hallucinations and does not handle attribute-level (color, size) or relation-level (spatial position) hallucinations.
  • Although \(K\) and \(w\) are empirically robust, their optimal values may vary with input resolution.
  • Selecting appropriate layer indices \(l\) and \(l'\) requires per-model ablation.
  • How to leverage GLSim scores to correct or mitigate hallucinations after detection remains an open direction.
  • CHAIR: A hallucination evaluation metric based on ground-truth matching; GLSim requires no annotations.
  • Internal Confidence: Uses Visual Logit Lens probabilities; GLSim improves upon this by adopting embedding similarity with Top-K aggregation.
  • SVAR: An attention-weight-based method susceptible to attention sinks and text-token bias.
  • The global-local complementarity paradigm is transferable to other detection tasks; GLSim can also be combined in series with training-time hallucination mitigation methods (e.g., Causal-LLaVA).

Rating

  • Novelty: ⭐⭐⭐⭐ — First systematic integration of global and local signals for hallucination detection, with a novel application of Visual Logit Lens.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 LVLMs × 2 datasets × 5 baselines × extensive ablations over \(K\), \(w\), layer selection, localization method, and global design; highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation and methodology are clearly presented; qualitative analysis is intuitive and effective.
  • Value: ⭐⭐⭐⭐ — The training-free, plug-and-play nature makes it highly practical; code is open-sourced.