Skip to content

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Conference: ICLR 2026
arXiv: 2509.23050
Code: None
Area: Dialogue Systems
Keywords: Language Prior, Visual Integration Point, Large Vision Language Models, Representation Analysis, Interpretability

TL;DR

By comparing layer-wise hidden representations (chain-of-embedding) with and without visual input, this study identifies a "Visual Integration Point" (VIP) layer in LVLMs and proposes the Total Visual Integration (TVI) metric to quantify the strength of the language prior.

Background & Motivation

Large Vision Language Models (LVLMs) exhibit excellent performance in multimodal tasks but often rely excessively on language priors (LP)—textual statistical patterns memorized during pre-training—while ignoring actual visual evidence. For example, when an image shows a green banana, the model might still answer "yellow."

Existing LP analysis methods primarily depend on input-output probing, which has two main limitations: (1) they ignore the rich information within the model's internal hidden representations; (2) they cannot reveal at which layer LP begins to interfere with visual integration. This paper proposes analyzing LP from the perspective of internal representation dynamics, using a contrastive chain-of-embedding to locate the critical layers where visual information begins to truly influence reasoning.

Method

Overall Architecture

This paper aims to clarify at which layer and to what extent an LVLM utilizes visual information, thereby quantifying its reliance on language priors. The core approach involves a single technique: feeding the same textual question to the model twice—once with the image and once with the image removed—and then comparing the differences in the hidden representation trajectories (chain-of-embedding) layer by layer. Given the input \((x_v, x_t)\), the trajectory with the image is denoted as \(Z_{\text{vis}}^l = f_l(X_v, X_t)\) and the "blind" trajectory as \(Z_{\text{blind}}^l = f_l(\varnothing, X_t)\). The distance \(\mathbf{D}_l\) between the two at the final token embedding of layer \(l\) characterizes the magnitude of the perturbation caused by visual information at that layer. Based on this, a lightweight proxy first partitions samples into a visual-dependent group \(\mathcal{D}_{VT}\) and a visual-independent group \(\mathcal{D}_T\). The "Visual Integration Point" is located by observing where the layer-wise distances of the two groups begin to diverge. Finally, the distances following this layer are aggregated into a scalar to quantify the strength of the language prior; this scalar can also serve as a training regularization term to mitigate the language prior.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input: Text Question + Image"] --> B["Contrastive Dual-path Forward<br/>Chain-of-embedding (With/Without Image)<br/>+ Layer-wise Representation Distance D_l"]
    B --> C["Prediction Consistency Data Partitioning Proxy<br/>Prediction changes after removal -> Visual-Dependent Group<br/>No change -> Visual-Independent Group"]
    C --> D["Visual Integration Point (VIP)<br/>Locating Layer l* for Distance Divergence"]
    D --> E["Total Visual Integration (TVI)<br/>Aggr. layer-wise distance post-VIP"]
    E -->|Diagnosis| F["Language Prior Strength<br/>Lower TVI indicates stronger prior"]
    E -->|Training| G["TVI-regularized Fine-tuning to Relieve Prior"]

Key Designs

1. Prediction Consistency Data Partitioning Proxy: Separating visual-dependent and independent groups without labels

To determine the layer at which visual information takes effect, one must have "visual-dependent" and "visual-independent" samples for comparison. However, existing datasets do not label whether a question depends on the image. The authors bypass labeling with a lightweight proxy: if the prediction changes after removing the image, i.e., \(F_\theta(x_v, x_t) \neq F_\theta(\varnothing, x_t)\), the sample is classified into \(\mathcal{D}_{VT}\); otherwise, it is classified into \(\mathcal{D}_T\). Layer-wise distance is measured using cosine distance by default—ablations show both cosine and L2 are effective, whereas projecting representations to the output space via logit-lens and calculating KL/JS divergence fails. This indicates that signals distinguishing visual integration are primarily hidden in the orientation of the hidden space rather than the output distribution.

2. Visual Integration Point (VIP): Locating where visual information starts to function

Looking directly at the distance of the entire trajectory makes it difficult to distinguish which perturbations are meaningful visual integration and which are merely general computational noise, as both paths perform similar syntactic/semantic encoding in shallow layers—visual features are "seen" but not yet "used." The authors found a critical layer \(l^*\), after which the representation distances for visual-dependent and independent samples begin to diverge stably, formalized as \(\mathbf{D}_l(\mathcal{P}_{VT}) - \mathbf{D}_l(\mathcal{P}_T) > \tau\) for all \(l \geq l^*\). This implies that prior to the VIP, the model performs general information processing independent of the visual input, and only after the VIP does it incorporate visual evidence for task-specific reasoning. A key observation is that the VIP position is nearly consistent across different datasets, suggesting it is an inherent property of the model rather than a data artifact. However, the VIP varies across models (typically occurring at approximately 60% depth in experiments, independent of model size), providing a basis for measuring visual integration only within "effective layers."

3. Total Visual Integration (TVI): Aggregating post-VIP distances into a language prior strength scalar

With the VIP identified, attention is focused only on the layers following it that carry visual integration. Averaging the layer-wise distances in this segment yields a comparable metric: \(\text{TVI}(l^*; x, F_\theta) = \frac{1}{L - l^* + 1} \sum_{l=l^*}^{L} d(z_{\text{vis}}^l, z_{\text{blind}}^l)\). A higher TVI indicates that the trajectories with and without the image diverge further, suggesting more thorough utilization of visual information and a weaker language prior. Conversely, a lower TVI indicates minimal change regardless of the image, implying the model is dominated by textual statistical patterns and a stronger language prior (thus, TVI is inversely related to the language prior). In this way, "prior reliance"—previously only inferable via input-output probing—is compressed into a continuous value directly comparable with accuracy. In experiments, the Spearman correlation between TVI and accuracy reaches 0.72 in the post-VIP segment, significantly higher than in the pre-VIP segment.

Loss & Training

TVI is not only an analytical tool but can also serve as a training regularization term to boost visual integration. The authors add a reward term to the LLaVA instruction fine-tuning objective to increase TVI:

\[\mathcal{L}(x, y; \theta) = -\log F_\theta(y|x) - \lambda \cdot \text{TVI}(l^*; x, F_\theta)\]

With \(\lambda = 0.03\), this encourages the model to pull the trajectories with and without the image further apart—i.e., stronger integration of visual evidence—alongside the standard next-token prediction loss, thereby mitigating the language prior without modifying the architecture.

Key Experimental Results

Main Results

Model × Dataset Spearman Correlation (TVI vs Accuracy) p-value
Qwen2.5-VL-7B (post-VIP) 0.7241 <0.001
Gemma3-4B (post-VIP) 0.7174 <0.001
Qwen2.5-VL-7B (pre-VIP) 0.1489 0.002
Gemma3-4B (pre-VIP) 0.4659 <0.001
Metric Qwen2.5-VL-7B VLind Qwen2.5-VL-7B ViLP InternVL-3-8B VLind InternVL-3-8B ViLP
TVI 0.7155 0.6335 0.6727 0.5709
Visual Attention 0.0871 -0.0364 0.4967 0.0746
Output Divergence 0.2978 0.5084 0.1627 0.5615

Ablation Study

Configuration VLind Correlation ViLP Correlation Description
Cosine Distance 0.7155 0.6335 Default, best performance
L2 Distance 0.7123 0.6578 Close, remains effective
KL Divergence (logit-lens) -0.1693 0.2901 Fails after projecting to output space
JS Divergence (logit-lens) -0.2261 0.2942 Same as above
TVI Regularization Perception Reasoning
LLaVA-v1.5 1369.75 298.21
LLaVA-v1.5 w/ TVI 1400.44 321.43

Key Findings

  • The VIP consistently appears across 60 combinations of 10 LVLMs and 6 datasets.
  • The VIP typically occurs at approximately 60% of the model depth, independent of model scale.
  • Larger models (e.g., Gemma-3-27B) exhibit higher normalized TVI, indicating stronger utilization of visual information.
  • Datasets with strong LP (e.g., ViLP) show significantly lower TVI than those with weak LP (e.g., MMBench).
  • Intervention experiment: Using PAI attention correction increased TVI from 0.038 to 0.144 and accuracy from 50% to 52.33%.

Highlights & Insights

  • Systematically analyzes the language prior of LVLMs from the perspective of internal representation dynamics for the first time, providing more granularity than input-output probing.
  • The discovery of the VIP as an inherent model property is significant, suggesting a fixed "starting point" for visual integration within model architectures.
  • TVI consistently outperforms visual attention and output divergence across all models and datasets.
  • Theoretical analysis connects representation divergence with KL divergence, providing an information-theoretic explanation.

Limitations & Future Work

  • Requires white-box access to internal states, making it inapplicable to closed-source APIs.
  • The selection of VIP depends on a manually set threshold \(\tau\) (though the appendix provides an automated selection method).
  • Analyzes only the language prior, without considering other biases like distribution shift.
  • TVI regularization experiments were conducted on a 60K subset; large-scale validation is still required.
  • Related to mechanistic interpretability, but focuses on multimodal integration rather than single modalities.
  • Inspires hierarchical intervention strategies based on TVI, such as applying stronger visual constraints to layers following the VIP.
  • Directly guides LVLM hallucination mitigation: samples with low TVI may require additional visual attention correction.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Offers a fresh perspective by analyzing language priors via internal representation dynamics; the VIP and TVI concepts are highly original.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive testing (10 models × 6 datasets = 60 settings), comprehensive ablations, including intervention validation and theoretical analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear argumentation, rigorous mathematical derivations, and information-rich charts.
  • Value: ⭐⭐⭐⭐ Provides a practical analytical tool for understanding and improving LVLMs; TVI regularization demonstrates clear potential for practical applications.