A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=6WsBGk4Iag
Code: https://github.com/RiiShin/pid-lvlm-analysis
Area: Interpretability / Multimodal Large Model Analysis
Keywords: Partial Information Decomposition (PID), Large Vision-Language Models (LVLM), Synergy, Multimodal Fusion, Information Theory, logit lens

TL;DR¶

This work pioneeringly uses Partial Information Decomposition (PID) to decompose the "decision-relevant information" of LVLMs into four non-negative atoms: redundant, unique visual, unique language, and synergistic. It constructs a model-agnostic estimation pipeline to quantitatively characterize whether LVLMs rely on genuine cross-modal fusion or language priors across 26 models and 4 datasets from three dimensions: "breadth, depth, and time."

Background & Motivation¶

Background: LVLMs show impressive performance in VQA, image captioning, and open-ended reasoning, but their internal decision-making processes remain opaque. Aggregate metrics like accuracy only reflect "if" a model is correct, failing to reveal "why"—whether it genuinely integrates visual evidence or simply exploits language priors.
Limitations of Prior Work: Existing LVLM interpretability works often analyze individual modalities in a "microscopic" manner (attention maps, attribution heatmaps, linear probes, multimodal neurons) or introduce ad-hoc metrics lacking theoretical grounding. While Mutual Information (MI) measures "total information," it cannot decompose complex interactions between multiple inputs.
Key Challenge: Answering whether predictions are driven by vision, language, or their interaction requires a quantitative tool that is both theoretically grounded in information theory and applicable to high-dimensional LVLM embeddings, which is currently missing.
Goal: To provide a process-level analysis framework applicable across models, tasks, layers, and training stages, moving beyond the "accuracy-only" evaluation paradigm.
Key Insight: 【Information Spectrum】 By treating visual features \(X_1\) and language features \(X_2\) as two inputs and model prediction \(Y\) as the target, PID decomposes the total mutual information \(I(X_1,X_2;Y)\) into four non-negative atoms: Redundancy \(R\), Unique Visual \(U_1\), Unique Language \(U_2\), and Synergy \(S\). This "Information Spectrum" is used to diagnose the intrinsic information processing strategies of LVLMs.

Method¶

Overall Architecture¶

Given an image-text pair, the framework first extracts mean-pooled token embeddings of image and text as source features \(X_1\) and \(X_2\). A standard multimodal forward pass yields \(P(Y|X_1,X_2)\), followed by two unimodal forward passes \(P(Y|X_1)\) and \(P(Y|X_2)\) where the other modality is replaced by noise. Finally, these three predictive distributions and source features are fed into the BATCH estimator to solve for \(\{R, U_1, U_2, S\}\). This pipeline is a process-level descriptor characterizing model behavior without architectural changes, retraining, or ground-truth labels.

flowchart LR
    A[Image-Text Pair] --> B[ViT/Projector<br/>Extract Visual X1]
    A --> C[Embedding Layer<br/>Extract Language X2]
    B --> D[Multimodal Forward<br/>P Y given X1,X2]
    B --> E[Language Noise<br/>Visual Unimodal P Y given X1]
    C --> F[Visual Noise<br/>Language Unimodal P Y given X2]
    C --> D
    D & E & F --> G[BATCH Estimator<br/>Sinkhorn Constraints]
    G --> H[Info Spectrum<br/>R / U1 / U2 / S]
    H --> I1[Dim 1: Cross-Model/Task]
    H --> I2[Dim 2: Layer Dynamics/Logit Lens]
    H --> I3[Dim 3: Training Dynamics]

Key Designs¶

1. PID Spectrum and BATCH Estimator Implementation: Computing information atoms on high-dimensional embeddings. PID is more useful than mutual information because interaction information \(I(X_1;X_2;Y)\) in a three-variable system can be negative and difficult to interpret, whereas PID strictly decomposes \(I(X_1,X_2;Y)\) into four non-negative atoms. Following the definition by Bertschinger et al., atoms are solved over a set of distributions \(\Delta_P\) that maintain source-target marginals: Redundancy \(R=\max_{Q\in\Delta_P} I_Q(X_1;X_2;Y)\), Unique Visual \(U_1=\min_{Q\in\Delta_P} I_Q(X_1;Y|X_2)\), Unique Language \(U_2=\min_{Q\in\Delta_P} I_Q(X_2;Y|X_1)\), and Synergy \(S=I(X_1,X_2;Y)-\min_{Q\in\Delta_P} I_Q(X_1,X_2;Y)\). For estimation on continuous high-dimensional LVLM embeddings, this work adopts the BATCH estimator proposed by Liang et al., which parameterizes distributions with neural networks, optimizes information-theoretic objectives on minibatches, and uses Sinkhorn algorithm variants to enforce marginal constraints.

2. Focusing Analysis on Multiple-choice VQA + Noise Masking Unimodal Conditions: Ensuring "unimodal prediction" is clean without extra components. BATCH requires \(Y\) to be a finite set, so multiple-choice VQA (e.g., {A,B,C,D}) is used to avoid noise from open-ended answer clustering and the need for extra projection heads. To estimate \(P(Y|X_1)\) and \(P(Y|X_2)\), a corruption scheme inspired by Meng et al. is used, replacing the other modality sequence in the embedding layer with noise: each noise vector is sampled i.i.d. from \(\mathcal{N}(\mu, \mathrm{diag}(\sigma^2))\), where \(\mu,\sigma\) are per-dimension means and standard deviations computed over the dataset. This "calibrated noise" erases one modality while keeping the remaining embeddings within the distribution.

3. Confidence Threshold Re-normalization + Soft Aggregation for Output Marginals: Preventing measurement artifacts from restricted candidates. Re-normalization on restricted candidate sets can inflate confidence; when a model gives low scores to all candidates in the full vocabulary, forced normalization creates artificial structure. Thus, token-length normalized scores \(S_{\mathrm{orig}}\) are computed first, and a threshold is applied: if \(\sum_{y\in\mathcal{Y}} S_{\mathrm{orig}}(Y{=}y|\cdot)\geq\tau\), the normalized \(P(Y|\cdot)\) is used; otherwise, it reverts to a uniform distribution \(U(K)\) over \(K=|\mathcal{Y}|\) to prevent low-confidence guessing from polluting PID calculations. For marginal \(P(Y)\) estimation, soft aggregation is used: the normalized distributions of \(N\) samples are averaged \(P(Y)=\frac{1}{N}\sum_{i=1}^N \hat{P}_i(Y)\), preserving true output statistics and ensuring faithful PID analysis.

4. Three-Dimensional Analysis Protocol: Breadth × Depth × Time. The framework provides four-atom spectra, but insights come from three complementary dimensions. The breadth dimension performs large-scale comparisons across 26 models and 4 datasets, cross-validated by a behavioral intervention (accuracy drop \(D_{\mathrm{vision}}\) after removing images). The depth dimension uses logit lens on representative models (InternVL3, Qwen2.5-VL, LLaVA-1.5) to project hidden states to the LM head for per-layer PID. The time dimension replicates the two-stage training of LLaVA-1.5, saving checkpoints to characterize the emergence of fusion.

Key Experimental Results¶

Main Results: Correlation between Accuracy and PID Atoms (Spearman ρ across 26 models)¶

Dataset	Type	\(S\) (ρ)	\(U_2\) (ρ)	\(I(X_1,X_2;Y)\) (ρ)	\(I(X_1;X_2;Y)\) (ρ)
MMBench	Synergy-driven	0.750 (p<0.001)	0.194	0.632	-0.757
POPE	Synergy-driven	0.742 (p<0.001)	-0.009	0.157	-0.701
Reefknot	Knowledge-driven	0.357 (p=0.073)	0.313	0.266	-0.348
PMC-VQA	Knowledge-driven	0.432 (p=0.027)	0.406 (p=0.040)	0.559	-0.587

On synergy-driven tasks, \(S\) is the strongest correlate of accuracy (ρ≈0.75); on knowledge-driven tasks, \(U_2\) becomes more predictive (significant on PMC-VQA), while \(S\) remains beneficial but not dominant.

Intervention Validation: Correlation between Image Removal Accuracy Drop \(D_{\mathrm{vision}}\) and \(S\)¶

Dataset	\(D_{\mathrm{vision}}\) vs \(S\) (ρ)	p-value
MMBench	0.809	<0.001
POPE	0.744	<0.001
Reefknot	0.459	0.018
PMC-VQA	0.400	0.043

Models with higher \(S\) are more sensitive to visual ablation, confirming that \(S\) captures decision-relevant visual dependence.

Ablation Study: Scaling Effects on Synergy-driven Tasks (ΔAcc vs ΔS, ΔU₂)¶

Family	Scale (B)	S→M ΔAcc	S→M ΔS	S→M ΔU₂	M→VL ΔAcc	M→VL ΔS
LLaVA-OneVision	0.5→7→72	11.9	11.9	-6.5	3.1	14.8
InternVL2.5	2→8→78	7.3	36.8	-55.6	3.6	10.6
InternVL3	2→8→78	2.7	2.5	-6.2	6.4	4.6

The proportion of unique language \(U_2\) does not systematically increase with scale (it often decreases); accuracy gains align more with the growth of \(S\), refuting the common expectation that larger models rely more on language priors.

Key Findings¶

Finding 1-2 (Task Level): Benchmarks fall into two mechanisms: Synergy-driven (MMBench/POPE, high \(S\)) and Knowledge-driven (Reefknot/PMC-VQA, low \(S\), high \(U_2\)). Synergy-driven accuracy is primarily determined by \(S\), while \(U_2\) is more predictive in knowledge-driven tasks.
Finding 3-4 (Model Level): Model families exhibit stable, opposing strategies: Fusion-centric (InternVL2.5/3, Qwen2.5-VL; high \(S\), low \(U_2\)) vs. Language-centric (Gemma3, Cambrian; low \(S\), high \(U_2\)). Scaling primarily strengthens \(S\) rather than \(U_2\).
Finding 5 (Layer-wise): A consistent three-phase information flow exists: info emerges in middle-late layers, language-based representation peaks in the penultimate layer (\(U_2\)), and decisive synergistic fusion occurs in the final layer (\(S\)). \(R\) and \(U_1\) remain minimal throughout.
Finding 6 (Training): In LLaVA-1.5 training, \(S\) is negligible during alignment pre-training and primarily emerges during visual instruction tuning.

Highlights & Insights¶

Quantifying "Fusion" as a Non-negative Scalar: Previously a qualitative concept, "multimodal fusion" is given a theoretically grounded measure via synergy \(S\), which explains accuracy better than total information \(I(X_1,X_2;Y)\). Superior models better convert overlapping cues into effective cross-modal synergy.
Mutual Reinforcement of Three Evidence Chains: Correlation (ρ≈0.75), behavioral intervention (\(D_{\mathrm{vision}}\) vs \(S\)), and stable family strategies collectively support that \(S\) captures true visual dependence.
Pinpointing the "Birth" of Fusion: The vague notion of "instruction tuning importance" is refined to "\(S\) is almost entirely unlocked during visual instruction tuning," providing a diagnostic signal for future training design.

Limitations & Future Work¶

Discrete Target Space Constraint: PID estimation requires a finite \(Y\), limiting the framework to multiple-choice VQA rather than open-ended generation.
Approximation of Unimodal Probes: Using calibrated noise to mask a modality is a stable proxy, but \(U_1, U_2, S\) are measured under these probes rather than natural unimodal inputs.
Correlation vs. Causality: PID atoms are derived from predictions; their relationship with accuracy/interventions is correlational, not a complete causal mechanism.
Future Work: Develop PID estimators for generative settings; use \((U_1, U_2, S)\) as diagnostic signals or auxiliary objectives for scaling and instruction tuning.

VLM Interpretability: Methods like attribution heatmaps, attention analysis, and logit lens are often single-modality "microscopic" views. This work unifies these into an information-theoretic framework.
Information Theory × Multimodal Learning: While MI and Information Bottleneck (IB) are used for representation learning, they cannot decompose multi-input interactions. This work is the first to apply PID to modern LVLM analysis of composition, flow, and evolution.
Insights: The "process-level descriptor" approach can be transferred to any multi-input system where source features and finite targets can be defined.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic, large-scale application of PID for internal LVLM analysis. The "Information Spectrum" perspective is theoretically grounded and provides actionable insights.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 26 models × 4 datasets across breadth/depth/time dimensions with multiple cross-validations.
Writing Quality: ⭐⭐⭐⭐ Logic is progressive across task, family, layer, and training questions; however, PID estimation details might be steep for non-information theory readers.
Value: ⭐⭐⭐⭐⭐ Provides a diagnostic tool beyond accuracy, directly aiding the understanding and design of the next generation of LVLMs.