Vision Function Layer in Multimodal LLMs¶

Conference: NeurIPS 2025 arXiv: 2509.24791 Code: GitHub Area: Multimodal Large Language Models / Interpretability Keywords: MLLM Internal Mechanism, Vision Function Layer, Token Swapping, LoRA, Data Selection

TL;DR¶

This paper identifies that vision-related functional decoding in MLLMs is concentrated in specific narrow layer blocks (Vision Function Layers), exhibiting a consistent hierarchical order across model families (recognition → counting → grounding → OCR). Building on this finding, the authors propose VFL-LoRA (matching full-LoRA performance with only 1/3 of the parameters) and VFL-select (achieving 98% of full-data performance with 20% of the data).

Background & Motivation¶

Core Problem: Despite significant advances in visual understanding, how MLLMs internally process and reason over visual tokens remains a "black box."
Limitations of Prior Work: Existing interpretability studies have primarily focused on token importance and cross-modal interactions, neglecting how different visual functions are internally represented and coordinated across layers.
Key Gap: A diagnostic framework capable of isolating individual visual functions is absent — most general-purpose tasks simultaneously require multiple capabilities, yielding only coarse conclusions (e.g., "shallow layers extract features, deep layers perform reasoning").
Additional Challenge: Different MLLMs employ different visual encoders and connector modules, further complicating the analysis of internal mechanisms.

Method¶

Vision Token Swapping Analysis Framework¶

Mechanism: At decoding layer \(k\), the KV cache of visual tokens from the original image is replaced with visual tokens from a different image, and the resulting change in output is observed. Carefully designed minimal-difference image pairs are used to isolate individual visual functions:

OCR: Different words rendered on a blank canvas.
Recognition: COCO images vs. blank canvas, querying the presence of a specific object.
Counting: CLEVR dataset images differing only in the number of objects.
Grounding: Identical objects placed at different positions.

Key Findings: Vision Function Layer¶

Taking Qwen-2.5-VL-7B (28 layers) as an example:

Visual Function	Peak Layer	Peak Change Rate	Characteristics
Recognition	Layers 0–10	Distributed	Established early, sustained influence
Counting	Layer 12	87.4%	Concentrated in middle layers
Grounding	Layer 18	100.0%	Concentrated in mid-deep layers
OCR	Layer 22	92.8%	Concentrated in deep layers

Alignment with Human Cognition: The hierarchy mirrors human visual processing — recognition → counting → grounding → reading — and is consistent across LLaVA and Qwen model families.

Vision Token Dropping Validation¶

Visual tokens are progressively removed from deeper layers on general VQA benchmarks to validate the VFL findings: - Dropping last 4 layers: OCR/TextVQA drops sharply (Qwen-7B: 82.8→74.1); other tasks are largely unaffected. - Dropping last 8 layers: OCR-type tasks nearly collapse (82.8→15.3); Recognition/Spatial tasks begin to degrade. - Dropping last 12 layers: All visual tasks degrade significantly.

Application 1: VFL-LoRA¶

LoRA is applied only to the layers corresponding to the target visual function rather than all layers. Using spatial reasoning as an example, LoRA is trained on Qwen2.5-VL-7B targeting counting function layers (layers 10–17, 20–23):

Trainable parameters: 155M vs. 309M for full-LoRA (50% reduction).
In-domain average: 85.0% vs. 84.4% for full-LoRA (on par or slightly better).
Out-of-domain average: 75.0% vs. 74.3% for full-LoRA (better generalization, reduced catastrophic forgetting).

Application 2: VFL-select Data Selection¶

By analyzing performance differences on training data when specific VFLs are ablated, data samples are automatically categorized by function. Using 20% of the data achieves 98% of full-data performance, surpassing manually curated expert data selection.

Key Experimental Results¶

Effect of Vision Token Dropping on Each Task (Qwen2.5-VL-7B)¶

Layers Dropped	SQA-I	POPE	TextVQA	OCR	ChartQA
0 (baseline)	87.2	86.1	82.8	82.2	83.2
drop 4	87.4	86.3	74.1↓	76.3↓	82.7↓
drop 8	87.4	86.2	15.3↓↓	5.5↓↓	20.5↓↓
drop 12	87.2	79.5↓	13.8	3.7	17.4

VFL-LoRA vs. Full-LoRA (Qwen2.5-VL-7B)¶

Method	Params (%)	CV-Count	CV-Avg	ChartQA	MMMU	POPE
Baseline	-	68.0	82.1	83.2	50.7	86.1
Full-LoRA	1.9%	70.9	84.4	86.2	50.1	86.6
VFL-LoRA	0.9%	72.6	85.0	86.4	51.7	86.9
Reversed-VFL	0.9%	69.0	82.7	85.9	51.2	84.9

VFL-select Data Selection¶

20% of data achieves 98% of full-data performance.
Outperforms human expert data selection under the same budget constraint.

Highlights & Insights¶

Cross-model consistent functional hierarchy: From LLaVA to Qwen, from 3B to 13B parameters, the hierarchical order of vision function layers (recognition → counting → grounding → OCR) is remarkably consistent, suggesting that MLLMs may develop human-like hierarchical visual processing strategies.
Token Swapping is more precise than traditional probing: Minimal-difference image pairs enable function-level causal analysis rather than mere correlation analysis.
Significant practical value: VFL-LoRA surpasses full-LoRA with half the parameters and reduced forgetting; VFL-select achieves 98% performance with 1/5 of the data — both applications offer clear engineering utility.
Reversed-VFL provides strong counter-evidence: Applying LoRA to non-functional layers yields significantly worse performance than VFL-LoRA, confirming the validity of the function layer localization.

Limitations & Future Work¶

Limited functional granularity: Only four visual functions are analyzed (recognition, counting, grounding, OCR); more complex reasoning and causal understanding are not covered.
Dependence on carefully designed image pairs: Token Swapping relies on the construction of minimal-difference image pairs, which poses a bottleneck for analyzing novel functions.
Variable localization clarity across functions: Recognition exhibits a distributed rather than localized pattern, indicating that not all functions have well-defined VFLs.
Layer selection for VFL-LoRA requires prior knowledge: Function layers must first be identified via Token Swapping analysis, increasing the barrier to adoption.
Performance drop on CV-Distance subtask: This subtask relies more on language priors than visual information, and VFL-LoRA offers limited benefit for such tasks.

LLM Interpretability: The hierarchical functional division observed in text-only LLMs (e.g., shallow layers for syntax, deep layers for semantics) finds its visual counterpart in MLLMs through this work.
LoRA Variants: VFL-LoRA represents a mechanism-informed layer selection strategy rather than an empirical search, offering stronger theoretical grounding than random or gradient-based layer selection.
Implications: The VFL findings can guide MLLM pruning — if a downstream scenario requires only recognition and counting, the OCR function layers can be safely skipped to accelerate inference.

Rating¶

⭐⭐⭐⭐⭐ — The scientific findings are profound (cross-model consistency of functional hierarchy), the methodology is elegant (Token Swapping), the practical applications are compelling (VFL-LoRA and data selection), and the experiments are comprehensive (multiple models × multiple tasks × ablations + applications). This represents an important contribution to the field of MLLM interpretability.