Vision Function Layer in Multimodal LLMs¶
Conference: NeurIPS 2025 arXiv: 2509.24791 Code: GitHub Area: Multimodal Large Language Models / Interpretability Keywords: MLLM Internal Mechanism, Vision Function Layer, Token Swapping, LoRA, Data Selection
TL;DR¶
This paper identifies that vision-related functional decoding in MLLMs is concentrated in specific narrow layer blocks (Vision Function Layers), exhibiting a consistent hierarchical order across model families (recognition → counting → grounding → OCR). Building on this finding, the authors propose VFL-LoRA (matching full-LoRA performance with only 1/3 of the parameters) and VFL-select (achieving 98% of full-data performance with 20% of the data).
Background & Motivation¶
- Core Problem: Despite significant advances in visual understanding, how MLLMs internally process and reason over visual tokens remains a "black box."
- Limitations of Prior Work: Existing interpretability studies have primarily focused on token importance and cross-modal interactions, neglecting how different visual functions are internally represented and coordinated across layers.
- Key Gap: A diagnostic framework capable of isolating individual visual functions is absent — most general-purpose tasks simultaneously require multiple capabilities, yielding only coarse conclusions (e.g., "shallow layers extract features, deep layers perform reasoning").
- Additional Challenge: Different MLLMs employ different visual encoders and connector modules, further complicating the analysis of internal mechanisms.
Method¶
Vision Token Swapping Analysis Framework¶
Mechanism: At decoding layer \(k\), the KV cache of visual tokens from the original image is replaced with visual tokens from a different image, and the resulting change in output is observed. Carefully designed minimal-difference image pairs are used to isolate individual visual functions:
- OCR: Different words rendered on a blank canvas.
- Recognition: COCO images vs. blank canvas, querying the presence of a specific object.
- Counting: CLEVR dataset images differing only in the number of objects.
- Grounding: Identical objects placed at different positions.
Key Findings: Vision Function Layer¶
Taking Qwen-2.5-VL-7B (28 layers) as an example:
| Visual Function | Peak Layer | Peak Change Rate | Characteristics |
|---|---|---|---|
| Recognition | Layers 0–10 | Distributed | Established early, sustained influence |
| Counting | Layer 12 | 87.4% | Concentrated in middle layers |
| Grounding | Layer 18 | 100.0% | Concentrated in mid-deep layers |
| OCR | Layer 22 | 92.8% | Concentrated in deep layers |
Alignment with Human Cognition: The hierarchy mirrors human visual processing — recognition → counting → grounding → reading — and is consistent across LLaVA and Qwen model families.
Vision Token Dropping Validation¶
Visual tokens are progressively removed from deeper layers on general VQA benchmarks to validate the VFL findings: - Dropping last 4 layers: OCR/TextVQA drops sharply (Qwen-7B: 82.8→74.1); other tasks are largely unaffected. - Dropping last 8 layers: OCR-type tasks nearly collapse (82.8→15.3); Recognition/Spatial tasks begin to degrade. - Dropping last 12 layers: All visual tasks degrade significantly.
Application 1: VFL-LoRA¶
LoRA is applied only to the layers corresponding to the target visual function rather than all layers. Using spatial reasoning as an example, LoRA is trained on Qwen2.5-VL-7B targeting counting function layers (layers 10–17, 20–23):
- Trainable parameters: 155M vs. 309M for full-LoRA (50% reduction).
- In-domain average: 85.0% vs. 84.4% for full-LoRA (on par or slightly better).
- Out-of-domain average: 75.0% vs. 74.3% for full-LoRA (better generalization, reduced catastrophic forgetting).
Application 2: VFL-select Data Selection¶
By analyzing performance differences on training data when specific VFLs are ablated, data samples are automatically categorized by function. Using 20% of the data achieves 98% of full-data performance, surpassing manually curated expert data selection.
Key Experimental Results¶
Effect of Vision Token Dropping on Each Task (Qwen2.5-VL-7B)¶
| Layers Dropped | SQA-I | POPE | TextVQA | OCR | ChartQA |
|---|---|---|---|---|---|
| 0 (baseline) | 87.2 | 86.1 | 82.8 | 82.2 | 83.2 |
| drop 4 | 87.4 | 86.3 | 74.1↓ | 76.3↓ | 82.7↓ |
| drop 8 | 87.4 | 86.2 | 15.3↓↓ | 5.5↓↓ | 20.5↓↓ |
| drop 12 | 87.2 | 79.5↓ | 13.8 | 3.7 | 17.4 |
VFL-LoRA vs. Full-LoRA (Qwen2.5-VL-7B)¶
| Method | Params (%) | CV-Count | CV-Avg | ChartQA | MMMU | POPE |
|---|---|---|---|---|---|---|
| Baseline | - | 68.0 | 82.1 | 83.2 | 50.7 | 86.1 |
| Full-LoRA | 1.9% | 70.9 | 84.4 | 86.2 | 50.1 | 86.6 |
| VFL-LoRA | 0.9% | 72.6 | 85.0 | 86.4 | 51.7 | 86.9 |
| Reversed-VFL | 0.9% | 69.0 | 82.7 | 85.9 | 51.2 | 84.9 |
VFL-select Data Selection¶
- 20% of data achieves 98% of full-data performance.
- Outperforms human expert data selection under the same budget constraint.
Highlights & Insights¶
- Cross-model consistent functional hierarchy: From LLaVA to Qwen, from 3B to 13B parameters, the hierarchical order of vision function layers (recognition → counting → grounding → OCR) is remarkably consistent, suggesting that MLLMs may develop human-like hierarchical visual processing strategies.
- Token Swapping is more precise than traditional probing: Minimal-difference image pairs enable function-level causal analysis rather than mere correlation analysis.
- Significant practical value: VFL-LoRA surpasses full-LoRA with half the parameters and reduced forgetting; VFL-select achieves 98% performance with 1/5 of the data — both applications offer clear engineering utility.
- Reversed-VFL provides strong counter-evidence: Applying LoRA to non-functional layers yields significantly worse performance than VFL-LoRA, confirming the validity of the function layer localization.
Limitations & Future Work¶
- Limited functional granularity: Only four visual functions are analyzed (recognition, counting, grounding, OCR); more complex reasoning and causal understanding are not covered.
- Dependence on carefully designed image pairs: Token Swapping relies on the construction of minimal-difference image pairs, which poses a bottleneck for analyzing novel functions.
- Variable localization clarity across functions: Recognition exhibits a distributed rather than localized pattern, indicating that not all functions have well-defined VFLs.
- Layer selection for VFL-LoRA requires prior knowledge: Function layers must first be identified via Token Swapping analysis, increasing the barrier to adoption.
- Performance drop on CV-Distance subtask: This subtask relies more on language priors than visual information, and VFL-LoRA offers limited benefit for such tasks.
Related Work & Insights¶
- LLM Interpretability: The hierarchical functional division observed in text-only LLMs (e.g., shallow layers for syntax, deep layers for semantics) finds its visual counterpart in MLLMs through this work.
- LoRA Variants: VFL-LoRA represents a mechanism-informed layer selection strategy rather than an empirical search, offering stronger theoretical grounding than random or gradient-based layer selection.
- Implications: The VFL findings can guide MLLM pruning — if a downstream scenario requires only recognition and counting, the OCR function layers can be safely skipped to accelerate inference.
Rating¶
⭐⭐⭐⭐⭐ — The scientific findings are profound (cross-model consistency of functional hierarchy), the methodology is elegant (Token Swapping), the practical applications are compelling (VFL-LoRA and data selection), and the experiments are comprehensive (multiple models × multiple tasks × ablations + applications). This represents an important contribution to the field of MLLM interpretability.