VisionZip: Longer is Better but Not Necessary in Vision Language Models¶

Conference: CVPR 2025
arXiv: 2412.04467
Code: https://github.com/dvlab-research/VisionZip
Area: Multimodal VLMs
Keywords: Vision Token Compression, Attention Redundancy, Token Selection and Merging, Efficient Inference, Multi-turn Dialogue

TL;DR¶

VisionZip reveals significant redundancy in visual tokens generated by vision encoders (CLIP/SigLIP), where only a fraction of tokens aggregate the vast majority of attention and information. Based on this observation, a text-independent token selection and merging method is proposed, maintaining over 95% of model performance with only 10% of tokens while achieving an 8x pre-fill acceleration.

Background & Motivation¶

The performance scaling of VLMs heavily relies on increasing visual token sequence length: LLaVA-1.5 utilizes 576 visual tokens, whereas LLaVA-NeXT generates 2,880 tokens from a 672x672 image, while text tokens typically number only dozens or hundreds. Visual tokens vastly outnumber textual tokens, leading to immense computational and memory overheads that severely constrain practical deployment in edge computing and autonomous driving. Key Challenge: Image information is naturally much more sparse than text, yet VLMs deploy a number of visual tokens far exceeding the number of text tokens.

Limitations of Prior Work: Existing methods (e.g., FastV, SparseVLM) rely on text-visual attention within LLM layers to progressively prune visual tokens. However, these methods suffer from a fundamental drawback—vision encoders have already aggregated information into a few "surrogate tokens" whose locations often do not align with the salient regions in the image (potentially appearing in the background or boundaries). Consequently, selection methods based on text alignment fail to select tokens containing sufficient actual information. Key Insight of VisionZip: Select the most informative tokens directly at the output of the vision encoder in a text-independent manner.

Method¶

Overall Architecture¶

VisionZip inserts a lightweight token compression module between the vision encoder and the LLM. First, \(K\) "Dominant Tokens" are selected based on the internal attention scores of the vision encoder. Then, the remaining tokens are grouped and merged into \(M\) "Contextual Tokens" through similarity matching. Ultimately, only the compressed \(K+M\) tokens are fed into the LLM. The entire pipeline is training-free, and optionally, the projector can be fine-tuned with a small amount of data to further boost performance.

Key Designs¶

Dominant Token Selection:
- Function: Filters out the core tokens aggregating the most information from the output of the vision encoder.
- Mechanism: Evaluates each token's importance using the attention scores from the second-to-last layer of the vision encoder. For models with a CLS token (e.g., CLIP), it selects the top-\(K\) tokens most attended to by the CLS token; for models without a CLS token (e.g., SigLIP), it computes the average attention each token receives from all other tokens and selects the top-\(K\).
- Design Motivation: Visualization analysis reveals that after the middle layers of the vision encoder, attention sharply converges to a few tokens. By the second-to-last layer (the layer typically selected for VLMs), attention and information are highly concentrated on an extremely small number of "dominant tokens." These tokens naturally preserve the vast majority of image information.
Contextual Token Merging:
- Function: Recovers small but essential details that might be omitted from the discarded tokens.
- Mechanism: Evenly splits non-dominant tokens into two groups: "target tokens" and "to-be-merged tokens". It utilizes Key value similarity matrix to allocate each to-be-merged token to its most similar target token, merging them via averaging to generate contextual tokens.
- Design Motivation: Selecting only dominant tokens may omit tiny but critical details (e.g., small objects, text) in the image. The merging strategy retains semantic similarity information at an extremely low cost, compensating for information loss.
Efficient Tuning:
- Function: Resolves the slight mismatch between the vision and language spaces caused by the drastic reduction in token count.
- Mechanism: Fine-tunes the multimodal projector using only 1/10 of the LLaVA-1.5 training data, while keeping other components frozen. This completes in 30 minutes on 8x A800 GPUs, and can even be run on a single RTX 3090.
- Design Motivation: VLMs are originally trained on full visual token sequences. Suddenly dropping this to 1/10 leads to a minor alignment drift between the vision and LLM spaces, which can be easily repaired by fine-tuning the projector.

Loss & Training¶

VisionZip itself is training-free and does not introduce new loss functions. The optional Efficient Tuning phase utilizes standard instruction-tuning loss, updating only the parameters of the projector.

Key Experimental Results¶

Main Results (LLaVA-1.5, 576 tokens baseline)¶

Token Count	Method	Avg. Retention across 11 Benchmarks	vs FastV	vs SparseVLM
192 (↓66.7%)	VisionZip	98.5%	+10.3%	+2.1%
192 (↓66.7%)	VisionZip‡	99.1%	+10.9%	+2.7%
128 (↓77.8%)	VisionZip	97.6%	+14.1%	+4.2%
64 (↓88.9%)	VisionZip	94.0%	+18.4%	+8.2%
64 (↓88.9%)	VisionZip‡	95.2%	+19.6%	+9.4%

Main Results (LLaVA-NeXT, 2880 tokens baseline)¶

Token Count	Method	Avg. Retention across 7 Benchmarks	Description
640 (↓77.8%)	VisionZip	97.6%	No extra training required
640 (↓77.8%)	VisionZip‡	98.9%	Only 30 minutes fine-tuning
320 (↓88.9%)	VisionZip‡	97.9%	Maintains high accuracy even with ~90% token reduction
160 (↓94.4%)	VisionZip‡	95.5%	Uses only 5% of tokens

Efficiency Analysis¶

Method	Token Count	Total Inference Time	Speedup	Prefill Latency	Prefill Speedup
Baseline	2880	2293s	1.0×	218ms	1.0×
FastV	160	1792s	1.3×	119ms	1.8×
SparseVLM	160	1895s	1.2×	128ms	1.7×
VisionZip	160	756s	3.0×	27.8ms	7.8×

Ablation Study¶

Configuration	TextVQA (64 tokens)	Description
SparseVLM Baseline	51.1	Select 64 from 576 tokens using LLM attention
Remove Top 50 visual attention tokens first	46.4 (−9.2%)	Proves tokens relied upon by SparseVLM are indeed highly informative
Provide only Top 128 VisionZip tokens to SparseVLM	52.5 (+2.7%)	Performance improves after pre-filtering

Key Findings¶

On benchmarks like MMMU and MMVet, reducing token count actually improves accuracy, indicating that redundant visual tokens can act as noise.
The attention concentration of the vision encoder stems from the gradient characteristics of the Softmax function: \(\frac{\partial \text{softmax}(z_i)}{\partial z_i} = \text{softmax}(z_i) \cdot (1-\text{softmax}(z_i))\), where high-attention regions have larger gradients, leading to a "Matthew effect."
Text-dependent methods (FastV/SparseVLM) select tokens that do not align with where the vision encoder aggregates information, as the latter tends to appear in background regions rather than on the main subjects.

Highlights & Insights¶

Deep exploration into the root causes of redundancy: Beyond proposing a method, the paper provides directional insights for vision encoder design by explaining why vision encoders generate token redundancy through the lens of Softmax gradient properties and the "Attention Sink" phenomenon.
Text-independent advantage: Unlike FastV/SparseVLM which depend on internal LLM attention, VisionZip performs compression on the vision encoder side. This natively supports multi-turn dialogue scenarios, as the tokens stored in the KV Cache do not become invalid when the conversation topic shifts.
13B outperforming 7B: VisionZip enables LLaVA-NeXT 13B to run faster than native 7B with higher accuracy, which is a highly practical finding.
Minimalist design philosophy: Consisting of only attention selection and similarity merging without any extra modules, it significantly outperforms methods requiring LLM forward propagation.

Limitations & Future Work¶

Dominant token selection relies solely on the attention from the final few layers of the vision encoder, potentially missing details that are only informative in shallower layers.
The contextual token merging strategy (uniform splitting + average merging) is relatively coarse. Finer merging schemes could be explored.
The paper primarily evaluates on LLaVA series and Mini-Gemini, without sufficient validation on other mainstream models like Qwen2-VL or InternVL.
In the training-free mode, there is still about a 5% drop in accuracy at extremely low token counts (such as 64), which may require further fine-tuning for strict application scenarios.

vs FastV: FastV prunes tokens layer-by-layer using text attention in shallow LLM layers, but cannot avoid the full computation overhead of these shallow layers, and the selected tokens may lack complete information. VisionZip completes compression before entering the LLM, offering massive efficiency benefits (3.0x vs 1.3x).
vs SparseVLM: SparseVLM also relies on text-visual attention within LLM layers. It fails significantly on video tasks (MSRVTT) preserving only 54.7% performance, while VisionZip achieves 91.9%.
vs NVILA (2412.04468): NVILA's STC spatial compression is integrated during training, whereas VisionZip is a plug-and-play inference acceleration solution, making them complementary.

Rating¶

Novelty: ⭐⭐⭐⭐ The core observation (visual token redundancy) is not entirely new, but the analysis from the perspective of Softmax gradients and the text-independent selection strategy offers unique insights.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 11 image benchmarks, 4 video benchmarks, 3 VLM architectures, detailed efficiency analyses, and ablation studies.
Writing Quality: ⭐⭐⭐⭐ Computation details and visualizations (attention distributions, feature misalignment) are rich and clear, although some tables are highly dense.
Value: ⭐⭐⭐⭐ High practical value as a plug-and-play inference acceleration solution, though it requires validation on more mainstream models.