CVPR 2026 Multimodal VLM Large vision-language model efficiency visual token sparsification dynamic computation allocation cross-attention self-attention selection

VISion On Request: Enhanced VLLM Efficiency with Sparse, Dynamically Selected, Vision-Language Interactions¶

Conference: CVPR 2026 arXiv: 2603.23495 Code: None (based on LLaVA-OV) Area: Multimodal VLM Keywords: Large vision-language model efficiency, visual token sparsification, dynamic computation allocation, cross-attention, self-attention selection

TL;DR¶

VISOR proposes a new efficiency paradigm distinct from visual token compression — by sparsifying vision-language interaction layers within the LLM (a small number of cross-attention layers plus dynamically selected self-attention layers), it achieves 8.6–18× FLOPs savings while retaining all high-resolution visual tokens, substantially outperforming token compression methods on challenging tasks that require fine-grained understanding.

Background & Motivation¶

Background: Large vision-language models (LVLMs) typically concatenate visual tokens generated by a vision encoder (e.g., CLIP/SigLIP) with text tokens and feed them into an LLM. High-resolution images produce large numbers of visual tokens, causing computation costs to grow quadratically with token count. Existing efficiency methods almost universally revolve around token compression or pruning.
Limitations of Prior Work: Token compression methods (e.g., VisionZip, PyramidDrop, HiRED) perform adequately on simple tasks requiring coarse-grained understanding, but suffer significant performance degradation on challenging tasks requiring fine-grained reasoning (e.g., DocVQA, ChartQA, InfoVQA). This is because compressing visual tokens inevitably creates an information bottleneck, discarding critical detail.
Key Challenge: The tension between efficiency and fidelity — token compression improves efficiency by reducing token count, but permanently discards visual information. The question is whether efficiency can be improved without discarding tokens at all.
Goal: (1) Substantially reduce LVLM inference cost without compressing or discarding visual tokens; (2) Enable task- and sample-adaptive computation allocation — less computation for simple inputs, more for difficult ones.
Key Insight: Through in-depth analysis of LLaVA-OV, the paper identifies three key observations: (1) vision-language interactions across layers are sparse, exhibiting a sawtooth pattern; (2) in simple tasks, visual features remain nearly unchanged (CKA > 0.9), whereas in complex tasks they are significantly refined (CKA dropping to 0.6); (3) different tasks vary greatly in their demand for visual processing.
Core Idea: Rather than compressing visual tokens, VISOR sparsifies the interaction between LLM layers and visual tokens — using a small number of cross-attention layers to efficiently provide visual context, and a small number of dynamically selected self-attention layers to refine visual representations when needed.

Method¶

Overall Architecture¶

VISOR builds on the LLaVA-OV architecture and decouples the full-sequence self-attention in standard LLM layers into three types: (1) text-only layers (the majority) — process only text tokens without touching visual tokens, incurring minimal computation; (2) cross-attention layers — text tokens query visual tokens without updating visual representations, with cost \(O(N_t N_v d)\), far below full attention; (3) self-attention layers — process the complete vision-plus-text sequence and update visual tokens, highest cost but enabling visual feature refinement. Cross-attention layers are uniformly distributed throughout the model; the number and positions of self-attention layers are determined dynamically based on the task.

Key Designs¶

Efficient Visual Context: Cross-Attention
- Function: Allows text tokens to efficiently query static visual features without updating visual tokens.
- Mechanism: A small set of uniformly distributed layers \(\mathcal{L}_{CA}\) is selected; in these layers, text tokens serve as queries and visual tokens as keys/values for cross-attention, with the result added back to the text stream via residual connection. Crucially, visual tokens remain fixed at their initial value \(\mathbf{V}^{(0)}\) throughout these layers. To preserve positional information, a 1D depthwise separable convolution (kernel size 7) is introduced for conditional positional encoding.
- Design Motivation: Analysis reveals that vision-language interactions are sparse across most layers, and simple tasks only require querying visual information at a few critical points. Cross-attention FLOPs scale as \(O(N_t N_v d)\) versus \(O((N_t + N_v)^2 d)\) for full self-attention, yielding substantial savings when \(N_v \gg N_t\).
Selective Self-Attention Refinement
- Function: Updates and refines visual token representations when needed, supporting fine-grained reasoning on complex tasks.
- Mechanism: At a small set of selected layers \(\mathcal{L}_{SA}\), standard full-sequence self-attention is performed over both visual and text tokens, updating visual tokens \(\mathbf{V}^{(l-1)} \to \mathbf{V}^{(l)}\). Subsequent cross-attention layers then query these refined visual tokens, enabling progressive refinement from low-level to high-level visual features.
- Design Motivation: CKA analysis shows that visual features are significantly refined in difficult tasks (forming clusters in stages), while cross-attention alone cannot update visual tokens, limiting fine-grained understanding. Self-attention layers provide this necessary capacity for visual feature updates.
Universal Model Training + Adaptive Inference
- Function: A single model supports multiple computational budgets and allocates computation per sample at runtime based on input complexity.
- Mechanism: Three steps. (1) Establish bounds: set \(|L_{CA}| = |L_{SA}| = L/3\) as the upper limit and pretrain the maximum-configuration model; (2) Identify viable sub-networks: systematically evaluate different subsets of self-attention layers from the pretrained model; (3) Universal fine-tuning: randomly sample a feasible configuration at each training step, making the model robust across all configurations. At inference time, an MLP routing layer placed before the first optional self-attention block processes routing tokens and predicts the optimal configuration for the current sample. The routing policy is trained via offline pseudo-labeling — all configurations are run on a training subset, and the most efficient configuration achieving 99% of full-model accuracy is selected as the pseudo-label.
- Design Motivation: Different tasks — and even different samples within the same task — require varying amounts of visual processing. A universal model avoids training and storing multiple models, while the routing mechanism enables genuine sample-level adaptive computation.

Loss & Training¶

Two-stage training: (1) freeze the original model and fine-tune newly added attention layers on 4M knowledge data; (2) full model fine-tuning on 3.2M high-quality data.
AdamW optimizer, no weight decay, batch size 128.
Routing network trained with standard cross-entropy loss.

Key Experimental Results¶

Main Results¶

Method	Simple Tasks (Mean)	Hard Tasks (Mean)	FLOPs Reduction
LLaVA-OV (Baseline)	61.5	57.1	1.0×
VisionZip†	59.3	43.1	5.7×
M3	64.0	56.6	8.0×
HiRED	59.3	39.0	5.0×
VISOR	63.6	58.4	8.6×
VISOR-TR	63.3	57.8	18×

VISOR outperforms all token compression methods on hard tasks while achieving greater efficiency gains.

Ablation Study¶

# SA Layers	# CA Layers	Simple	Hard	Note
0	6	63.3	51.8	Cross-attention only
2	8	63.5	56.2	Few self-attention layers
9 (L/3)	9 (L/3)	63.6	58.4	Full configuration

Method Combination	FLOPs Reduction	Simple	Hard
VISOR	8.9×	63.6	58.4
VISOR-TR [2×]	17.8×	63.3	57.8
VISOR-TR [4×]	35.0×	63.1	56.2
VISOR + VisionZip	37.0×	63.3	55.3
VISOR + VisPruner	39.0×	63.5	55.9

Key Findings¶

Cross-attention alone (0 SA layers) already surpasses most token compression methods on simple tasks (63.3 vs. VisionZip 57.3), but falls notably short on hard tasks (51.8), validating the necessity of visual feature refinement.
Hard-task accuracy is highly sensitive to the number of SA layers: adding just 2 SA layers improves hard-task performance from 51.8 to 56.2, demonstrating that even a small number of self-attention layers is critical for complex reasoning.
Orthogonal and composable with token compression: VISOR + VisPruner achieves 39× FLOPs reduction, with only 0.1% drop on simple tasks and 2.5% drop on hard tasks.
Adaptive routing is effective: the universal model with routing achieves performance comparable to the best fixed configuration across all benchmarks.

Highlights & Insights¶

Paradigm innovation: shifting from "compressing visual tokens" to "sparsifying vision-language interaction layers" entirely avoids the information bottleneck problem. The idea is elegant — rather than making the input smaller, it makes the processing sparser.
Analysis-driven design: three complementary analyses — CKA similarity, attention patterns, and layer-dropping experiments — precisely motivate the architectural choices, with every design decision grounded in empirical evidence.
The combination of universal model training and offline pseudo-label routing enables a single model to support multiple computational budgets, which is highly practical for real-world deployment.
Orthogonality with token compression means both approaches can be combined, achieving up to 39× FLOPs reduction in extreme efficiency scenarios.

Limitations & Future Work¶

Validation is currently limited to LLaVA-OV 0.5B and 1.5B; larger models (7B+) have not been tested.
The routing strategy relies on offline pseudo-labels and cannot perform true online adaptation — end-to-end routing via reinforcement learning is a natural extension.
Cross-attention layer positions are fixed to uniform distribution; whether alternative layer placement strategies yield better performance remains unexplored.
Scenarios such as video understanding that require substantial inter-frame reasoning have not been evaluated.

vs. VisionZip / PyramidDrop: These token compression methods suffer severe performance degradation on hard tasks (DocVQA drops from 68.7 to 36.7), whereas VISOR maintains or exceeds baseline performance.
vs. M3: M3 is a training-aware token compression method that performs better on hard tasks but still creates an information bottleneck; VISOR achieves higher accuracy with fewer FLOPs.
vs. SparseVLM: SparseVLM uses text token scores to dynamically prune visual tokens, but is fundamentally still a token-reduction approach; VISOR retains all visual tokens and follows an entirely different technical path.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Proposes a fundamentally new paradigm for LVLM efficiency optimization, breaking free from the limitations of token compression.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 13 benchmarks, comparisons with 8+ SOTA methods, extensive ablations and analyses.
Writing Quality: ⭐⭐⭐⭐⭐ Analysis-driven narrative is exceptionally clear, with well-motivated justification for every design choice.
Value: ⭐⭐⭐⭐⭐ Has the potential for paradigm-level impact on the field of LVLM efficiency optimization.