Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects¶

Conference: ACL 2026 arXiv: 2604.05546 Code: https://github.com/SuDIS-ZJU/Efficient-LVLMs-Inference Area: Multimodal VLM / LLM Efficiency Keywords: vision-language models, inference efficiency, visual token dominance, KV cache, token compression

TL;DR¶

This paper proposes a systematic taxonomy for efficient inference in large vision-language models (LVLMs), analyzing bottlenecks across an encode–prefill–decode three-stage pipeline. It identifies a systemic efficiency barrier caused by visual token dominance and presents a comprehensive map of optimization techniques spanning information density shaping, long-context attention management, and memory bandwidth breakthroughs.

Background & Motivation¶

Background: Large vision-language models (e.g., Qwen2.5-VL-72B) have become foundational infrastructure for complex multimodal reasoning, capable of processing high-resolution images and long videos. However, as model scale and input resolution grow, inference efficiency has emerged as the central deployment bottleneck.

Limitations of Prior Work: Visual data generates far more tokens than text (visual tokens typically range from 576 to 4,000+, vastly exceeding text prompts), giving rise to the phenomenon of visual token dominance. This not only increases the quadratic complexity of attention computation but also creates a visual memory wall—static visual KV caches consume substantial memory bandwidth. Existing surveys focus on isolated optimization techniques (e.g., token compression or modality-specific efficient architectures) while neglecting the systemic interconnections across the inference pipeline.

Key Challenge: LVLM inference is not a monolithic workload but a dynamic pipeline spanning three distinct hardware regimes. Optimizing a single stage in isolation tends to shift bottlenecks elsewhere rather than improving end-to-end latency. Upstream decisions (e.g., encoder resolution) directly determine downstream bottlenecks (e.g., decoding bandwidth), yet this global perspective is absent from existing literature.

Goal: To construct a unified, stage-aware taxonomy for efficient LVLM inference that analyzes the physical nature of bottlenecks at each stage and the combinatorial effects of optimization techniques.

Key Insight: Each stage is analyzed through the lens of the Roofline model from a computational physics perspective—encoding is compute-bound (high arithmetic intensity), prefilling is mixed-bound, and decoding is memory-bound (low arithmetic intensity).

Core Idea: Decouple efficiency optimization into three axes—information density shaping (encoding), long-context attention management (prefilling), and memory bandwidth breakthroughs (decoding)—and analyze how isolated optimizations compose to navigate the tradeoff between visual fidelity and system efficiency.

Method¶

Overall Architecture¶

The survey is organized around the three-stage LVLM inference pipeline: (1) Encoding—the visual encoder extracts patch embeddings, and the modality adapter aligns them to the LLM token space, producing \(N_v\) visual tokens; (2) Prefilling—the concatenated visual and text context is processed to generate the initial KV cache; (3) Decoding—output tokens are generated autoregressively, with each step requiring loading of model weights and the accumulated KV cache.

Key Designs¶

Encoding Stage Optimization (Compute-Bound):
- Function: Minimize encoding latency \(\tau_{\text{ENC}}\) and reduce the number of output visual tokens \(N_v\).
- Mechanism: Two strategic axes—(a) Architectural optimization: efficient visual encoders (FastViT structural re-parameterization, EfficientViT distillation) and efficient modality adapters (ranging from simple MLPs to token-compressing adapters such as Q-Former); (b) Input reduction: keyframe selection (for video scenarios), adaptive resolution (adjusted to content complexity), and encoder-side token compression. Reducing \(N_v\) yields cascading benefits—prefill complexity drops from \(O((N_v+N_t)^2)\) and KV cache size decreases linearly.
- Design Motivation: Encoding is a compute-bound stage (\(\tau_{\text{ENC}} \approx \text{FLOPs}/\pi_{\text{peak}}\)); although its per-request cost is fixed, reducing \(N_v\) provides multiplicative downstream gains.
Prefilling Stage Optimization (Mixed-Bound):
- Function: Alleviate the quadratic attention computation and the large memory writes incurred by KV cache construction.
- Mechanism: (a) Token compression: attention-guided pruning (FastV, SparseVLM), similarity-driven merging (ToMe), and learned abstraction (Q-Former); (b) Sparse attention: window attention, sparse patterns, and linear attention approximations. Latency is governed by the binding resource: \(\tau_{\text{PFL}} \approx \max(\text{FLOPs}_{\text{attn}}/\pi_{\text{peak}}, |\mathcal{KV}|_{\text{PFL}}/\beta_{\text{mem}})\).
- Design Motivation: Large \(N_v\) exposes prefilling to simultaneous compute and memory pressure. Unlike pure-text prefilling, visual token dominance can push this stage against the memory wall.
Decoding Stage Optimization (Memory-Bound):
- Function: Overcome the visual memory wall—static visual KV caches must be loaded from HBM to SRAM at every generation step.
- Mechanism: (a) KV cache optimization: cache eviction (identifying and discarding unimportant visual KV entries), quantization (compressing KV cache storage), and merging (reducing the number of KV entries); (b) Speculative decoding: a smaller draft model proposes multiple tokens that are verified in parallel by the larger target model; (c) Efficient reasoning (e.g., chain-of-thought optimization). Per-step latency is \(\tau_{\text{DEC}}^{(i)} \approx (|\psi| + |\mathcal{KV}|_i) / \beta_{\text{mem}}\), and the visual KV cache \(|\mathcal{KV}|_v \propto N_v \cdot L \cdot D_{\mathcal{L}}\) is reloaded at every generation step.
- Design Motivation: Decoding is strictly memory-bound (arithmetic intensity far below unity), and the visual KV cache is static—once generated it is never updated, yet it is loaded at every step, causing significant bandwidth waste.

Loss & Training¶

As a survey paper, no specific training procedure is introduced. Four frontier directions are outlined: (1) mixed compression guided by functional-unit sensitivity; (2) modality-aware decoding with relaxed verification; (3) progressive state management for streaming continuity; and (4) hardware–algorithm co-design for stage-decoupled serving.

Key Experimental Results¶

Main Results (Efficiency Analysis)¶

Inference Stage	Bottleneck Type	Arithmetic Intensity	Primary Optimization Directions
Encoding	Compute-bound	High (>>1)	Efficient encoders, reducing patch count
Prefilling	Mixed-bound	Medium	Token compression, sparse attention
Decoding	Memory-bound	Low (<<1)	KV cache optimization, speculative decoding

Ablation Study (Quantitative Analysis Examples)¶

Scenario	Visual Token Count	KV Cache Size	Notes
Qwen2.5-VL-72B processing 20 images	>40K	>13 GB	Severe memory pressure
5-second 720p video	>50K	>16 GB	Visual memory wall

Key Findings¶

Visual token dominance is the fundamental efficiency bottleneck of LVLMs, distinct from the efficiency challenges of text-only LLMs.
Reducing \(N_v\) at the encoding stage yields cascading benefits (quadratic reduction in prefill complexity + linear reduction in KV cache size).
Single-stage optimization may shift rather than eliminate bottlenecks, necessitating an end-to-end optimization perspective.
The visual memory wall at the decoding stage is the most overlooked yet most impactful bottleneck.

Highlights & Insights¶

Systematic three-stage bottleneck analysis: The Roofline model is used to formalize the bottleneck type (compute- vs. memory-bound) of each stage, providing theoretical guidance for selecting appropriate optimization techniques and avoiding ad hoc trial-and-error.
Quantification of cascading gains: The multiplicative downstream benefits of reducing \(N_v\) at the encoding stage are explicitly established, offering a principled basis for optimization prioritization.
Formalization of the visual memory wall: This concept is introduced and formally characterized, identifying the repeated loading of static visual KV caches during decoding as a bandwidth-waste problem unique to LVLMs.

Limitations & Future Work¶

As a survey, the paper does not propose new methods or provide unified empirical comparisons.
The four frontier directions are discussed conceptually without sufficient experimental validation.
The focus is limited to inference efficiency; training efficiency (e.g., the inference-time impact of parameter-efficient fine-tuning) is not addressed.
Discussion of multi-device and distributed inference is insufficiently deep.

vs. Prior surveys (Shao et al. 2025b): Prior surveys focus on token compression techniques; this paper provides a full-pipeline perspective.
vs. LLM efficiency surveys: LLM efficiency research does not address the unique challenge of visual token dominance.
vs. Individual technique papers: This paper reveals the interdependencies and combinatorial effects among individual techniques.

Rating¶

Novelty: ⭐⭐⭐⭐ The stage-aware taxonomy and the formalization of the visual memory wall are valuable contributions.
Experimental Thoroughness: ⭐⭐⭐ Preliminary quantitative analysis is provided, but large-scale unified comparisons are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Well-organized, analytically rigorous, with excellent figure and table design.
Value: ⭐⭐⭐⭐⭐ Provides a systematic conceptual framework for LVLM inference efficiency optimization.