Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects¶
Conference: ACL 2026 Findings
arXiv: 2604.05546
Code: https://github.com/SuDIS-ZJU/Efficient-LVLMs-Inference
Area: Multimodal VLM / LLM Efficiency
Keywords: Large Vision-Language Models, Inference Efficiency, Vision Token Dominance, KV Cache, Token Compression
TL;DR¶
This paper proposes a systematic taxonomy for LVLM inference efficiency, analyzing bottlenecks along the encoding-prefilling-decoding pipeline. It reveals the systemic efficiency barrier caused by "vision token dominance" and summarizes a comprehensive technical map ranging from information density shaping and long-context attention management to memory bandwidth breakthroughs.
Background & Motivation¶
Background: Large Vision-Language Models (e.g., Qwen2.5-VL-72B) have become essential infrastructure for complex multimodal reasoning, capable of processing high-resolution images and long videos. However, as model scales and input resolutions grow, inference efficiency has become the core bottleneck for deployment.
Limitations of Prior Work: The number of tokens generated by visual data far exceeds text (typically 576–4000+, significantly larger than text prompts), leading to the "vision token dominance" phenomenon. This not only increases the quadratic complexity of attention computation but also creates a "Visual Memory Wall"—where static visual KV caches consume massive bandwidth. Existing surveys focus on isolated optimization techniques (e.g., token compression or specific modality-efficient architectures), ignoring the systemic interconnection of the inference pipeline.
Key Challenge: LVLM inference is not a single workload but a dynamic pipeline spanning three different hardware regimes. Optimizing one stage in isolation often shifts the bottleneck elsewhere, failing to improve end-to-end latency. Upstream decisions (e.g., encoder resolution) directly determine downstream bottlenecks (e.g., decoding bandwidth), yet existing literature lacks this global perspective.
Goal: To construct a unified, phase-aware taxonomy for efficient LVLM inference and analyze the physical nature of bottlenecks at each stage along with the combined effects of optimization techniques.
Key Insight: Utilize the Roofline model from a "computational physics" perspective to analyze bottleneck types for each stage—encoding is compute-bound (high arithmetic intensity), prefilling is hybrid-bound, and decoding is memory-bound (low arithmetic intensity).
Core Idea: Decouple efficiency optimization into three axes: information density shaping (encoding), long-context attention management (prefilling), and memory bandwidth breakthroughs (decoding), analyzing how isolated optimizations combine to trade off visual fidelity and system efficiency.
Method¶
Overall Architecture¶
The survey is organized around the three-stage inference pipeline of LVLMs: (1) Encoding stage—visual encoders extract patch embeddings, and modality adapters align them to the LLM space, producing \(N_v\) vision tokens; (2) Prefilling stage—processing the concatenated vision + text context to generate initial KV caches; (3) Decoding stage—autoregressively generating output tokens, loading model weights and accumulated KV caches at each step.
Key Designs¶
The paper uses a single metric—the arithmetic intensity of the Roofline model—to link the three stages into a causal chain: upstream encoding determines how many vision tokens the downstream must carry, making isolated optimization insufficient.
1. Encoding Stage: Reducing the vision token count \(N_v\) early under compute-bound conditions
Latency in the encoding stage is approximated as \(\tau_{\text{ENC}} \approx \text{FLOPs}/\pi_{\text{peak}}\), characterizing it as typically compute-bound (high arithmetic intensity). Since the overhead per request is constant, squeezing efficiency here might seem low-yield. However, the paper argues that the true leverage is not in shortening \(\tau_{\text{ENC}}\) itself, but in reducing the output vision token count \(N_v\)—as \(N_v\) propagates downstream: prefilling attention complexity is \(O((N_v+N_t)^2)\), and KV cache size grows linearly with \(N_v\). Cutting one vision token saves resources across all three stages simultaneously.
Techniques follow two axes: Architectural (efficient encoders like FastViT's structural reparameterization or EfficientViT's distillation, and compact adapters like Q-Former) and Input-level (keyframe selection for video, adaptive resolution based on content complexity, and direct token compression on the encoder side). Reducing tokens here yields the highest return due to the multiplicative cascading benefits at the top of the pipeline.
2. Prefilling Stage: Balancing quadratic attention and massive KV writes in a hybrid-bound regime
When \(N_v\) is large, prefilling is squeezed from two sides—quadratic attention computation and one-time massive KV cache writes. Latency depends on which wall is hit first: \(\tau_{\text{PFL}} \approx \max(\text{FLOPs}_{\text{attn}}/\pi_{\text{peak}},\, |\mathcal{KV}|_{\text{PFL}}/\beta_{\text{mem}})\). This is where LVLMs diverge from text-only LLMs: vision token dominance can push prefilling, which should be compute-bound, directly past the memory wall.
Two categories of methods target these sides: Reducing computation via sparse attention (window attention, sparse patterns, linear attention approximations), and reducing token volume via compression—attention-guided pruning (FastV, SparseVLM), similarity-driven merging (ToMe), or learned abstractions (Q-Former). Since the bottleneck is a \(\max\) function rather than a sum, addressing only one side is often insufficient.
3. Decoding Stage: Breaking the "Visual Memory Wall"—preventing static vision KV from reloading every step
Decoding is the most overlooked yet fatal stage. It is strictly memory-bound (arithmetic intensity << 1), with step latency approximated as \(\tau_{\text{DEC}}^{(i)} \approx (|\psi| + |\mathcal{KV}|_i)/\beta_{\text{mem}}\), where vision KV cache \(|\mathcal{KV}|_v \propto N_v \cdot L \cdot D_{\mathcal{L}}\). Crucially, this visual KV is static—once generated during prefilling, it never updates, yet it must be reloaded from HBM to SRAM at every generation step. This is defined as the "Visual Memory Wall": pure bandwidth waste.
Strategies to break this wall follow three paths: (1) Shrinking the KV cache directly (cache eviction of unimportant entries, quantization, or merging); (2) Speculative decoding (using small models to draft tokens for parallel verification by large models to amortize memory movement); (3) Optimizing generated content (e.g., Chain-of-Thought optimization to reduce invalid decoding steps). The shared logic is reducing the total bytes transferred across the bandwidth per step.
Loss & Training¶
As a survey paper, it does not propose a specific training method but outlines four frontier directions: (1) Mixed compression based on functional unit sensitivity; (2) Modality-aware decoding with relaxed verification; (3) Progressive state management for streaming continuity; (4) Hardware-algorithm co-design for phase-decoupled services.
Key Experimental Results¶
Main Results (Efficiency Analysis)¶
| Inference Phase | Bottleneck Type | Arithmetic Intensity | Primary Optimization Direction |
|---|---|---|---|
| Encoding | Compute-bound | High (>>1) | Efficient encoders, reduced patch count |
| Prefilling | Hybrid-bound | Medium | Token compression, sparse attention |
| Decoding | Memory-bound | Low (<<1) | KV cache optimization, speculative decoding |
Ablation Study (Quantitative Analysis Examples)¶
| Scenario | Vision Token Count | KV Cache Size | Description |
|---|---|---|---|
| Qwen2.5-VL-72B (20 images) | >40K | >13GB | Severe memory pressure |
| 5s 720p Video | >50K | >16GB | Visual Memory Wall |
Key Findings¶
- Vision token dominance is the fundamental efficiency bottleneck for LVLMs, distinct from LLM efficiency issues.
- Reducing \(N_v\) in the encoding stage yields cascading benefits (reduced prefilling complexity + linear KV cache reduction).
- Single-stage optimization may shift bottlenecks rather than eliminate them—an end-to-end optimization perspective is required.
- The "Visual Memory Wall" in the decoding stage is the most neglected but high-impact bottleneck.
Highlights & Insights¶
- Systemic Three-Phase Analysis: Formalizes bottleneck types (compute/memory-bound) for each stage using the Roofline model, providing theoretical guidance for selecting optimization techniques and avoiding blind trial-and-error.
- Quantification of Cascade Gains: Explicitly points out the multiplicative downstream benefits of reducing \(N_v\) at the encoding stage, providing a basis for prioritizing optimizations.
- Visual Memory Wall Concept: Proposes and formalizes this concept, highlighting that the bandwidth waste caused by repeatedly loading static visual KV caches during decoding is a unique challenge for LVLMs.
Limitations & Future Work¶
- As a survey, it lacks the proposal of new methods or unified experimental comparisons.
- The four frontier directions tend toward conceptual discussion and lack sufficient experimental validation.
- Focuses primarily on inference efficiency, excluding training efficiency (e.g., the inference impact of parameter-efficient fine-tuning).
- Discussion on multi-device/distributed inference is not deep enough.
Related Work & Insights¶
- vs. Prior Surveys (Shao et al. 2025b): Previous surveys focused on token compression; this paper provides a full-pipeline perspective.
- vs. LLM Efficiency Surveys: LLM research does not address the unique challenge of vision token dominance.
- vs. Specific Technical Papers: This paper reveals the interactions and combinatorial effects between different technologies.
Rating¶
- Novelty: ⭐⭐⭐⭐ The phase-aware taxonomy and Visual Memory Wall concept are valuable contributions.
- Experimental Thoroughness: ⭐⭐⭐ Initial experimental analysis is provided, but lacks large-scale unified comparisons.
- Writing Quality: ⭐⭐⭐⭐⭐ Well-organized, deep analysis, and excellent chart design.
- Value: ⭐⭐⭐⭐⭐ Provides a systematic framework for LVLM efficiency optimization.