⚡ VLM Efficiency¶

📹 ICCV2025 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8)

🔥 Top topics: Multimodal/VLM ×10 · Compression ×3 · Model Compression ×2 · LLM ×2

AirCache: Activating Inter-Modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference: This paper proposes AirCache, a KV Cache compression method for LVLMs that evaluates visual token importance via an Elite Observation Window, combined with adaptive layer-wise budget allocation based on the intensity and skewness of importance score distributions. At only 10% visual KV Cache retention, performance degradation remains within 1%, while decoding latency is reduced by 29%–66%.
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference: This paper proposes AirCache, which achieves model performance retention with only 10% of the visual KV cache—reducing decoding latency by 29%–66%—through an elite observation window (leveraging text self-attention to select critical text tokens for evaluating visual token importance) and adaptive inter-layer budget allocation (based on the intensity and skewness of importance score distributions).
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM: This paper proposes Dynamic-VLM, which employs a dynamic visual token compressor to flexibly adjust the number of tokens per frame according to video length. Combined with a 2-million-scale high-quality synthetic video QA dataset, the method achieves a 2.7% improvement over LLaVA-OneVision on VideoMME and a 10.7% improvement on MuirBench.
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration: This paper identifies a systematic positional bias in early visual token pruning for VLMs—caused by RoPE, which tends to retain tokens from the bottom of the image—and proposes FEATHER, which addresses this issue via RoPE-free attention, uniform sampling, and multi-stage pruning, achieving over 5× performance improvement on visual grounding tasks.
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance: This paper proposes FOLDER — a plug-and-play visual token compression module that systematically analyzes three key factors of information loss (reduction impact, propagation effect, and aggregation method), performs aggressive token merging in the last few layers of the visual encoder, and achieves up to 70% token reduction while maintaining or even improving model performance.
Growing a Twig to Accelerate Large Vision-Language Models: This paper proposes TwigVLM, which attaches a lightweight twig module to the early layers of a VLM to simultaneously enable twig-guided visual token pruning (TTP, for prefilling acceleration) and self-speculative decoding (SSD, for decoding acceleration). On LLaVA-1.5-7B, TwigVLM retains 96% accuracy after pruning 88.9% of visual tokens and achieves a 154% speedup in long-answer generation, substantially outperforming existing methods in both accuracy and speed.
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models: By exploiting the sparsity of attention scores between the CLS token and spatial tokens in the visual encoder, this work adaptively prunes and merges visual tokens, maintaining comparable LMM performance while retaining only 5.5% of visual tokens.
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling: This paper proposes MaTVLM, which replaces a portion of Transformer layers in a pretrained VLM with Mamba-2 layers and trains the resulting model via single-stage knowledge distillation, achieving 3.6× inference speedup and 27.5% memory reduction while maintaining competitive performance.
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models: METEOR proposes the first three-stage progressive token pruning framework for multi-encoder MLLMs: at the encoding stage, feature rank is used to allocate sparsity ratios across encoders; at the fusion stage, collaborative pruning eliminates cross-encoder redundancy; at the decoding stage, pruning ratios are adaptively adjusted based on text prompts. The framework reduces visual tokens by 76% with only a 0.3% performance drop.
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers: This work identifies significant layer-level redundancy in MLLMs—most layers contribute minimally to the transformation of visual tokens—and proposes ShortV: freezing visual tokens (skipping their attention and FFN computations) in approximately 60% of layers. On LLaVA-NeXT-13B, this achieves a 50% reduction in FLOPs with negligible performance degradation. The method is training-free and orthogonal to token pruning approaches, allowing them to be combined.
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference: This paper proposes SparseVILA—the first VLM inference acceleration framework that decouples visual sparsity between the prefill and decode stages: query-agnostic redundant token pruning during prefill, and query-aware relevant token retrieval during decode. The approach achieves up to 4.0× prefill speedup, 2.5× decode throughput improvement, and 2.6× end-to-end acceleration, while maintaining accuracy in multi-turn conversation settings where existing methods suffer severe degradation due to permanent token deletion.