⚡ VLM Efficiency¶

🧪 ICML2025 · 3 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8)

🔥 Top topics: Multimodal/VLM ×2

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models: This work is the first to reveal the intrinsic correlation between token sparsity and neuron sparsity in VLMs—core neurons and core tokens mutually determine and reinforce each other. Based on this correlation, the authors propose the CoreMatching co-adaptive sparse inference framework, achieving simultaneous acceleration in both pre-filling and decoding stages, which leads to a 5× FLOPs reduction and 10× overall speedup.
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention: This paper proposes MMInference, which accelerates the prefill stage of long-context VLMs by up to \(8.3\times\) in a \(1\text{M}\) token scenario without modifying model weights or fine-tuning, while maintaining task accuracy. This is achieved via "modality-aware permutation sparse attention + head-level offline pattern search + online dynamic indexing + customized GPU kernels."
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference: SparseVLM proposes the first training-free, text-guided visual token sparsification framework. By selecting vision-related text tokens as "raters" to evaluate the importance of visual tokens, combined with an adaptive pruning ratio and a token recycling mechanism, it preserves 99.1% of the original performance on LLaVA while retaining only 192 tokens (a 66.7% reduction).