⚡ VLM Efficiency¶

💬 ACL2025 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8)

🔥 Top topics: Multimodal/VLM ×7 · Model Compression ×2 · LLM ×2

EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models: Proposes EffiVLM-Bench, a unified evaluation framework to systematically evaluate training-free acceleration methods (token compression + parameter compression) for LVLMs across four dimensions: performance, generalization, faithfulness, and efficiency. Spanning 3 cutting-edge models and 17 benchmark tasks, it reveals the Pareto-optimal trade-offs of various methods under different compression rates.
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models: This paper proposes Hierarchical Safety Realignment (HSR), a method that first identifies safety-critical attention heads and then locates and restores safety-critical pruned neurons within these heads. With minimal parameter overhead (on the order of ten-thousandths), HSR significantly recovers the safety performance lost in pruned LVLMs.
HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval: This paper proposes HotelMatch-LLM, an asymmetric architecture that employs an SLM to encode queries and an LLM to encode hotel documents. Combined with a tri-objective multi-task optimization (retrieval alignment + MLM geographic prediction + visual facility recognition) and patch-level mean pooling for multi-image processing, it significantly outperforms SOTA methods like MARVEL and VISTA on travel-domain multimodal retrieval tasks.
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference: This paper proposes MadaKV, a modality-aware KV cache eviction strategy. Through two core components—Modality Preference Adaptation (MPA) and Hierarchical Compression Compensation (HCC)—MadaKV significantly reduces KV cache memory consumption (by 80-95%) and decoding latency (1.3x to 1.5x speedup) while maintaining performance on multimodal long-context tasks.
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval: The authors propose OMGM, a multimodal RAG system for knowledge-bound visual question answering (KB-VQA). By orchestrating the matching between query and knowledge base across various granularities and modalities via a coarse-to-fine three-step retrieval strategy, OMGM achieves state-of-the-art retrieval performance and highly competitive VQA results on InfoSeek and E-VQA datasets.
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs: This work proposes the RedundancyLens framework to systematically reveal the extensive structured and clustered redundancy in self-attention and FFN operations for visual tokens within decoder-only MLLMs. Leveraging this finding, training-free inference acceleration is achieved, which is orthogonal to and combinable with existing token compression methods.
Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding: Proposes the Sophia model to handle hour-scale long videos: accurately selects query-relevant frames via Shot-adaptive Frame Pruning (a two-stage frame pruning based on shot segmentation), and replaces full attention with Hierarchical Attention of \(O(N)\) complexity. It achieves state-of-the-art (SOTA) performance on 6 out of 8 long video benchmarks, while requiring only 1/8.5 of the attention FLOPs compared to InternVL2.
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?: Large-scale benchmark experiments reveal several fundamental issues with current visual token pruning methods for MLLMs: elaborately designed pruning strategies (such as FastV and SparseVLM) underperform even naive methods like random selection and pooling on most benchmarks. This is due to positional bias in attention scores, misuse of language information, imbalance between importance and redundancy, and unreliable evaluation metrics.