⚡ VLM Efficiency¶

💬 ACL2026 · 6 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Multimodal/VLM ×2

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention: APB-V accelerates long-video LMM inference using sequence-parallelism-aware approximate attention and system-level load balancing. While preserving full visual embeddings, it achieves speedups of 12.72×, 1.70×, and 1.18× compared to FlashAttn, ZigZagRing, and APB, respectively, under a 64-frame 1440p setting without significant performance loss.
From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration: This work reveals two sources of visual redundancy in MLLM inference: Inherited Visual Redundancy (IVR) caused by dense ViT tokenization and Secondary Saturation Redundancy (SSR) caused by deep semantic saturation, which manifests differently across backbone architectures. The proposed HalfV framework handles these two types of redundancy separately, achieving a 4.1x FLOPs acceleration on Qwen2.5-VL while preserving 96.8% of the performance.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding: This paper proposes HERMES, which conceptualizes KV cache as a hierarchical memory framework (shallow = sensory memory, middle = working memory, deep = long-term memory) based on a mechanistic analysis of MLLM decoder hierarchical attention preferences. It achieves training-free efficient streaming video understanding, maintaining or improving accuracy while reducing video tokens by 68%. The TTFT latency is <30ms, 10x faster than the previous SOTA.
HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models: This paper identifies a hierarchical attention pattern in vision encoders—middle layers focus on primary objects while deep layers capture global information. Based on this, it proposes HiPrune, a training-free and model-agnostic vision token pruning method. By selecting three types of tokens (Anchor/Buffer/Register) to preserve multi-level visual information, it maintains 99.3% performance using only 1/3 of the tokens, reducing FLOPs by 58.7%.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference: To address the "straggler" problem where Multimodal MoE models are bottlenecked by the "slowest expert" during Expert Parallelism (EP) inference, MACS re-estimates expert load using the Shannon entropy of visual tokens as semantic importance weights. It dynamically scales expert capacity based on the real-time modality composition of the batch. MACS is a training-free inference framework that maintains nearly identical performance (averaging 99.7% of vanilla MoE) across 12 multimodal benchmarks, significantly outperforming token-counting methods like CAI-MoE.
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs: ReGATE utilizes a frozen text-only teacher to estimate which output tokens require visual information, combined with the student's historical learning difficulty to dynamically select training tokens. This allows MLLMs to train faster with fewer tokens without changing architecture or adding parameters, achieving or exceeding standard fine-tuning performance on multiple image and video benchmarks.