⚡ VLM Efficiency¶
💬 ACL2026 · 5 paper notes
📌 Same area in other venues: 📷 CVPR2026 (62) · 🧪 ICML2026 (4) · 🔬 ICLR2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)
- APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
-
APB-V accelerates long-video LMM inference using sequence-parallelism-aware approximate attention and system-level load balancing. While retaining full visual embeddings, it achieves 12.72\(\times\), 1.70\(\times\), and 1.18\(\times\) speedups relative to FlashAttn, ZigZagRing, and APB respectively under 64-frame 1440p settings, without significant performance loss.
- From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration
-
This work reveals two sources of visual redundancy during MLLM inference: Inherited Visual Redundancy (IVR) caused by dense tokenization in ViT, and Secondary Saturation Redundancy (SSR) resulting from deep semantic saturation, whose manifestation varies across backbone architectures. The proposed HalfV framework addresses these redundancies separately, achieving a 4.1x FLOPs speedup on Qwen2.5-VL while retaining 96.8% of the performance.
- HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
-
This paper proposes HERMES, which conceptualizes the KV cache as a hierarchical memory framework (Shallow = Sensory, Middle = Working, Deep = Long-term) based on a mechanistic analysis of the hierarchical attention preferences in MLLM decoders. HERMES achieves training-free, efficient streaming video understanding, maintaining or improving accuracy while reducing video tokens by 68%, with a TTFT latency under 30ms—10 times faster than previous SOTA.
- HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models
-
This paper discovers hierarchical attention patterns in vision encoders—middle layers focus on main objects, while deep layers focus on global information. Based on this, it proposes HiPrune, a training-free and model-agnostic visual token pruning method. By selecting three types of tokens (Anchor/Buffer/Register) to preserve visual information at different levels, it maintains 99.3% performance using only 1/3 of the tokens, reducing FLOPs by 58.7%.
- ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
-
ReGATE utilizes a frozen text-only teacher to estimate which output tokens require visual information, combined with the student's historical learning difficulty to dynamically select training tokens. This allows MLLMs to train faster with fewer tokens without changing the architecture or adding parameters, achieving or exceeding standard fine-tuning performance on multiple image and video benchmarks.