Skip to content

⚡ VLM Efficiency

🔬 ICLR2026 · 5 paper notes

📌 Same area in other venues: 📷 CVPR2026 (62) · 🧪 ICML2026 (4) · 💬 ACL2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Model Compression ×3 · Compression ×2 · Multimodal/VLM ×2

HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

This paper proposes the HiDrop framework, which conducts a systematic layer-wise behavioral analysis of MLLMs (shallow layers = propagators, middle layers = fusion hubs, deep layers = language reasoners) and designs a three-stage strategy: Late Injection (skipping shallow layers) + Concave Pyramid Pruning (aggressive pruning in middle layers) + Early Exit (discarding tokens in deep layers). The framework compresses approximately 90% of visual tokens with negligible performance degradation and achieves a 1.72× training speedup.

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding

A binary patch classifier with only 203K parameters is inserted before the VLM visual encoder to remove background tokens from document images. A \(3 \times 3\) max-pooling operation is then applied to recover fragmented text regions while preserving original spatial indices, achieving 40–60% FLOPs reduction on Qwen2.5-VL with accuracy degradation of no more than ~5 percentage points.

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

This paper reveals the implicit visual coordinate (IVC) system established by RoPE positional encoding within LVLMs, and proposes a training-free, prompt-aware vision token pruning strategy that preserves IVC tokens and semantic foreground tokens while pruning approximately 50% of visual tokens with ≥99% of original performance retained.

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

This paper identifies modality-specific and attention-head-specific semantic redundancy in the KV Cache of LVLMs, demonstrating that importance-only selection fails to preserve semantic coverage. The proposed MixKV adaptively mixes importance and diversity scores per attention head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression ratios.

PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models

This paper proposes PPE (Positional Preservation Embedding), which exploits the dimensional independence of rotations in RoPE to encode multiple original position IDs from merged tokens into distinct dimension segments, enabling a single compressed token to carry multiple spatial/temporal positional cues. PPE is a zero-parameter, plug-and-play operator that achieves an average performance drop of only 3.6% on image tasks at 55% compression, and maintains comparable performance at 90% compression via cascaded compression.