⚡ VLM Efficiency¶
🔬 ICLR2026 · 5 paper notes
📌 Same area in other venues: 📷 CVPR2026 (62) · 🧪 ICML2026 (4) · 💬 ACL2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)
🔥 Top topics: Model Compression ×3 · Compression ×2 · Multimodal/VLM ×2
- HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
-
This paper proposes the HiDrop framework, which conducts a systematic layer-wise behavioral analysis of MLLMs (shallow layers = propagators, middle layers = fusion hubs, deep layers = language reasoners) and designs a three-stage strategy: Late Injection (skipping shallow layers) + Concave Pyramid Pruning (aggressive pruning in middle layers) + Early Exit (discarding tokens in deep layers). The framework compresses approximately 90% of visual tokens with negligible performance degradation and achieves a 1.72× training speedup.
- Index-Preserving Lightweight Token Pruning for Efficient Document Understanding
-
A binary patch classifier with only 203K parameters is inserted before the VLM visual encoder to remove background tokens from document images. A \(3 \times 3\) max-pooling operation is then applied to recover fragmented text regions while preserving original spatial indices, achieving 40–60% FLOPs reduction on Qwen2.5-VL with accuracy degradation of no more than ~5 percentage points.
- IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
-
This paper reveals the implicit visual coordinate (IVC) system established by RoPE positional encoding within LVLMs, and proposes a training-free, prompt-aware vision token pruning strategy that preserves IVC tokens and semantic foreground tokens while pruning approximately 50% of visual tokens with ≥99% of original performance retained.
- Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
-
This paper identifies modality-specific and attention-head-specific semantic redundancy in the KV Cache of LVLMs, demonstrating that importance-only selection fails to preserve semantic coverage. The proposed MixKV adaptively mixes importance and diversity scores per attention head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression ratios.
- PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
-
This paper proposes PPE (Positional Preservation Embedding), which exploits the dimensional independence of rotations in RoPE to encode multiple original position IDs from merged tokens into distinct dimension segments, enabling a single compressed token to carry multiple spatial/temporal positional cues. PPE is a zero-parameter, plug-and-play operator that achieves an average performance drop of only 3.6% on image tasks at 55% compression, and maintains comparable performance at 90% compression via cascaded compression.