⚡ VLM Efficiency¶
🧪 ICML2026 · 4 paper notes
📌 Same area in other venues: 📷 CVPR2026 (62) · 💬 ACL2026 (5) · 🔬 ICLR2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)
🔥 Top topics: Multimodal/VLM ×2 · Model Compression ×2
- CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models
-
This work identifies a counter-intuitive "similarity reversal" phenomenon where visual tokens in referred regions exhibit low similarity with text [EOS] tokens in CLIP. Based on this, it proposes LiteLVLM—a training-free, text-guided visual token pruning method. It retains 90.3% of original pixel grounding performance while pruning 66.7% of tokens, achieving 22% inference acceleration and 2.3× memory savings.
- Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
-
This paper unifies Quantization-Aware Training (QAT) and Knowledge Distillation (KD) through the lens of the Information Bottleneck (IB) principle. It proposes the GRACE framework (Gated Decoupled Distillation + Relational Centered Kernel Alignment + Adaptive IB Controller), enabling INT4-quantized LLaVA / Qwen-VL to not only avoid degradation but also exceed BF16 baselines on multiple benchmarks, while achieving 3× throughput and 54% memory savings in practice.
- Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy
-
This paper executes 700,000 experiments across 16 quantization methods \(\times\) 10 VLMs \(\times\) multiple reliability metrics. It discovers that quantization is not a simple disruptor—it improves calibration, OOD detection, and noise robustness by suppressing high-rank, low-variance spectral components, while simultaneously amplifying reliance on covariate shifts and spurious correlations.
- On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
-
This paper provides the first systematic study of the adversarial robustness of Large Vision-Language Models (LVLMs) with visual token compression. It identifies the "optimization-inference space mismatch" in existing encoder attacks and proposes the CAGE attack. By utilizing Expected Feature Disturbance (EFD) and Ranking-Disturbance Alignment (RDA), CAGE significantly reduces the robust accuracy of compressed LVLMs under conditions where the compression mechanism and token budget are unknown.