Skip to content

⚡ VLM Efficiency

🧪 ICML2026 · 4 paper notes

📌 Same area in other venues: 📷 CVPR2026 (62) · 💬 ACL2026 (5) · 🔬 ICLR2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Multimodal/VLM ×2 · Model Compression ×2

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

This work identifies a counter-intuitive "similarity reversal" phenomenon where visual tokens in referred regions exhibit low similarity with text [EOS] tokens in CLIP. Based on this, it proposes LiteLVLM—a training-free, text-guided visual token pruning method. It retains 90.3% of original pixel grounding performance while pruning 66.7% of tokens, achieving 22% inference acceleration and 2.3× memory savings.

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

This paper unifies Quantization-Aware Training (QAT) and Knowledge Distillation (KD) through the lens of the Information Bottleneck (IB) principle. It proposes the GRACE framework (Gated Decoupled Distillation + Relational Centered Kernel Alignment + Adaptive IB Controller), enabling INT4-quantized LLaVA / Qwen-VL to not only avoid degradation but also exceed BF16 baselines on multiple benchmarks, while achieving 3× throughput and 54% memory savings in practice.

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

This paper executes 700,000 experiments across 16 quantization methods \(\times\) 10 VLMs \(\times\) multiple reliability metrics. It discovers that quantization is not a simple disruptor—it improves calibration, OOD detection, and noise robustness by suppressing high-rank, low-variance spectral components, while simultaneously amplifying reliance on covariate shifts and spurious correlations.

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

This paper provides the first systematic study of the adversarial robustness of Large Vision-Language Models (LVLMs) with visual token compression. It identifies the "optimization-inference space mismatch" in existing encoder attacks and proposes the CAGE attack. By utilizing Expected Feature Disturbance (EFD) and Ranking-Disturbance Alignment (RDA), CAGE significantly reduces the robust accuracy of compressed LVLMs under conditions where the compression mechanism and token budget are unknown.