ICML2026 VLM Efficiency AI paper notes paper summaries Multimodal/VLM Model Compression Alignment/RLHF Compression Adversarial Robustness

⚡ VLM Efficiency¶

🧪 ICML2026 · 4 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Multimodal/VLM ×2 · Model Compression ×2

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models: This work identifies an anti-intuitive "similarity reversal" phenomenon in CLIP, where visual tokens of referring regions exhibit the lowest similarity with [EOS] text tokens. Based on this observation, the authors propose LiteLVLM—a training-free, text-guided visual token pruning method. It retains 90.3% of the original pixel grounding performance even after discarding 66.7% of tokens, while achieving a 22% inference acceleration and 2.3× VRAM savings.
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs: This paper unifies Quantization-Aware Training (QAT) and Knowledge Distillation (KD) from an Information Bottleneck (IB) perspective, proposing the GRACE framework (Gated Decoupled Distillation + Relational Centered Kernel Alignment + Adaptive IB Controller). This enables INT4-quantized LLaVA / Qwen-VL to not only avoid performance degradation but outperform BF16 baselines across multiple benchmarks, while achieving 3× throughput and 54% memory savings.
Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy: This study evaluates 16 quantization methods across 10 VLMs and multiple reliability metrics through 700,000 experiments. It finds that quantization is not a simple disruptor—it improves calibration, OOD detection, and noise robustness by suppressing high-rank low-variance spectral components, while simultaneously amplifying reliance on covariate shifts and spurious correlations.
On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression: This paper presents the first systematic study of the adversarial robustness of Large Vision-Language Models (LVLMs) under visual token compression. It identifies an "optimization-inference space mismatch" in existing encoder attacks and proposes the CAGE attack. By utilizing Expected Feature Distortion (EFD) and Ranking-Distortion Alignment (RDA), CAGE significantly reduces the robust accuracy of compressed LVLMs under conditions where the compression mechanism and token budget are unknown.