⚡ VLM Efficiency¶

🎞️ ECCV2024 · 4 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8)

🔥 Top topics: Multimodal/VLM ×3 · Model Compression ×2

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding: This work proposes ClassAct/ActiveCLIP, which utilizes small, low-cost proxy models to compute "learnability" scores for data points to prioritize training data. This reduces training updates for large-scale visual classifiers and multimodal models by 46% and 51% respectively, achieves up to 25% total compute savings, and stands as the first active learning method to achieve net positive compute savings in large-scale pre-training.
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models: Groma proposes a new paradigm that embeds localization capabilities directly into the visual tokenization process. By discovering regions of interest (ROIs) via a region proposer and encoding them into region tokens, Groma enables MLLMs to perform high-accuracy referring and grounding without relying on LLM-generated coordinates or external modules. It also leverages GPT-4V with visual prompting to construct Groma Instruct, the first grounded chat dataset featuring dual visual-textual prompts.
IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models: IVTP proposes utilizing textual instruction information to dynamically assess the importance of each visual token and prune redundant tokens during the inference of Large Vision-Language Models (LVLMs). This achieves task-related adaptive visual info compression, significantly reducing computational overhead while maintaining or even improving model performance.
Quantized Prompt for Efficient Generalization of Vision-Language Models: By treating quantization error as a form of regularization noise, this work applies ultra-low-bit quantization (down to 1-bit) to the learnable prompts of VLMs. This significantly reduces storage overhead (up to \(16\times\) compression) while markedly improving the model's generalization capability to unseen classes. QCoOp achieves superior performance over various state-of-the-art (SOTA) methods using only 0.26KB of storage.