⚡ VLM Efficiency¶

📷 CVPR2025 · 3 paper notes

🔥 Top topics: Model Compression ×2

MBQ: Modality-Balanced Quantization for Large Vision-Language Models: This work identifies that the sensitivity of vision tokens and language tokens to quantization errors in large VLMs differs by more than tenfold. It proposes MBQ, a post-training quantization method that introduces a gradient-based modality-balancing factor during calibration. Under W3A16 and W4A8 configurations, MBQ improves accuracy by up to 4.4% and 11.6%, respectively, while achieving a 1.4× end-to-end acceleration.
Quantization without Tears: This paper proposes the QwT (Quantization without Tears) method, which compensates for quantization information loss by adding a lightweight linear compensation layer after each block of the quantized network. The parameters of this compensation layer can be obtained via a closed-form solution in under 2 minutes, significantly improving PTQ accuracy across various tasks including vision, language, and multimodality.