MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models¶

Conference: CVPR 2026 arXiv: 2603.04800 Code: https://github.com/alibaba/EfficientAI Area: Multimodal VLM Keywords: Post-training quantization, multimodal LLM, smooth quantization, cross-modal compensation, low-rank decomposition

TL;DR¶

This paper identifies a "smoothing misalignment" problem that arises when channel-wise smooth quantization methods (e.g., SmoothQuant) are directly applied to MLLMs—the large discrepancy in activation magnitudes across modalities causes non-dominant modalities to be over-smoothed. MASQuant is proposed to address this via modality-aware smoothing factors and SVD whitening-based cross-modal low-rank compensation.

Background & Motivation¶

Background: Post-training quantization (PTQ) is a critical technique for deploying large models. Channel-wise smoothing methods based on mathematical invariance (SmoothQuant, AWQ, etc.) perform well on text-only LLMs by redistributing activation outliers through channel scaling factors.

Limitations of Prior Work: When channel-wise smoothing is directly applied to MLLMs, visual token activations are typically 10–100× larger in magnitude than text token activations. A unified smoothing factor is dominated by the dominant modality (usually vision), causing non-dominant modalities (text, audio) to be over-smoothed, compressing their signals and introducing severe quantization errors—referred to as "Smoothing Misalignment."

Key Challenge: Learning independent smoothing factors per modality resolves the misalignment, but requires storing separate quantized weights for each modality, fundamentally defeating the purpose of quantization compression.

Goal: Can modality-aware smooth quantization be achieved while maintaining a single set of quantized weights?

Key Insight: The authors observe that weight differences across modalities after smoothing are low-rank (provably so mathematically), enabling lightweight low-rank matrices to serve as compensation.

Core Idea: Learn modality-specific smoothing factors + store a single set of quantized weights anchored to the text modality + apply SVD whitening-based low-rank compensation for other modalities.

Method¶

Overall Architecture¶

MASQuant consists of two core modules: (1) Modality-Aware Smoothing (MAS), which learns independently optimized smoothing factors per modality; and (2) Cross-Modal Compensation (CMC), which compresses cross-modal weight differences into a low-rank form via SVD whitening, storing only one set of quantized weights plus lightweight compensation matrices.

Key Designs¶

Modality-Aware Smoothing (MAS):
- Function: Learn an independent optimized smoothing factor \(\mathbf{S}_m\) for each modality \(m\).
- Mechanism: Initialize \(s_i^m = \sqrt{\max_t|x_{t,i}^m| / \max_j|w_{j,i}|}\), then directly optimize the smoothing factors by minimizing a modality-specific MAE loss \(\sum_{m} \lambda_m \cdot \mathcal{L}_{MAE}(\mathbf{S}_m, \mathbf{X}_m, \mathbf{W})\).
- SQNR Theoretical Analysis: Proves that unified smoothing degrades SQNR for non-dominant modalities by \(\Delta = 10\log_{10}(\frac{d(\min_i \alpha_i^2)}{\sum_i 1/\alpha_i^2})\), where \(\alpha_i\) denotes the activation range ratio between modalities.
- Design Motivation: Rather than searching for a hyperparameter \(\beta\), the smoothing factors themselves are directly optimized, reaching the optimization limit of channel-wise smoothing.
Cross-Modal Compensation (CMC):
- Function: Compensate for quantization error in non-text modalities while using a single set of quantized weights.
- Mechanism: Using the text-modality smoothed weight \(Q(\mathbf{S}_t \mathbf{W})\) as the anchor, the visual modality produces a residual \(\Delta\mathbf{W} = \mathbf{S}_v \mathbf{W} - Q(\mathbf{S}_t \mathbf{W})\). Directly applying SVD to \(\Delta\mathbf{W}\) yields poor results due to lack of low-rank structure, but after a whitening transform \(\mathbf{T} = (\mathbf{P}\Lambda^{1/2})^\top\), the transformed residual \(\mathbf{T}(\Delta\mathbf{W})\) exhibits strong low-rank structure.
- After truncated SVD: \(\Delta\mathbf{W} \approx \mathbf{L}_1 \mathbf{L}_2\), where \(\mathbf{L}_1 = \mathbf{T}^{-1}\mathbf{U}_r\) and \(\mathbf{L}_2 = \Sigma_r \mathbf{V}_r^\top\).
- Theoretical Guarantee: Proved that this scheme minimizes the output reconstruction error \(\|\mathbf{X}_v \mathbf{S}_v^{-1}(\Delta\mathbf{W} - \mathbf{L})\|_F^2\).
Inference Pipeline:
- Text modality: \(\mathbf{Y} = Q(\mathbf{X}_t \mathbf{S}_t^{-1}) \cdot Q(\mathbf{S}_t \mathbf{W})\)
- Non-text modalities: \(\mathbf{Y} = Q(\mathbf{X}_m \mathbf{S}_m^{-1}) \cdot Q(\mathbf{S}_t \mathbf{W}) + \mathbf{X}_m \mathbf{S}_m^{-1} \cdot \mathbf{L}_1^m \mathbf{L}_2^m\)
- Only lightweight low-rank matrices need to be stored additionally; the primary weights remain a single quantized copy.

Key Experimental Results¶

Main Results (Qwen2.5-VL Series)¶

Method	Bits	MMMU	OCRBench	ScienceQA	TextVQA	Avg
FP16	W16A16	Baseline	Baseline	Baseline	Baseline	100%
SmoothQuant	W8A8	Notable drop	Drop	Drop	Drop	—
MASQuant	W8A8	Best	Best	Best	Best	SOTA

Cross-Architecture Validation¶

Model Type	Description
Dual-modal VLM	Consistently outperforms SmoothQuant and AWQ on Qwen2.5-VL-3B/7B
Three-modal Omni	Equally effective on Qwen2.5-Omni-3B; audio modality also benefits

Ablation Study¶

MAS alone already significantly improves SQNR (Figure 2 validates Theorem 1).
The low-rank approximation quality of CMC converges rapidly as rank increases.
The low-rank property of the whitened residual is substantially stronger than that of the direct SVD baseline.

Highlights & Insights¶

First formal definition of the "smoothing misalignment" problem in MLLM quantization, with a theoretical SQNR analysis (Theorem 1).
Mathematical proof of the low-rank structure of cross-modal activation differences, providing theoretical guarantees for CMC (Theorem 2).
The framework applies to both dual-modal (vision–text) and tri-modal (vision–text–audio) MLLMs.
Maintains a single set of quantized weights with negligible additional storage overhead (low-rank matrices only).
Consistently outperforms existing channel-wise smoothing PTQ methods on Qwen2.5-VL and Qwen2.5-Omni.

Ablation Study¶

MAS only (without CMC): Requires storing separate quantized weights per modality, but achieves optimal quantization accuracy.
CMC only (without smoothing correction): Limited remediation, as the underlying smoothing misalignment remains unresolved.
MAS + CMC (full method): Approaches the accuracy upper bound of MAS under the single-weight constraint.
CMC low-rank compensation: Rank 16–32 is generally sufficient to recover 90%+ of the accuracy gap.
Singular value decay of whitened \(\mathbf{T}(\Delta\mathbf{W})\) is far faster than that of direct SVD, validating the low-rank assumption.

Limitations & Future Work¶

The calibration stage requires collecting per-modality data to separately optimize smoothing factors, increasing preprocessing complexity.
The rank \(r\) selection for low-rank compensation requires a trade-off between accuracy and additional storage.
Only W8A8 and W4A8 settings are validated; performance under more aggressive low-bit configurations (e.g., W2A4) remains unknown.
Inference on non-text modalities requires an additional matrix multiplication \(\mathbf{X}_m \mathbf{S}_m^{-1} \cdot \mathbf{L}_1^m \mathbf{L}_2^m\), introducing marginal latency overhead.
The modality-aware idea could potentially be extended to rotation-based methods (e.g., QuaRot, SpinQuant).
In tri-modal and higher settings, the number of low-rank compensation matrices grows linearly, requiring memory management optimization.

Implementation Details¶

MAS optimization uses Adam and typically converges within 100–200 iterations.
CMC low-rank matrices are stored in FP16; their memory footprint is negligible compared to the full weight matrices.

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Key Experimental Results¶

Main Results (Qwen2.5-VL Series)¶

Cross-Architecture Validation¶

Ablation Study¶

Highlights & Insights¶

Ablation Study¶

Limitations & Future Work¶

Implementation Details¶

Related Papers¶