MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models¶
Conference: CVPR 2026 arXiv: 2603.04800 Code: https://github.com/alibaba/EfficientAI Area: Multimodal VLM Keywords: Post-training quantization, multimodal LLM, smooth quantization, cross-modal compensation, low-rank decomposition
TL;DR¶
This paper identifies a "smoothing misalignment" problem that arises when channel-wise smooth quantization methods (e.g., SmoothQuant) are directly applied to MLLMs—the large discrepancy in activation magnitudes across modalities causes non-dominant modalities to be over-smoothed. MASQuant is proposed to address this via modality-aware smoothing factors and SVD whitening-based cross-modal low-rank compensation.
Background & Motivation¶
Background: Post-training quantization (PTQ) is a critical technique for deploying large models. Channel-wise smoothing methods based on mathematical invariance (SmoothQuant, AWQ, etc.) perform well on text-only LLMs by redistributing activation outliers through channel scaling factors.
Limitations of Prior Work: When channel-wise smoothing is directly applied to MLLMs, visual token activations are typically 10–100× larger in magnitude than text token activations. A unified smoothing factor is dominated by the dominant modality (usually vision), causing non-dominant modalities (text, audio) to be over-smoothed, compressing their signals and introducing severe quantization errors—referred to as "Smoothing Misalignment."
Key Challenge: Learning independent smoothing factors per modality resolves the misalignment, but requires storing separate quantized weights for each modality, fundamentally defeating the purpose of quantization compression.
Goal: Can modality-aware smooth quantization be achieved while maintaining a single set of quantized weights?
Key Insight: The authors observe that weight differences across modalities after smoothing are low-rank (provably so mathematically), enabling lightweight low-rank matrices to serve as compensation.
Core Idea: Learn modality-specific smoothing factors + store a single set of quantized weights anchored to the text modality + apply SVD whitening-based low-rank compensation for other modalities.
Method¶
Overall Architecture¶
MASQuant consists of two core modules: (1) Modality-Aware Smoothing (MAS), which learns independently optimized smoothing factors per modality; and (2) Cross-Modal Compensation (CMC), which compresses cross-modal weight differences into a low-rank form via SVD whitening, storing only one set of quantized weights plus lightweight compensation matrices.
Key Designs¶
-
Modality-Aware Smoothing (MAS):
- Function: Learn an independent optimized smoothing factor \(\mathbf{S}_m\) for each modality \(m\).
- Mechanism: Initialize \(s_i^m = \sqrt{\max_t|x_{t,i}^m| / \max_j|w_{j,i}|}\), then directly optimize the smoothing factors by minimizing a modality-specific MAE loss \(\sum_{m} \lambda_m \cdot \mathcal{L}_{MAE}(\mathbf{S}_m, \mathbf{X}_m, \mathbf{W})\).
- SQNR Theoretical Analysis: Proves that unified smoothing degrades SQNR for non-dominant modalities by \(\Delta = 10\log_{10}(\frac{d(\min_i \alpha_i^2)}{\sum_i 1/\alpha_i^2})\), where \(\alpha_i\) denotes the activation range ratio between modalities.
- Design Motivation: Rather than searching for a hyperparameter \(\beta\), the smoothing factors themselves are directly optimized, reaching the optimization limit of channel-wise smoothing.
-
Cross-Modal Compensation (CMC):
- Function: Compensate for quantization error in non-text modalities while using a single set of quantized weights.
- Mechanism: Using the text-modality smoothed weight \(Q(\mathbf{S}_t \mathbf{W})\) as the anchor, the visual modality produces a residual \(\Delta\mathbf{W} = \mathbf{S}_v \mathbf{W} - Q(\mathbf{S}_t \mathbf{W})\). Directly applying SVD to \(\Delta\mathbf{W}\) yields poor results due to lack of low-rank structure, but after a whitening transform \(\mathbf{T} = (\mathbf{P}\Lambda^{1/2})^\top\), the transformed residual \(\mathbf{T}(\Delta\mathbf{W})\) exhibits strong low-rank structure.
- After truncated SVD: \(\Delta\mathbf{W} \approx \mathbf{L}_1 \mathbf{L}_2\), where \(\mathbf{L}_1 = \mathbf{T}^{-1}\mathbf{U}_r\) and \(\mathbf{L}_2 = \Sigma_r \mathbf{V}_r^\top\).
- Theoretical Guarantee: Proved that this scheme minimizes the output reconstruction error \(\|\mathbf{X}_v \mathbf{S}_v^{-1}(\Delta\mathbf{W} - \mathbf{L})\|_F^2\).
-
Inference Pipeline:
- Text modality: \(\mathbf{Y} = Q(\mathbf{X}_t \mathbf{S}_t^{-1}) \cdot Q(\mathbf{S}_t \mathbf{W})\)
- Non-text modalities: \(\mathbf{Y} = Q(\mathbf{X}_m \mathbf{S}_m^{-1}) \cdot Q(\mathbf{S}_t \mathbf{W}) + \mathbf{X}_m \mathbf{S}_m^{-1} \cdot \mathbf{L}_1^m \mathbf{L}_2^m\)
- Only lightweight low-rank matrices need to be stored additionally; the primary weights remain a single quantized copy.
Key Experimental Results¶
Main Results (Qwen2.5-VL Series)¶
| Method | Bits | MMMU | OCRBench | ScienceQA | TextVQA | Avg |
|---|---|---|---|---|---|---|
| FP16 | W16A16 | Baseline | Baseline | Baseline | Baseline | 100% |
| SmoothQuant | W8A8 | Notable drop | Drop | Drop | Drop | — |
| MASQuant | W8A8 | Best | Best | Best | Best | SOTA |
Cross-Architecture Validation¶
| Model Type | Description |
|---|---|
| Dual-modal VLM | Consistently outperforms SmoothQuant and AWQ on Qwen2.5-VL-3B/7B |
| Three-modal Omni | Equally effective on Qwen2.5-Omni-3B; audio modality also benefits |
Ablation Study¶
- MAS alone already significantly improves SQNR (Figure 2 validates Theorem 1).
- The low-rank approximation quality of CMC converges rapidly as rank increases.
- The low-rank property of the whitened residual is substantially stronger than that of the direct SVD baseline.
Highlights & Insights¶
- First formal definition of the "smoothing misalignment" problem in MLLM quantization, with a theoretical SQNR analysis (Theorem 1).
- Mathematical proof of the low-rank structure of cross-modal activation differences, providing theoretical guarantees for CMC (Theorem 2).
- The framework applies to both dual-modal (vision–text) and tri-modal (vision–text–audio) MLLMs.
- Maintains a single set of quantized weights with negligible additional storage overhead (low-rank matrices only).
- Consistently outperforms existing channel-wise smoothing PTQ methods on Qwen2.5-VL and Qwen2.5-Omni.
Ablation Study¶
- MAS only (without CMC): Requires storing separate quantized weights per modality, but achieves optimal quantization accuracy.
- CMC only (without smoothing correction): Limited remediation, as the underlying smoothing misalignment remains unresolved.
- MAS + CMC (full method): Approaches the accuracy upper bound of MAS under the single-weight constraint.
- CMC low-rank compensation: Rank 16–32 is generally sufficient to recover 90%+ of the accuracy gap.
- Singular value decay of whitened \(\mathbf{T}(\Delta\mathbf{W})\) is far faster than that of direct SVD, validating the low-rank assumption.
Limitations & Future Work¶
- The calibration stage requires collecting per-modality data to separately optimize smoothing factors, increasing preprocessing complexity.
- The rank \(r\) selection for low-rank compensation requires a trade-off between accuracy and additional storage.
- Only W8A8 and W4A8 settings are validated; performance under more aggressive low-bit configurations (e.g., W2A4) remains unknown.
- Inference on non-text modalities requires an additional matrix multiplication \(\mathbf{X}_m \mathbf{S}_m^{-1} \cdot \mathbf{L}_1^m \mathbf{L}_2^m\), introducing marginal latency overhead.
- The modality-aware idea could potentially be extended to rotation-based methods (e.g., QuaRot, SpinQuant).
- In tri-modal and higher settings, the number of low-rank compensation matrices grows linearly, requiring memory management optimization.
Implementation Details¶
- MAS optimization uses Adam and typically converges within 100–200 iterations.
- CMC low-rank matrices are stored in FP16; their memory footprint is negligible compared to the full weight matrices.