MSQ: Memory-Efficient Bit Sparsification Quantization¶

Conference: ICCV 2025 arXiv: 2507.22349 Institution: Sungkyunkwan University, University of Arizona Area: Model Compression / Quantization / Mixed-Precision Quantization Keywords: mixed-precision quantization, bit-level sparsity, quantization-aware training, memory-efficient, Hessian, model compression

TL;DR¶

MSQ is proposed to achieve mixed-precision quantization discovery by computing the least significant bit (LSB) directly from weights via a RoundClamp quantizer and imposing L1 regularization to induce sparsity, without explicitly creating bit-level trainable parameters. This reduces trainable parameters by 8× and training time by 86% while maintaining competitive accuracy–compression trade-offs.

Background & Motivation¶

Deploying DNNs on mobile/edge devices requires quantization. Uniform quantization suffers from accumulated noise in sensitive layers, while mixed-precision quantization yields better results but entails an enormous search space.

Limitations of existing methods: 1. Search-based methods (HAQ): High computational cost; do not account for sensitivity changes during training. 2. Sensitivity analysis (HAWQ): Analyzes only pre-trained models, ignoring dynamic sensitivity changes during training. 3. Bit-level methods (BSQ/CSQ): Treat each bit as an independent trainable variable — multiplying trainable parameters by \(n\), causing GPU memory and training time to surge.

Core Problem¶

How to achieve effective bit-level sparsity without incurring the overhead of bit-level trainable parameters?

Method¶

Core Idea¶

Key observation: it is unnecessary to train each bit independently. The LSB is computed directly from the floating-point weights \(W\), and L1 regularization is applied to drive the LSB toward zero; once zeroed, the bit can be safely pruned, reducing precision.

Key Designs¶

RoundClamp Quantizer:
- Standard STE quantization: \(W_n = \text{Round}[(2^n - 1) \cdot W] / (2^n - 1)\)
- MSQ bipartite bit-slicing: decomposes an \(n\)-bit value into an MSB part and an LSB part.
- RoundClamp: first rounds to \(n\)-bit, then clamps to the \((n{-}1)\)-bit range.
- The difference between Round and Clamp equals the contribution of the LSB.
- Gradients of the LSB are back-propagated to the floating-point weights \(W\) via STE.
LSB Sparsification Regularization:
- \(\mathcal{L}_{\text{reg}} = \lambda \cdot \sum |\text{LSB}|\)
- L1 regularization drives the LSB toward zero.
- When the LSBs of all weights in a given layer approach zero, the layer can safely be reduced from \(n\)-bit to \((n{-}1)\)-bit.
- No independent bit variables need to be created.
Hessian-Aware Aggressive Pruning:
- Hessian trace is used to estimate per-layer sensitivity.
- Insensitive layers are assigned faster bit-pruning rates.
- Multiple LSBs can be pruned simultaneously (e.g., directly from 4-bit to 2-bit).
- This substantially accelerates training.
Complete Training Pipeline:
- Initialization: all layers start from high precision.
- Forward pass: RoundClamp quantization is applied.
- Loss: task loss + \(\lambda \cdot \mathcal{L}_1(\text{LSB})\).
- Periodic detection of LSB sparsity: the lowest bit is pruned when the sparsity exceeds a threshold.
- Hessian guidance enables aggressive pruning of insensitive layers.

Core Differences from BSQ/CSQ¶

	BSQ/CSQ	MSQ
Trainable Parameters	Independent variable per bit	Original floating-point weights only
Parameter Count	\(n\times\) original	\(1\times\) original
Multi-bit Pruning	1 bit per step	Hessian-guided multi-bit pruning

Key Experimental Results¶

Training Efficiency¶

Method	Trainable Parameters	Training Time
BSQ	\(8\times\) original	Baseline
MSQ	\(1\times\) original	\(-86\%\)

ResNet Accuracy–Compression Trade-off¶

Method	Model	Top-1 (%)	Compression Ratio
BSQ	ResNet-18	69.2	8.0×
CSQ	ResNet-18	69.4	8.0×
MSQ	ResNet-18	69.1	8.0×

Accuracy is on par with BSQ/CSQ while training cost is substantially reduced.

Scalability¶

First extension of bit-level quantization to ViT architectures (ViT-S/B).
Extended to heterogeneous CNNs such as MobileNetV3.

Ablation Study¶

RoundClamp vs. standard STE: RoundClamp provides more accurate gradient direction for the LSB.
Hessian-guided aggressive pruning: reduces training epochs by 30–40% under equivalent compression.
L1 regularization strength: too large leads to accuracy degradation; too small slows convergence.

Highlights & Insights¶

Eliminating bit-level parameter overhead is the core contribution: fundamentally resolves the critical limitation of BSQ (i.e., \(n\)-fold parameter inflation).
Elegant RoundClamp quantizer design: the LSB is constructed from the difference between Round and Clamp, which is mathematically natural and elegant.
Hessian-guided multi-bit pruning: accelerates training while leveraging per-layer sensitivity information.
Strong scalability: first validation of bit-level quantization on ViT and MobileNetV3.
Practical engineering value: the substantial improvement in training efficiency makes mixed-precision quantization genuinely viable in resource-constrained scenarios.

Limitations & Future Work¶

Accuracy is marginally lower than BSQ/CSQ (0.1–0.3%), indicating that fine-grained control via bit-level parameters retains some value.
L1 may not be the optimal sparsification objective; Group Lasso warrants exploration.
Hessian computation incurs overhead, albeit far smaller than the time saved.
Validation is limited to ImageNet classification; performance on detection and segmentation tasks remains unknown.
Only weight quantization is addressed; activation quantization is not considered.

vs. HAQ: Search-based with high computational cost and static analysis; MSQ dynamically discovers the precision assignment.
vs. BSQ: Pioneering bit-level approach but suffers from severe parameter inflation; MSQ preserves the advantages while eliminating the overhead.
vs. CSQ: Smooths BSQ but does not address the parameter inflation problem.

The insight that "each bit need not be trained independently" suggests that many problems seemingly requiring fine-grained parameters admit more efficient indirect solutions. The RoundClamp construction (difference between two quantization operations as a useful signal) offers inspiration for other quantization methods. The work has direct engineering value for on-device quantization-aware training.

Rating¶

Novelty: ⭐⭐⭐⭐ RoundClamp and LSB regularization are novel, though the overall approach is a natural extension of BSQ.
Experimental Thoroughness: ⭐⭐⭐⭐ Thorough efficiency comparisons and multi-architecture validation, but downstream task evaluation is absent.
Writing Quality: ⭐⭐⭐⭐ Comparison with BSQ/CSQ is clear; Figures 1 and 2 are informative.
Value: ⭐⭐⭐⭐⭐ An 8× parameter reduction and 86% training time reduction make bit-level quantization genuinely practical.