QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge¶

Conference: CVPR 2025
arXiv: 2503.16709
Code: GitHub
Area: 3D Vision
Keywords: Post-training quantization, depth estimation, edge deployment, ASIC accelerator, 4-bit quantization

TL;DR¶

This paper proposes QuartDepth, a post-training quantization framework for ASIC edge devices. By employing LogNP activation polishing (transforming abnormally distributed activation values into quantization-friendly distributions), activation quantization compensation (updating weights to compensate for activation quantization errors), and Fisher information-guided weight reconstruction, depth estimation foundation models are quantized to W4A4/W4A8. A programmable hardware accelerator is also designed to achieve real-time inference.

Background & Motivation¶

Foundation depth estimation models (such as Metric3D and DepthAnything) achieve excellent performance but require massive computation, making deployment on edge devices challenging.
ASICs are ideal platforms for edge deployment, but they require low bit-width quantization (e.g., 4-bit) to fully utilize the hardware bandwidth.
The datasets for large-scale foundation monocular depth estimation (MDE) models are massive, making full-model retraining impractical; thus, post-training quantization (PTQ) is required.
Severe outlying activation distributions exist in the decoders of depth estimation models, characterized by large differences in outliers across channels and distributions that deviate significantly from normality.
Per-tensor quantization cannot effectively handle the variability of outliers, while outliers still lead to significant quantization errors in per-channel quantization.
Existing PTQ methods are primarily targeted at classification or language models, failing to specifically address the distribution characteristics of depth estimation models.
Matrix multiplications and convolution operations account for the vast majority (~90%+) of inference time; therefore, quantizing these operations is critical.
There is a lack of specialized hardware accelerator designs tailored for quantized depth estimation models.

Method¶

Overall Architecture¶

The QuartDepth pipeline consists of three steps: (1) First, apply LogNP polishing to transform the distribution of activation values, followed by activation quantization; (2) Update weights to compensate for the error introduced by activation quantization; (3) Quantize the updated weights using Fisher information-guided AdaRound. Concurrently, a flexible hardware accelerator supporting kernel fusion and custom instruction programmability is designed, containing dedicated compute cores for W4A4/W4A8 and a programmable vector computation array.

Key Designs¶

Design 1: LogNP Activation Polishing - Function: Transforms the anomalous activation distributions in the depth estimation decoder into quantization-friendly normal distributions. - Mechanism: For the activation value \(x\) in each channel \(i\), a logarithmic transform \(\Phi(x, \alpha) = \text{sign}(x) \cdot [\log_2(|x| + \alpha) - \log_2(\alpha)]\) is applied, where the polishing factor \(\alpha_i = P_\epsilon(\mathbf{x}_i)\) is determined by the 95th percentile. After quantization, it is recovered via the inverse transform \(\Phi^{-1}\). LogNP effectively compresses outliers while preserving the discriminability of the bulk distribution. - Design Motivation: Directly quantizing activations containing outliers loses substantial information. Logarithmic transforms naturally compress larger values and expand smaller ones, rendering the distribution more concentrated and symmetric.

Design 2: Activation Quantization Error Compensation - Function: Minimizes the output errors introduced by activation quantization by updating the weights. - Mechanism: For each layer, the optimization problem \(\min_{\Delta\mathbf{W}} \|\mathbf{W}\mathbf{x} - (\mathbf{W} + \Delta\mathbf{W})\hat{\mathbf{x}}\|_2^2\) is solved, yielding the closed-form solution \(\Delta\mathbf{W}^* = -\mathbf{W}(\mathbf{x} - \hat{\mathbf{x}})\hat{\mathbf{x}}^T(\hat{\mathbf{x}}\hat{\mathbf{x}}^T)^{-1}\). A dampening technique is utilized when \(\hat{\mathbf{x}}\hat{\mathbf{x}}^T\) is not of full rank. - Design Motivation: By treating activation quantization and weight quantization separately, this design compensates for the activation error prior to quantizing the weights, minimizing the cumulative error over both stages.

Design 3: Fisher Information-Guided Weight Reconstruction - Function: Minimizes the degradation in loss caused by weight quantization using second-order information. - Mechanism: The impact of quantization error on the loss is approximated using a Taylor expansion as \(\frac{1}{2}\Delta\mathbf{w}^T\mathbf{H}_\mathbf{w}\Delta\mathbf{w}\), and the layer-wise Fisher matrix is approximated using KFAC as \(\mathbf{F}_l = \mathbf{G}_l \otimes \mathbf{A}_l\). Using this as the optimization objective for AdaRound, the rounding parameter \(\mathbf{v}\) is learned to minimize \(\sum_l (\mathbf{w}^{(l)} - \hat{\mathbf{w}}^{(l)})^T\mathbf{F}_l(\mathbf{w}^{(l)} - \hat{\mathbf{w}}^{(l)}) + \lambda h(\mathbf{v})\). - Design Motivation: Traditional round-to-nearest ignores the varying sensitivity of different weights to the loss. The Fisher information matrix yields a more practical and computable second-order approximation than the Hessian.

Loss & Training¶

Activation compensation: \(\|\mathbf{W}\mathbf{x} - (\mathbf{W} + \Delta\mathbf{W})\hat{\mathbf{x}}\|_2^2\) (layer-wise closed-form solution); Weight reconstruction: Fisher-weighted AdaRound objective + regularization term, learning the rounding direction via gradient optimization.

Key Experimental Results¶

Main Results: Quantization Comparison of Depth Estimation on KITTI/NYUv2¶

Model	Quantization Configuration	NYUv2 \(\delta_1\)↑	NYUv2 AbsRel↓	KITTI \(\delta_1\)↑	KITTI AbsRel↓
Metric3D (ViT-L) FP32	W32A32	0.977	0.064	0.975	0.052
Metric3D (ViT-L)	W8A8	0.975	0.065	0.974	0.053
Metric3D (ViT-L)	W4A8	0.970	0.069	0.970	0.056
Metric3D (ViT-L)	W4A4	0.960	0.076	0.963	0.061

Ablation Study: Contribution of Each Component (Metric3D ViT-L, W4A4, NYUv2 \(\delta_1\)↑)¶

Method	\(\delta_1\)↑	AbsRel↓
Baseline (Direct Quantization)	0.891	0.118
+ LogNP polishing	0.938	0.088
+ Activation Compensation	0.949	0.081
+ Fisher Weight Reconstruction	0.960	0.076

Key Findings¶

In the W4A8 configuration, \(\delta_1\) degrades by only 0.7% (0.977 \(\rightarrow\) 0.970), and W4A4 by 1.7%.
LogNP polishing is the most critical component, improving the metric from 0.891 with direct quantization to 0.938 (+4.7%).
Embodied Road Depth accuracy is independent of the choice of segmentation model (with a gap of <1% compared to GT segmentation).
The ASIC hardware achieves real-time inference of 30+ FPS on the ViT-L model.
Compared to AdaRound using only the MSE objective, Fisher-guided weight reconstruction delivers more accurate quantization.

Highlights & Insights¶

Intuitiveness of the LogNP Transform: The logarithmic transform is naturally suited for handling long-tailed distributions, and the design of the percentile-adaptive polishing factor is simple and effective.
Step-by-Step Decoupled Quantization Pipeline: Succession of polishing \(\rightarrow\) activation compensation \(\rightarrow\) weight quantization, backed by clear mathematical derivations at each step.
Hardware-Software Co-design: The computational overhead of LogNP polishing is completely hidden by the parallel execution of the programmable vector computation array.
Generality: Applicable to multiple ViT-based depth estimation models (such as Metric3D and DepthAnything).

Limitations & Future Work¶

W4A4 still incurs a ~2% \(\delta_1\) loss in certain scenarios, indicating that precision-sensitive applications may require W4A8.
Currently, only linear and convolutional layers are quantized, while non-linear operations such as softmax in attention layers remain in floating-point format.
The ASIC design is tailored to specific models, which limits its general applicability.
Future work could explore mixed-precision quantization strategies or integrate knowledge distillation to further reduce accuracy degradation.

Unlike SmoothQuant, which migrates outliers to weights, LogNP directly transforms the activation distribution.
Fisher-guided weight reconstruction incorporates ideas from OBQ/GPTQ while avoiding the high overhead of column-by-column solving.
The closed-form solution design for activation compensation can be generalized to the PTQ workflows of other models.

Rating¶

⭐⭐⭐⭐ — A systematic quantization framework that forms a complete closed loop from problem analysis to solution and hardware design; LogNP polishing is a valuable technical contribution.