Quantized Visual Geometry Grounded Transformer¶
Conference: ICLR 2026 arXiv: 2509.21302 Code: https://github.com/wlfeng0509/QuantVGGT Area: 3D Vision / Model Compression Keywords: VGGT, post-training quantization, 3D reconstruction, Hadamard rotation, calibration
TL;DR¶
To address the deployment demands of the billion-scale 3D reconstruction model VGGT, this paper proposes QuantVGGT, the first dedicated PTQ framework for VGGT. It resolves heavy-tailed activation distributions caused by special tokens via dual-smoothed fine-grained quantization (Hadamard rotation + channel-wise smoothing), and addresses calibration instability via noise-filtered diverse sampling. At 4-bit quantization, the method achieves 3.7× memory compression and 2.5× inference speedup while retaining 98%+ accuracy.
Background & Motivation¶
Background: VGGT is a unified 3D reconstruction model with 1.2B parameters that performs depth estimation, point map regression, camera pose prediction, and point tracking in a single forward pass. Despite its outstanding performance, its computational and memory overhead is substantial, limiting practical deployment.
Limitations of Prior Work: While PTQ is well-established for LLMs and 2D vision models, applying it to VGGT introduces two unique challenges: (1) data-independent special tokens (camera/register tokens) induce extreme heavy-tailed activation distributions; (2) the semantic complexity of 3D multi-view data makes calibration sample selection highly unstable.
Key Challenge: Special tokens are a critical design element for VGGT's multi-task reasoning, yet the distributional gap between these tokens and regular image tokens causes quantization bits to be wasted on outliers.
Goal: Design a VGGT-specific PTQ scheme that preserves reconstruction accuracy under low-bit quantization.
Key Insight: Distribution analysis reveals that special tokens are the root cause of heavy tails, and inter-frame relationships in multi-view data are the key structural factor for calibration.
Core Idea: Global Hadamard rotation disperses the spikes from special tokens, followed by local channel-wise smoothing to reduce residual variance after rotation, combined with frame-aware diverse sampling to construct a robust calibration set.
Method¶
Overall Architecture¶
QuantVGGT comprises two core components: (1) DSFQ (Dual-Smoothed Fine-Grained Quantization) — applies global Hadamard rotation to smooth heavy-tailed distributions, followed by local channel-wise scaling to reduce inter-channel variance, with fine-grained quantization granularity; (2) NFDS (Noise-Filtered Diverse Sampling) — filters anomalous samples using deep-layer activation statistics, and constructs a diverse calibration set via frame-aware correlation clustering.
Key Designs¶
-
Pre-Global Rotation (Global Hadamard Rotation):
- Function: Disperses activation spikes caused by special tokens.
- Mechanism: Simultaneously left-multiplies both activation \(\mathbf{X}\) and weight \(\mathbf{W}\) by a random Hadamard matrix \(\mathbf{H}\), leveraging the central limit effect to approximate the heavy-tailed distribution as Gaussian. \(\mathbf{XW}^\top = (\mathbf{XH})(\mathbf{WH})^\top\)
- Design Motivation: The Hadamard transform uniformly redistributes extreme values concentrated in a few channels across all channels.
-
Post-Local Smooth (Local Channel-wise Smoothing):
- Function: Reduces residual inter-channel variance after rotation.
- Mechanism: Computes scaling factors in the rotated space as \(\hat{c}_i = \frac{\max(|\mathbf{X}_i\mathbf{H}|)^\alpha}{\max(|\mathbf{W}_i\mathbf{H}|)^{1-\alpha}}\), with \(\alpha=0.5\).
- Design Motivation: Rotation only disperses global spikes without eliminating local inter-channel discrepancies. Applying smoothing after rotation is more stable than the reverse order, as prior smoothing would be disrupted by the subsequent rotation.
-
Fine-Grained Quantization Granularity:
- Function: Reduces quantization error by decreasing quantization granularity.
- Mechanism: Weights are quantized along the \(d_{out}\) dimension; activations are quantized along the token dimension, exploiting the fact that matrix multiplication accumulates only along \(d_{in}\).
- Design Motivation: \(\mu\)-coherence theory shows that finer-grained quantization significantly reduces quantization difficulty.
-
Noise-Filtered Diverse Sampling (NFDS):
- Function: Constructs a robust calibration dataset.
- Mechanism: A two-stage pipeline — (a) computes a noise score for each sample from deep-layer activation statistics (L2 norm of z-scores of mean and variance), filtering out high-scoring anomalous samples; (b) leverages VGGT's inter-frame correlations (normalized similarity vectors \(c_t^i\) between the first frame and subsequent frames) for K-means clustering, then uniformly samples from clusters to form the calibration set.
- Design Motivation: Theorem 3.2 proves that calibration sets should sample sub-regions of the data space proportionally; inter-frame relationships are central to VGGT's inductive bias.
Key Experimental Results¶
Main Results (Camera Pose Estimation on CO3Dv2)¶
| Configuration | W/A bit | Accuracy Retention | Memory Compression | Speedup |
|---|---|---|---|---|
| Full FP16 | 16/16 | 100% | 1× | 1× |
| W8A8 QuantVGGT | 8/8 | ~99% | 2× | 1.5× |
| W4A4 QuantVGGT | 4/4 | ~98% | 3.7× | 2.5× |
| W4A4 SmoothQuant | 4/4 | ~85% | 3.7× | 2.5× |
| W4A4 QuaRot | 4/4 | ~90% | 3.7× | 2.5× |
Ablation Study¶
| Component | Accuracy Change | Notes |
|---|---|---|
| Hadamard rotation only | +5% vs. naive | Disperses spikes |
| + Channel-wise smoothing | +3% | Reduces residual variance |
| + Fine-grained quantization | +2% | Finer quantization granularity |
| + NFDS | +2% | Robust calibration |
| Full QuantVGGT | 98% of FP | All components combined |
Key Findings¶
- Special tokens are the primary obstacle to quantization: The activations of the first 5 tokens (camera + register) are more than 10× larger in magnitude than regular patch tokens.
- The rotation → smoothing order matters: Applying smoothing before rotation undermines the benefits of smoothing; rotating first to homogenize the distribution and then smoothing yields greater stability.
- Frame-aware clustering outperforms label-based clustering: t-SNE visualizations show that semantic labels for 3D scenes fail to effectively distinguish calibration sub-domains, whereas inter-frame relationships can.
- 4-bit quantization is practically viable: Empirical measurements on an RTX 4090 confirm a 2.5× inference speedup.
Highlights & Insights¶
- First quantization work for a billion-scale 3D model: Fills a gap in quantization research within the 3D reconstruction domain.
- "Global-then-local" dual-smoothing design: Elegantly resolves the heavy-tail problem in two steps with no additional runtime overhead, as scaling factors can be folded into LayerNorm.
- Frame-aware calibration in NFDS: Exploits VGGT's unique inductive bias of "first frame vs. subsequent frames," embodying the principle that deeper model understanding enables better compression.
- Theoretical contribution of Theorem 3.2: Provides a formal guiding principle for calibration set construction.
Limitations & Future Work¶
- Evaluated only on VGGT; generalizability to other 3D models such as DUSt3R/MASt3R remains unverified.
- A 2% accuracy drop at 4-bit may be insufficient for high-precision applications.
- The noise threshold and cluster count in NFDS require manual tuning.
- Extremely low-bit regimes such as INT2/INT3 are not explored.
Related Work & Insights¶
- vs. SmoothQuant: Applies only global smoothing without accounting for the influence of VGGT's special tokens.
- vs. QuaRot: Applies only Hadamard rotation without subsequent local channel-wise smoothing.
- The proposed approach also offers insights for quantizing other large models containing special tokens, such as [CLS] tokens in VLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐ Individual components are not entirely new (Hadamard rotation, SmoothQuant), but their combination and analysis tailored to VGGT are original.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark, multi-bit-width evaluation with complete ablations.
- Writing Quality: ⭐⭐⭐⭐ Motivation analysis is thorough and visualizations are clear.
- Value: ⭐⭐⭐⭐ First quantization work for a large-scale 3D model with strong practical significance.