SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization¶

Conference: AAAI 2026 arXiv: 2511.11663 Code: None Area: Model Compression Keywords: Quantization, Frequency-Domain Decomposition, Outlier Mitigation, Ultra-Low-Bit, Fourier Truncation

TL;DR¶

SpecQuant proposes a two-stage quantization framework based on adaptive Fourier-domain decomposition: it first smoothly migrates activation outliers into weights, then suppresses high-frequency noise in the weights via channel-wise low-frequency Fourier truncation. On LLaMA-3 8B, W4A4 quantization achieves only 1.5% accuracy degradation, while delivering 2× speedup and 3× memory savings.

Background & Motivation¶

State of the Field¶

LLM deployment faces significant memory and computational pressure. Quantization addresses this by reducing the precision of weights and activations. Numerous post-training quantization (PTQ) methods have emerged in recent years, targeting performance preservation at 4-bit or even lower precision.

Limitations of Prior Work¶

Activation Outlier Dilemma: Extreme values (outliers) in LLM activations expand the quantization dynamic range, causing severe accuracy degradation.

Intrinsic Limitation of Smoothing Methods: Methods such as SmoothQuant transfer quantization difficulty from activations to weights via scaling factors, but this is merely robbing Peter to pay Paul — new outliers and larger dynamic ranges emerge in the weights.

Overhead of Rotation Methods: SpinQuant and QuaRot introduce rotation layers to align distributions, incurring non-negligible runtime overhead.

Limitation of SVD Methods: SVDQuant absorbs outliers via global low-rank approximation but fails to capture channel-level outlier structures.

Root Cause¶

Smoothing makes activation quantization easier but weight quantization harder. How can one resolve the new quantization challenges introduced into the weights when transferring outliers?

Starting Point¶

The problem is revisited from the Fourier frequency-domain perspective: weight energy is concentrated in low-frequency components, while outliers correspond to high-frequency components. Low-frequency truncation can precisely absorb the transferred outliers while retaining the vast majority of signal energy.

Method¶

Overall Architecture¶

SpecQuant is a two-stage framework: 1. Stage 1: Activation Smoothing — migrates activation outliers into weights. 2. Stage 2: Channel-Wise Low-Frequency Fourier Truncation — suppresses high-frequency components in weights while preserving essential signal.

Key Designs¶

1. Outlier Migration from Activations to Weights¶

Input $\mathbf{X}$ is scaled per channel: $\hat{\mathbf{X}} = \mathbf{X} \cdot \text{diag}(\boldsymbol{\lambda})^{-1}$
Weight compensation: $\hat{\mathbf{W}} = \mathbf{W} \cdot \text{diag}(\boldsymbol{\lambda})$
$\boldsymbol{\lambda}$ denotes per-channel smoothing factors.
Issue: Post-scaling weight dynamic range and magnitude increase, introducing new quantization challenges.

2. Channel-Wise Low-Frequency Fourier Truncation¶

Core Assumption: The weight vector $\mathbf{W}[:,j] \in \mathbb{R}^{C_{in}}$ of each output channel is an independently stationary signal with energy concentrated at low frequencies.
Empirical Support: In the attention layers of LLaMA-2 7B, the low-frequency components (top 20%) account for an average of 92.3% of energy across 1,000 random channel vectors, with a standard deviation of only 3.7%.
Theoretical Basis — Parseval's Theorem: $$\sum_{n=0}^{N-1} |x[n]|^2 = \frac{1}{N} \sum_{k=0}^{N-1} |X[k]|^2$$ For smooth functions, Fourier coefficients decay at a polynomial rate: $|X[k]| \leq C/|k|^r$
Procedure:
Apply FFT to each channel vector.
Adaptively allocate frequency budgets based on activation-aware importance scores.
Truncate high-frequency components, retaining low frequencies.
Reconstruct the compressed weights via inverse FFT.

3. Activation-Aware Adaptive Frequency Budget Allocation¶

Importance Score: $\text{Score}(j) = |\bar{\mathbf{X}}_{:,j} \cdot \bar{\hat{\mathbf{W}}}_{:,j}|$
- Measures the contribution magnitude of each channel in the activation–weight interaction.
Softmax-normalized budget allocation: $$\rho_j = \frac{\exp(\alpha \cdot \text{Score}(j))}{\sum_{l=1}^{C_{out}} \exp(\alpha \cdot \text{Score}(l))}$$
Each channel retains $k_j = \lfloor \rho_j \cdot C_{in} \rfloor$ low-frequency components.
Design Motivation: Channels with greater activation influence receive larger budgets, preserving critical spectral information.

4. Dual-Branch Computation Architecture¶

Low-Frequency Branch (16-bit high precision): $\hat{\mathbf{X}} \mathbf{W}'$
Residual Branch (4-bit low precision): $Q(\hat{\mathbf{X}}) Q(\mathbf{R})$, where $\mathbf{R} = \hat{\mathbf{W}} - \mathbf{W}'$
Overall approximation: $\mathbf{X}\mathbf{W} \approx \hat{\mathbf{X}}\mathbf{W}' + Q(\hat{\mathbf{X}})Q(\mathbf{R})$
The low-frequency branch incurs minimal overhead: only 16 or 32 low-frequency groups are retained per channel, with additional cost of $2k/m$ (where $m$ is the number of input channels).

Loss & Training¶

Post-training quantization (PTQ), no fine-tuning required.
Calibration set: 256 samples randomly drawn from WikiText2.
Optimal smoothing strength $\alpha$ is searched per layer (minimizing MSE).
Weight quantization employs GPTQ column-wise error compensation.
Activation quantization uses per-token asymmetric quantization.

Key Experimental Results¶

Main Results¶

Model	Quant Config (W-A-KV)	Method	WikiText2 PPL	Zero-Shot 9-Task Accuracy
LLaMA-3 8B	16-16-16	FP16	6.14	68.09%
LLaMA-3 8B	4-16-16	SpinQuant	6.49	66.54%
LLaMA-3 8B	4-16-16	SpecQuant	6.48	66.88%
LLaMA-3 8B	4-4-16	SpinQuant	7.28	64.11%
LLaMA-3 8B	4-4-16	SpecQuant	7.25	64.75%
LLaMA-3 8B	4-4-4	SpinQuant	7.35	64.10%
LLaMA-3 8B	4-4-4	SpecQuant	7.33	64.75%
LLaMA-3 70B	4-4-16	SpinQuant	6.10	66.99%
LLaMA-3 70B	4-4-16	SpecQuant	5.12	69.75%
LLaMA-2 7B	4-4-16	SpinQuant	6.78	57.37%
LLaMA-2 7B	4-4-16	SpecQuant	5.88	62.88%

Ablation Study¶

Quant	Smooth	Trunc.	LLaMA-7B Wiki PPL	LLaMA-7B 0-shot9	Note
✓	✗	✗	9e3	25.34%	Direct quantization collapses
✓	✓	✗	3e2	34.42%	Smoothing alone yields only marginal improvement
✓	✗	✓	24.57	54.72%	Truncation alone is insufficient
✓	✓	✓	6.05	61.85%	Both components are necessary

Effect of Number of Truncation Groups¶

Truncation Groups	Latency Overhead	LLaMA-7B Wiki PPL	LLaMA-7B 0-shot9
16	2.7%	6.04	61.89%
32	5.5%	6.03	62.01%
64	11.2%	5.99	62.88%

Speedup and Memory Savings (LLaMA-3 8B)¶

Sequence Length	FP16 Prefill	INT4 Prefill	Speedup	FP16 Memory	INT4 Memory	Savings
256	8.05ms	3.51ms	2.29×	0.43GB	0.13GB	3.41×
2048	57.47ms	26.31ms	2.19×	0.51GB	0.19GB	2.73×
8192	256.39ms	119.20ms	2.15×	0.80GB	0.40GB	1.99×

Key Findings¶

W4A4 quantization incurs only 1.5% accuracy loss (LLaMA-3 8B), significantly outperforming SmoothQuant, GPTQ, and related methods.
Frequency-domain truncation absorbs outliers more effectively than rotation: Under LLaMA-3 70B W4A4, SpecQuant achieves 16% lower PPL than SpinQuant (5.12 vs. 6.10).
Smoothing and truncation are both indispensable: Either component alone yields limited benefit; their combination reduces PPL from 9000+ to 6.05.
Spectral Entropy as an importance measure is optimal: It better captures channel structure than Abs Mean, Max, or L2 Norm.
16 truncation groups already achieve a favorable accuracy–efficiency trade-off.
Practical deployment achieves over 2× speedup and over 3× memory savings.

Highlights & Insights¶

First work to establish a connection between frequency-domain compression and quantization robustness: The Fourier energy decay property provides theoretical guarantees for accuracy preservation.
Addresses the fundamental rob-Peter-to-pay-Paul limitation of smoothing methods: Rather than transferring difficulty, SpecQuant eliminates it in the frequency domain.
Advantage of channel-independent processing: Better captures channel-level outlier patterns than global SVD.
Activation-aware adaptive budget allocation: Relies on activation–weight interaction strength rather than weight statistics alone.
Negligible additional overhead: 16 truncation groups introduce only 2.7% latency overhead.

Limitations & Future Work¶

Per-layer search for optimal smoothing strength incurs additional calibration cost.
The number of low-frequency truncation groups must be manually set according to the target bit-width.
Validation is limited to the LLaMA family; effectiveness on other architectures (e.g., Mixture-of-Experts) remains unknown.
Performance under extreme quantization (e.g., 2-bit) is not demonstrated.
The frequency-domain decomposition assumption (weight signal smoothness) may not hold for certain models.

SmoothQuant pioneered the activation-to-weight migration paradigm; SpecQuant builds upon this by resolving the post-migration challenges via frequency-domain processing.
The rotation strategies of QuaRot/SpinQuant are effective but incur runtime overhead; frequency-domain truncation offers a more lightweight alternative.
FourierFT and related works apply frequency-domain methods to parameter-efficient fine-tuning; this paper extends the paradigm to quantization.
Parseval's theorem provides a solid theoretical foundation for frequency-domain compression.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Addressing quantization outliers from a frequency-domain perspective constitutes a genuinely new paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 8 models, multiple quantization configurations, detailed ablations, and efficiency benchmarks.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are rigorous, though the overall framework is somewhat complex.
Value: ⭐⭐⭐⭐⭐ — W4A4 practical utility is strong; speedup and memory savings results are compelling.