ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference¶
Conference: ICLR 2026 arXiv: 2511.10645 Code: Project Page Area: Model Compression Keywords: Post-training quantization, Givens rotation, Reasoning LLM, Quantization efficiency, Algorithm-system co-design
TL;DR¶
ParoQuant is proposed to eliminate weight outliers via hardware-efficient and optimizable independent Givens rotations combined with channel scaling, achieving high-accuracy, low-overhead 4-bit weight quantization for reasoning LLMs.
Background & Motivation¶
LLM quantization faces a fundamental accuracy-efficiency trade-off: - AWQ: Fast but incurs significant accuracy loss (e.g., 2.8% drop on MMLU-Pro for Qwen3-4B); the long chain-of-thought in reasoning LLMs causes quantization errors to accumulate progressively. - QTIP: High accuracy but approximately 30% slower than AWQ due to substantial overhead introduced by Hadamard transforms. - Reasoning models generate tens of thousands of tokens, imposing stringent requirements on both quantization accuracy and efficiency.
Core observations:
Rotation effectively suppresses outliers, but full rotation matrices are computationally expensive.
Sparsely parameterized rotations are equally effective — retaining only the top-10% channel pairs suffices to match full rotation performance.
Method¶
Overall Architecture¶
ParoQuant designs a Scaled Pairwise Rotation transform composed of multiple independent rotations and channel scaling, paired with layer-wise optimization and efficient inference kernels for end-to-end acceleration.
Key Designs¶
-
Givens Rotation Decomposition:
- A small set of channel pairs is selected: \(\mathcal{P} = \{(i_1,j_1), \ldots, (i_m,j_m)\}\)
- Each pair undergoes a planar rotation: \(\mathbf{W}^{(k)}[i,:] = \cos\theta_k \cdot \mathbf{W}^{(k-1)}[i,:] - \sin\theta_k \cdot \mathbf{W}^{(k-1)}[j,:]\)
- Only a small number of vectorized multiply-add operations are required, avoiding full matrix multiplication.
-
Independent Rotation:
- Each channel is constrained to appear in at most one rotation pair (\(P_k \cap P_l = \emptyset\)).
- All Givens rotations are fully parallelizable, fully exploiting GPU parallelism.
- Naturally compatible with group quantization: independent rotations within each quantization group.
-
Sequential Independent Rotations + Channel Scaling:
- A single independent rotation has only \(n/2\) parameters, limiting its expressiveness.
- \(K\) sequential independent rotations (default \(K=8\)) are applied to enhance fitting capacity.
- Channel scaling \(\text{diag}(\boldsymbol{\alpha})\) directly equalizes channel magnitudes.
- Final transform: \(T_{\mathcal{P},\Theta,\boldsymbol{\alpha}}(\mathbf{W}) = (\prod_{t=1}^K R(\mathcal{P}_t, \Theta_t)) \cdot \text{diag}(\boldsymbol{\alpha}) \cdot \mathbf{W}\)
Loss & Training¶
- Layer-wise optimization: \(\mathcal{L}(Q) = \|Q(D)(\mathbf{X'}) - D(\mathbf{X})\|\)
- Two-stage optimization: rotation angles and scaling factors are optimized first, followed by QAT-like fine-tuning of weights and quantization parameters \(s, z\).
- Each layer is optimized for 10 epochs using AdamW, with uniform sampling from three datasets (WikiText2, C4, RedPajama).
- Inference kernels exploit three-level parallelism: token dimension, channel group dimension, and rotation pair dimension.
Key Experimental Results¶
Main Results (Perplexity — W4G128 Quantization)¶
| Model | Method | WikiText2 PPL | C4 PPL | Inference Speedup |
|---|---|---|---|---|
| LLaMA-3-8B | FP16 | 5.54 | 7.10 | 1.0× |
| AWQ | 5.92 | 7.42 | 2.4× | |
| QTIP | 5.69 | 7.22 | 1.7× | |
| ParoQuant | 5.68 | 7.17 | 2.2× | |
| Qwen3-4B | AWQ | 7.36 | 7.89 | 2.4× |
| QTIP | 7.09 | 7.68 | 1.7× | |
| ParoQuant | 7.03 | 7.63 | 2.2× |
Reasoning Task Accuracy (DeepSeek-R1-distilled LLaMA-3.1-8B)¶
| Method | MMLU-Pro | GPQA Diamond | AIME-24 | AIME-25 | Avg. |
|---|---|---|---|---|---|
| FP16 | 52.4 | 43.9 | 56.7 | 40.0 | 48.3 |
| AWQ | 49.3 | 40.4 | 46.7 | 26.7 | 40.8 |
| ParoQuant | 52.5 | 41.4 | 53.3 | 36.7 | 46.0 |
Key Findings¶
- ParoQuant outperforms AWQ by an average of 2.4% on reasoning tasks with less than 10% additional overhead.
- Accuracy matches QTIP (vector quantization SOTA) while being approximately 25% faster.
- Gains are especially pronounced on the Qwen3 series (1.7B–14B), where smaller models pose greater quantization challenges.
Highlights & Insights¶
- Algorithm-system co-design: the independence constraint on rotations simultaneously preserves the mathematical optimization space and naturally suits GPU parallelism.
- Incisive analysis: only 10% of channel pairs are needed to match full rotation performance, revealing redundancy in orthogonal transforms.
- Particular attention is paid to reasoning LLMs, with thorough analysis of quantization error accumulation under long chain-of-thought generation.
- Online rotation kernels leverage shared memory and registers; multiple independent rotations can be fused into a single kernel call.
Limitations & Future Work¶
- Validation is primarily limited to 4-bit linear quantization; 2–3 bit scenarios remain unexplored.
- The channel pair selection strategy for independent rotations (random sampling with deduplication) may be suboptimal.
- The number of rotations \(K=8\) is empirically determined and may require tuning for different models.
- Lack of open-sourced code may limit community adoption.
Related Work & Insights¶
- Distinction from QuaRot/SpinQuant: ParoQuant employs optimizable independent Givens rotations rather than fixed Hadamard transforms.
- Distinction from AWQ: the addition of rotation transforms on top of channel scaling substantially improves outlier suppression.
- Implication: in the era of reasoning LLMs, quantization methods must recalibrate the trade-off between accuracy and efficiency.
Rating¶
- Novelty: ⭐⭐⭐⭐ The design of independent Givens rotations is novel and practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple models, tasks, and metrics.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, though the paper is notation-heavy.
- Value: ⭐⭐⭐⭐⭐ A practical solution for reasoning LLM quantization.