CommVQ: Commutative Vector Quantization for KV Cache Compression¶
Conference: ICML 2025
arXiv: 2506.18879
Code: https://github.com/UMass-Embodied-AGI/CommVQ
Area: Robotics / Model Compression
Keywords: KV Cache Compression, Vector Quantization, RoPE Commutativity, Long-Context Inference, 1-bit Quantization
TL;DR¶
This paper proposes CommVQ, which compresses the KV cache using Additive Vector Quantization (AVQ). By innovatively designing a codebook that commutes with RoPE and training it via the EM algorithm, CommVQ achieves near-lossless accuracy at 2-bit and retains usable accuracy at 1-bit, enabling LLaMA-3.1 8B to support a 128K context length on a single RTX 4090 GPU.
Background & Motivation¶
Background: LLM context lengths continue to grow (reaching 128K+), causing the KV cache to become the primary bottleneck for GPU memory. For instance, LLaMA-3.1 8B requires 88GB of KV cache for a 128K context with a batch size of 2.
Limitations of Prior Work: Existing KV cache quantization methods (such as KVQuant) perform scalar-wise independent quantization, leading to severe accuracy degradation below 2-bit, and fail to optimize the handling of Rotary Position Embedding (RoPE) in keys.
Key Challenge: Scalar-wise quantization incurs too much information loss at ultra-low bit-widths; vector-level quantization is required to preserve more information.
Goal: Efficient vector-level KV cache compression.
Key Insight: Treat the key/value vectors of each token as a whole for additive vector quantization, reducing quantization errors.
Core Idea: Design a codebook that commutes with the RoPE matrix, allowing the decoding process to be efficiently embedded into attention computation — where intermediate results can be precomputed on the codebook and reused.
Method¶
Overall Architecture¶
- Use Additive Quantization (AQ) to encode key/value vectors as the sum of multiple codewords.
- Design the codebook to commute with RoPE (\(C \cdot R = R \cdot C\)), allowing \(Q \cdot C\) to be precomputed and reused across all tokens.
- Train the codebook using the EM algorithm.
Key Designs¶
-
Additive Vector Quantization:
- Function: Quantize KV vectors into the weighted sum of codewords from multiple codebooks.
- Mechanism: \(v \approx c_{i_1} + c_{i_2} + \ldots + c_{i_M}\), where each codeword index requires only \(\log_2 K\) bits.
- Design Motivation: Vector-level quantization yields smaller errors than scalar-level quantization at the same bit-width.
-
RoPE-Commutative Codebook:
- Function: Design the codebook such that \(\text{Decode}(\text{RoPE}(\text{Encode}(k))) = \text{RoPE}(\text{Decode}(\text{Encode}(k)))\).
- Mechanism: The codewords in the codebook do not change the quantization codebook structure under RoPE rotation, enabling \(Q \cdot R \cdot C\) to be precomputed and reused.
- Design Motivation: Avoid the \(O(N \cdot d)\) overhead of token-by-token decoding and RoPE application, reducing it to \(O(K \cdot d)\) (where \(K\) is the codebook size).
-
EM Algorithm Codebook Training:
- Function: Alternately perform the E-step (assigning codewords) and M-step (updating codebook centroids).
- Mechanism: Minimize the quantization reconstruction error under the RoPE commutativity constraint.
- Design Motivation: A classic approach to vector quantization training with convergence guarantees.
Loss & Training¶
- Quantization reconstruction error + RoPE commutativity constraint.
- Triton kernel implementation to achieve actual memory savings.
Key Experimental Results¶
Main Results¶
LLaMA-3.1 8B long-context benchmarks:
| Method | Bit-width | LongBench | InfiniteBench | Memory Saving |
|---|---|---|---|---|
| FP16 | 16-bit | 42.1 | 22.8 | 1× |
| KVQuant | 2-bit | 38.5 | 18.2 | 8× |
| CommVQ | 2-bit | 41.8 | 22.1 | 8× |
| KVQuant | 1-bit | 28.3 | 11.5 | 16× |
| CommVQ | 1-bit | 36.2 | 17.8 | 16× |
Ablation Study¶
| Configuration | LongBench | Description |
|---|---|---|
| Scalar-wise 2-bit | 38.5 | Baseline |
| Vector quantization 2-bit (w/o RoPE commutation) | 40.9 | Advantages of vector quantization |
| Vector quantization 2-bit (+RoPE commutation) | 41.8 | Full method |
Key Findings¶
- Near-lossless performance at 2-bit (42.1 \(\rightarrow\) 41.8), outperforming all baselines.
- First to achieve usable accuracy at 1-bit (36.2 vs. 42.1 for FP16).
- Direct execution of a 128K context with LLaMA-3.1 8B on a single RTX 4090 (24GB).
Highlights & Insights¶
- RoPE commutativity design is the core innovation — integrating the mathematical properties of positional embeddings into the quantization scheme.
- The paradigm shift from scalar-wise to vector-level quantization is particularly critical at ultra-low bit-widths.
- 1-bit KV cache makes long-context LLMs practical on consumer-grade GPUs, which holds immense practical significance.
Limitations & Future Work¶
- Codebook training requires calibration data.
- RoPE commutativity requires a specific codebook structure, which may limit representation capacity.
- Evaluated only on the LLaMA model family.
Related Work & Insights¶
- vs KVQuant: Scalar-wise quantization; CommVQ's vector-level quantization is superior.
- vs any4 (weight quantization): any4 uses LUTs for weight quantization, while CommVQ uses vector quantization for KV cache compression, making them complementary.
- Inspires inference optimization for all Transformers using RoPE.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The RoPE-commutative codebook design is highly creative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, 1-bit/2-bit, and memory analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivation.
- Value: ⭐⭐⭐⭐⭐ Enables long-context LLMs on consumer-grade GPUs.