CommVQ: Commutative Vector Quantization for KV Cache Compression¶

Conference: ICML 2025
arXiv: 2506.18879
Code: https://github.com/UMass-Embodied-AGI/CommVQ
Area: Robotics / Model Compression
Keywords: KV Cache Compression, Vector Quantization, RoPE Commutativity, Long-Context Inference, 1-bit Quantization

TL;DR¶

This paper proposes CommVQ, which compresses the KV cache using Additive Vector Quantization (AVQ). By innovatively designing a codebook that commutes with RoPE and training it via the EM algorithm, CommVQ achieves near-lossless accuracy at 2-bit and retains usable accuracy at 1-bit, enabling LLaMA-3.1 8B to support a 128K context length on a single RTX 4090 GPU.

Background & Motivation¶

Background: LLM context lengths continue to grow (reaching 128K+), causing the KV cache to become the primary bottleneck for GPU memory. For instance, LLaMA-3.1 8B requires 88GB of KV cache for a 128K context with a batch size of 2.

Limitations of Prior Work: Existing KV cache quantization methods (such as KVQuant) perform scalar-wise independent quantization, leading to severe accuracy degradation below 2-bit, and fail to optimize the handling of Rotary Position Embedding (RoPE) in keys.

Key Challenge: Scalar-wise quantization incurs too much information loss at ultra-low bit-widths; vector-level quantization is required to preserve more information.

Goal: Efficient vector-level KV cache compression.

Key Insight: Treat the key/value vectors of each token as a whole for additive vector quantization, reducing quantization errors.

Core Idea: Design a codebook that commutes with the RoPE matrix, allowing the decoding process to be efficiently embedded into attention computation — where intermediate results can be precomputed on the codebook and reused.

Method¶

Overall Architecture¶

Use Additive Quantization (AQ) to encode key/value vectors as the sum of multiple codewords.
Design the codebook to commute with RoPE (\(C \cdot R = R \cdot C\)), allowing \(Q \cdot C\) to be precomputed and reused across all tokens.
Train the codebook using the EM algorithm.

Key Designs¶

Additive Vector Quantization:
- Function: Quantize KV vectors into the weighted sum of codewords from multiple codebooks.
- Mechanism: \(v \approx c_{i_1} + c_{i_2} + \ldots + c_{i_M}\), where each codeword index requires only \(\log_2 K\) bits.
- Design Motivation: Vector-level quantization yields smaller errors than scalar-level quantization at the same bit-width.
RoPE-Commutative Codebook:
- Function: Design the codebook such that \(\text{Decode}(\text{RoPE}(\text{Encode}(k))) = \text{RoPE}(\text{Decode}(\text{Encode}(k)))\).
- Mechanism: The codewords in the codebook do not change the quantization codebook structure under RoPE rotation, enabling \(Q \cdot R \cdot C\) to be precomputed and reused.
- Design Motivation: Avoid the \(O(N \cdot d)\) overhead of token-by-token decoding and RoPE application, reducing it to \(O(K \cdot d)\) (where \(K\) is the codebook size).
EM Algorithm Codebook Training:
- Function: Alternately perform the E-step (assigning codewords) and M-step (updating codebook centroids).
- Mechanism: Minimize the quantization reconstruction error under the RoPE commutativity constraint.
- Design Motivation: A classic approach to vector quantization training with convergence guarantees.

Loss & Training¶

Quantization reconstruction error + RoPE commutativity constraint.
Triton kernel implementation to achieve actual memory savings.

Key Experimental Results¶

Main Results¶

LLaMA-3.1 8B long-context benchmarks:

Method	Bit-width	LongBench	InfiniteBench	Memory Saving
FP16	16-bit	42.1	22.8	1×
KVQuant	2-bit	38.5	18.2	8×
CommVQ	2-bit	41.8	22.1	8×
KVQuant	1-bit	28.3	11.5	16×
CommVQ	1-bit	36.2	17.8	16×

Ablation Study¶

Configuration	LongBench	Description
Scalar-wise 2-bit	38.5	Baseline
Vector quantization 2-bit (w/o RoPE commutation)	40.9	Advantages of vector quantization
Vector quantization 2-bit (+RoPE commutation)	41.8	Full method

Key Findings¶

Near-lossless performance at 2-bit (42.1 \(\rightarrow\) 41.8), outperforming all baselines.
First to achieve usable accuracy at 1-bit (36.2 vs. 42.1 for FP16).
Direct execution of a 128K context with LLaMA-3.1 8B on a single RTX 4090 (24GB).

Highlights & Insights¶

RoPE commutativity design is the core innovation — integrating the mathematical properties of positional embeddings into the quantization scheme.
The paradigm shift from scalar-wise to vector-level quantization is particularly critical at ultra-low bit-widths.
1-bit KV cache makes long-context LLMs practical on consumer-grade GPUs, which holds immense practical significance.

Limitations & Future Work¶

Codebook training requires calibration data.
RoPE commutativity requires a specific codebook structure, which may limit representation capacity.
Evaluated only on the LLaMA model family.

vs KVQuant: Scalar-wise quantization; CommVQ's vector-level quantization is superior.
vs any4 (weight quantization): any4 uses LUTs for weight quantization, while CommVQ uses vector quantization for KV cache compression, making them complementary.
Inspires inference optimization for all Transformers using RoPE.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The RoPE-commutative codebook design is highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, 1-bit/2-bit, and memory analysis.
Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivation.
Value: ⭐⭐⭐⭐⭐ Enables long-context LLMs on consumer-grade GPUs.