Skip to content

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Conference: NeurIPS 2025 arXiv: 2502.00299 Code: https://github.com/NVIDIA/kvpress Area: Model Compression Keywords: KV cache compression, semantic chunk, layer-wise index reuse, long-context inference, memory efficiency

TL;DR

ChunkKV elevates the basic unit of KV cache compression from discrete tokens to semantic chunks (groups of contiguous tokens). By aggregating attention scores at the chunk level, it selects semantically intact segments for retention, and leverages the high cross-layer index similarity induced by chunking to enable layer-wise index reuse. At a 10% compression ratio, ChunkKV improves over SnapKV/PyramidKV by up to 8.7% and achieves a 26.5% throughput gain.

Background & Motivation

Background: In long-context LLM inference, the KV cache consumes up to 70% of GPU memory. For a 7B model, the KV cache for a single token occupies approximately 0.5 MB, meaning a 10K-token prompt alone requires roughly 5 GB of VRAM. Existing compression methods (H2O, SnapKV, PyramidKV) assess token importance based on token-level attention scores and selectively discard low-scoring tokens.

Limitations of Prior Work: Token-level importance estimation ignores semantic dependencies between tokens. The example in Figure 1 clearly illustrates the problem — for the query "what does a turaco eat?", token-level methods retain individually high-scoring words (e.g., "turaco", "eat", "bamboo") while discarding their subject/predicate/object context, resulting in semantic fragmentation.

Key Challenge: Complete semantic units in natural language typically manifest as contiguous sequences (subject-verb-object structures, clauses, phrases), and token-level compression disrupts this continuity.

Goal: Preserve complete semantic information in KV cache compression without increasing — and ideally reducing — computational overhead.

Key Insight: Contiguous tokens are grouped into chunks (default chunk size = 10), and importance is computed, retained, or discarded at the chunk granularity. Retained chunks contain complete subject-verb-object structures, eliminating semantic fragmentation. Further analysis reveals that the cross-layer Jaccard similarity of retained chunk indices is substantially higher than that of token-level indices (57.74% vs. 27.95%), naturally motivating layer-wise index reuse for acceleration.

Core Idea: Replace discrete tokens with semantic chunks as the atomic unit of KV cache compression, preserving complete semantic structures and exploiting cross-layer index consistency for lossless acceleration.

Method

Overall Architecture

ChunkKV compresses the KV cache during the prefilling stage: (1) an observe window (queries from the last \(w\) tokens) is used to compute attention scores; (2) all tokens are partitioned into \(C = \lceil T_k / c \rceil\) chunks; (3) attention scores within each chunk are aggregated into a chunk score; (4) the top-\(k\) chunks are retained and the rest are discarded; (5) the retained chunk indices are reused across adjacent layers (layer-wise index reuse).

Key Designs

  1. Chunk-Based Attention Score Aggregation:

    • Function: Evaluate KV cache importance at the chunk granularity.
    • Mechanism: The KV cache of length \(T_k\) is partitioned into \(C = \lceil T_k / c \rceil\) contiguous chunks of size \(c\). The score of chunk \(i\) is the sum of attention scores of all tokens within it: \(A_{\text{chunk}}^i = \sum_{j \in \text{chunk}_i} A_j\). The top-\(k\) chunks are selected for retention.
    • Design Motivation: Retaining or discarding entire chunks ensures that the compressed KV cache preserves complete subject-verb-object structures and phrases. Compared to token-level methods at the same compression ratio, this yields lower L1 loss (~2%) and higher attention cosine similarity (~1.5%).
  2. Layer-Wise Index Reuse:

    • Function: Reuse compressed KV cache indices across adjacent Transformer layers.
    • Mechanism: The Jaccard similarity of indices retained by ChunkKV across adjacent layers is substantially higher than that of token-level methods (LLaMA-3-8B: 57.74% vs. SnapKV's 27.95%). Accordingly, ChunkKV compression is computed only at the first layer every \(N_{\text{reuse}}\) layers, with subsequent layers directly reusing the indices.
    • Design Motivation: Chunk-level selection is inherently more stable — the importance of semantic chunks varies less across layers than that of individual tokens. Experiments show that reuse = 2 incurs only a 0.5% performance drop while reducing compression time by ~20%.
  3. Observe Window + Suffix Concatenation:

    • The last \(w\) tokens of the KV cache (recent context) are always retained and concatenated with the selected top-\(k\) chunks to form the final compressed KV cache.
    • The default chunk size is 10; performance is robust for values in the range of 5–20.

Loss & Training

ChunkKV is entirely training-free, modifying only the KV cache management strategy at inference time. It is open-sourced based on the NVIDIA kvpress library and is compatible with Flash Attention 2.

Key Experimental Results

Main Results

Task Model Compression Ratio StreamingLLM H2O SnapKV PyramidKV ChunkKV
GSM8K LLaMA-3.1-8B 10% 47.8 45.0 50.3 48.2 65.7
Many-Shot GSM8K LLaMA-3.1-8B 10% 74.3 51.2 68.2 70.3 79.3
NIAH LLaMA-3.1-8B 128 KV 23.7 47.9 58.9 65.1 73.8
LongBench LLaMA-3-8B 10% -13.8% -10.6% -3.2% -3.3% -2.3%
LongBench Qwen2-7B 10% -5.3% -0.6% -0.4% -1.0% +0.4%
JailbreakV LLaMA-3.1-8B 10% 53.1 65.4 84.3 85.5 87.9

ChunkKV's advantage is most pronounced at extreme compression ratios (10%), outperforming the second-best method by 15.4% on GSM8K.

Ablation Study

Configuration Result Notes
Chunk size 3→5→10→20→30 LongBench: 40.49/40.47/40.51/40.05/39.57 Size 10 is optimal; robust in the 5–20 range
Layer reuse = 1/2/3 LongBench: 40.51/40.27/39.45 Reuse = 2 drops only 0.59%; recommended default
vs. KIVI 2-bit quantization Total Gen Time: 164.66s vs. 226.52s ChunkKV is 27.3% faster than KIVI-2bit
Hybrid (chunk at lower layers + token at upper layers) Avg 39.80 vs. pure ChunkKV 40.51 Pure ChunkKV is superior overall; hybrid performs slightly better on summarization/few-shot

Key Findings

  • Cross-layer index similarity is a byproduct of chunking: Chunk-level selection is inherently more stable, making index reuse feasible. Token-level methods (SnapKV: 27.95%) cannot exploit this effectively.
  • Advantages are pronounced under extreme compression: At 30% compression, differences across methods are small; at 10%, ChunkKV pulls 3–15% ahead of the second-best method, demonstrating that semantic preservation is critical under aggressive compression.
  • Effective for reasoning models (DeepSeek-R1): On R1-Distill-Llama-8B at 10% compression, ChunkKV achieves 65.7% vs. PyramidKV's 62.6%.
  • Effective for both Chinese and English: On the Chinese LongBench with Qwen2, ChunkKV even surpasses FullKV (39.45 vs. 38.60), possibly because chunk-level filtering removes noisy tokens.
  • Orthogonal to quantization: ChunkKV reduces KV cache size while KIVI reduces precision; the two are composable. ChunkKV achieves faster inference than KIVI-2bit at comparable compression ratios (164.66s vs. 226.52s), as eviction is completed prior to prefilling.

Highlights & Insights

  • Minimal design, substantial gains: The core modification replaces token-level top-\(k\) selection with chunk-level top-\(k\) selection, requiring minimal code changes while yielding significant improvements — demonstrating the power of selecting the correct level of abstraction.
  • Theoretical grounding in ICL: A theoretical justification is provided from an in-context learning perspective — chunks preserve complete demonstrations intact, reducing the noise term \(\xi_\theta(r)\) in the distinguishability condition, thereby lowering the 0-1 risk bound for ICL.
  • Index reuse as an emergent discovery: The high cross-layer consistency naturally induced by chunk-level selection opens an entirely new acceleration dimension without any additional computation.

Limitations & Future Work

  • A fixed chunk size may not be appropriate for all languages and tasks. Adaptive boundary detection (e.g., chunking based on sentence boundaries) could yield further improvements.
  • In scenarios requiring verbatim retention (e.g., legal documents, biomedical analysis), discarding any chunk may result in the loss of critical information.
  • The hybrid strategy (chunk at lower layers + token at upper layers) performs better on summarization tasks, suggesting that task-adaptive compression strategies may be warranted in future work.
  • vs. SnapKV: SnapKV performs token-level attention selection within an observe window. ChunkKV uses the same observe window but selects at the chunk level, achieving 73.8% vs. 58.9% on NIAH (KV = 128).
  • vs. PyramidKV: PyramidKV finds that different layers require different compression ratios (pyramid shape); ChunkKV finds that adjacent layers can share indices. The two observations are complementary.
  • vs. KIVI (quantization): Quantization reduces precision (2-bit); eviction reduces quantity (10%). ChunkKV is even faster than KIVI-2bit (27.3%) because eviction is completed before prefilling begins.
  • vs. ARM / Controlling Thinking Speed: These works optimize token efficiency at the reasoning chain level (selecting reasoning formats / controlling inference speed), while ChunkKV optimizes memory efficiency at the operator level; the two are orthogonal and can be combined.

Rating

  • Novelty: ⭐⭐⭐⭐ The chunk idea is concise and effective, and index reuse is an elegant emergent finding, though the core idea is relatively intuitive.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four models, four benchmark categories, multiple compression ratios, chunk size ablation, quantization comparison, hybrid analysis, efficiency analysis, and theoretical justification.
  • Writing Quality: ⭐⭐⭐⭐ Figure 1's comparison is intuitive and Table 1's method comparison is clear.
  • Value: ⭐⭐⭐⭐⭐ Training-free, already integrated into the NVIDIA kvpress library, and orthogonal to quantization — extremely high practical value.