KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction¶
Conference: NeurIPS 2025 (Oral)
arXiv: 2505.23416
Code: Available
Area: Model Compression / LLM Inference Optimization
Keywords: KV Cache Compression, Query-Agnostic, Context Reconstruction, Long-Context Inference, Cache Reuse
TL;DR¶
This paper proposes KVzip, a query-agnostic KV cache eviction method that quantifies the importance of each KV pair by leveraging the LLM itself to reconstruct the original context from the cached KV pairs. KVzip achieves 3–4× KV cache compression and approximately 2× reduction in FlashAttention decoding latency, while significantly outperforming existing query-aware methods in multi-query scenarios.
Background & Motivation¶
As the context lengths processed by LLMs continue to grow, the memory overhead of KV caches and attention computation latency have become deployment bottlenecks:
Memory Challenge: In long-context scenarios (e.g., 170K tokens), KV caches consume substantial GPU memory.
Latency Challenge: Attention computation latency scales linearly with KV cache length.
Existing KV cache compression methods are predominantly query-aware, determining which KV pairs to retain based on attention scores of the current query. However, this approach has fundamental limitations:
- Cache non-reusability: Each query requires recompression; the same compressed cache cannot be reused across multiple queries.
- Performance degradation: Even at 90% cache budget ratio, query-aware methods suffer performance degradation in multi-query scenarios.
- Query dependency: Decisions about which KV pairs are important can only be made at query time.
The core insight of KVzip is that the importance of a KV pair should be measured by its contribution to context reconstruction, rather than by the attention scores of a specific query.
Method¶
Overall Architecture¶
The KVzip pipeline proceeds as follows:
- Prefill Stage: The LLM processes the context normally, generating a complete KV cache.
- Importance Estimation: The LLM itself is used to evaluate the importance of each KV pair with respect to reconstructing the original context.
- KV Eviction: Low-importance KV pairs are removed, yielding a compressed KV cache.
- Multi-Query Inference: The compressed KV cache is reused across multiple distinct queries.
Key Designs¶
1. Importance Estimation via Context Reconstruction
KVzip uses the LLM itself as an evaluator: given a candidate set of retained KV pairs, it measures whether the LLM can reconstruct the original input context from those pairs. Stronger reconstruction ability indicates that the retained KV pairs preserve more critical information.
Concretely, importance scores are computed as follows: - The marginal increase in context reconstruction loss incurred by removing each KV pair is measured. - A larger increase implies greater importance, and the corresponding KV pair is retained.
2. Query-Agnosticism
Importance estimation is entirely based on the context itself, independent of any future queries. This means: - Compression is performed only once. - The compressed KV cache can serve arbitrary subsequent queries. - The approach is particularly well-suited for prefill-once, decode-many deployment scenarios (e.g., document QA, multi-turn dialogue).
3. Layer-Wise Eviction Strategy
Different attention layers and attention heads may assign different importance to KV pairs. KVzip independently estimates importance and performs eviction at each layer, enabling finer-grained compression.
Loss & Training¶
KVzip is a training-free method that directly leverages the capabilities of existing LLMs for importance estimation. The context reconstruction loss is the standard cross-entropy loss, measuring the ability to reconstruct the original token sequence from the compressed KV cache.
Key Experimental Results¶
Main Results¶
Table 1: Performance in Multi-Query Scenarios (25% KV Cache Budget)
| Method | Type | QA Accuracy | Retrieval Accuracy | Reasoning Accuracy | Code Understanding Accuracy | Average |
|---|---|---|---|---|---|---|
| Full Cache (100%) | — | Baseline | Baseline | Baseline | Baseline | 100% |
| H2O | Query-Aware | Significant Drop | Significant Drop | Significant Drop | Significant Drop | <80% |
| SnapKV | Query-Aware | Drop | Drop | Drop | Drop | <85% |
| KVzip | Query-Agnostic | Near Baseline | Near Baseline | Near Baseline | Near Baseline | ~97% |
Table 2: Compression Results Across Models
| Model | KV Cache Compression Ratio | FlashAttention Latency Reduction | Performance Loss |
|---|---|---|---|
| LLaMA3.1-8B | 3–4× | ~2× | Negligible |
| Qwen2.5-14B | 3–4× | ~2× | Negligible |
| Gemma3 | 3–4× | ~2× | Negligible |
Table 3: Comparison at Different Cache Budget Ratios (Multi-Query Scenario)
| Cache Budget Ratio | H2O (Query-Aware) | SnapKV (Query-Aware) | KVzip (Query-Agnostic) |
|---|---|---|---|
| 90% | Degradation | Degradation | Near Full Cache |
| 50% | Severe Degradation | Noticeable Degradation | Minor Impact |
| 25% | Unusable | Large Drop | Still Acceptable |
Ablation Study¶
Context Reconstruction vs. Other Importance Metrics
| Importance Metric | Average Performance at 25% Budget |
|---|---|
| Random Eviction | Very Poor |
| Attention Score (first query) | Moderate but Unstable |
| Accumulated Attention Score | Good but Degrades |
| Context Reconstruction (KVzip) | Best and Stable |
Long-Context Scaling Test
| Context Length | KVzip Compression Ratio | Performance Retention |
|---|---|---|
| 8K tokens | 3× | >98% |
| 32K tokens | 3× | >97% |
| 128K tokens | 4× | >95% |
| 170K tokens | 4× | >94% |
Key Findings¶
- Query-aware methods fail in multi-query scenarios: Even at 90% budget ratio, methods such as H2O degrade on non-targeted queries due to query-specific optimization.
- Context reconstruction is a superior importance signal: It better captures the global importance of KV pairs compared to attention scores.
- Cross-model consistency: The method is effective across LLaMA3.1, Qwen2.5, and Gemma3.
- Feasibility for extremely long contexts: Effective compression is maintained at 170K token context lengths.
- Practical speedup: FlashAttention decoding latency is reduced by approximately 2×, with 3–4× memory reduction.
Highlights & Insights¶
- Query-agnostic paradigm: KVzip reframes the KV cache compression problem — importance should be determined by the information density of the context, not by the query.
- Self-supervised signal: The method cleverly repurposes the LLM itself as a "judge" to assess KV pair importance, requiring no additional annotation or training.
- Prefill-once deployment: The approach naturally fits real-world scenarios requiring multiple queries over the same context, such as long-document QA and agentic reasoning.
- NeurIPS 2025 Oral: Reflects strong recognition from reviewers for this concise and effective approach.
- Plug-and-play: Requires no modification to model architecture and can be directly applied to any Transformer-based LLM.
Limitations & Future Work¶
- Overhead of importance estimation: Additional forward computation is required to assess KV pair importance (a one-time cost).
- Static compression: Once compressed, the retained KV pairs cannot be dynamically adjusted for new queries.
- Combination with quantization: KV cache eviction and quantization are orthogonal techniques; combining them could yield greater compression ratios.
- Attention head heterogeneity: Different attention heads may have different optimal compression ratios; adaptive strategies warrant further exploration.
- Streaming scenarios: In settings with continuously growing context (e.g., dialogue), incremental importance updates are needed.
Related Work & Insights¶
- H2O (Zhang et al., 2024): KV cache eviction based on attention scores.
- SnapKV (Li et al., 2024): Compression via observation-window attention patterns.
- StreamingLLM (Xiao et al., 2023): Streaming inference that retains attention sink tokens.
- GQA / MQA: Reducing KV cache size through shared KV heads.
- Key insight: The critical question in KV cache compression is not "which tokens are important to the current query," but rather "which tokens carry contextual information."
Rating¶
| Dimension | Score (1–5) | Note |
|---|---|---|
| Novelty | 5 | Query-agnostic compression paradigm; novel perspective via context reconstruction |
| Technical Quality | 4 | Simple yet effective method; NeurIPS Oral |
| Experimental Thoroughness | 5 | Multi-model, multi-task, multi-length, multi-baseline evaluation |
| Practicality | 5 | Plug-and-play; addresses real deployment pain points |
| Writing Quality | 4 | Clear problem formulation; thorough presentation of results |