Skip to content

Inference-Time Hyper-Scaling with KV Cache Compression

Conference: NeurIPS 2025 arXiv: 2506.05345 Code: HuggingFace Models (NVIDIA Open Source) Area: Model Compression / Inference Acceleration Keywords: KV Cache Compression, Inference-Time Scaling, Sparse Attention, Dynamic Memory Sparsification, Long-Sequence Reasoning

TL;DR

This paper proposes the Inference-Time Hyper-Scaling paradigm: by efficiently compressing the KV cache, more or longer parallel reasoning sequences can be generated under the same compute/memory budget, substantially improving the accuracy of reasoning models on tasks such as mathematics, code, and scientific reasoning.

Background & Motivation

Inference-time scaling is currently a primary approach for enhancing the reasoning capability of LLMs, trading computation for accuracy by generating longer reasoning chains (e.g., Chain-of-Thought) or multiple parallel paths (e.g., Best-of-N). However, the true bottleneck of generation cost in Transformer LLMs is not the number of generated tokens but rather the size of the KV cache:

Memory bottleneck: The KV cache grows linearly with sequence length; even an 8K context requires substantial GPU memory for a 32B-parameter model.

Latency bottleneck: Autoregressive generation is memory-bandwidth-bound; each step requires reading the entire KV cache, and larger caches incur higher latency.

Scaling ceiling: Under a fixed memory budget, the KV cache size limits the number of sequences that can be generated in parallel or the maximum length of a single sequence.

The authors identify a key insight: if the KV cache can be compressed at a high compression ratio while preserving accuracy, more tokens can be generated within the same compute budget, enabling further gains in reasoning accuracy—this is the core idea of "hyper-scaling."

Method

Overall Architecture

The core pipeline of Inference-Time Hyper-Scaling:

  1. Compress the KV cache to \(1/C\) of its original size (e.g., \(C=8\)) using a KV cache compression method.
  2. Under the same memory budget, run \(C\) times more parallel inference paths (width scaling \(W\)), or generate sequences \(C\) times longer (depth scaling \(D\)).
  3. Aggregate parallel results via Majority Voting, or directly use the answer from the longest sequence.

A critical prerequisite is that the compression method must maintain generation quality at high compression ratios.

Key Designs: Dynamic Memory Sparsification (DMS)

DMS is the KV cache sparsification method proposed in this paper, with three core innovations:

1. Delayed Eviction

  • Unlike conventional methods (e.g., H2O, StreamingLLM) that make eviction decisions at write time, DMS retains the full cache until the capacity limit is reached.
  • Upon reaching the limit, a single batch sparsification is performed to select the most important \(B\) tokens for retention.
  • Advantage: Before eviction, all tokens have implicitly "merged" information across multiple attention layers; the retained tokens thus encode the semantics of the evicted ones.

2. Learnable Sparsification Thresholds

  • A scoring function is learned per layer and per head to determine which KV pairs to retain, based on attention scores.
  • Differentiable token selection is achieved via Gumbel-Sigmoid.
  • Training requires only standard language modeling loss (next-token prediction).

3. Minimal Training

  • Only 1K training steps (on the order of hours) with a small amount of text data suffice to equip a pretrained model with DMS capability.
  • The trained DMS module generalizes directly to sequences of varying lengths and across different tasks.

Loss & Training

  • Loss function: standard next-token prediction cross-entropy loss.
  • Sparsification is triggered at random positions during training, enabling the model to maintain generation quality under varying compression timings.
  • Only the sparsification-related scoring heads are trained; the original model parameters are frozen.

Key Experimental Results

Main Results

Hyper-Scaling is evaluated on multiple reasoning models with 8× compression via DMS:

Model Task Baseline (Full Cache) DMS 8× + Hyper-Scale Gain
Qwen-R1 32B AIME 24 72.0 84.0 +12.0
Qwen-R1 32B GPQA 61.6 70.2 +8.6
Qwen-R1 32B LiveCodeBench 57.3 67.0 +9.7
Qwen3-8B AIME 24 62.7 73.3 +10.6
Qwen3-8B GPQA 54.8 62.1 +7.3
Qwen3-8B LiveCodeBench 48.5 58.6 +10.1

Core finding: under an equivalent compute budget (same number of memory reads), the combination of DMS compression and increased parallelism consistently outperforms the uncompressed baseline.

Comparison with Other Compression Methods

Method Type Training Required AIME 24 @ 8× GPQA @ 8×
Full Cache 72.0 61.6
StreamingLLM Training-free No 30.2 38.1
H2O Training-free No 41.5 42.8
SnapKV Training-free No 55.3 50.7
DMS (Ours) Learned 1K steps 68.7 59.4

DMS loses only 3–4 points at 8× compression, far outperforming all training-free methods.

Ablation Study

Importance of Delayed Eviction:

Strategy AIME 24 (8×) GPQA (8×)
Eager Eviction 52.1 47.3
Delayed Eviction 68.7 59.4

Delayed eviction yields a gain of over 16 points, validating the critical role of implicit information merging.

Scaling Strategy Comparison (equivalent compute budget):

Strategy Width W Depth D AIME 24
Baseline 1 72.0
Width scaling only 8 80.3
Depth scaling only 1 76.5
Mixed scaling 4 84.0

Mixed scaling (simultaneously increasing parallelism and sequence length) achieves the best performance.

Key Findings

  1. Hyper-scaling is effective: The chain of KV cache compression → more tokens → higher accuracy holds consistently across multiple models and tasks.
  2. Compression quality is critical: Only high-quality compression methods (e.g., DMS) enable hyper-scaling to yield positive returns; low-quality compression is detrimental.
  3. DMS training is highly efficient: 1K training steps suffice to achieve quality close to Full Cache, with good cross-task generalization.
  4. Mixed width-and-depth scaling is optimal: Increasing parallelism or sequence length alone is inferior to combining both.

Highlights & Insights

  1. Paradigm innovation: KV cache compression is repositioned from an "efficiency optimization" tool to an "reasoning capability enhancement" tool, opening an entirely new research direction.
  2. Implicit information merging: The delayed eviction mechanism leverages the inherent information aggregation of attention; retained tokens naturally absorb the information of neighboring evicted tokens.
  3. High practicality: 1K training steps, 8× compression, and significant accuracy gains make this directly applicable to deployment optimization of existing reasoning models.
  4. Reliability of majority voting: In parallel inference settings, majority voting is a more robust answer aggregation strategy than simple selection.

Limitations & Future Work

  1. DMS still requires a small amount of training, which incurs additional cost for frequently updated models.
  2. The benefits of hyper-scaling diminish on simple tasks where long reasoning chains are unnecessary.
  3. The scoring function relies solely on attention scores and does not account for the semantic importance of token content.
  4. Majority voting as an aggregation strategy is relatively simple; more sophisticated verification mechanisms may yield further improvements.
  • StreamingLLM / H2O / SnapKV: Training-free KV cache compression methods; this paper demonstrates their insufficient quality at high compression ratios.
  • Best-of-N / Majority Voting: Foundational strategies for inference-time scaling; DMS provides a larger scaling headroom for these approaches.
  • KVQuant / KIVI: Quantization-based KV cache compression, orthogonal to the sparsification approach of DMS and combinable with it.
  • Insight: Model efficiency optimization can not only reduce costs but also enhance capability—this principle is generalizable to compression of other model components.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (paradigm-level innovation)
  • Technical Depth: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Practicality: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐