Skip to content

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Conference: AAAI 2026 arXiv: 2508.15212 Code: To be released Area: Interpretability Keywords: KV Cache Compression, Channel Pruning, Unstructured Sparsity, Long-Context Inference, Attention Mechanism

TL;DR

This paper proposes SparK — a training-free, token-wise unstructured channel pruning method for KV cache. It selects salient channels via query-aware saliency scoring and recovers the contribution of pruned channels through a recovery mechanism. At an 80% pruning ratio, performance degradation remains below 5%. The method is orthogonal to token eviction approaches and can reduce KV cache storage by an additional 30%+.

Background & Motivation

Background: The primary bottleneck for long-context LLM inference is the KV cache — 100K tokens in LLaMA3.1-8B require >50 GB of storage, and attention computation scales quadratically with sequence length. Existing compression methods operate along three axes: temporal (token eviction/merging), spatial (layer/head sharing), and channel (low-rank decomposition/structured pruning).

Limitations of Prior Work: - Token eviction methods (SnapKV, PyramidKV, etc.) only reduce the sequence length \(S\), leaving the head dimension \(D\) untouched. - Structured channel pruning (ThinK) applies the same pruning mask to all tokens, assuming globally uniform channel importance — yet empirical observation shows that channel importance is token-dependent. - Structured pruning suffers severe performance degradation at high pruning ratios (≥70%): ThinK incurs a 47.6% performance loss at 80% pruning.

Key Challenge: Channel importance is highly token-specific (CV > 1.1), but existing methods apply fixed masks in structured pruning, failing to capture this dynamic variation.

Goal: Perform fine-grained, token-aware unstructured sparsity along the channel dimension of the KV cache, while preserving or recovering information from pruned channels.

Key Insight: Two key observations — (1) channel importance varies drastically across tokens, making unstructured pruning far superior to structured pruning; (2) replacing pruned channels with a small constant (rather than zeroing them out) substantially reduces information loss.

Core Idea: Apply token-wise unstructured channel pruning to the KV cache (retaining the most salient channels) combined with a lightweight recovery function to restore the contribution of pruned channels.

Method

Overall Architecture

SparK operates in two inference phases: - Prefill phase: Compute per-token, per-channel saliency scores → apply unstructured pruning, retaining only the top-\(T\) channels → store distributional statistics of pruned channels. - Decode phase: For each new query, use the recovery function \(\mathcal{F}\) to reconstruct pruned channel values from cached statistics → execute standard full attention.

Key Designs

  1. Unstructured Channel Pruning (Saliency-Based Selection):

    • Function: Select the \(T\) most important channels for each token within each head.
    • Mechanism: Channel pruning is formulated as a selection problem that maximizes the saliency of retained channels. The objective minimizes the Frobenius norm error of the attention score after pruning: \(\min_{\mathcal{S}_{i,t}} \|\mathbf{q}_{i,t}\mathbf{k}_{i,t}^\top - (\mathbf{q}_{i,t}\mathcal{S}_{i,t})(\mathbf{k}_{i,t}\mathcal{S}_{i,t})^\top\|_F\). Expanding this expression and assuming approximate inter-channel decorrelation yields a simplification into a sum of independent per-channel contributions. The saliency score is defined as \(w_{i,t}^j = \|q_{i,t}^j\|_2 \|k_{i,t}^j\|_2\), and the top-\(T\) channels by score are greedily selected. In practice, a mean query computed over an observation window replaces per-token queries to reduce computational overhead.
    • Design Motivation: Unstructured pruning assigns each token its own pruning mask, perfectly adapting to token-dependent channel importance. The saliency metric jointly accounts for both query and key channel magnitudes, yielding more accurate importance estimates than key-only approaches.
  2. Recovery Mechanism:

    • Function: Restore the contribution of pruned channels during attention computation.
    • Mechanism: Rather than discarding pruned channels outright, a recovery function \(\mathcal{F}\) samples approximate values from distributional parameters (mean and variance) stored for each pruned channel. These reconstructed values participate in the attention score computation at decode time.
    • Design Motivation: Empirical observation shows that even replacing pruned channels with a crude constant (0.01) substantially reduces performance loss (by an average of 32.4% in accuracy loss). Distribution-based sampling better preserves the statistical properties of attention scores compared to using a fixed constant.
  3. Ratio-Free Variants:

    • SparK-g (Group-Based): Channels are partitioned into groups; top channels are independently selected within each group, eliminating the need to specify a global pruning ratio.
    • SparK-p (Top-p): Analogous to top-p sampling — channels are retained until cumulative saliency reaches a threshold \(p\).

Loss & Training

  • Fully training-free: No modification of model parameters is required; the method is plug-and-play.
  • Orthogonal to existing methods: Can be stacked on top of SnapKV/PyramidKV (temporal axis compression) for further reduction.
  • Compatible with KV cache quantization methods.

Key Experimental Results

Main Results

LongBench results on LLaMA3.1-8B (KV-size 128, combined with SnapKV):

Method NrtvQA HotpotQA SAMSum TriviaQA Avg.
Vanilla (full KV) 22.48 48.49 42.28 88.35 44.27
SnapKV 15.29 39.92 36.64 68.64 32.38
+ThinK (50%) 13.60 36.24 32.80 50.73 28.20
+ThinK (80%) 7.60 19.98 11.22 21.36 16.97
+SparK (50%) 13.52 38.77 36.66 68.95 32.04
+SparK (80%) 13.82 40.84 35.41 57.20 31.16

At 80% pruning, SparK achieves Avg = 31.16 vs. ThinK's 16.97 — a substantial margin.

Ablation Study

Configuration Effect Notes
Structured pruning (50%) −4.2% Global fixed mask
Unstructured pruning (50%) −1.2% Token-wise mask
Unstructured + constant replacement (80%) Large improvement SAMSum: 55.7%→12.2% loss
Unstructured + Recovery (80%) <5% degradation Distribution sampling achieves best results
ThinK (80%) −47.6% Structured pruning collapses at high ratio
SparK (80%) <5% degradation Unstructured + recovery is highly robust

Key Findings

  • Channel importance is highly token-dependent: Mean CV > 1.1, indicating that the importance of a given channel varies drastically across tokens — directly invalidating the core assumption of structured pruning.
  • The gap between unstructured and structured pruning widens sharply with pruning ratio: ~3% at 50%; ~27.4% at 80%.
  • The recovery mechanism is critical at high pruning ratios: At 80% pruning, performance collapses without recovery, but drops less than 5% with recovery enabled.
  • Orthogonal complementarity with token eviction: Stacking SparK on top of SnapKV yields an additional 30%+ storage reduction with negligible performance loss.

Highlights & Insights

  • The insight that "the channel axis is an overlooked compression dimension" is particularly valuable. The vast majority of KV cache compression work targets sequence length \(S\) (token eviction); SparK opens an orthogonal channel-dimension compression direction that can be composed with existing methods.
  • The recovery mechanism is elegantly designed: Rather than simple pruning, the approach prunes and then recovers. The observation that constant substitution alone substantially reduces information loss is counterintuitive yet empirically compelling, and directly motivates the distribution-based recovery design.
  • The derivation from problem formulation to greedy solution is clear: Channel pruning is formulated as a max-saliency selection problem, and the approximate inter-channel decorrelation property simplifies it to a linearly decomposable problem amenable to greedy solving.

Limitations & Future Work

  • Observation window selection: Approximating all queries with the window mean query introduces errors, particularly in long-context settings where query patterns vary substantially across positions.
  • The design space of recovery functions is not fully explored: The current approach uses distribution sampling, but more expressive recovery mechanisms (e.g., small learned networks) may yield further gains.
  • Value cache pruning relies on a simple norm-based heuristic: The key cache benefits from a principled theoretical derivation, while value cache pruning strategy has room for further refinement.
  • Evaluation is limited to LongBench and RULER: Assessment on reasoning-intensive tasks (math, code) is absent.
  • vs. ThinK: Both perform channel pruning, but ThinK is structured (all tokens share a mask) while SparK is unstructured (each token has an independent mask). At 80% pruning, SparK's Avg = 31.16 substantially outperforms ThinK's 16.97.
  • vs. SnapKV/PyramidKV: These methods compress along the temporal axis (reducing sequence length), while SparK compresses along the channel axis. The two are orthogonal and can be composed.
  • vs. KV quantization (KIVI, etc.): Quantization reduces the bit-width of each stored value; SparK reduces the number of channels participating in computation. The two approaches are complementary.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of unstructured channel pruning and recovery is novel in the KV cache compression literature, establishing channel-dimension compression as a new direction.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, diverse baselines, multi-model evaluation, and detailed ablation and visualization analyses.
  • Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, and the observation–motivation–method logical chain is well articulated.
  • Value: ⭐⭐⭐⭐⭐ Highly practical — training-free, plug-and-play, orthogonally complementary to existing methods, and directly applicable to long-context LLM inference acceleration.