SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning¶

Conference: AAAI 2026 arXiv: 2508.15212 Code: To be released Area: Interpretability Keywords: KV Cache Compression, Channel Pruning, Unstructured Sparsity, Long-Context Inference, Attention Mechanism

TL;DR¶

This paper proposes SparK — a training-free, token-wise unstructured channel pruning method for KV cache. It selects salient channels via query-aware saliency scoring and recovers the contribution of pruned channels through a recovery mechanism. At an 80% pruning ratio, performance degradation remains below 5%. The method is orthogonal to token eviction approaches and can reduce KV cache storage by an additional 30%+.

Background & Motivation¶

Background: The primary bottleneck for long-context LLM inference is the KV cache — 100K tokens in LLaMA3.1-8B require >50 GB of storage, and attention computation scales quadratically with sequence length. Existing compression methods operate along three axes: temporal (token eviction/merging), spatial (layer/head sharing), and channel (low-rank decomposition/structured pruning).

Limitations of Prior Work: - Token eviction methods (SnapKV, PyramidKV, etc.) only reduce the sequence length \(S\), leaving the head dimension \(D\) untouched. - Structured channel pruning (ThinK) applies the same pruning mask to all tokens, assuming globally uniform channel importance — yet empirical observation shows that channel importance is token-dependent. - Structured pruning suffers severe performance degradation at high pruning ratios (≥70%): ThinK incurs a 47.6% performance loss at 80% pruning.

Key Challenge: Channel importance is highly token-specific (CV > 1.1), but existing methods apply fixed masks in structured pruning, failing to capture this dynamic variation.

Goal: Perform fine-grained, token-aware unstructured sparsity along the channel dimension of the KV cache, while preserving or recovering information from pruned channels.

Key Insight: Two key observations — (1) channel importance varies drastically across tokens, making unstructured pruning far superior to structured pruning; (2) replacing pruned channels with a small constant (rather than zeroing them out) substantially reduces information loss.

Core Idea: Apply token-wise unstructured channel pruning to the KV cache (retaining the most salient channels) combined with a lightweight recovery function to restore the contribution of pruned channels.

Method¶

Overall Architecture¶

SparK operates in two inference phases: - Prefill phase: Compute per-token, per-channel saliency scores → apply unstructured pruning, retaining only the top-\(T\) channels → store distributional statistics of pruned channels. - Decode phase: For each new query, use the recovery function \(\mathcal{F}\) to reconstruct pruned channel values from cached statistics → execute standard full attention.

Key Designs¶

Unstructured Channel Pruning (Saliency-Based Selection):
- Function: Select the \(T\) most important channels for each token within each head.
- Mechanism: Channel pruning is formulated as a selection problem that maximizes the saliency of retained channels. The objective minimizes the Frobenius norm error of the attention score after pruning: \(\min_{\mathcal{S}_{i,t}} \|\mathbf{q}_{i,t}\mathbf{k}_{i,t}^\top - (\mathbf{q}_{i,t}\mathcal{S}_{i,t})(\mathbf{k}_{i,t}\mathcal{S}_{i,t})^\top\|_F\). Expanding this expression and assuming approximate inter-channel decorrelation yields a simplification into a sum of independent per-channel contributions. The saliency score is defined as \(w_{i,t}^j = \|q_{i,t}^j\|_2 \|k_{i,t}^j\|_2\), and the top-\(T\) channels by score are greedily selected. In practice, a mean query computed over an observation window replaces per-token queries to reduce computational overhead.
- Design Motivation: Unstructured pruning assigns each token its own pruning mask, perfectly adapting to token-dependent channel importance. The saliency metric jointly accounts for both query and key channel magnitudes, yielding more accurate importance estimates than key-only approaches.
Recovery Mechanism:
- Function: Restore the contribution of pruned channels during attention computation.
- Mechanism: Rather than discarding pruned channels outright, a recovery function \(\mathcal{F}\) samples approximate values from distributional parameters (mean and variance) stored for each pruned channel. These reconstructed values participate in the attention score computation at decode time.
- Design Motivation: Empirical observation shows that even replacing pruned channels with a crude constant (0.01) substantially reduces performance loss (by an average of 32.4% in accuracy loss). Distribution-based sampling better preserves the statistical properties of attention scores compared to using a fixed constant.
Ratio-Free Variants:
- SparK-g (Group-Based): Channels are partitioned into groups; top channels are independently selected within each group, eliminating the need to specify a global pruning ratio.
- SparK-p (Top-p): Analogous to top-p sampling — channels are retained until cumulative saliency reaches a threshold \(p\).

Loss & Training¶

Fully training-free: No modification of model parameters is required; the method is plug-and-play.
Orthogonal to existing methods: Can be stacked on top of SnapKV/PyramidKV (temporal axis compression) for further reduction.
Compatible with KV cache quantization methods.

Key Experimental Results¶

Main Results¶

LongBench results on LLaMA3.1-8B (KV-size 128, combined with SnapKV):

Method	NrtvQA	HotpotQA	SAMSum	TriviaQA	Avg.
Vanilla (full KV)	22.48	48.49	42.28	88.35	44.27
SnapKV	15.29	39.92	36.64	68.64	32.38
+ThinK (50%)	13.60	36.24	32.80	50.73	28.20
+ThinK (80%)	7.60	19.98	11.22	21.36	16.97
+SparK (50%)	13.52	38.77	36.66	68.95	32.04
+SparK (80%)	13.82	40.84	35.41	57.20	31.16

At 80% pruning, SparK achieves Avg = 31.16 vs. ThinK's 16.97 — a substantial margin.

Ablation Study¶

Configuration	Effect	Notes
Structured pruning (50%)	−4.2%	Global fixed mask
Unstructured pruning (50%)	−1.2%	Token-wise mask
Unstructured + constant replacement (80%)	Large improvement	SAMSum: 55.7%→12.2% loss
Unstructured + Recovery (80%)	<5% degradation	Distribution sampling achieves best results
ThinK (80%)	−47.6%	Structured pruning collapses at high ratio
SparK (80%)	<5% degradation	Unstructured + recovery is highly robust

Key Findings¶

Channel importance is highly token-dependent: Mean CV > 1.1, indicating that the importance of a given channel varies drastically across tokens — directly invalidating the core assumption of structured pruning.
The gap between unstructured and structured pruning widens sharply with pruning ratio: ~3% at 50%; ~27.4% at 80%.
The recovery mechanism is critical at high pruning ratios: At 80% pruning, performance collapses without recovery, but drops less than 5% with recovery enabled.
Orthogonal complementarity with token eviction: Stacking SparK on top of SnapKV yields an additional 30%+ storage reduction with negligible performance loss.

Highlights & Insights¶

The insight that "the channel axis is an overlooked compression dimension" is particularly valuable. The vast majority of KV cache compression work targets sequence length \(S\) (token eviction); SparK opens an orthogonal channel-dimension compression direction that can be composed with existing methods.
The recovery mechanism is elegantly designed: Rather than simple pruning, the approach prunes and then recovers. The observation that constant substitution alone substantially reduces information loss is counterintuitive yet empirically compelling, and directly motivates the distribution-based recovery design.
The derivation from problem formulation to greedy solution is clear: Channel pruning is formulated as a max-saliency selection problem, and the approximate inter-channel decorrelation property simplifies it to a linearly decomposable problem amenable to greedy solving.

Limitations & Future Work¶

Observation window selection: Approximating all queries with the window mean query introduces errors, particularly in long-context settings where query patterns vary substantially across positions.
The design space of recovery functions is not fully explored: The current approach uses distribution sampling, but more expressive recovery mechanisms (e.g., small learned networks) may yield further gains.
Value cache pruning relies on a simple norm-based heuristic: The key cache benefits from a principled theoretical derivation, while value cache pruning strategy has room for further refinement.
Evaluation is limited to LongBench and RULER: Assessment on reasoning-intensive tasks (math, code) is absent.

vs. ThinK: Both perform channel pruning, but ThinK is structured (all tokens share a mask) while SparK is unstructured (each token has an independent mask). At 80% pruning, SparK's Avg = 31.16 substantially outperforms ThinK's 16.97.
vs. SnapKV/PyramidKV: These methods compress along the temporal axis (reducing sequence length), while SparK compresses along the channel axis. The two are orthogonal and can be composed.
vs. KV quantization (KIVI, etc.): Quantization reduces the bit-width of each stored value; SparK reduces the number of channels participating in computation. The two approaches are complementary.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of unstructured channel pruning and recovery is novel in the KV cache compression literature, establishing channel-dimension compression as a new direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, diverse baselines, multi-model evaluation, and detailed ablation and visualization analyses.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, and the observation–motivation–method logical chain is well articulated.
Value: ⭐⭐⭐⭐⭐ Highly practical — training-free, plug-and-play, orthogonally complementary to existing methods, and directly applicable to long-context LLM inference acceleration.