NeurIPS 2025 Model Compression Sparse Attention Speculative Decoding KV Cache Pruning KL Divergence Layer Mapping Sorting-Free Top-p Selection Triton Kernel

SpecAttn: Speculating Sparse Attention¶

Conference: NeurIPS 2025 arXiv: 2510.27641 Area: Model Compression Keywords: Sparse Attention, Speculative Decoding, KV Cache Pruning, KL Divergence Layer Mapping, Sorting-Free Top-p Selection, Triton Kernel

TL;DR¶

SpecAttn proposes a training-free method that leverages attention weights already computed by the draft model in speculative decoding to predict important tokens for the verification model. Through KL divergence layer mapping, sorting-free top-p nucleus selection, and dynamic KV cache pruning, it achieves a 78.4% reduction in KV cache accesses with only a 15.29% increase in perplexity, significantly outperforming existing sparse attention methods.

Background & Motivation¶

Background: The \(O(L^2 d)\) complexity of the Transformer self-attention mechanism is a core bottleneck in LLM inference, especially in long-context scenarios. System-level optimizations such as vLLM and FlashAttention provide significant speedups but still perform full dense attention, computing every query-key pair.

Limitations of Prior Work: Methods such as Longformer and BigBird adopt predefined sparse patterns (sliding windows, global tokens) to achieve linear complexity, but require retraining and cannot adapt to varying input content. Dynamic methods such as MInference and SpargeAttn select top-k keys at inference time, but rely on preset head patterns or introduce additional prediction overhead.

Blind Spot in Speculative Decoding: Speculative decoding uses a lightweight draft model to generate candidate tokens in parallel, which are then verified by a large model to reduce the number of large-model calls. However, it does not reduce the internal attention computation cost of the large model—full dense attention is still performed during verification.

Core Insight: The draft model already computes attention weights during speculative decoding, and these weights contain rich token importance information. Prior work treats speculative decoding and sparse attention as orthogonal optimization strategies, overlooking the opportunity to combine them.

Goal: Without modifying model weights or requiring any training, leverage the draft model's attention distribution in speculative decoding to guide dynamic KV cache pruning in the verification model, thereby achieving content-aware sparse attention while preserving output quality.

Core Idea: Draft model attention weights → KL divergence mapping to verification model layers → sorting-free top-p selection of important tokens → dynamic pruning of the verification model's KV cache = content-aware sparse attention at zero training cost.

Method¶

Overall Architecture¶

SpecAttn integrates seamlessly into existing speculative decoding pipelines and consists of three core steps: (1) an offline phase that establishes layer mappings between the draft and verification models using KL divergence; (2) a runtime phase that selects important tokens from the draft model's attention distribution using a sorting-free algorithm; and (3) construction of a sparse mask from the selected tokens, so that the verification model computes attention only over these tokens.

Key Designs¶

KL Divergence Layer Mapping:
- Function: Establishes a correspondence \(f: [m] \to [n]\) between layers of the draft model (\(n\) layers) and the verification model (\(m\) layers).
- Mechanism: The similarity between draft model layer \(i\) and verification model layer \(j\) is defined as \(S_{i,j} = -D_{KL}(A_j^v \| A_i^d)\), where \(A_i^d, A_j^v \in \mathbb{R}^L\) denote the attention distributions of the two models computed on a representative dataset (WikiText). For each verification model layer \(j\), the draft model layer with the highest similarity is selected, subject to a monotonically increasing mapping constraint.
- Design Motivation: Layers at different depths learn hierarchically corresponding attention patterns—shallow layers attend to local patterns while deep layers capture global dependencies. The monotonicity constraint reflects this hierarchical structure. KL divergence is better suited than cosine similarity for measuring differences between probability distributions, and the optimization problem can be solved efficiently via dynamic programming.
- A single draft model layer may be mapped to multiple verification model layers, since the verification model is typically much deeper than the draft model.
Sorting-Free Nucleus Selection:
- Function: Efficiently selects from the draft model's attention weights the minimal token subset \(\mathcal{T}\) whose cumulative attention mass reaches a threshold \(p\).
- Mechanism: Binary search is used in place of sorting. Given attention weights \(\mathbf{a} \in \mathbb{R}^L\) and a target mass \(M_{target} = p \cdot \sum_i a_i\), binary search is performed over \([\theta_{low}=0, \theta_{high}=\max(\mathbf{a})]\) to find threshold \(\theta_{mid}\), computing \(M_{current} = \sum_{i: a_i \geq \theta_{mid}} a_i\) and adjusting boundaries based on comparison with the target; convergence is achieved in a fixed 10 iterations.
- Design Motivation: Conventional top-p requires sorting \(L\) attention weights, an operation that is inefficient on GPUs due to \(O(L \log L)\) complexity and severe branch divergence. Ten iterations of binary search require only 10 parallel summations with \(O(L)\) complexity, fully exploiting the SIMD parallelism of GPUs.
- Implemented as a Triton kernel, achieving at least 4× speedup over PyTorch sorting for KV cache sizes \(\leq 8192\).
- Tokens are selected separately for each of the \(\gamma\) speculative steps, and the union is taken: \(\mathcal{T} = \bigcup_{s=1}^{\gamma} \mathcal{T}_s\), ensuring coverage of important tokens across all speculative steps.
Sparse Attention Computation:
- Function: Constructs a sparse mask from the selected tokens so that the verification model computes attention only over these tokens.
- Mechanism: The selected token indices \(\mathcal{I}\) define a diagonal mask matrix \(\Lambda_\mathcal{I}\), and the sparse attention output is \(\hat{O} = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \Lambda_\mathcal{I} V\).
- Implementation Details: The mask is converted to CSR (Compressed Sparse Row) format and computed using FlashInfer's BlockSparseAttention kernel. The first 2 layers use full attention, as attention distributions in early layers tend to be more diffuse (attention sink phenomenon).
- Design Motivation: Although the CSR format introduces format conversion overhead, it enables efficient computation via well-established sparse attention kernels.

Algorithm Pipeline¶

Initialization: Compute layer mapping \(f\) offline; prefill the input sequence with both models to initialize their respective KV caches.
Speculative Generation: The draft model autoregressively generates \(\gamma\) candidate tokens while collecting per-layer attention weights \(\mathcal{A}\).
Mask Creation: For each verification model layer \(j\), locate the corresponding draft layer via \(f(j)\), apply the sorting-free top-p algorithm to select important tokens, and generate a sparse mask.
Sparse Verification: The verification model uses the sparse mask to verify \(\gamma\) candidate tokens in parallel and produces predictions.
Accept/Reject: Draft tokens are accepted or rejected sequentially, and the KV caches of both models are updated accordingly.

Key Experimental Results¶

Experimental Setup¶

Hardware: Single NVIDIA RTX 4090 (24 GB VRAM)
Draft model: TinyLlama-1.1B
Verification model: Llama-2-7b-hf
Perplexity evaluation: PG-19 dataset, truncated to 2048 tokens (10% prefill + 90% decoding)
Latency evaluation: LongBench gov_report task

Main Results: Perplexity Comparison¶

Method	Perplexity	PPL Diff.	Relative Increase	KV Cache Reduction
Full Attention	6.435	-	-	-
StreamingLLM	186.242	+179.807	+2794.32%	77.4%
Quest	7.823	+1.389	+21.58%	77.4%
SpecAttn (p=0.95)	7.419	+0.984	+15.29%	78.4%
SpecAttn (p=0.97)	6.720	+0.285	+4.43%	68.8%
SpecAttn (p=0.99)	6.471	+0.036	+0.56%	44.3%

Throughput Results¶

Method	Tokens/sec (↑)	KV Cache Reduction (↑)
No Speculative Decoding (FlashAttn)	42.00	-
Speculative Decoding (Full Attention)	68.26	-
SpecAttn (p=0.97)	59.95	71.89%

Key Findings¶

SpecAttn substantially outperforms Quest in quality at comparable sparsity: at \(p=0.95\), KV cache reduction is 78.4% (vs. Quest's 77.4%), yet perplexity increase is only 15.29% (vs. Quest's 21.58%), representing a relative improvement of approximately 30%.
StreamingLLM completely fails during decoding: perplexity surges to 186, demonstrating that the attention sink strategy is unsuitable for dynamic decoding scenarios.
The parameter \(p\) offers flexible quality–efficiency trade-offs: at \(p=0.99\), perplexity is nearly lossless (+0.56%); at \(p=0.95\), KV cache reduction approaches 80%.
End-to-end speedup has not yet been achieved: at \(p=0.97\), throughput of 59.95 tokens/sec falls below the 68.26 tokens/sec of full-attention speculative decoding, primarily due to the overhead of mask generation (Algorithm 2). However, the attention computation speedup grows with prompt length (>4× at 2048 tokens), suggesting advantages at longer contexts.
Sorting-free Triton kernel speedup: at least 4× over PyTorch sorting, with the speedup becoming more pronounced as KV cache size increases.

Highlights & Insights¶

First integration of speculative decoding and sparse attention: prior work treats the two as independent acceleration strategies. This paper identifies the draft model's attention distribution as a natural signal for predicting important tokens in the verification model, cleverly reusing already-available computation.
Practical value of sorting-free top-p selection: replacing sorting with binary search is an engineering-friendly design that converges in 10 iterations and is naturally suited to GPU parallelism. This technique can be applied independently to any scenario requiring top-p sampling.
Monotonicity constraint in layer mapping: the monotonic mapping constraint, inspired by Dynamic Time Warping (DTW), both reduces the search space and aligns with the intuition of hierarchical learning across model layers, serving as a well-motivated inductive bias.
Zero training cost: the entire method requires no fine-tuning or additional training and can be plugged directly into existing speculative decoding pipelines, lowering the barrier to deployment.

Limitations & Future Work¶

No end-to-end speedup achieved: at a context length of 2048 tokens, the overhead of mask generation offsets the gains from sparse attention, resulting in lower actual throughput than full-attention speculative decoding. Longer contexts (10K+ tokens) are needed for the advantage to materialize.
Limited model pairs evaluated: validation is performed only on the TinyLlama-1.1B / Llama-2-7b pair; other draft–verification model combinations (e.g., cross-architecture families) are not explored.
CSR format conversion overhead: converting sparse masks to CSR format introduces additional latency, which may require more efficient sparse formats or custom kernels to mitigate.
Perplexity-only evaluation: downstream task evaluation (e.g., question answering, summarization) is absent; improvements in perplexity do not necessarily translate directly to preserved task quality.
Single-GPU experiments: scalability and communication overhead in multi-GPU distributed settings are not explored.
The authors suggest exploring alternative similarity metrics (Jaccard similarity, other distributional distances) as replacements for KL divergence.

vs. Quest: Quest performs query-aware KV page selection at the chunk level with fixed granularity; SpecAttn uses the draft model's attention distribution for token-level dynamic selection, achieving finer granularity and lower perplexity at comparable sparsity.
vs. StreamingLLM: StreamingLLM's attention sink strategy retains a small number of tokens after prefill, suited to streaming scenarios but suffering catastrophic quality degradation during decoding; SpecAttn dynamically selects different tokens at each step, offering stronger adaptability.
vs. MInference / SpargeAttn: these inference-time dynamic sparse methods require additional head pattern precomputation or two-stage filtering; SpecAttn directly reuses computation already present in speculative decoding with no additional prediction overhead.
vs. Twilight: the sorting-free top-p algorithm is inspired by Twilight's hierarchical top-p pruning, but applied in a different context—Twilight prunes within a model's own attention, whereas SpecAttn uses cross-model attention prediction.

Rating¶

Novelty: ⭐⭐⭐⭐ — First use of speculative decoding attention signals for sparse attention, a novel angle, though the individual sub-techniques (KL mapping, top-p selection) are relatively mature.
Experimental Thoroughness: ⭐⭐⭐ — Limited to a single model pair and single-dataset perplexity evaluation; downstream task and longer-context validation are lacking.
Writing Quality: ⭐⭐⭐⭐ — Method description is clear with complete algorithmic pseudocode, though the presentation of the failure to achieve end-to-end speedup is somewhat passive.
Value: ⭐⭐⭐⭐ — Introduces a new paradigm combining speculative decoding and sparse attention with promising direction, though current experiments do not fully validate end-to-end gains.