ParallelComp: Parallel Long-Context Compressor for Length Extrapolation¶
Conference: ICML 2025
arXiv: 2502.14317
Code: GitHub
Area: Model Compression
Keywords: length extrapolation, KV cache eviction, parallel attention, attention sink, training-free
TL;DR¶
This paper proposes ParallelComp, a training-free parallel long-context compression method. By leveraging parallel KV cache eviction and attention calibration strategies, it enables an 8B LLM to extrapolate from an 8K context window up to 128K tokens on a single A100 GPU.
Background & Motivation¶
Limitations of Prior Work¶
Limitations of Prior Work: Ultra-long context extrapolation (>128K) remains a major challenge for LLMs:
Background¶
Background: NTK-based methods and text chunking techniques are limited by the attention sink phenomenon.
Key Challenge¶
Key Challenge: The attention bias in chunked parallel attention is fundamentally different from that in classical attention mechanisms, yet it has not been fully investigated before.
Core Idea¶
Core Idea: Memory bottlenecks limit the feasibility of long-sequence inference.
Method¶
1. Parallel Attention¶
The input sequence \(X \in \mathbb{R}^{N \times d}\) is partitioned into \(C = \lceil N/w \rceil\) chunks, with each chunk containing at most \(w\) tokens:
Local attention is computed independently within each chunk, reusing positional encodings to achieve training-free extrapolation.
2. Chunk Eviction¶
The most relevant chunks are selected based on the self-information score of the query token:
A fixed-size priority queue is utilized to retain the chunks with the lowest scores (i.e., most relevant), thereby controlling memory usage during the pre-filling phase.
3. Parallel KV Cache Eviction¶
Prior to local attention computation, cumulative attention scores are leveraged to rapidly identify and evict tokens of low importance:
By retaining the KV cache of high-scoring tokens, the computational overhead of subsequent global attention is significantly reduced.
4. Attention Calibration¶
Key Novelty: Evicting tokens with abnormally high attention scores (rather than merely retaining high-scoring tokens) to mitigate the attention bias exacerbated by parallel KV cache eviction:
where \(R^h_H\) represents the set of tokens whose attention scores exceed the threshold \(\lambda\).
5. Theoretical Analysis¶
Theorem 3.1: This theorem formalizes the inevitability of attention collapse in parallel attention. As the input length increases, the number of effective entries in the local attention matrix decreases:
6. Three Attention Patterns¶
Three attention distributions are empirically identified: - U-shape: Attention is concentrated on the first and last tokens (attention sink + recency bias). - Mountain-shape: Attention is concentrated on a small number of tokens in the middle (middle bias). - Uniform-shape: Attention is uniformly distributed.
Experimental Results¶
Main Results: Long-Context Benchmarks¶
- Performance: The 8B model (trained on 8K context length) achieves 91.17% of GPT-4's performance, surpassing Claude-2 and Kimi-Chat.
- Efficiency: Chunk throughput is increased by 1.76×, and the pre-filling phase is accelerated by 23.50×.
- Performs better on average than baselines like InfLLM across 16 sub-tasks in LongBench.
- Supports processing 128K+ context length on a single A100 80GB GPU.
Ablation Study¶
| Component | Performance Change Upon Removal |
|---|---|
| Attention Calibration | Performance drops by 2-5% across multiple tasks |
| Chunk Eviction | Out of Memory (OOM) / Execution Failure |
| Parallel KV Eviction | Throughput drops significantly |
Highlights & Insights¶
- Provides the first systematic analysis of the unique attention bias patterns in parallel attention.
- The counter-intuitive attention calibration strategy (evicting high-scoring tokens) yields significant improvements.
- Combines theoretical and empirical analysis to establish a formalized bound on attention collapse.
- Highly practical, achieving 128K context extrapolation on a single GPU without requiring training.
Limitations & Future Work¶
- The attention calibration threshold \(\lambda\) requires manual setting and lacks an adaptive mechanism.
- Chunk eviction may lose critical information, particularly in scenarios where information is highly dispersed.
- The theoretical analysis assumes a fixed number of chunks \(C\), which may not hold in practical applications.
- Compared to training-based methods (such as Yarn and LongRoPE), there remains a gap in the extrapolation range.
Rating¶
⭐⭐⭐⭐ — The addressed problem is significant and the proposed solution is highly practical. The analysis of parallel attention bias is novel and deep. It successfully achieves significant length extrapolation without requiring training.