Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention¶

Conference: ACL 2025
arXiv: 2503.08640
Code: millix19/dbsa
Area: LLM Efficiency / In-Context Learning
Keywords: Many-Shot ICL, Block-Sparse Attention, KV Cache, Retrieval ICL, Inference Efficiency

TL;DR¶

Proposes Dynamic Block-Sparse Attention (DBSA), a training-free inference framework that achieves near-fine-tuning inference latency in many-shot in-context learning through structured block-sparse attention encoding and dynamic Retrieval-based KV cache while maintaining >95% of the accuracy of the best-performing methods.

Background & Motivation¶

Background¶

Many-Shot In-Context Learning (Many-Shot ICL) utilizes thousands of demonstration examples, achieving performance comparable to or even surpassing fine-tuning without requiring parameter updates, being easily adaptable to new tasks, and eliminating the need to deploy dedicated models for each task. However, scaling ICL to long contexts introduces new computational trade-offs:

Fixed ICL: Uses a fixed set of demonstrations for all queries. It can cache KV states, but its performance is inferior to Retrieval ICL.
Retrieval ICL: Dynamically retrieves relevant demonstrations for each query, offering better performance but requiring context re-encoding for every inference, which is highly expensive.
Fine-tuning: Has the highest setup cost (requires training) but the lowest inference overhead.

Key Challenge¶

Many-Shot ICL shifts the computational burden from training to inference. The inference cost of processing thousands of demonstrations is orders of magnitude higher than zero-shot inference, rendering it impractical for high-throughput applications.

Goal¶

Design a framework to simultaneously achieve the high accuracy of Retrieval ICL and the low latency of Fixed (cached) ICL, making Many-Shot ICL feasible for actual deployment.

Method¶

Overall Architecture¶

DBSA formulates Many-Shot ICL as a two-stage process: 1. Stage 1 (One-time Pre-encoding): Encodes the entire demonstration pool using block-sparse attention. 2. Stage 2 (Dynamic Inference): Retrieves relevant demonstration blocks for each test query and reuses the pre-encoded KV cache.

Key Designs¶

Stage 1: Block-Sparse Streaming Attention Encoding¶

Given a demonstration pool \(D = \{d_1, d_2, \dots, d_n\}\): 1. Divide \(D\) into \(n/k\) blocks: \(D = [b_1, b_2, \dots, b_{n/k}]\), where each block contains \(k\) demonstrations. 2. Each block \(b_i\) only attends to three regions: - Anchor block \(b_1\) (attention sink) - Previous \(j\) blocks \(\{b_{i-j}, \dots, b_{i-1}\}\) (local context) - Itself (standard causal attention) 3. Use sequential position encoding \([0, \dots, n-1]\), but cache the KV states before applying Rotary Position Embeddings (RoPE). 4. Key advantage: New demonstrations can be incrementally appended at any time, requiring the encoding of only one additional block.

The implementation utilizes Flex Attention, bypassing computation for masked blocks to achieve a speedup proportional to the sparsity.

Stage 2: Dynamic Demonstration Selection and Answer Generation¶

Given a test query \(q^*\): 1. Use a retrieval method (such as BM25) to select a subset of demonstration blocks \(D' = \{b'_1, b'_2, \dots, b'_m\}\), where \(m < n\). 2. The anchor block \(b_1\) is always included (serving as an attention sink). 3. Concatenate the KV caches of selected blocks, and re-apply Rotary Position Embeddings using new sequential position IDs \([0, |D'|-1]\). 4. The test query attends to the selected KV cache via full attention, and generates the answer autoregressively.

DBSA supports integrating plug-and-play retrieval methods (text similarity, cosine similarity, diversity retrievers, etc.); this paper utilizes BM25.

Configuration Parameters¶

50 demonstrations per block
Random grouping
Each block attends to the previous 2 blocks as local context
Retrieval ratio of 30%

Experiments¶

Experimental Setup¶

Models: Llama-2-7B (32k), Llama-3.1-8B (128k)
Datasets: 5 classification datasets — TREC (6 classes), TREC-Fine (50 classes), NLU (68 classes), Banking-77 (77 classes), Clinic-150 (151 classes)
Context Length: 30k and 90k tokens
Baselines: Fixed ICL (cached KV), Retrieval ICL (re-encoded every time), LoRA fine-tuning
Hardware: L40S (48GB) for 30k, A100 (80GB) for 90k

Efficiency Comparison¶

Relative Latency (compared to the RetICL baseline):

Method	30k Setup	30k Inference	90k Setup	90k Inference
RetICL	1x	1x	1x	1x
Fixed ICL	5x	0.11x	6.5x	0.06x
LoRA Fine-tuning	>600x	0.08x	>1500x	0.046x
DBSA	3x	0.10x	4x	0.053x

(Using Llama-3.1-8B as an example)

Key Observations: - DBSA dynamic inference latency is close to that of the fine-tuned model, while its setup time is less than 1/375 of fine-tuning. - In scenarios with >100,000 requests, DBSA remains the most efficient solution. - Grouped-Query Attention (GQA) significantly reduces cached ICL latency; the inference latency of DBSA on Llama-3.1 is only 5.3% of RetICL.

Accuracy Comparison¶

90k context, Llama-3.1-8B:

Dataset	Fixed ICL	Ret ICL	Fine-tuning	DBSA
TREC	0.96	0.95	0.96	0.95 (99%)
TREC Fine	0.88	0.89	0.83	0.88 (99%)
Banking77	0.91	0.90	0.81	0.89 (98%)
Clinic	0.92	0.91	0.74	0.90 (98%)
NLU	0.89	0.90	0.82	0.88 (98%)
Average	0.91	0.91	0.83	0.90 (99%)

DBSA achieves 99% of the accuracy of the best-performing methods under a 90k context. Fine-tuning is consistently inferior to Many-Shot ICL on these datasets.

Ablation Study¶

Sparse Attention Patterns¶

Pattern	Average Accuracy
Full Attention	0.84
Sink + Prev + Self (DBSA)	0.82
Sink + Self	0.27
Self Only	0.09

Both attention sinks and local context connections are essential. Utilizing only Sink + Self yields extremely poor results (0.27), indicating that the token eviction strategy of StreamingLLM is unsuitable for Many-Shot ICL scenarios.

Block-level vs. Example-level Retrieval¶

Setting	Standard Inference	DBSA	Gap
Example-level (90k)	0.90	0.86	0.04
Block-level (90k)	0.89	0.88	0.01

Block-level selection better preserves the contextual relationships established during encoding, while being faster and more memory-efficient during retrieval.

Block Grouping Strategies¶

Strategy	Average Accuracy
Random Grouping	0.827
K-means Clustering	0.764
Clustering + 10% Diversity	0.810

Surprisingly, random grouping performs the best, as it naturally introduces intra-block diversity.

Dynamic Block Sorting¶

Maintaining original encoding order or sorting from lowest to highest relevance yields comparable performance.
Reversing the order significantly degrades performance due to local attention dependencies.

Storage Overhead¶

KV Cache of Llama-3.1-8B: 0.125 MiB/token, ~3.7 GB for 30k, ~11.1 GB for 90k.
LoRA Fine-tuning: ~0.01 GB per task, but cumulative storage costs become significant in multi-task scenarios (hundreds of tasks).

Highlights & Insights¶

Training-Free Inference Framework: DBSA requires no auxiliary training or parameter tuning, functioning as a plug-and-play solution.
Overcoming ICL Efficiency Bottlenecks: It reduces setup time by >375x compared to fine-tuning while achieving near-fine-tuning inference latency, making Many-Shot ICL viable in practice.
Incremental Demonstration Insertion: The computational cost to encode new demonstrations remains constant regardless of the total pool size, making it highly suitable for dynamically changing data scenarios.
Fine-Tuning Underperforms ICL on These Datasets: Even when trained on the full target dataset, fine-tuning fails to significantly outperform Many-Shot ICL, challenging the conventional belief that "fine-tuning is always superior."
Efficacy of Local Attention & Sink: The Sink + Prev + Self pattern exhibits robust performance in a training-free setting, maintaining strong capability even with >90% sparsity.
Comprehensive Ablation Analyses: Thorough investigations are presented across various aspects (sparsity patterns, grouping strategies, sorting).

Limitations & Future Work¶

The efficacy of the method relies heavily on the quality of Retrieval ICL, making it unsuitable for tasks requiring synthesis over the entire demonstration set (e.g., statistical tasks).
Evaluation is restricted to classification datasets, without covering generative tasks (e.g., translation, summarization).
The accuracy gap between DBSA and Ret ICL is more pronounced at the shorter 30k context length.
The retrieval ratio is fixed at 30% and has not been optimized.
The memory footprint for storing the KV Cache may become a bottleneck for extremely large demonstration pools.

Many-Shot ICL: Bertsch et al. 2024, Agarwal et al. 2024 (thousand-shot ICL matching fine-tuning)
Sparse Attention: StreamingLLM (Xiao et al. 2024), Star Attention (Acharya et al. 2024)
KV Cache Compression: Token eviction (Xiao et al. 2024), quantization, and low-rank approximation
ICL Demonstration Selection: BM25 retrieval (Luo et al. 2024), Parallel Context Windows (Ratner et al. 2023)
Sparse Attention in RAG: Lu et al. 2024, Sun et al. 2024

Rating ⭐⭐⭐⭐¶

This work addresses a practical and crucial problem (deployment efficiency of Many-Shot ICL). The methodology is elegant (two-stage + block-sparse design), and the empirical evaluations are solid. While it is limited by only evaluating classification tasks with relatively low task diversity, its core contribution — proving that training-free sparse attention can bring ICL latency close to fine-tuning — is highly valuable.