BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs¶
Conference: ACL 2026
arXiv: 2604.05942
Code: None
Area: LLM Efficiency / Attention Optimization
Keywords: Sliding Window Attention, Attention Head Selection, Black-Box Optimization, Large Neighborhood Search, KV-Cache
TL;DR¶
The authors propose BOSCH, a training-free mixture-of-SWA method at the attention-head level. It models the SWA head selection as a Large Neighborhood Search (LNS) problem and decomposes it into a three-stage optimization (Layer Importance Probing → Adaptive Rate Assignment → Grouped Head Selection). It systematically outperforms layer-level heuristics and six static head-level methods across four models and four ratio settings.
Background & Motivation¶
Background: Post-training hybridization reduces KV-Cache usage and improves latency by replacing a portion of self-attention with Sliding Window Attention (SWA). Existing schemes primarily operate at the layer level (e.g., alternating or BME patterns) or the head level based on static rankings.
Limitations of Prior Work: Layer-level schemes ignore the reality that different heads within the same layer route local and global dependencies separately—switching an entire layer may remove critical global information. Static head-level methods (which rank heads by their degree of locality/globality first, then convert the most local heads) suffer from the "entanglement problem": a head's estimated behavior before hybridization may change after hybridization, leading to suboptimal selection.
Key Challenge: The head-level search space is massive (modern LLMs have hundreds to thousands of heads), making direct black-box optimization algorithms infeasible. Each evaluation is expensive, and the probability of improvement via single-bit flips decreases at a rate of \(\sim 1/N\) as dimensions grow. Methods like MADS experience a sharp drop in efficiency when exceeding approximately 50 variables.
Goal: To find an SWA head selection scheme that outperforms both layer-level heuristics and static head-level methods within a practically feasible evaluation budget.
Key Insight: Model the problem as Large Neighborhood Search (LNS) and decompose the high-dimensional search space into three low-dimensional sub-problems.
Core Idea: Instead of searching all heads directly, the method first probes layer importance, then assigns differentiated SWA ratios to each layer, and finally optimizes head selection jointly within layer groups sharing the same ratio. Each sub-problem's variable count is controlled within the range manageable by black-box optimization.
Method¶
Overall Architecture¶
BOSCH addresses a high-dimensional constrained binary optimization problem: selecting a subset of \(N\) attention heads to replace with SWA to minimize performance loss while meeting a total SWA ratio target \(\rho\), formalized as \(\min_{z \in \{0,1\}^N} \mathcal{L}(\mathcal{M}, z, \mathcal{D})\). Since direct black-box search on thousands of heads is infeasible, BOSCH adopts the Large Neighborhood Search (LNS) strategy to decompose this massive space into three low-dimensional sub-problems solved sequentially: probing layer sensitivity to localization, assigning differentiated SWA ratios, and optimizing specific head selection within groups. The input is a pre-trained model and a target ratio; the output is a global head selection mask that satisfies the budget with optimal performance.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Input: Pre-trained Model + Target SWA Ratio ρ"] --> B["Layer Importance Probing<br/>Top-to-bottom cascaded search, convert ⌈ρH⌉ heads per layer"]
B --> C["Result: Per-layer best score vector s_best ∈ ℝ^L"]
C --> D["Adaptive Rate Assignment<br/>Calculate drop δ → weight w_ℓ, map to ratio r_ℓ via binning"]
D --> E["Shift layers between adjacent bins to precisely meet global budget ρ"]
E --> F["Multi-layer Head Selection<br/>Group layers by ratio, concatenate head indices for joint optimization"]
F -->|Driven by Loss L = −Ŝ + α(ρ(z)−ρ)²| G["Commit to Global Head Selection Mask z ∈ {0,1}^N"]
Key Designs¶
1. Stage 1: Layer Importance Probing — Measuring sensitivity via cascaded search
The first step answers "which layers are sensitive to localization." BOSCH iterates from the top layer to the bottom. For each layer, it uses a small-budget black-box search to convert \(\lceil \rho H \rceil\) heads to SWA and records the best score. Since the layers above it have already been localized when searching a specific layer, this forms a cascaded evaluation. The resulting scores reflect the true cost of localizing a layer given the existing context of localization. This provides a data-driven basis for differentiated budget allocation.
2. Stage 2: Adaptive Rate Assignment — Skewing budget toward resilient layers
Performance tolerance for localization varies significantly across layers. A uniform ratio either wastes the margin of easy layers or overwhelms difficult ones. BOSCH calculates the performance drop \(\delta\) relative to the original model, converts this to a weight \(w_\ell \in [0,1]\) (lower values indicate easier localization), and sorts layers into bins mapped to coarse ratio levels. By shifting layers between adjacent bins, it precisely satisfies the global budget \(\rho\), ensuring "easier" layers carry more SWA load.
3. Stage 3: Multi-layer Head Selection — Joint optimization to capture inter-layer interactions
Finally, the method focuses on specific heads. Layers sharing the same ratio are grouped and processed from easiest to hardest. Within a group, instead of independent selection, BOSCH concatenates the head indices of all layers in the group for joint binary decision optimization. This approach keeps the variable count within the effective range of black-box optimization while capturing layer-to-layer interactions rather than treating them as independent.
Loss & Training¶
The search is driven by a normalized loss function: \(\mathcal{L} = -\hat{\mathcal{S}} + \alpha(\rho(z) - \rho)^2\). The first term normalizes the score using the performance of all-SWA and all-attention models as anchors, while the second term is a quadratic penalty for deviating from the target ratio. Additionally, for GQA models, BOSCH enforces identical decisions for heads within the same group to ensure actual KV-Cache savings.
Key Experimental Results¶
Main Results (NIAH Benchmark, 4 Qwen3 Models)¶
| Method | ρ=0.25 | ρ=0.5 | ρ=0.75 | ρ=0.875 |
|---|---|---|---|---|
| Ours (BOSCH, 8B) | 98.9 | 90.3 | 72.7 | 42.5 |
| Fisher (Prev. SOTA, 8B) | 94.2 | 89.3 | 63.4 | 29.0 |
| RAND (Layer-level, 8B) | 45.9 | 15.4 | 12.8 | 13.2 |
| BME (Layer-level, 8B) | 30.8 | 12.4 | 12.2 | 12.7 |
Ablation Study¶
| Configuration | Description |
|---|---|
| BOSCH-single | Uses only the single-layer search results from Stage 1 |
| BOSCH-multi | Uses only Stage 3 multi-layer search (no adaptive ratio) |
| BOSCH-layer | Layer-level optimization instead of head-level |
| Full BOSCH | Complete three-stage pipeline, consistently optimal |
Key Findings¶
- BOSCH is optimal or near-optimal across all 16 settings (4 models × 4 ratios), with its advantage becoming more pronounced at high SWA ratios.
- At \(\rho=0.875\) (87.5% of heads using SWA), BOSCH maintains performance between 26.9–47.2, while most baselines approach random performance.
- Significant "turnover" exists in the set of selected heads under different SWA ratios, proving the existence of the "entanglement problem" and showing that fixed rankings cannot handle varying ratio requirements.
Highlights & Insights¶
- The decomposition strategy for Large Neighborhood Search is highly effective: breaking an N-dimensional binary optimization into three low-dimensional problems keeps each within the effective range of black-box optimization.
- Discovery and validation of the "Entanglement Problem": The variation of optimal head sets across SWA ratios explains why static ranking methods are insufficient.
- Training-free nature: The method can be directly applied to post-training optimization of already deployed models.
Limitations & Future Work¶
- The three-stage search still requires a computational budget (multiple model forward passes).
- Evaluation was limited to the Qwen3 series; effectiveness on other architectures like Llama or Mistral remains to be confirmed.
- While NIAH and LongBench were used, real-world long-context applications are more diverse.
Related Work & Insights¶
- vs. Layer-level Heuristics (INTR/BME): These ignore routing differences at the head level, causing performance to collapse at high SWA ratios.
- vs. Fisher/Razor (Static Head-level): These suffer from the "entanglement problem," where head behavior changes post-hybridization, leading to suboptimal selection.
Rating¶
- Novelty: ⭐⭐⭐⭐ The LNS decomposition is innovative, and the entanglement analysis is deep.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across 4 models, 4 ratios, and 9+ baselines.
- Writing Quality: ⭐⭐⭐⭐ Clear problem formalization and algorithm description.
- Value: ⭐⭐⭐⭐ Practical for KV-Cache optimization in long-context LLMs.
Related Papers¶
- [ICML 2025] MoH: Multi-Head Attention as Mixture-of-Head Attention
- [ICML 2025] Long-Short Alignment for Effective Long-Context Modeling in LLMs
- [ACL 2025] LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
- [NeurIPS 2025] From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers
- [ACL 2026] Native Hybrid Attention for Efficient Sequence Modeling