BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs¶
Conference: ACL 2026 arXiv: 2604.05942 Code: N/A Area: LLM Efficiency / Attention Optimization Keywords: Sliding Window Attention, Attention Head Selection, Black-Box Optimization, Large Neighborhood Search, KV-Cache
TL;DR¶
BOSCH is a training-free head-level SWA mixing method that models SWA head selection as a large neighborhood search problem decomposed into three stages (layer importance probing → adaptive ratio allocation → grouped head selection), systematically outperforming layer-level heuristics and 6 static head-level methods across 4 models and 4 ratio settings.
Method¶
Key Designs¶
-
Stage 1: Layer Importance Probing: Evaluates each layer's sensitivity to head localization via cascaded small-budget black-box search.
-
Stage 2: Adaptive Ratio Allocation: Differentially allocates per-layer SWA ratios based on sensitivity, mapping layers into coarse-grained buckets.
-
Stage 3: Multi-Layer Head Selection: Groups layers sharing the same ratio and jointly optimizes head binary decisions within each group.
Key Experimental Results¶
| Method | ρ=0.25 | ρ=0.5 | ρ=0.75 | ρ=0.875 |
|---|---|---|---|---|
| BOSCH (8B) | 98.9 | 90.3 | 72.7 | 42.5 |
| Fisher (best baseline) | 94.2 | 89.3 | 63.4 | 29.0 |
Highlights & Insights¶
- Large neighborhood search decomposition elegantly breaks N-dimensional binary optimization into three tractable subproblems
- "Entanglement problem" discovery: optimal head sets differ significantly across SWA ratios, proving static ranking methods insufficient
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐