BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs¶

Conference: ACL 2026 arXiv: 2604.05942 Code: N/A Area: LLM Efficiency / Attention Optimization Keywords: Sliding Window Attention, Attention Head Selection, Black-Box Optimization, Large Neighborhood Search, KV-Cache

TL;DR¶

BOSCH is a training-free head-level SWA mixing method that models SWA head selection as a large neighborhood search problem decomposed into three stages (layer importance probing → adaptive ratio allocation → grouped head selection), systematically outperforming layer-level heuristics and 6 static head-level methods across 4 models and 4 ratio settings.

Method¶

Key Designs¶

Stage 1: Layer Importance Probing: Evaluates each layer's sensitivity to head localization via cascaded small-budget black-box search.
Stage 2: Adaptive Ratio Allocation: Differentially allocates per-layer SWA ratios based on sensitivity, mapping layers into coarse-grained buckets.
Stage 3: Multi-Layer Head Selection: Groups layers sharing the same ratio and jointly optimizes head binary decisions within each group.

Key Experimental Results¶

Method	ρ=0.25	ρ=0.5	ρ=0.75	ρ=0.875
BOSCH (8B)	98.9	90.3	72.7	42.5
Fisher (best baseline)	94.2	89.3	63.4	29.0

Highlights & Insights¶

Large neighborhood search decomposition elegantly breaks N-dimensional binary optimization into three tractable subproblems
"Entanglement problem" discovery: optimal head sets differ significantly across SWA ratios, proving static ranking methods insufficient

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐