Skip to content

BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

Conference: ACL 2026 arXiv: 2604.05942 Code: N/A Area: LLM Efficiency / Attention Optimization Keywords: Sliding Window Attention, Attention Head Selection, Black-Box Optimization, Large Neighborhood Search, KV-Cache

TL;DR

BOSCH is a training-free head-level SWA mixing method that models SWA head selection as a large neighborhood search problem decomposed into three stages (layer importance probing → adaptive ratio allocation → grouped head selection), systematically outperforming layer-level heuristics and 6 static head-level methods across 4 models and 4 ratio settings.

Method

Key Designs

  1. Stage 1: Layer Importance Probing: Evaluates each layer's sensitivity to head localization via cascaded small-budget black-box search.

  2. Stage 2: Adaptive Ratio Allocation: Differentially allocates per-layer SWA ratios based on sensitivity, mapping layers into coarse-grained buckets.

  3. Stage 3: Multi-Layer Head Selection: Groups layers sharing the same ratio and jointly optimizes head binary decisions within each group.

Key Experimental Results

Method ρ=0.25 ρ=0.5 ρ=0.75 ρ=0.875
BOSCH (8B) 98.9 90.3 72.7 42.5
Fisher (best baseline) 94.2 89.3 63.4 29.0

Highlights & Insights

  • Large neighborhood search decomposition elegantly breaks N-dimensional binary optimization into three tractable subproblems
  • "Entanglement problem" discovery: optimal head sets differ significantly across SWA ratios, proving static ranking methods insufficient

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐