SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration¶
Conference: ACL 2026 arXiv: 2604.12247 Code: GitHub Area: LLM Efficiency Keywords: Speculative Decoding, Self-Draft, Early Exit, Confidence Calibration, Inference Acceleration
TL;DR¶
SpecBound suppresses shallow-layer false high-confidence predictions via layer-wise temperature annealing and designs a bounded speculation algorithm to adaptively control draft depth and width, achieving up to 2.33x inference acceleration while maintaining lossless output.
Method¶
Key Designs¶
-
Annealed Confidence Threshold (ACT): Temperature \(T_\ell = 1 + \alpha(1 - \ell/L)\) — shallow layers get higher temperature (flattening softmax), deep layers approach 1. Zero computational overhead — only a scalar multiplication.
-
Bounded Speculation with Cached States (BSCS): Maximum depth \(d_{\max}\) and maximum width \(w_{\max}\) bounds. Any token reaching \(d_{\max}\) without exiting triggers immediate speculation interruption and parallel verification.
-
Hidden State Cache Manager: Manages inconsistent layer depths across tokens, enabling parallel verification through the remaining deep layers.
Key Experimental Results¶
| Model | Method | Avg CR | Overall Speedup |
|---|---|---|---|
| Vicuna-7B | Medusa | - | 1.71× |
| Vicuna-7B | SpecBound | 3.78+ | 2.15× |
| CodeLlama-13B | SpecBound | 3.49+ | 2.33× |
Highlights & Insights¶
- Temperature annealing is elegantly simple: a linear schedule effectively calibrates shallow-layer confidence at near-zero computational cost
- Bounded speculation philosophy: "better to guess less than guess wrong" — stopping at difficult tokens is more efficient than forcing through
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐