SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration¶

Conference: ACL 2026 arXiv: 2604.12247 Code: GitHub Area: LLM Efficiency Keywords: Speculative Decoding, Self-Draft, Early Exit, Confidence Calibration, Inference Acceleration

TL;DR¶

SpecBound suppresses shallow-layer false high-confidence predictions via layer-wise temperature annealing and designs a bounded speculation algorithm to adaptively control draft depth and width, achieving up to 2.33x inference acceleration while maintaining lossless output.

Method¶

Key Designs¶

Annealed Confidence Threshold (ACT): Temperature \(T_\ell = 1 + \alpha(1 - \ell/L)\) — shallow layers get higher temperature (flattening softmax), deep layers approach 1. Zero computational overhead — only a scalar multiplication.
Bounded Speculation with Cached States (BSCS): Maximum depth \(d_{\max}\) and maximum width \(w_{\max}\) bounds. Any token reaching \(d_{\max}\) without exiting triggers immediate speculation interruption and parallel verification.
Hidden State Cache Manager: Manages inconsistent layer depths across tokens, enabling parallel verification through the remaining deep layers.

Key Experimental Results¶

Model	Method	Avg CR	Overall Speedup
Vicuna-7B	Medusa	-	1.71×
Vicuna-7B	SpecBound	3.78+	2.15×
CodeLlama-13B	SpecBound	3.49+	2.33×

Highlights & Insights¶

Temperature annealing is elegantly simple: a linear schedule effectively calibrates shallow-layer confidence at near-zero computational cost
Bounded speculation philosophy: "better to guess less than guess wrong" — stopping at difficult tokens is more efficient than forcing through

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐