Skip to content

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Conference: ACL 2026 arXiv: 2604.12247 Code: GitHub Area: LLM Efficiency Keywords: Speculative Decoding, Self-Draft, Early Exit, Confidence Calibration, Inference Acceleration

TL;DR

SpecBound suppresses shallow-layer false high-confidence predictions via layer-wise temperature annealing and designs a bounded speculation algorithm to adaptively control draft depth and width, achieving up to 2.33x inference acceleration while maintaining lossless output.

Method

Key Designs

  1. Annealed Confidence Threshold (ACT): Temperature \(T_\ell = 1 + \alpha(1 - \ell/L)\) — shallow layers get higher temperature (flattening softmax), deep layers approach 1. Zero computational overhead — only a scalar multiplication.

  2. Bounded Speculation with Cached States (BSCS): Maximum depth \(d_{\max}\) and maximum width \(w_{\max}\) bounds. Any token reaching \(d_{\max}\) without exiting triggers immediate speculation interruption and parallel verification.

  3. Hidden State Cache Manager: Manages inconsistent layer depths across tokens, enabling parallel verification through the remaining deep layers.

Key Experimental Results

Model Method Avg CR Overall Speedup
Vicuna-7B Medusa - 1.71×
Vicuna-7B SpecBound 3.78+ 2.15×
CodeLlama-13B SpecBound 3.49+ 2.33×

Highlights & Insights

  • Temperature annealing is elegantly simple: a linear schedule effectively calibrates shallow-layer confidence at near-zero computational cost
  • Bounded speculation philosophy: "better to guess less than guess wrong" — stopping at difficult tokens is more efficient than forcing through

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐