NeurIPS 2025 Model Compression speculative decoding self-speculative cascade layer sparsity training-free DyTC

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs¶

Conference: NeurIPS 2025 arXiv: 2510.26843 Code: Submitted (open-source) Area: Model Compression Keywords: speculative decoding, self-speculative, cascade, layer sparsity, training-free, DyTC

TL;DR¶

CAS-Spec constructs a multi-level draft model hierarchy from the target model itself via Dynamically Switchable Inference Acceleration (DSIA) strategies (e.g., layer sparsity at varying degrees), and employs the Dynamic Tree Cascade (DyTC) algorithm to adaptively route among draft models and allocate draft lengths based on online acceptance rates and latency predictions. The approach achieves lossless inference acceleration of 1.1×–2.3× in a fully training-free manner, with DyTC yielding gains of 47% and 48% over cascade and tree baselines, respectively.

Background & Motivation¶

Background: Speculative decoding accelerates LLM inference via the draft-then-verify paradigm. Self-speculative decoding methods (e.g., SWIFT with layer skipping) require no additional draft model training but offer limited speedup. Cascade speculative decoding (e.g., CS-Drafting) achieves greater acceleration through multi-level draft hierarchies but requires training multiple models.

Limitations of Prior Work: (1) Training-free self-speculative methods (e.g., SWIFT, Lookahead) achieve speedups inferior even to simple Prompt Lookup Decoding (PLD); (2) Cascade methods require maintaining multiple trained draft models, which is impractical; (3) Even when naively combining self-speculative methods with PLD in a cascade, theoretical analysis shows that most SWIFT acceptance rate/cost trade-off points fall outside the effective cascade boundary, making naïve cascading unable to guarantee acceleration.

Key Challenge: Self-speculative methods exhibit insufficiently high acceptance rates and insufficiently low cost coefficients, such that naïve vertical/horizontal cascades may be slower than directly using PLD. A more intelligent scheduling algorithm is needed to fully exploit the potential of multi-level drafting.

Goal: How can effective cascade speculative decoding be constructed without training any draft model, while designing adaptive scheduling that genuinely outperforms single-level methods?

Key Insight: (1) Define the DSIA strategy framework—creating multiple virtual draft models using the same acceleration strategy at different parameter settings (e.g., layer skipping at different sparsity rates); (2) Design the DyTC algorithm—inspired by A* search, optimizing not only the local speedup of the current step but also considering the minimum speedup of subsequent steps (EWIF of the fastest bottom draft).

Core Idea: Reinterpret different compression levels of self-speculative decoding as multi-level draft models within a cascade, and employ dynamic tree search based on EMA-tracked acceptance rates and latency prediction to adaptively schedule the entire hierarchy.

Method¶

Overall Architecture¶

CAS-Spec = DSIA (draft hierarchy construction) + DyTC (dynamic scheduling) + PLD (bottom draft). During inference, the target model generates draft tokens at varying quality–speed trade-offs via different layer sparsity configurations; DyTC selects the optimal draft model, draft length, and tree routing based on online statistics.

Key Designs¶

Dynamically Switchable Inference Acceleration (DSIA):
- Function: Constructs a multi-level virtual draft model hierarchy from the target model itself.
- Mechanism: Defines DSIA strategies as inference-time acceleration techniques that can be dynamically toggled (layer sparsity, early exiting, activation quantization, etc.). Different parameter configurations yield different virtual draft models: \(\mathcal{M}_{d1}\) (0.4 layer sparsity, high quality, slow), \(\mathcal{M}_{d2}\) (0.6 layer sparsity, lower quality, fast), \(\mathcal{M}_{dn}\) (PLD, near-zero cost).
- Design Motivation: Scaling-DSIA cascades (same strategy, different parameters) and Mixing-DSIA cascades (different strategies combined) enable flexible construction of arbitrary hierarchies. The approach is fully training-free, as it only modifies the inference path.
Dynamic Tree Cascade (DyTC) Algorithm:
- Function: Adaptively selects draft models and draft lengths to construct the optimal draft tree.
- Mechanism: Maintains an EMA estimate of acceptance rates for each draft configuration: \(\hat{\alpha}_{new} = \lambda \cdot \hat{\alpha}_{prev} + (1-\lambda) \cdot \hat{\alpha}_{recent}\) (\(\lambda=0.7\), window \(H=20\)). At each step, the following A*-inspired objective is maximized: \(\mathcal{T}_s = \frac{\hat{\alpha}(1-\hat{\alpha}^{k_s})}{1-\hat{\alpha}} + \hat{\alpha}^{k_s} \cdot \hat{\alpha}_{dn} / (\hat{c} \cdot k_s + \hat{c}_{dn})\)—accounting not only for the current-step speedup but also incorporating the bottom draft as an admissible heuristic for subsequent steps.
- Design Motivation: Greedy selection does not satisfy the greedy choice property (local optimum ≠ global optimum). The A*-style heuristic term ensures that the minimum benefit of subsequent steps is considered.
- Implementation: For already-generated but unverified tokens, logit/n-gram match lengths serve as token-level acceptance rate estimates; hardware latency is predicted via a Bayesian linear regression roofline model.
Tree-based Parallel Draft Generation:
- Function: Generates multiple draft paths in parallel.
- Mechanism: After selecting the best leaf node, draft tokens are simultaneously generated for its TOP-P sibling nodes (since slightly increasing input length does not affect latency in memory-bounded decoding).

Loss & Training¶

Fully training-free. All DSIA strategies (layer sparsity + PLD) require no training. DyTC's online estimation requires a brief warm-up period (~20 steps).

Key Experimental Results¶

Main Results (Spec-Bench, H100 GPU)¶

Model	Method	Training-Free	Overall Speedup
Vicuna-7B	Lade	✓	1.274×
	PLD	✓	1.539×
	SWIFT	✓	1.064×
	CAS-Spec	✓	1.578×
	Kangaroo	✗	1.534×
	CAS-Spec† (w/ Kangaroo)	✗	1.696×
Vicuna-13B	SWIFT	✓	1.119×
	CAS-Spec	✓	1.524×
	CAS-Spec†	✗	1.673×
Vicuna-33B	SWIFT	✓	1.206×
	CAS-Spec	✓	1.481×

The training-free CAS-Spec surpasses Kangaroo, which requires training. Incorporating Kangaroo as a DSIA component yields further improvement.

Ablation Study (Vicuna-7B)¶

Scheduling Algorithm	Avg. Speedup	vs CS-Drafting	vs SWIFT Tree
VC+HC (CS-Drafting)	~1.07×	—	—
Tr (SWIFT)	~1.07×	—	—
DyTC	~1.58×	+47%	+48%

The adaptive routing of DyTC is the primary source of performance gains.

Key Findings¶

SWIFT alone cannot form an effective cascade: Theoretical analysis (Figure 1b,c) shows that most SWIFT data points fall outside the effective cascade boundary in the acceptance rate/cost coefficient plane.
DyTC's A*-style heuristic greatly outperforms greedy: The objective function accounting for minimum future-step gains avoids locally optimal but globally suboptimal scheduling decisions.
Online acceptance rate estimation is effective: The EMA mechanism adapts to dynamic difficulty changes during generation (e.g., large acceptance rate differences between translation and summarization tasks).
Training-free CAS-Spec already surpasses trained Kangaroo: This demonstrates that system-level optimization via cascade and intelligent scheduling can compensate for lower single-model draft quality.
Incompatibility with EAGLE3: The EAGLE series depends on hidden states from the target model; DSIA-modified hidden states are mismatched, which is an important limitation.

Highlights & Insights¶

"Virtual draft model" abstraction: Treating different acceleration configurations of the same model as distinct draft models is an elegant abstraction that reduces the multi-model requirement of cascade decoding to single-model multi-configuration switching.
First application of A* heuristics in speculative decoding: Using the EWIF of the bottom draft as an admissible heuristic is a clever design—it guarantees an underestimate of future-step gains (since actual draft models are at least as good as PLD), thereby preventing excessive greediness.
Practical utility of the theoretical effective boundary: The theoretical boundary in Figure 1b,c can serve as a prescreening tool to determine whether any self-speculative method is suitable for cascading.

Limitations & Future Work¶

Incompatible with state-of-the-art methods such as EAGLE3 that rely on hidden states.
Online estimation in DyTC introduces slight overhead; the advantage of dynamic scheduling diminishes at large batch sizes.
Current experiments are primarily conducted on the Vicuna series; evaluation on newer models such as Llama3 and Qwen2.5 is lacking.
DSIA strategies currently rely mainly on layer sparsity; activation quantization and sparsity remain underexplored due to hardware constraints.

vs SWIFT: SWIFT is one of the DSIA components in CAS-Spec. Used in isolation, SWIFT achieves only 1.06× speedup, whereas CAS-Spec's cascade + DyTC brings this to 1.58×.
vs CS-Drafting: CS-Drafting requires multiple trained draft models (e.g., the FLAN-T5 series). CAS-Spec eliminates this requirement via DSIA.
vs ChunkKV (2502.00299): ChunkKV optimizes KV cache memory, while CAS-Spec reduces serial inference dependencies. The two approaches are orthogonal—KV cache compression can be incorporated as a DSIA strategy within CAS-Spec.
vs SambaY (Decoder-Hybrid-Decoder): SambaY reduces decoding memory I/O through architectural changes, while CAS-Spec reduces serial steps via speculative decoding without architectural modification. The two are complementary.

Rating¶

Novelty: ⭐⭐⭐⭐ The DSIA framework and A*-inspired DyTC are meaningful contributions, though core components (layer sparsity, PLD) are drawn from existing work.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive Spec-Bench evaluation with clear DyTC ablations, though experiments on newer models and broader DSIA strategies are lacking.
Writing Quality: ⭐⭐⭐⭐ The theoretical analysis (effective boundary in Figure 1) is intuitive, and the DyTC algorithm is described clearly.
Value: ⭐⭐⭐⭐ The combination of training-free, plug-and-play deployment, and performance exceeding trained baselines gives the method high practical value.