Skip to content

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Conference: ACL 2025
arXiv: 2505.24196
Code: None
Area: Model Compression / LLM Efficiency
Keywords: speculative decoding, layer skipping, self-speculative, dynamic programming, lossless acceleration

TL;DR

CLaSp proposes a training-free self-speculative decoding method that dynamically adjusts the layer skipping strategy based on context after each verification step using a dynamic programming algorithm. By utilizing the full hidden states of the previous verification step as the target to select the optimal set of skipped layers, it achieves \(1.3-1.7\times\) speedup on the LLaMA3 series without altering the generation distribution.

Background & Motivation

Background

Background: Speculative decoding (SD) is a primary method for lossless LLM acceleration, which pre-generates tokens using a fast draft model and verifies them in parallel with a full verify model. Self-speculative decoding (Self-SD) constructs the draft model by skipping some layers of the verify model, avoiding compatibility issues associated with additional models.

Limitations of Prior Work: (1) Traditional SD requires finding or training a draft model, which is infeasible for specialized LLMs; (2) Self-SD searches for a fixed set of skipped layers using Bayesian optimization on a training set, which is time-consuming and lacks generalization—different tasks or contexts require different skipping strategies; (3) Although SWIFT optimizes progressively, it relies on accumulating sufficient user requests (leading to a slow cold start).

Key Challenge: The layer skipping strategy should adapt to changes in context (dynamic), but existing methods either use a statically fixed set of skipped layers or rely on offline optimization.

Goal: Design a plug-and-play dynamic layer skipping strategy that is training-free, requires no pre-optimization, and adjusts in real-time according to the context.

Key Insight: Utilize the full hidden states of the previous verification step as the "ground truth" target and apply dynamic programming to select the optimal layer-skipping combination in real-time.

Core Idea: Use the full hidden states from the previous verification as a reference, and select the layer-skipping set that minimizes the approximation error using dynamic programming before each speculation step.

Method

Overall Architecture

Verification phase (full \(L\) layers) \(\rightarrow\) Save hidden states of each layer \(X = \{x_0, ..., x_{L-1}\}\) \(\rightarrow\) Before the speculation phase, run the DP algorithm to select the optimal set \(\mathcal{S}\) of skipping \(M\) layers \(\rightarrow\) Generate draft tokens using the skipped sub-model \(\rightarrow\) Accept/Reject in the verification phase \(\rightarrow\) Update hidden states \(\rightarrow\) Repeat.

Key Designs

  1. Dynamic Programming for Selecting Layer-Skipping Set:

    • Function: Given an \(L\)-layer model and a budget of skipping \(M\) layers, find the layer-skipping combination that minimizes the discrepancy between the skipped and full hidden states.
    • Mechanism: Define the state \(g[l, m]\) as the "hidden state at layer \(l\) when \(m\) layers have been skipped," with the objective to minimize \(\|g[L, M] - x_{L-1}\|\). The DP recurrence is: at layer \(l\), either execute (\(g[l+1, m] = f_l(g[l, m])\)) or skip (\(g[l+1, m+1] = g[l, m]\)).
    • Time Complexity: \(O(L \times M)\) forward passes (where each pass only computes a single layer); leveraging GPU parallelism makes the additional latency almost negligible.
    • Design Motivation: Based on the observation of "slowly changing inter-layer embeddings," the error introduced by layer skipping is controllable.
  2. Context Adaptivity:

    • Function: Re-run DP after each verification to update the layer skipping strategy based on the new full hidden states.
    • Design Motivation: Different generation stages (e.g., beginning vs. middle vs. end) may require different critical layers. Static strategies fail to adapt.
    • Implementation: The verification phase inherently requires forward propagation through all layers to obtain hidden states, incurring no additional storage overhead.
  3. GPU Parallel Optimization:

    • Function: Multiple candidate paths in DP can be computed in parallel.
    • Layer forward passes can be executed in batches.
    • Ensures that the additional latency (DP selection overhead) is significantly smaller than the speedup gain brought by layer skipping.

Comparison with Traditional Self-SD

  • Self-SD: Offline Bayesian optimization \(\rightarrow\) Fixed layer-skipping set \(\rightarrow\) Static during inference.
  • CLaSp: Online DP optimization \(\rightarrow\) Dynamic layer-skipping set \(\rightarrow\) Updated after each verification step.

Key Experimental Results

Main Results: Spec-Bench (LLaMA3 Series)

Method LLaMA3-8B Speedup LLaMA3-70B Speedup Requires Training
Autoregressive Baseline \(1.0\times\) \(1.0\times\) -
Self-SD (Static) \(\sim 1.2\times\) \(\sim 1.3\times\) Requires BO
SWIFT \(\sim 1.3\times\) \(\sim 1.4\times\) Dynamic but requires accumulation
CLaSp \(1.3-1.5\times\) \(1.5-1.7\times\) No

Key Findings

  • Lossless Acceleration: The generation distribution of CLaSp is mathematically identical to the original model (guaranteed by speculative decoding).
  • Dynamic Strategy Outperforms Static: The layer-skipping set adapts contextually, consistently outperforming fixed layer skipping across diverse tasks.
  • Negligible DP Overhead: With GPU parallelism, DP selection only introduces \(<5\%\) latency overhead.
  • Plug-and-Play: Training-free, requires no calibration data, and does not modify model parameters.

Highlights & Insights

  • Formulating draft model design in speculative decoding as a combinatorial optimization problem: Choosing which layers to skip is framed as a combinatorial selection problem, for which DP is a naturally suited solver.
  • Exploiting "free" information from the verification phase: The full hidden states from the verification phase are computed anyway; using them as the target for DP introduces no additional computational overhead.
  • Plug-and-play practicality: Highly valuable for proprietary/private LLMs—eliminating the need to search for a draft model or train additional modules.

Limitations & Future Work

  • Limited speedup ratio (\(1.3-1.7\times\)), lagged behind trained speculative decoding methods (e.g., EAGLE achieves \(2\times+\)).
  • DP assumes that layer-skipping errors can be tracked sequentially, but error accumulation may become inaccurate when a large number of layers are skipped.
  • The combination with tree attention to further improve acceptance rates remains unexplored.
  • Not validated on Mixture-of-Experts (MoE) models.
  • vs. Self-SD (Zhang et al., 2024): Self-SD uses Bayesian optimization for a fixed layer-skipping set, whereas CLaSp dynamically adjusts it using DP, eliminating offline optimization.
  • vs. EAGLE (Li et al., 2024): EAGLE trains specialized draft heads (achieving greater speedup at the cost of training). CLaSp requires no training whatsoever.
  • vs. LayerSkip (Elhoushi et al., 2024): LayerSkip requires introducing layer-skipping regularization during training. CLaSp operates entirely during inference.

Rating

  • Novelty: ⭐⭐⭐⭐ The concept of using dynamic programming for layer-skipping selection is novel, and using verification hidden states as the target is ingenious.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on multiple tasks via Spec-Bench, with comprehensive hyperparameter analyses.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and detailed methodological descriptions.
  • Value: ⭐⭐⭐⭐ Plug-and-play lossless acceleration provides practical value for deployment.