Steering Pretrained Drafters during Speculative Decoding¶

Conference: AAAI 2026 arXiv: 2511.09844 Code: github.com/ETH-DISCO/SD-square Area: Model Compression Keywords: Speculative Decoding, Dynamic Alignment, Steering Vector, LLM Inference Acceleration, Pretrained Drafter

TL;DR¶

This paper proposes SD², which extracts steering vectors from verifier hidden states and injects them into the MLP layers of a pretrained drafter, achieving dynamic drafter–verifier alignment in speculative decoding. Under standard sampling, the number of accepted tokens increases by up to 35% with negligible computational overhead.

Background & Motivation¶

Speculative decoding accelerates LLM inference via a "fast draft + parallel verify" paradigm. Its core bottleneck is drafter–verifier misalignment: the more candidate tokens the drafter generates are rejected by the verifier, the worse the speedup.

Two major categories of drafter exist:

Dependent drafters (e.g., EAGLE, Medusa): lightweight speculative heads attached directly to the verifier. While fast, their generation quality is limited, and efficiency degrades when verification latency dominates total latency.

Independent drafters (e.g., pretrained small models): benefit from strong generation capability due to independent training, yielding higher acceptance rates, but lack dynamic alignment mechanisms with the verifier.

Existing offline alignment methods (e.g., distillation) can improve acceptance rates to some extent but suffer degradation on out-of-distribution (OOD) data. The authors observe that intermediate representations in LLMs implicitly encode information about future multi-step tokens, and thus predictive signals can be extracted from verifier hidden states to dynamically guide drafter generation—preserving the generalization capability of pretrained drafters while improving alignment quality.

Method¶

Overall Architecture¶

SD² follows the standard speculative decoding pipeline: the drafter drafts \(k\) candidate tokens → the verifier verifies in parallel → tokens are accepted or rejected. The key innovation is that during the verification step, an additional steering vector is generated and injected into the drafter to guide the next drafting round.

The method consists of three stages: verification (generating steering vectors), drafting (autoregressive generation guided by steering vectors), and training (learning the steering mechanism parameters).

Key Designs¶

Steering Vector Extraction

During verification, hidden states \(h_t, m_t, l_t\) are extracted from three different layers of the verifier (high, middle, and low), and a steering vector is generated via linear projection:

\(g_t = W_{hml}[h_t, m_t, l_t]^\top\)

Layer selection follows EAGLE-2, using layers 3, \(L/2\), and \(L-2\), respectively. This multi-layer fusion enables the steering vector to simultaneously capture high-level semantics, mid-level features, and low-level patterns.

Bias Injection in MLP

The steering vector is transformed into a bias via linear mapping \(W_s\) and injected into the up-projection of each MLP layer in the drafter, modifying the SwiGLU activation as:

\(a^{(l)}_{t+i}, g_t \mapsto W_d((W_u a^{(l)}_{t+i} + W_s g_t) \odot \sigma(W_g a^{(l)}_{t+i}))\)

This design offers two advantages: \(W_s g_t\) is invariant to drafting position \(i\) and need only be computed once at the start of each drafting round; KV-Cache compatibility is preserved; and the gating mechanism naturally regulates control intensity.

Initialization Strategy

\(W_s\) is initialized to the zero matrix (ensuring steering does not alter drafter behavior at initialization), and \(W_{hml}\) is initialized such that its output equals \(h_t + m_t + l_t\). A LayerNorm is applied after the steering vector to stabilize training.

Loss & Training¶

Training uses probability distributions generated by the verifier \(\pi_V\) on synthetic data as targets (synthetic data better reflects verifier behavior than real data).

A random offset \(\delta \in [1, k]\) is sampled to simulate drafting the \(\delta\)-th token.
KL divergence is used as the training loss: \(D_{KL}(\pi_V(\cdot|x_{1:t-1}) \| \pi_D(\cdot|x_{1:t-1}, g_{t-\delta}))\)
Both drafter parameters and steering mechanism parameters are fine-tuned jointly; the verifier is frozen throughout.
Training consists of 6 epochs followed by 1 epoch of fine-tuning on ShareGPT data.
AdamW optimizer with cosine learning rate schedule is used.

Key Experimental Results¶

Main Results¶

Evaluation is conducted on 4 verifier–drafter pairs across 5 datasets with \(k=8\). Metrics are Block Efficiency (\(\tau\), accepted tokens per block) and Speedup (\(\alpha\), relative to the pretrained drafter).

Verifier & Drafter	Method	τ (T=1 avg)	α (T=1 avg)	τ (T=0 avg)	α (T=0 avg)
Vicuna 13B + Llama 160M	Pretrained	1.88	1.00	2.36	1.00
	Distilled	2.45	1.32	2.86	1.24
	SD²	2.96	1.61	3.27	1.43
Qwen3 14B + Qwen3 0.6B	Pretrained	3.86	1.00	4.03	1.00
	Distilled	3.97	1.05	4.21	1.06
	SD²	4.26	1.11	4.49	1.12
Llama 3.1 8B + Llama 3.2 1B	Pretrained	4.91	1.00	5.12	1.00
	Distilled	4.78	0.97	—	—
	SD²	5.00	1.00	5.31	1.02

Ablation Study¶

Configuration	Block Efficiency Change	Notes
SD² (in-MLP bias + unfrozen drafter)	Best	Final adopted design
Post-MLP bias	Slightly lower	Insufficient control
Conditional bias (based on \(g_t\) and \(f^{(l-1)}\))	Slightly lower	More parameters, no additional gain
SD² + frozen drafter	+100% over pretrained	Steering vectors alone are valuable
SD² + unfrozen drafter	Best	Steering and fine-tuning synergize optimally
Random offset vs. block offset	Negligible difference	Random offset marginally better

Key Findings¶

For large-capacity-gap pairs (Vicuna 13B + Llama 160M), SD² achieves an average improvement of 61% throughput and 57% Block Efficiency under standard sampling.
Even on already well-aligned pairs (Llama 3.1 8B + Llama 3.2 1B), SD² does not degrade performance.
Distillation frequently degrades on OOD data (e.g., below pretrained on GSM8K and HumanEval), while SD² consistently matches or exceeds the pretrained baseline.
Steering vectors alone (with frozen drafter) improve accepted token count by 100%, demonstrating that verifier hidden states contain rich future predictive information.
SD² maintains stable advantages in long-sequence generation without degradation as position increases.

Highlights & Insights¶

Elegantly lightweight intervention: steering is achieved solely by adding a bias to the up-projection of existing MLPs, avoiding additional self-attention computation with negligible overhead.
Clever initialization: \(W_s = 0\) ensures the original drafter behavior is undisturbed at the start of training.
Plug-and-play compatibility: requires no changes to the pretraining process and can be appended to any pretrained drafter using SwiGLU activations.
Reveals an important phenomenon: intermediate layers of LLMs implicitly encode multi-step future token information, a finding with broad research implications.

Limitations & Future Work¶

Performance depends on the quality of the training data distribution; OOD performance, while superior to distillation, still exhibits some degradation.
Gains are limited for already well-aligned drafters (e.g., the Llama 3.1 family).
Access to verifier hidden states is required, making the approach inapplicable to black-box remote verifier settings.
Validation is limited to the SwiGLU activation function; applicability to other activation functions remains unexplored.
No exploration of novel drafter training schemes specifically designed for dynamic steering.
Extension to more complex speculative decoding paradigms such as tree-based verification has not been investigated.

EAGLE-3 is the primary dependent drafter baseline, guiding a lightweight head by directly concatenating verifier hidden states; SD²'s steering approach is more modular and plug-and-play.
DistillSpec proposes the offline distillation baseline used in this work; SD²'s dynamic steering demonstrates clearly superior robustness on OOD scenarios compared to distillation.
The activation steering literature inspired the design of this work; however, prior work is static and oriented toward interpretability, whereas SD² is the first to apply this technique to the dynamic setting of speculative decoding.

Rating¶

Novelty: ⭐⭐⭐⭐ — Introducing activation steering into speculative decoding is a novel integration.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 model pairs, 5 datasets, multiple sampling modes, detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Clear structure and rigorous formulation.
Value: ⭐⭐⭐⭐ — Highly practical; plug-and-play with existing systems, delivering significant speedups.