OmniDraft: A Cross-Vocabulary Online Adaptive Drafter for On-Device Speculative Decoding¶

Conference: NeurIPS 2025 arXiv: 2507.02659 Code: Not available Area: LLM Efficiency Keywords: Speculative decoding, cross-vocabulary, online distillation, adaptive drafting, on-device inference

TL;DR¶

This paper proposes the OmniDraft framework, which achieves cross-vocabulary speculative decoding via an online n-gram cache, aligns the draft model with the target model through a hybrid distillation loss, and dynamically adjusts the proposal length with an adaptive drafting head. A single lightweight Llama-68M model thereby provides speculative decoding acceleration (1.5–2×) for diverse target models such as Vicuna-7B, Qwen2-7B, and Llama3-8B.

Background & Motivation¶

Speculative Decoding accelerates LLM inference by having a small draft model predict multiple future tokens, which a large target model then verifies in a single forward pass. Two fundamental challenges remain:

Tight coupling between draft and target models: Existing methods require the draft and target models to belong to the same model family (e.g., both from the Llama series), sharing the same tokenizer and vocabulary. Switching to a different family (e.g., Qwen) renders the draft model unusable.

Dynamic demands in on-device deployment: Users may switch between different target models at runtime and expect inference latency to improve over time.

Prior work UAG proposes a vocabulary-intersection mapping but handles only directly mapped tokens and cannot resolve "false rejections"—cases where multiple sub-tokens proposed by the draft model correspond to a single merged token in the target vocabulary and are therefore incorrectly rejected. Offline distillation-based alignment further assumes a fixed target model and cannot adapt to dynamic switching scenarios.

Core Idea: Establish a "one drafter for all" paradigm—resolving cross-vocabulary mapping via an n-gram cache, achieving dynamic alignment via online distillation, and controlling efficiency via adaptive drafting.

Method¶

Overall Architecture¶

OmniDraft consists of three core components: (1) a cross-vocabulary n-gram cache for token translation between draft and target vocabularies; (2) a hybrid distillation loss for online alignment of the draft model; and (3) an adaptive drafting head for dynamically adjusting the proposal length. The overall pipeline is: the draft model generates tokens → the n-gram cache translates them into the target vocabulary space → the target model verifies them → accept/reject outcomes are fed back to update the cache and distill the draft model.

Key Designs¶

Cross-Vocabulary N-gram Cache: The core mechanism maintains a cache \(\mathcal{C} = \{(t_i, [d_j^i]_{j=1:n})\}\) recording mappings between target tokens \(t_i\) and draft token sequences \([d_1^i, d_2^i, \cdots, d_n^i]\). During the proposal phase, the draft token sequence is scanned and the n-gram cache is queried to perform merge mappings, with probabilities computed as:

\(q'(t_i) = \begin{cases} q(d_i), & \text{direct mapping} \\ \prod_j q(d_j^i), & \text{n-gram mapping} \end{cases}\)

For the residual distribution in the correction phase, \(q'\) must be defined over the entire target vocabulary, requiring probability adjustment for prefix sub-tokens: \(q'(d_1^i) = q(d_1^i) - \prod_j q(d_j^i)\), ensuring correct allocation of probability mass. The cache is updated online during inference whenever a previously unseen mapping instance is encountered. Design Motivation: Unlike UAG, which handles only vocabulary-intersection tokens, the n-gram cache accommodates merged tokens, thereby avoiding false rejections and improving the acceptance rate.

Cross-Vocabulary Hybrid Distillation Loss: Online distillation is decomposed into two parts—reverse KL divergence is applied to directly mapped tokens to obtain rich supervision signals, while negative log-likelihood (NLL) is applied to n-gram tokens because only reliable point probability estimates are available. The total loss is:

\(\mathcal{L}_{\text{cross\_vocab\_distill}}(\theta) = \mathcal{L}_{\text{DM}}(\theta) + \lambda \mathcal{L}_{\text{N-gram}}(\theta)\)

where \(\mathcal{L}_{\text{DM}}\) computes KL divergence over directly mapped tokens and \(\mathcal{L}_{\text{N-gram}}\) computes NLL over n-gram tokens. \(\lambda\) can be set as a fixed hyperparameter or as a dynamic weight (e.g., the target model's verification probability for the n-gram). This design enables the draft model to continuously align with a potentially changing target model during online inference.

Online Adaptive Drafting: A lightweight head network \(f_\phi\) predicts the acceptance rate of the current proposed token. Early termination of the proposal is triggered by monitoring the cumulative rejection probability:

\(P(\exists 1 \leq i \leq k, \text{s.t. } y_i \text{ rejected}) > \gamma \Rightarrow \text{exit}\)

Two training variants are proposed: joint training (distillation and adaptive head updated simultaneously) and interleaved training (the adaptive head is updated multiple times per single distillation update, with a larger buffer to mitigate distribution shift).

Loss & Training¶

Distillation uses on-policy data generated by the draft model itself.
\(\lambda = 0.2\) is fixed across all tasks and experiments.
LoRA fine-tuning is supported as a lightweight alternative enabling dynamic adapter switching.
The adaptive head is trained with a weighted BCE loss, using acceptance rate \(\min(1, p/q)\) as labels.

Key Experimental Results¶

Main Results: Cross-Vocabulary Online Distillation¶

Target Model	Method	GSM8K Acc/Speed	MBPP+HE Acc/Speed	Alpaca Acc/Speed	XSum Acc/Speed
Llama3-8B	SpD_DM (baseline)	0.10 / 0.94×	0.09 / 1.03×	0.09 / 0.96×	0.11 / 0.91×
Llama3-8B	\(\mathcal{L}_{\text{DM}}\) + \(\lambda\mathcal{L}_{\text{N-gram}}\)	0.42 / 1.70×	0.27 / 1.33×	0.20 / 1.30×	0.24 / 1.24×
Qwen2-7B	SpD_DM (baseline)	0.14 / 1.04×	0.09 / 0.91×	0.13 / 1.01×	0.12 / 0.96×
Qwen2-7B	\(\mathcal{L}_{\text{DM}}\) + \(\lambda\mathcal{L}_{\text{N-gram}}\)	0.37 / 1.61×	0.26 / 1.36×	0.20 / 1.30×	0.22 / 1.22×

Ablation Study: Adaptive Drafting (Vicuna-7B target)¶

Method	GSM8K Acc/Speed	MBPP+HE Acc/Speed	Alpaca Acc/Speed	XSum Acc/Speed
SpD (vanilla)	0.21 / 1.44×	0.14 / 1.22×	0.20 / 1.44×	0.20 / 1.42×
Distill Only	0.42 / 2.20×	0.35 / 1.92×	0.25 / 1.57×	0.23 / 1.53×
Joint Distill+Adapt	0.61 / 2.08×	0.51 / 1.91×	0.44 / 1.61×	0.42 / 1.59×
Interleaved Distill+Adapt	0.52 / 2.15×	0.48 / 1.94×	0.41 / 1.60×	0.38 / 1.58×

Key Findings¶

The n-gram cache yields significant gains even without training (cache hit rate 0.87), and combining it with distillation produces the best results.
The cache footprint is small (1–5 MB), suitable for on-device deployment.
The framework scales to larger target models (Qwen2.5-32B), achieving up to 2.05× speedup.
The interleaved training variant achieves marginally higher speedup than joint training, while joint training attains a higher acceptance rate.
LoRA fine-tuning performance is close to full-parameter fine-tuning, supporting dynamic switching across multiple target models.

Highlights & Insights¶

The "one drafter for all" paradigm is of significant practical value—only a single 68M draft model needs to be deployed on-device to serve all target LLMs.
The n-gram cache is an elegant engineering design that reformulates cross-vocabulary mapping as an online cache lookup problem.
The hybrid distillation loss applies different loss functions to directly mapped and n-gram tokens, reflecting a deep understanding of the problem structure.

Limitations & Future Work¶

Online adaptation performs only a single pass over the data stream, which may be unstable for entirely unseen data distributions.
Cross-vocabulary mapping for special tokens (e.g., multimodal tokens) has not yet been addressed.
Online training of the adaptive drafting head is insufficiently stable and may underestimate the optimal proposal length.
The cache lacks an eviction policy; memory-constrained devices would require further optimization.

Compared to UAG, the n-gram cache extends from vocabulary-intersection mapping to many-to-one mappings, resolving false rejections.
Compared to OSD (Online Speculative Decoding), the proposed framework adds cross-vocabulary capability.
Insight: In on-device inference scenarios, lightweight, general-purpose, and adaptable designs are more valuable than heavy but specialized solutions.

Rating¶

Novelty: ⭐⭐⭐⭐ The n-gram cache for cross-vocabulary mapping is a novel contribution, though the adaptive drafting component draws on SpecDec++.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple tasks and target models with complete ablations, though comparisons with additional baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ The framework is clearly presented, with detailed mathematical derivations and intuitive illustrations.
Value: ⭐⭐⭐⭐⭐ A universal on-device draft model addresses an important practical need; the framework is complete and deployment-ready.