FastDraft: How to Train Your Draft¶
Conference: ACL 2025
arXiv: 2411.11055
Code: None (Internal implementation by Intel Labs)
Area: Others
Keywords: speculative decoding, Draft Model, knowledge distillation, LLM Inference, Edge Deployment
TL;DR¶
This work proposes FastDraft, an efficient draft model pre-training and alignment pipeline. It can train a draft model of approximately 50M parameters on a single node with 8 GPUs within 24 hours. When paired with Speculative Decoding, it achieves up to a 3x memory bandwidth speedup and a 2x actual inference speedup.
Background & Motivation¶
Speculative Decoding (SD) is a lossless acceleration technique for LLM inference: it first uses a small draft model to quickly generate candidate token sequences, and then validates them in parallel using a large target model. This approach achieves a 2-3x speedup without sacrificing generation quality.
However, the practical deployment of SD faces a key bottleneck: the lack of high-quality draft models. The reason is that the draft model must share the exact same vocabulary as the target model, while most popular open-source LLMs (such as Phi-3 and Llama-3.1) do not have readily available, vocabulary-compatible small models. Furthermore, unlike training general-purpose LLMs, the objective of a draft model is not to generate high-quality responses, but rather to produce token sequences that are easily accepted by the target model—a training objective that has lacked systematic study.
The core motivation of FastDraft is to fill this gap: it proposes an end-to-end, resource-friendly draft model training pipeline, enabling any given target LLM to quickly obtain a paired draft model.
Method¶
Overall Architecture¶
FastDraft adopts a three-stage pipeline:
- Pre-Training (PT): Training the draft model from scratch on natural language data.
- Continued Pre-Training (CP): Continuing training with a mixture of code and natural language data.
- Fine-Tuning (FT): Performing sequence-level knowledge distillation using synthetic data generated by the target model.
Key Designs¶
-
Draft Architecture Selection: The only rigid constraint on the draft model is that it must output a probability distribution over the same vocabulary as the target model. The authors explored two scales: 50M and 120M parameters (approximately 76x and 32x smaller than Phi-3-mini, respectively). The architecture directly inherits the Transformer design of the target model, only scaling down the hidden dimensions and the number of layers.
-
Pre-training Data Scale Experiments: Through a systematic ablation study over {0.1, 0.5, 1, 2, 5, 10} BT (billion tokens), 5BT was found to be an excellent balance point—more data yields diminishing returns in acceptance rate, and even performance degradation on certain tasks. This finding challenges the intuition that "more data is always better," indicating that draft model training has its unique patterns.
-
Continued Pre-training Strategy: Directly training on code data fails to maintain natural language capabilities. FastDraft adopts a mixed CP strategy: starting from the text-pretrained model, it continues training with a mixture of 5BT code and 2.5BT text data. This outperforms training from scratch on mixed data or reverse CP (code initialization -> text CP).
-
Sequence-level Knowledge Distillation (KD): The target model (Phi-3-mini / Llama-3.1-8B) is used to generate responses for multiple instruction datasets, including Alpaca, OIG, Evol-Instruct, etc. Generation utilizes multiple temperatures (0.6, 0.8, 1.0) and greedy sampling to enhance diversity. The Magpie method is also applied to directly generate instructions from the target model and then supplement them with responses.
-
Hardware-aware Architecture Design: Under a fixed parameter budget, the impacts of different depths and widths on latency were analyzed. It was found that increasing the number of layers has the greatest impact on latency, whereas increasing the hidden dimension has a smaller effect. The acceleration curve exhibits an inverted U-shape—the optimal draft architecture should be neither too deep nor too shallow.
Loss & Training¶
The fine-tuning phase compared two knowledge distillation strategies:
- Sequence-level KD: Training the draft model directly on sequences generated by the target model using cross-entropy loss, without using the target's logits.
- Token-level KD: Using sparse logits from the target (keeping only the most significant ones) to compute KL divergence or Total Variation Distance (TVD).
Experiments revealed that sequence-level KD is already sufficiently effective, and the additional gains from token-level KD are unclear. This is a pragmatic conclusion—token-level KD requires pre-computing and storing logits, which incurs much higher overhead.
The specific compared loss function combinations include: - \(\mathcal{L}_{CE}\) (cross-entropy) - \(\frac{1}{2}\mathcal{L}_{CE} + \frac{1}{2}\mathcal{L}_{KL}\) - \(\frac{1}{2}\mathcal{L}_{CE} + \frac{1}{2}\mathcal{L}_{TVD}\) - \(\mathcal{L}_{KL}\) - \(\mathcal{L}_{TVD}\)
Key Experimental Results¶
Main Results¶
| Model | Stage | CNN-DM | TinyStories | Dolly | HumanEval |
|---|---|---|---|---|---|
| Phi3-mini 50M | PT | 0.311 | 0.277 | 0.245 | 0.229 |
| Phi3-mini 50M | PT→CP | 0.304 | 0.287 | 0.226 | 0.561 |
| Phi3-mini 50M | PT→CP→FT | 0.369 | 0.306 | 0.370 | 0.562 |
| Llama3.1 150M | PT | 0.280 | 0.227 | 0.247 | 0.248 |
| Llama3.1 150M | PT→CP | 0.280 | 0.235 | 0.273 | 0.606 |
| Llama3.1 150M | PT→CP→FT | 0.307 | 0.266 | 0.334 | 0.649 |
*Acceptance rate results (\(\gamma=3\), multinomial sampling \(T=0.6\))
Ablation Study¶
| Draft | Data Volume | PPL | CNN-DM AR | TinyStories AR | Dolly AR |
|---|---|---|---|---|---|
| 50M | 2BT | 297.4 | 0.323 | 0.264 | 0.241 |
| 50M | 5BT | 256.6 | 0.311 | 0.277 | 0.245 |
| 50M | 10BT | 240.9 | 0.312 | 0.283 | 0.234 |
| 120M | 2BT | 199.6 | 0.362 | 0.297 | 0.284 |
| 120M | 5BT | 167.7 | 0.366 | 0.327 | 0.281 |
| 120M | 10BT | 147.4 | 0.351 | 0.331 | 0.251 |
*Perplexity decreases with data volume, but the acceptance rate does not necessarily follow—Dolly and CNN-DM actually show declines as data increases.
Key Findings¶
- Target-generated Data vs. Original Data for fine-tuning: the former achieves consistently better performance across all tasks (Table 2), with an improvement of approximately 6 percentage points on the Dolly task.
- Continued Pre-training on Code brings a massive boost to HumanEval (~33 percentage points), while mixing in natural language data maintains or even improves performance on natural language tasks.
- Empirical Speedup of 50M Draft on Intel Core Ultra: An average of 1.5x speedup for natural language tasks and 2x for code completion tasks.
- MBSU Metric: ~2x for natural language, ~3x for code.
- Entire Training Takes Only 24 Hours on a single node with 8 Intel Gaudi 2 accelerators.
Highlights & Insights¶
- "Good Enough" Data Philosophy: Pre-training on 5BT data combined with sequence-level KD is sufficient to train an excellent draft model, without necessitating trillion-token datasets or complex distillation losses.
- Differences in Training Objectives: Draft training characteristics differ from those of general LLM training—more data does not necessarily increase acceptance rates because the target of a draft model is to "resemble the target model" rather than to "become smarter."
- Effectiveness of Extremely Small Drafts: A draft model with only 50M parameters (76x smaller than the target) can achieve meaningful speedup, indicating that speculative decoding does not have overly strict quality requirements for the draft model.
- Extremely Low End-to-End Training Costs: Completing the entire pipeline within 24 hours makes adapting a draft model for any new LLM a near-instant viability.
Limitations & Future Work¶
- Validated Only on English: syntactic structure differences in other languages might affect performance.
- Only Single-Sequence Speculative Decoding Explored: advanced strategies such as multi-candidate or tree-based speculation were not investigated.
- Identical Draft and Target Architectures: the potential of heterogeneous architecture combinations remains unexplored.
- Limited Evaluation Tasks: evaluated mostly on summarization, text completion, and code, lacking a diverse range of scenarios such as conversation or translation.
Related Work & Insights¶
- Methods like MEDUSA and EAGLE use plugin prediction heads or hidden representations of the target model, limiting flexibility.
- DistillSpec (Zhou et al., 2023) studies the relationship between TVD loss and acceptance rate.
- TVD++ (Goel et al., 2024) fine-tunes on a draft model pre-trained on 600B tokens.
- FastDraft highlights the advantages of an independent draft model: higher flexibility, ability to pair with over 1000 compatible models, and no requirement for additional target model inference.
Rating¶
- Novelty: ⭐⭐⭐ While the methodology components are not entirely brand new, systematically studying the various dimensions of draft training (data scale, CP, KD strategies, hardware-aware architectures) is a valuable contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ The ablation studies are comprehensive and solid, covering dimensions such as data scale, CP strategies, KD losses, architecture width/depth, and hardware deployment.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with rich charts and logically organized experimental design.
- Value: ⭐⭐⭐⭐ It provides a practical and low-cost draft training scheme for industry and edge deployment scenarios.