Skip to content

VocabTrim: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Conference: ICML 2025
arXiv: 2506.22694
Code: -
Area: Model Compression
Keywords: speculative decoding, vocabulary pruning, LM head, EAGLE, inference acceleration

TL;DR

Proposed VocabTrim, a training-free method that reduces draft latency in speculative decoding by pruning the vocabulary of the draft model's LM head, achieving up to 16% memory-bound speedup on Llama-3.

Background & Motivation

Speculative decoding (SpD) uses a small drafter model to predict tokens that the target LLM will generate, which are then validated by the target model. However, an overlooked efficiency bottleneck exists:

  • Modern LLM vocabularies are very large (e.g., 128K tokens for Llama-3).
  • In a 314M-parameter drafter, the LM head accounts for over 30% of the total parameters.
  • In actual downstream tasks, most vocabulary tokens are never sampled (e.g., more than 120,000 tokens remain unused in function calling tasks).
  • The generation process is typically memory-bound, making the large matrix multiplication of the LM head waste valuable memory bandwidth.

Method

Core Method

Prune the LM head parameter \(W\) and vocabulary \(\mathbb{V}\) of the drafter:

\[\mathbb{V}^\text{Trim} = \mathbb{V}[\text{Top-K}(c, k)]$$ $$W^\text{Trim} = W[\text{Top-K}(c, k), :]\]

where \(c\) is the token frequency counter counted on the calibration dataset \(\mathcal{D}\), and \(k\) is the target vocabulary size.

Calibration Dataset Selection

Three strategies, with increasing effectiveness: 1. Raw text data: Directly available but sub-optimal. 2. Drafter-generated data: A by-product of fine-tuning the drafter. 3. Target-model-generated data: The best choice (minimal acceptance rate drop, maximum speedup gain).

Integration with SpD Pipeline

  • Applicable to any drafter-based SpD method (EAGLE, independent drafter, etc.).
  • No architectural constraints: Only replaces the weight matrix of the LM head.
  • No training overhead: Only requires counting token frequencies and slicing the matrix.
  • The target model is completely unaffected, keeping generation lossless.

Trade-off Analysis

  • After pruning the vocabulary, the drafter can only predict the retained tokens, which slightly reduces the acceptance rate (block efficiency).
  • However, shrinking the LM head significantly reduces memory latency.
  • Under memory-bound environments, the net gain of MBSU is positive:
\[\text{MBSU}(x) = \frac{\tau(x)}{c\gamma + 1}\]

where \(\tau(x)\) represents the block efficiency, and \(c\) is the drafter/target parameter ratio.

Experimental Results

Llama-3.2-3B-Instruct (EAGLE drafter)

Configuration LM Head (M) Writing MBSU Math MBSU Coding MBSU Average MBSU
Original EAGLE 394.0 1.475 1.640 1.708 ~1.55
+Target generated (32K) 101.3 1.745 1.950 1.945 ~1.84
  • LM Head parameters reduced from 394M to 101M (75% reduction).
  • Memory-bound speedup improved by approximately 16%.

Independent drafter (314M)

Configuration LM Head (M) Average MBSU
Original 131.3 ~2.91
+Target generated (32K) 33.8 ~3.10

LM head reduced from 131M to 34M, further improving speedup.

Ablation: Vocabulary Size vs. Performance

Top-K Size Block Efficiency MBSU
128K (Original) 3.63 1.70
64K 3.54 1.83
32K 3.43 1.95
16K 3.25 1.90

32K is the optimal sweet spot—the decrease in acceptance rate is manageable, and MBSU is maximized.

Ablation: Calibration Data Type

Calibration Data MBSU
Raw-dataset 1.685
Draft-generated 1.732
Target-generated 1.745

Data generated by the target model yields the best performance, as it most accurately reflects the actual required token distribution.

Highlights & Insights

  • Simple and highly effective: Achieved via a line-of-code level modification (matrix slicing).
  • Identifies an overlooked efficiency bottleneck in SpD: an excessively large drafter LM head.
  • Training-free, plug-and-play, and maintains lossless generation.
  • Particularly valuable for memory-bound scenarios such as edge devices.
  • Highly generalizable method, supporting various pruning strategies like Top-K, Top-P, or lowest frequency.

Limitations & Future Work

  • Speedup performance depends on being in a memory-bound scenario (limited gains when compute-bound).
  • Fixed \(K\) values cannot adapt to dynamically changing task requirements.
  • Might significantly impact tasks requiring high vocabulary coverage (e.g., multilingual translation).
  • Selection of calibration dataset introduces task dependency.
  • More fine-grained token selection strategies (e.g., based on token importance rather than frequency) remain unexplored.

Rating

⭐⭐⭐⭐ — Although simple, the method directly addresses a core pain point, offering practical engineering value for speculative decoding on edge devices. The 16% training-free speedup is noteworthy.