SubSpec: Speculate Deep and Accurate — Lossless and Training-Free Acceleration for Offloaded LLMs¶

Conference: NeurIPS 2025 arXiv: 2509.18344 Code: None Area: LLM/NLP Keywords: Speculative Decoding, Parameter Offloading, Training-Free Acceleration, Quantization, KV-Cache Sharing

TL;DR¶

This paper proposes SubSpec, a plug-and-play lossless and training-free acceleration method for offloaded LLMs. The core idea is to construct a highly aligned quantized substitute draft model directly from the offloaded target model itself, and to maximize alignment by sharing GPU-resident layers and KV-Cache. SubSpec achieves a 9.1× speedup for Qwen2.5 7B under an 8GB VRAM budget and a 12.5× speedup for Qwen2.5 32B under 24GB VRAM.

Background & Motivation¶

Background: As LLM scale continues to grow, deploying these models on consumer-grade GPUs faces severe memory constraints. Two main strategies exist: model compression (quantization/pruning) and parameter offloading (storing some weights on CPU/disk).

Limitations of Prior Work: Model compression may degrade generation quality; parameter offloading preserves quality but results in extremely slow inference — each forward pass requires transferring large amounts of weights from the CPU, making I/O bandwidth a serious bottleneck. Speculative decoding has been seen as a promising direction for accelerating offloaded inference: a fast small draft model generates multiple candidate tokens, which are then verified in parallel by the target LLM in a single forward pass, thereby reducing the number of forward passes that involve offloaded weight transfers.

Key Challenge: Existing speculative decoding methods face two fundamental limitations: (1) they rely on pretrained small-model weights from the same family, requiring additional alignment training for custom-trained models; (2) insufficient alignment between the draft and target models leads to limited token acceptance lengths (typically only 3–5 tokens), resulting in suboptimal speedups.

Goal: How to construct a draft model that is highly aligned with the target LLM without any training, so that speculative decoding achieves far greater speedups in the parameter offloading setting than existing methods.

Key Insight: Since high alignment is required, the best source for a draft model is the target model itself. By applying low-bit quantization to offloaded layers to generate "substitute layers," and by sharing non-offloaded layers and KV-Cache, the prediction consistency between draft and target can be maximized.

Core Idea: Use a quantized version of the target LLM itself as the draft model rather than an independent small model, fundamentally resolving the alignment problem.

Method¶

Overall Architecture¶

The SubSpec workflow: partition the target LLM into GPU-resident layers and CPU-offloaded layers according to the memory budget → apply low-bit quantization to the offloaded layers to generate substitute layers → substitute layers + shared resident layers form the draft model → the draft model autoregressively generates multiple candidate tokens on the GPU → the target model loads the offloaded layers and performs one complete forward pass for parallel verification → accept tokens that pass verification and reject inconsistent tokens.

Key Designs¶

Substitute Layer Generation:
- Function: Transform the CPU-offloaded layers of the target LLM into lightweight substitutes that can be efficiently inferred on the GPU.
- Mechanism: Apply extremely low-bit quantization (2-bit or 3-bit) to the offloaded layers, so that the memory footprint of substitute layers is far smaller than the originals. Standard uniform quantization or GPTQ is used without calibration data or fine-tuning. The substitute layers are structurally identical to the original layers (attention layers, FFN layers), with only reduced weight precision.
- Design Motivation: Since substitute layers are derived from approximations of the target model's own weights, they are inherently highly aligned — a sharp contrast to using an independent small model as the draft, whose weight distribution may differ significantly from the target even within the same model family.
GPU-Resident Layer Sharing:
- Function: Layers of the target model that remain on the GPU are fully shared between the draft and the target.
- Mechanism: Draft model = substitute layers (quantized offloaded layers) + shared resident layers (original precision). This means the draft model's computation on the shared layers is exactly identical to the target model's; discrepancies arise only from quantization error in the substitute layers.
- Design Motivation: This simultaneously addresses two issues — reducing additional VRAM overhead (the draft model does not need to separately store the shared layers) and substantially improving alignment (shared layer outputs are 100% consistent). Ablation experiments show that removing layer sharing reduces speedup by more than 40%.
KV-Cache Sharing:
- Function: The draft model and the target model share a single KV-Cache.
- Mechanism: Since the shared layers are identical to those of the target model, the corresponding KV-Cache entries are also identical and can be reused directly. The KV values produced by the substitute layers carry quantization error, but their impact is kept controllable by the "anchoring" effect of the shared layers.
- Design Motivation: Eliminates the need to maintain a separate KV-Cache for the draft model, further saving VRAM, while leveraging the target model's existing context information to improve draft prediction accuracy.

Loss & Training¶

SubSpec requires no training whatsoever — this is its most fundamental practical advantage. All steps are deterministic operations: the quantization process uses standard methods without calibration data or fine-tuning, enabling plug-and-play support for arbitrarily custom-trained models.

Key Experimental Results¶

Main Results¶

Qwen2.5 7B (8GB VRAM limit):

Method	MT-Bench Speedup	Quality Loss	Extra VRAM	Training Required
Vanilla offloading (no acceleration)	1.0×	Lossless	0	No
Same-family small model draft	3.2×	Lossless	+1.2GB	No
Training-aligned draft model	4.8×	Lossless	+0.8GB	Yes
SubSpec	9.1×	Lossless	+0.3GB	No

Qwen2.5 32B (24GB VRAM limit):

Benchmark	SubSpec Speedup	Avg. Acceptance Length
MT-Bench	11.3×	8.7
HumanEval	13.8×	10.2
GSM8K	12.1×	9.4
MMLU	12.8×	9.8
Average	12.5×	9.5

Ablation Study¶

Ablation Setting	Speedup Change	Acceptance Length Change
No GPU-resident layer sharing	−4.2×	−3.8
No KV-Cache sharing	−1.8×	−1.5
4-bit instead of 2-bit quantization	−0.5×	+0.3
Independent small model instead of substitute layers	−5.3×	−4.2

Key Findings¶

High acceptance length is the key to speedup: SubSpec achieves an average acceptance length of 9.5, far exceeding the 3–5 of existing methods; accepting more tokens per verification step directly translates into higher speedup.
Layer sharing is central to alignment: Removing GPU-resident layer sharing causes a dramatic drop in speedup, indicating that the 100% consistency guaranteed by shared layers is the foundation of high acceptance length.
Quantization bit-width of substitute layers is not a critical bottleneck: The difference between 2-bit and 4-bit is only 0.5×, suggesting that quantization error is effectively controlled by the "anchoring" effect of shared layers.
Plug-and-play applicability: The method is effective across different model families including Qwen2.5 and LLaMA.

Highlights & Insights¶

Elegance of the "self-explaining self" design philosophy: This approach sidesteps the challenges of draft model selection and alignment training, reducing the problem to a purely engineering-level quantization operation. The idea is extensible to other scenarios requiring model approximation.
Systematic triple-sharing strategy: Layer sharing, KV-Cache sharing, and shared weight origin work in concert to ensure draft quality from multiple dimensions, making it more effective than any single-point optimization.
Deployment friendliness: The combination of zero training cost, minimal additional VRAM, and lossless quality makes this method highly applicable in real production environments.

Limitations & Future Work¶

Dependency on layer-wise offloading assumption: SubSpec assumes that the model is offloaded layer by layer; finer-grained offloading strategies (per attention head or per channel) would require a redesign of the substitute layer construction.
Precision ceiling of quantized substitute layers: For high-precision tasks such as long-context mathematical reasoning, the cumulative effects of quantization error warrant further investigation.
Combinations with other acceleration techniques unexplored: Integration with techniques such as speculative sampling tree search and multi-draft models may yield further improvements.
Only autoregressive generation is evaluated: Applicability to non-autoregressive or semi-autoregressive generation paradigms has not been validated.

vs. SpecInfer/Medusa: These methods require training draft heads or draft models; SubSpec requires no training and achieves higher alignment.
vs. GPTQ/AWQ: Directly quantizing the target model degrades quality; SubSpec applies quantization only to the draft, leaving the target model quality unaffected.
vs. LayerSkip: Layer-skipping acceleration is limited when certain layers are highly important; SubSpec leverages information from all layers.

Rating¶

Novelty: ⭐⭐⭐⭐ The core idea is intuitive yet effective; using the model's own quantized version as the draft is simple but had not been systematically implemented before.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models and benchmarks with complete ablations, though coverage of model families could be broader.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear and the method is presented concisely.
Value: ⭐⭐⭐⭐⭐ Extremely high practical deployment value; the combination of zero training, lossless quality, and high speedup is very attractive.