Skip to content

DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding

Conference: ICML2025
arXiv: 2507.12000
Authors: Jiahong Ning, Ce Zheng, Tingting Yang
Code: JasonNing96/DSSD-Efficient-Edge-Computing
Area: LLM Efficiency
Keywords: Speculative Decoding, Edge Computing, LLM Deployment, Device-Edge Collaboration, Communication Optimization, Distributed Inference

TL;DR

This paper proposes the Distributed Split Speculative Decoding (DSSD) framework, which splits the verification stage of speculative decoding between the device and the edge. By replacing multiple uplink transmissions (the SLM's \(\gamma\) vocabulary distributions) with a single downlink transmission (a single vocabulary distribution of the LLM), DSSD significantly reduces communication latency while maintaining identical inference quality.

Background & Motivation

Problem Background

Large Language Models (LLMs) perform exceptionally well in areas such as conversational agents, machine translation, and code generation, but face a dilemma in practical deployment: - Device-side: Memory, battery, and computing power are severely limited, making it impossible to run the full LLM. - Cloud-side: Network latency is unpredictable, and user mobility leads to connection disruptions, affecting service continuity. - Device-Edge Collaboration: Deploying a Small Language Model (SLM) on the device and an LLM on the base station/edge server, achieving efficient collaborative inference through speculative decoding.

Limitations of Prior Work

Ding et al. (2024): A query-difficulty routing scheme that reduces costs by assigning simple queries to the SLM, but sacrifices LLM inference accuracy.

Hao et al. (2024): A draft-verification method based on token probability thresholds to trade off performance and cost, which also fails to guarantee LLM-level inference quality.

Distributed Speculative Decoding (DSD) by Zhao et al. (2024): First to propose a distributed architecture placing the draft model on the device and the target model on the edge. However, each token requires the uplink transmission of the full vocabulary probability distribution to the base station for verification. The communication payload is linearly related to the vocabulary size \(|\mathcal{V}|\), becoming a severe bottleneck.

Oh et al. (2024): Proposed skipping uplink transmission and LLM inference for tokens with high acceptance probability to improve token throughput, but still at the expense of inference accuracy.

Core Motivation

The fundamental issue with existing distributed speculative decoding is that the verification stage is executed entirely on the edge. The device must transmit all \(\gamma\) vocabulary distributions (each of dimension \(|\mathcal{V}|\)) generated by the SLM to the base station over the uplink. Since uplink bandwidth is typically much smaller than downlink bandwidth, this becomes the primary source of latency. Can the computation allocation of the verification stage be redesigned to shift communication from "multiple uplinks" to "a single downlink"?

Method

Overall Architecture

DSSD retains the basic architecture of deploying an SLM (\(M_q\)) on the device and an LLM (\(M_p\)) on the edge, but performs a key split and reconstruction of the verification stage in speculative decoding. The overall workflow is divided into the following stages:

  1. Drafting Stage (Device-side): The SLM autoregressively generates \(\gamma\) candidate tokens \(x_1, x_2, \ldots, x_\gamma\), while keeping the corresponding vocabulary probability distribution \(Q_i(x)\) for each token on the device.
  2. Uplink Transmission (Reduced): The device only sends the vocabulary indices of the \(\gamma\) tokens to the base station (rather than the full probability distributions), reducing the communication payload from \(\gamma \times |\mathcal{V}|\) to \(\gamma \times \lceil\log_2|\mathcal{V}|\rceil\).
  3. Partial Verification on Edge: The LLM computes \(\gamma+1\) distributions \(P_1(x), \ldots, P_{\gamma+1}(x)\) based on the prefix and the received tokens, but does not perform full verification.
  4. Downlink Transmission: The edge server only needs to send one critical vocabulary distribution to the device.
  5. Verification Completion on Device: The device completes the token acceptance/rejection judgment and resampling locally, utilizing the locally retained SLM distributions \(Q_i(x)\) and the received LLM distribution.

Key Designs: Splitting the Verification Stage

In standard speculative decoding, the verification process requires simultaneous access to both the SLM distribution \(Q_i(x)\) and the LLM distribution \(P_i(x)\) to perform accept-reject sampling: - For each draft token \(x_i\), accept it with probability \(\min\left(1, \frac{P_i(x_i)}{Q_i(x_i)}\right)\). - If rejected, resample from the modified distribution \(\text{norm}\left(\max(0, P_i(x) - Q_i(x))\right)\).

In DSD, \(Q_i(x)\) is generated on the device and \(P_i(x)\) is generated on the edge, necessitating the upload of \(Q_i(x)\), which leads to huge uplink communication overhead.

The core innovation of DSSD lies in redistributing the verification computation: - The SLM distributions \(Q_i(x)\) always remain on the device and do not need to be uploaded. - Only the LLM distribution information needs to be transmitted downlink from the edge to the device. - The device completes the verification locally using its local \(Q_i(x)\) and the received \(P_i(x)\).

This design leverages a crucial asymmetry: uplink bandwidth is much smaller than downlink bandwidth. DSSD reverses the communication direction from uplink to downlink and reduces the transmission volume from \(\gamma\) distributions to 1 distribution, achieving a dual optimization.

Communication Overhead Analysis

Scheme Uplink Transmission Volume Downlink Transmission Volume Total Transmission Complexity
DSD (Zhao et al.) \(\gamma \times \|\mathcal{V}\|\) (SLM distribution) + token indices Verification results \(O(\gamma \cdot \|\mathcal{V}\|)\)
Oh et al. (2024) Partial SLM distributions (skipping some transmissions) Verification results \(O(k \cdot \|\mathcal{V}\|), k \leq \gamma\)
DSSD (Ours) Only token indices \(\gamma \times \lceil\log_2\|\mathcal{V}\|\rceil\) Single LLM distribution \(\|\mathcal{V}\|\) \(O(\|\mathcal{V}\|)\)

DSSD reduces the communication complexity from \(O(\gamma \cdot |\mathcal{V}|)\) to \(O(|\mathcal{V}|)\), which is a reduction by a factor of \(\gamma\) (\(\gamma\) is the draft length, typically 3-10).

Inference Quality Guarantee

The output distribution of DSSD is identical to that of standard speculative decoding, which is equivalent to directly using LLM autoregressive decoding. This is because the accept-reject mechanism of the verification stage remains mathematically unchanged, with only the computation location shifted from the edge to the device. Therefore, DSSD does not sacrifice any inference accuracy while reducing communication costs.

Integration with Draft Length Optimization

DSSD is orthogonal to the optimization of draft length \(\gamma\). It can be further integrated with the adaptive \(\gamma\) optimization strategy proposed by Zhao et al. (2024) to adapt to different network conditions and task scenarios while minimizing end-to-end latency.

Key Experimental Results

End-to-End Latency Comparison

Method Inference Quality Uplink Communication Overhead End-to-End Latency Key Characteristics
DSD (Zhao et al., 2024) Maintains LLM quality High (\(\gamma\) vocabulary distributions) Baseline Full SD verification on the edge
Oh et al. (2024) Lossy Medium (skipping partial transmissions) Lower Sacrifices accuracy for latency
Ding et al. (2024) Lossy Low Low Query routing, partially using SLM
DSSD (Ours) Maintains LLM quality Extremely low (only token indices) Lowest Split verification, downlink replacing uplink

Communication Efficiency Improvement Analysis

Vocabulary Size Draft Length \(\gamma\) DSD Uplink Payload DSSD Uplink Payload DSSD Downlink Payload Communication Reduction Ratio
32,000 3 96,000 floats 3 indices 32,000 floats ~3x
32,000 5 160,000 floats 5 indices 32,000 floats ~5x
32,000 10 320,000 floats 10 indices 32,000 floats ~10x
128,000 5 640,000 floats 5 indices 128,000 floats ~5x

When considering the asymmetry of uplink and downlink bandwidth (uplink is typically 1/3 to 1/10 of downlink), the actual latency reduction is even more significant.

Highlights & Insights

  • Ingenious Design of Reversing Communication Direction: Reversing the flow of information required for verification from "uplink transmission of SLM distributions" to "downlink transmission of LLM distributions" cleverly exploits the physical characteristics of asymmetric uplink/downlink bandwidth in mobile networks.
  • Lossless Inference Quality: Unlike other low-latency solutions (such as verification skipping and query routing), DSSD strictly maintains the mathematical equivalence of speculative decoding, where the output distribution matches the LLM autoregressive decoding exactly.
  • Transmission Volume Reduced from Linear to Constant: Reducing the number of transmissions from \(\gamma\) to 1 decouples communication complexity from the draft length, enabling longer draft sequences without increasing the communication burden.
  • Framework Generality: DSSD can be orthogonally combined with existing strategies such as draft length optimization and adaptive speculative decoding.
  • High Practicality: The scheme requires no modification to the model architectures of either the SLM or the LLM, changing only the communication protocol and verification computation allocation.

Limitations & Future Work

  • On-Device Verification Computational Overhead: Shifting part of the verification to the device increases the device's computational burden, which may not be suitable for extremely resource-constrained end-devices (such as IoT sensors).
  • Assumption of Single Downlink Transmission: The paper assumes only a single LLM distribution transmission is needed to complete verification; in practice, certain verification strategies might require additional information exchange.
  • Unexplored Vocabulary Compression: The downloaded \(|\mathcal{V}|\)-dimensional distribution remains relatively large, which could be further compressed in combination with vocabulary pruning or Top-K sparsification.
  • Multi-Round Communication Not Considered: Only the communication efficiency of single-round speculative decoding is analyzed, leaving KV cache synchronization and state management in multi-round conversation scenarios unaddressed.
  • Limited Experimental Scale: The model scale and scenario coverage in the paper need to be expanded, and there is a lack of deployment validation in real-world mobile network environments.
  • Security & Privacy: Retaining SLM distributions on the device might involve intellectual property protection issues for the model, which warrants further analysis.
  • Leviathan et al. (2023), Chen et al. (2023): Foundational work on speculative decoding establishing the draft-verification paradigm. This paper inherits its core accept-reject mechanism in a distributed scenario.
  • Zhao et al. (2024) DSD: The first distributed speculative decoding architecture, deploying the draft and target models on the device and edge, respectively, but leaving the communication bottleneck unresolved.
  • Oh et al. (2024): Reduced communication overhead by skipping the transmission of high-probability tokens, but at the cost of inference accuracy.
  • Ding et al. (2024): Collaborative inference based on query routing, where simple queries are handled by the SLM and complex ones by the LLM, which does not guarantee LLM-grade inference quality.
  • Hao et al. (2024): Introduced a cost-aware draft-verification method, though a performance-cost trade-off is inevitable.
  • Shao & Li (2025): Communication optimization for device-edge collaborative inference, sharing the focus on communication efficiency with this paper but taking a different technical route.

Insights for Future Research

  • The concept of splitting the verification stage can be extended to other distributed inference scenarios (e.g., federated inference, multi-device collaboration).
  • Asymmetry in uplink/downlink bandwidth is a crucial consideration for design in mobile scenarios; future work can further optimize this combined with 5G/6G network characteristics.
  • Synthesizing speculative decoding with edge computing is an important direction for ubiquitous LLM deployment.

Rating

  • Novelty: ⭐⭐⭐⭐ — The design concepts of verification splitting and reversing communication direction are novel and highly practical.
  • Experimental Thoroughness: ⭐⭐⭐ — The experiments validate the core claims, but lack large-scale deployment verification in real-world environments.
  • Writing Quality: ⭐⭐⭐⭐ — Clear problem formulation, detailed description of system architecture, and motivations are well-articulated.
  • Value: ⭐⭐⭐⭐ — Addresses a real-world communication bottleneck in distributed LLM inference, offering direct reference value for edge AI deployment.