DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services¶

Conference: ACL 2025
arXiv: 2502.11417
Code: Not open-sourced
Area: LLM Inference Systems / Device-Server Collaboration
Keywords: Device-Server Collaboration, QoE, TTFT, TBT, Token Migration, LLM Serving

TL;DR¶

This work proposes DiSCo, a device-server collaborative LLM inference scheduler that optimizes user-perceived Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT) under cost constraints through cost-aware request dispatching and token-level migration mechanisms.

Background & Motivation¶

Core Problem: LLM-based text streaming services face severe challenges in Quality of Experience (QoE) and operational cost. TTFT (Time-to-First-Token) and TBT (Time-Between-Tokens) are critical metrics for real-time interaction, yet existing deployment paradigms fail to optimize both simultaneously.

Limitations of Prior Work: (1) Server-only deployment is highly expensive, and TTFT fluctuates severely due to request queuing, batching contention, and network latency (e.g., GPT-4o-mini's TTFT spikes from 0.3s to several seconds under high load); (2) Device-only deployment is constrained by limited hardware resources, suffering from slow prefill time for long prompts and high energy consumption (e.g., running a 7B model on an iPhone can last for less than 2 hours).

Design Motivation: It is observed that server-side TTFT is unpredictable but weakly correlated with prompt length, whereas device-side TTFT is predictable and scales linearly with prompt length. Furthermore, the token generation speed of both deployment paradigms exceeds the human consumption rate. Using these complementary characteristics motivates a device-server collaborative scheduling approach.

Method¶

Overall Architecture¶

Operating as a middleware, DiSCo consists of two core controllers: the Dispatch Controller, which determines the initial execution endpoint of a request, and the Migration Controller, which dynamically switches the execution endpoint during the generation process. Together, they optimize TTFT and TBT under cost constraints.

Key Designs¶

Unified Cost Model and Cost-Aware Dispatching Strategy: A dynamic exchange rate \(\lambda\) is introduced to unify server monetary costs and device energy costs. In device-constrained scenarios, a waiting-time strategy is adopted: execution is first attempted on the server-side, and device-side inference is initiated after a waiting time \(w(l)\), allocating the budget in two stages (tail protection + average optimization). In server-constrained scenarios, requests are routed based on a prompt length threshold \(l_{th}\)—short prompts are sent to the device to save server budget, while long prompts are processed in parallel on both ends to select the faster result.
Token-Level Migration Framework: A token buffer \(B = r_c \times t_m\) is constructed by exploiting the difference between the token generation rate \(r_g\) and the human consumption rate \(r_c\). Migration is triggered when the buffer accumulates enough tokens to cover the migration overhead; the source endpoint ceases generation, and the target endpoint seamlessly takes over. Migration is only executed when the expected cost savings exceed the migration overhead: \(C_{migration} = \Delta c_{decode} \cdot l_{remaining} > \text{Overhead}_{migration}\).
Efficient Token Transmission: When endpoints share the same vocabulary, token IDs are transmitted instead of the full text, reducing data volume by 35-54%. For different vocabularies, the data is converted to text first and then re-tokenized. Transmitting intermediate states (e.g., KV cache) is avoided because endpoints often utilize different model architectures.

Loss & Training¶

The optimization goal is to minimize average and tail TTFT while maintaining stable TBT, under the cost constraint \(\mathbb{E}[I_d(l)l] \leq b \cdot \mathbb{E}[l]\) or \(\mathbb{E}[I_s(l)l] \leq b \cdot \mathbb{E}[l]\).

Experimental Results¶

Main Results¶

Evaluation of four commercial LLM services (GPT-4o-mini, DeepSeek-V2.5, Cohere Command, LLaMA-3-70b) and three device configurations:

Platform/Model	Constraint	Tail TTFT Reduction (Pixel 7 Pro B-1.1B)	Tail TTFT Reduction (Xiaomi 14 Q-0.5B)
GPT	Server	23.85%	44.04%
GPT	Device	26.39%	16.32%
LLaMA	Server	11.08%	26.29%
LLaMA	Device	35.67%	21.29%
Command	Server	47.93%	52.23%
Command	Device	34.78%	24.42%

Ablation Study¶

Aspect	Conclusion
Migration Mechanism	Up to 72.7% cost reduction in device-constrained scenarios, and up to 83.6% in server-constrained scenarios
Request Arrival Interval	Benefits persist under real-world DiffusionDB workload patterns
Migration Impact on Generation Quality	Evaluations by three LLM judges (GPT-4o, Gemini, Qwen) show consistent quality retention
Scalability	DiSCo-S requires only 9.08ms on 100K samples, and DiSCo-D requires 14.86ms

Key Findings¶

DiSCo reduces average TTFT by 6-78% and tail TTFT by 11-52%. The migration mechanism saves up to 84% in serving costs while preserving comparable QoE.
Only a few tokens (3-17 on average) are delayed during migration, which is negligible compared to generation lengths of hundreds or thousands of tokens, leaving the P99 TBT unaffected.
The stability of device-side TTFT and TBT is significantly superior to that of the server side, providing a reliable predictive foundation for the collaborative strategy.

Highlights & Insights¶

Presents the first device-server collaborative LLM inference scheduling paradigm, rather than basic routing and dispatching.
Elegantly designs a token-level migration mechanism that leverages the delta between generation and consumption speeds to enable imperceptible switching.
Backed by extensive empirical data from real-world commercial LLM services (GPT, DeepSeek, etc.).

Limitations & Future Work¶

Focuses on application scenarios where device-side LLMs have achieved sufficient accuracy (e.g., chat, translation), but is not applicable to complex reasoning tasks.
Uses a FLOPs-based linear model for device energy consumption, whereas real-world energy consumption is more complex and influenced by factors such as battery health and temperature.
Only considers single-device scenarios, leaving coordination overheads and resource allocation problems in multi-device collaboration unexplored.
Does not consider privacy preservation issues, assuming users accept the transmission of data between the device and the server.

Device-Server Collaborative LLMs: EdgeShard (Zhang et al., 2024) and WDMoE (Xue et al., 2024) deploy model shards across endpoints; LLMCad (Xu et al., 2023) leverages device-side models to reduce server costs.
LLM Routing: Ong et al. (2024) and Ding et al. (2024) route to different models based on query complexity, but do not optimize token delivery metrics.
LLM Inference Optimization: vLLM (Kwon et al., 2023) and Sarathi-Serve (Agrawal et al., 2024) optimize server-side throughput-latency trade-offs.

Rating¶

Dimension	Score (1-10)
Novelty	8
Practicality	9
Experimental Thoroughness	8
Writing Quality	7
Overall Rating	8