Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling¶

Conference: ICLR 2026 arXiv: 2503.04398 Code: Implemented on SGLang (~5000 lines of Python + Triton kernels) Area: LLM Efficiency Keywords: Mixture-of-Experts, Expert Parallelism, all-to-all communication, model-data co-scheduling, token-expert affinity

TL;DR¶

This paper proposes the Semantic Parallelism (SP) paradigm, which predicts token-expert routing paths and co-schedules model placement with data dispatch to substantially reduce all-to-all communication overhead in MoE inference under expert parallelism. It achieves up to 2.78× throughput improvement in Attention-DP settings and up to 24.9% latency reduction in Attention-TP settings.

Background & Motivation¶

MoE inference is bottlenecked by all-to-all communication: Expert Parallelism (EP) distributes experts across multiple GPUs but requires two all-to-all collective communications to route tokens to remote experts and back. Even on 400 GB/s high-speed interconnects, this accounts for 59.2% of MoE layer forward latency.

Existing approaches decouple model placement from data scheduling: Expert placement and token dispatch are treated as independent problems, resulting in substantial unnecessary cross-device communication.

Tokens exhibit context-independent expert affinity: Profiling reveals that tokens consistently activate a concentrated and stable set of top-k experts across different contexts (median cumulative activation probability of top-k experts: 0.833–0.976), providing a basis for predictive routing.

Widespread deployment of MoE models such as DeepSeek-V3/R1 and Qwen3 makes EP communication optimization a critical industrial need.

Method¶

Core Insight: Context-Independent Token-Expert Affinity¶

Profiling DeepSeek-V2-Lite on ShareGPT reveals that each token consistently routes to the same top-k expert subset across different contexts.
The median F1-score across all layers reaches 0.833–1.000, and the maximum activation frequency of non-top-k experts is only ~0.05.
Based on this, a token-expert activation frequency table \(T^{(L)} \in \mathbb{N}^{t \times N}\) is constructed to estimate routing probabilities.

Offline Model Scheduling: Expert Clustering and Placement¶

Model-data co-scheduling is formulated as a 0-1 integer linear programming (ILP) problem.
Objective: \(L = \theta \cdot \text{load balance term} + (1-\theta) \cdot \text{remote activation minimization term}\)
Constraints: each token/expert belongs to exactly one cluster; each cluster contains an equal number of experts.
An alternating optimization algorithm is used for efficient solving, placing co-activated experts on the same GPU.

Online Data Scheduling (Two Settings)¶

Attention-DP Setting — Inter-request Scheduling: - Predicts the target DP rank for an entire request based on token-expert affinity. - \(S_r = \arg\max_j \sum_{i \in r} R_{ij}\), assigning each request to the device hosting the experts most activated by its tokens. - Combined with workload-aware balanced scheduling to ensure load balance across ranks.

Attention-TP Setting — Intra-request Token Scheduling: - Leverages Markov dependencies in inter-layer expert selection (2-gram device transition model) to enhance prediction. - Designs Shuffled-Reduce-Scatter (SRS) and Shuffled-AllGather (SAG) fused communication primitives. - Embeds speculative token reordering seamlessly into existing TP communication phases with only ~1% overhead.

System Implementation¶

Built on SGLang with ~5000 lines of Python and custom Triton kernels.
Optimized argsort kernel is 25% faster than PyTorch's native implementation.
Integrates the DeepEP communication library for efficient all-to-all operations.

Key Experimental Results¶

Attention-DP Setting (Throughput under SLO Constraints)¶

Model	vs SGLang (TTFT SLO)	vs SGLang (E2E SLO)	vs MoETuner (TTFT)	vs MoETuner (E2E)
DeepSeek-V2-Lite	+31%	+221%	+32%	+278%
Qwen3-30B-A3B	+98%	+11%	+35%	+32%

Attention-TP Setting (Latency Reduction)¶

Model	Input Length 256	Input Length 512	Input Length 1024
DeepSeek-V2-Lite p99 TTFT	-12.21%	-10.60%	-18.89%
Qwen3-30B-A3B p99 TTFT	-17.16%	-24.90%	-3.80%

Key Findings¶

Local Activation Rate (LAR) improves by 37–43% over vanilla EP, corresponding to a 41.8–46.6% reduction in EP layer latency.
The co-scheduling algorithm achieves 15.4% higher LAR than the best baseline (MoETuner) with lower load imbalance.
Zero-shot cross-dataset transfer results validate the generalizability of the scheduling strategy.

Highlights & Insights¶

Introduces the "Semantic Parallelism" paradigm, shifting communication optimization from reactive mitigation to proactive prevention.
Reveals the context-independent nature of token-expert affinity, providing a theoretical foundation for predictive scheduling.
Jointly optimizes both model placement and data scheduling, offering a systematic rather than a localized solution.
The SRS/SAG fused primitive design is elegant, embedding token reordering into existing communication flows with only ~1% additional overhead.

Limitations & Future Work¶

Validation is limited to single-node 8-GPU setups; effectiveness in cross-node or low-bandwidth interconnect settings remains to be verified.
The prediction model requires offline profiling data, making it ineffective during cold starts.
MoE variants with highly dynamic routing mechanisms require re-profiling after changes to the gating function.
Joint evaluation with KV cache optimization or quantization techniques has not been conducted.

Expert placement: MoETuner (ILP-based placement), ExFlow (inter-layer expert affinity), EPLB (DeepSeek's load balancing).
MoE inference systems: DeepSpeed-MoE, Tutel, vLLM, SGLang.
Prefetching/offloading: Pre-gated MoE (architecture modification for next-layer expert prediction) — Sem-MoE requires no architectural changes.
This work is the first to jointly optimize both model scheduling and data scheduling.

Rating ⭐⭐⭐⭐⭐¶

Novelty: 5/5 — The Semantic Parallelism paradigm and co-scheduling concept are highly original.
Experimental Thoroughness: 4/5 — Covers two models and two settings, but limited to single-node evaluation.
Writing Quality: 4/5 — System description is clear with high-quality figures.
Value: 5/5 — Addresses a core bottleneck in MoE inference with significant industrial relevance.