⚡ LLM Efficiency¶
🧪 ICML2026 · 11 paper notes
📌 Same area in other venues: 💬 ACL2026 (13) · 📷 CVPR2026 (4) · 🔬 ICLR2026 (18) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (35) · 📹 ICCV2025 (1)
🔥 Top topics: LLM ×5
- A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
-
This work establishes the first queueing model for LLM inference that explicitly incorporates the dynamic behavior of KV cache memory, deriving a closed-form stability condition \(\lambda < \mu(1-\delta)\), enabling operators to directly compute the required number of GPUs. Validation on single GPU, 8-GPU clusters, and LongBench real data shows prediction error within \(10\%\).
- Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
-
This paper identifies that in multi-turn dialogue scenarios, the traditional Prefill-Decode (PD) disaggregation architecture is highly inefficient due to repeated P→D recomputation and KV transmission at every turn. It proposes the PPD (Prefill-capable Decode) dynamic routing system, allowing decode nodes to decide—based on SLO weights—whether to locally process Turn 2+ append-prefill. This reduces Turn 2+ TTFT by approximately 68%.
- OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
-
OServe jointly models LLM serving’s “resource allocation + parallelism strategy + request routing” as a two-level max-flow problem on a flow network. It leverages LSTM-based workload prediction and ad hoc model switching via GPU interconnects to address real-world traffic heterogeneity in both spatial (different request types) and temporal (composition shifts over time) dimensions. Compared to vLLM, OServe achieves an average 1.5× and up to 2× improvement in end-to-end P99 latency and throughput.
- PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
This paper proposes PipeSD: transforming speculative decoding from sequential cloud-edge execution to a token-batch pipeline, replacing fixed draft length with dual-threshold NAV triggering and Bayesian autotuning. On a real 5G cloud-edge testbed, PipeSD achieves 1.16×–2.16× speedup and 14–25% reduction in cloud energy consumption.
- Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
-
This work proposes the Dilated Unmasking Scheduler (DUS): by using a "dilated, equidistant" predefined unmasking order that does not rely on model confidence, the number of denoiser calls per block of \(B\) tokens is reduced from \(\mathcal O(B)\) to \(\mathcal O(\log B)\). On LLaDA / Dream / DiffuCoder, this achieves a 5.8× wall-clock speedup with quality surpassing confidence-based parallel planners.
- Scout: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
-
Scout reframes million-token long-text understanding as an "active information foraging" process, introducing a provenance-anchored, trajectory-decoupled epistemic state \(\mathcal{E}_t\) as the sole basis for reasoning. Through gap-diagnosed self-evaluation, it iteratively contracts to a query-sufficient subset. On LooGLE-v2 and \(\infty\)Bench, it matches or surpasses state-of-the-art models like Gemini-3-Pro, while reducing token cost to about \(1/8\).
- SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel
-
SLAY linearizes the Yat-kernel, inspired by the physical "inverse-square interaction," through four steps: (1) spherical normalization, (2) Laplace integral representation via Bernstein theorem, (3) Gauss-Laguerre quadrature, and (4) tensor product positive random features for polynomial+exponential kernels. This yields an \(O(L)\) attention mechanism nearly indistinguishable from softmax.
- Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
-
This work provides the first theoretical framework for the emerging Attention-FFN Disaggregation (AFD) inference architecture. Based on a probabilistic workload model with "finite mean prefill length + decode length following a geometric distribution," it derives a closed-form solution for the optimal A/F ratio under the rA-1F topology: \(r^*=\max\{r_A, r_C, r_{\text{peak}}\}\). A trace-calibrated simulator verifies that the theoretical and empirical optima differ by less than 10%.
- Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines
-
The authors built a staged GPU energy measurement framework based on NVML, decomposing the distillation pipeline into "teacher side + student side + evaluation" for segment-wise accounting. They found that one-off teacher logit caching/synthetic data generation dominates energy use, causing KD and synthetic SFT to consume about \(2.4\times\) more energy than direct SFT for 1B–13B OLMo-2 students. They provide a closed-form break-even formula, showing distillation is only truly "energy-saving" when teacher outputs are reused more than \(N^*\) times.
- Training-Inference Consistent Segmented Execution for Long-Context LLMs
-
This paper proposes a long-context LLM framework where training and inference share exactly the same segmented forward execution semantics: only a fixed-length differentiable KV tail is retained across segments, plus a forward-only retrieval bypass. On LLaMA2-7B 32K/80K, it achieves comparable or even better LongBench/RULER performance than full attention with about \(6\times\) lower prefill peak memory.
- Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
-
This paper unifies optimizations in modern LLM long-context inference—such as sparse attention, RAG, and compressed context memory—into a four-stage "Prepare Memory → Compute Relevancy → Retrieval → Apply to Inference" memory processing pipeline. It quantitatively demonstrates that this pipeline accounts for 22%-97% of total latency and that each stage exhibits highly heterogeneous computational characteristics. Based on this, a GPU-FPGA heterogeneous system is proposed: regular/compute-intensive operations remain on the GPU, while sparse/irregular/memory-intensive operations are offloaded to the FPGA. On MI210 + Alveo U55C, up to 2.2× end-to-end speedup and 4.7× energy reduction are achieved.