⚡ LLM Efficiency¶
💬 ACL2026 · 23 paper notes
📌 Same area in other venues: 📷 CVPR2026 (8) · 🔬 ICLR2026 (171) · 🧪 ICML2026 (48) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (34) · 📹 ICCV2025 (1)
🔥 Top topics: LLM ×7 · Reasoning ×2 · Diffusion Models ×2 · Alignment/RLHF ×2
- Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
-
The "number of activated experts" in MoE inference is abstracted as a global budget \(B\). Optimal Top-K allocation is performed across layers via dynamic programming (Alloc-L), followed by token-level redistribution using global Top-\((K \cdot T)\) selection (Alloc-T). This approach halves the activation budget of DeepSeek-V2-Lite while maintaining accuracy, achieving a 1.15× speedup in prefill and a 1.34× speedup in decode.
- Are Large Language Models Economically Viable for Industry Deployment?
-
The Edge-Eval framework is proposed to evaluate the full life cycle of LLMs on traditional T4 GPUs through five deployment metrics (Economic Break-even, Intelligence-Power Ratio, System Density, Cold Start Tax, and Quantization Fidelity). It reveals that small models (<2B) are comprehensively superior to 7B models in economic and ecological dimensions and identifies an anomalous phenomenon where QLoRA increases energy consumption by up to 7x despite reducing memory usage.
- Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning
-
This paper proposes PTE (Prefill Token Equivalents), a hardware-aware efficiency metric for tool-integrated reasoning (TIR) that unifies the costs of internal reasoning and external tool usage. Through large-scale experiments, it reveals four inefficiency patterns in TIR: confirmatory tool use, tool mixing, lack of tool priors, and tool format collapse.
- BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
-
The authors propose BOSCH, a training-free mixture-of-SWA method at the attention-head level. It models the SWA head selection as a Large Neighborhood Search (LNS) problem and decomposes it into a three-stage optimization (Layer Importance Probing → Adaptive Rate Assignment → Grouped Head Selection). It systematically outperforms layer-level heuristics and six static head-level methods across four models and four ratio settings.
- Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
-
This paper proposes AHD (Anchor-based History-stable Decoding), a training-free, plug-and-play dynamic decoding strategy. By utilizing dynamic anchors to backtrack historical trajectories and identify cross-block stable tokens in diffusion LLMs, AHD achieves early unlocking. It reduces decoding steps by 80% on BBH while simultaneously improving performance by 3.67%.
- CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
-
CoMeT introduces a "global memory + FIFO temporary memory" dual-memory plug-in for existing LLMs. By processing inputs in chunks, it achieves constant memory and linear time complexity. Fine-tuned only on 32k context, it enables precise retrieval at any position within 1M tokens and proposes hierarchical pipeline parallelism to allow fine-tuning 128k context on 16×80GB GPUs.
- CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit
-
This paper proposes CreditDecoding, a training-free parallel decoding acceleration method that enhances correct but low-confidence tokens by accumulating token-level historical evidence (trace credit), achieving up to a 5.48x speedup and a 0.48 accuracy improvement on LLaDA-8B-Instruct.
- Lizard: An Efficient Linearization Framework for Large Language Models
-
Lizard replaces the softmax attention of pretrained Transformers with a hybrid subquadratic attention module (Gated Linear Attention for global compression + Anchor Window Attention for local precision + learnable gates replacing RoPE). Using only 0.04B tokens for distillation, it outperforms existing linearization methods by 9.4–24.5 points on 5-shot MMLU and achieves a 32% throughput increase via a tensor-core-friendly training algorithm.
- MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings
-
MTRouter models the selection of "which LLM to invoke at each turn" within multi-turn agent tasks as a per-turn routing problem under cost constraints. By using history-model joint embeddings to predict the contribution of candidate models to the final task outcome, it improves task performance while significantly reducing total invocation costs on ScienceWorld and HLE.
- Multi-Drafter Speculative Decoding with Alignment Feedback
-
This paper proposes MetaSD, a unified framework that integrates multiple heterogeneous drafters into speculative decoding. By modeling drafter selection as a Multi-Armed Bandit (MAB) problem and using Block Divergence as a reward signal, MetaSD dynamically selects the drafter most aligned with the target LLM. It consistently outperforms single-drafter methods in both black-box and white-box configurations.
- Native Hybrid Attention for Efficient Sequence Modeling
-
This paper proposes Native Hybrid Attention (NHA), which unifies the long-term memory slots of linear RNNs with the short-term precise tokens of sliding windows through a single softmax attention operation. This achieves native unification of intra-layer and inter-layer mixing—dynamically allocating attention weights between long and short terms without extra fusion parameters—outperforming Transformer and other hybrid baselines on recall-intensive and common-sense reasoning tasks.
- RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
-
RACER proposes a training-free speculative decoding method that unifies retrieval-based exact pattern matching with logit-based future prediction. By constructing a Logits Tree via a copy-logit strategy and a Retrieval Tree via an LRU-evicted AC automaton, it achieves over 2x inference acceleration across multiple benchmarks.
- Saber: Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for DLMs
-
This paper proposes Saber, a training-free sampling algorithm for Diffusion Language Models (DLMs). By utilizing adaptive acceleration (dynamically adjusting the volume of parallel decoding based on the established context) and backtracking-enhanced remasking (undoing tokens invalidated by new context), it achieves an average Pass@1 improvement of 1.9% while attaining a 251.4% inference speedup in code generation.
- Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning
-
This paper integrates adversarial training into Parameter-Efficient Fine-Tuning (PEFT). By employing a unified robust optimization framework, SDBN, it generates worst-case perturbations in the embedding space. Specific discrete uncertainty sets are introduced for "tokenization-breaking character noise" and "generative tasks." This approach significantly enhances the robustness of LoRA/Adapter/BitFit in low-data and noisy scenarios without adding trainable parameters or increasing VRAM.
- SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
-
The authors propose SpecBound, a self-speculative decoding framework that suppresses spurious high-confidence predictions in shallow layers through layer-wise temperature annealing. By designing a bounded speculation algorithm to adaptively control the depth and width of drafts, the framework achieves up to 2.33× inference acceleration while maintaining lossless output.
- Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding
-
Ours proposes Speculative Verification (SV), which introduces a companion model of the same scale as the draft model. By leveraging the similarity between draft and companion distributions to predict speculative accuracy, it dynamically adjusts the verification length to maximize effective throughput. This method achieves an average speedup of 1.4× and up to 1.9× compared to standard speculative decoding in large-batch inference scenarios.
- StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
-
This paper proposes StructKV, a structure-aware KV Cache compression framework. It identifies global information hubs through cross-layer accumulated attention patterns (Global In-Degree Centrality), adaptively locates the optimal compression layer via Dynamic Pivot Detection, and separates computation from storage budgets using Structural Propagation & Decoupling. On LongBench and RULER, it achieves near full-context performance with 60% prefill + 10% KV retention.
- Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning
-
Tandem enables large models to generate only four types of short reasoning clues—Goal, Planning, Retrieval, and Action—while a small model uses perplexity and entropy to judge clue sufficiency and complete the answer. On MATH, GSM8K, and HumanEval, it achieves or exceeds the performance of standalone large models using approximately 60% of the computational cost.
- Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios
-
A multi-level task-profile-guided data synthesis framework is proposed to address the cold-start problem in LLM routing. TRouter, a routing method using task types as latent variables, is designed to model the query-cost-performance relationship via variational inference, achieving effective routing in both cold-start and in-domain settings.
- The Illusion of Specialization: Revealing the "Standing Committee" in Mixture-of-Experts Models
-
By introducing the CommitteeAudit framework, the authors discover a "Standing Committee" in MoE models—a compact, persistent set of experts consistently activated and dominating routing weights across different domains. This contrasts with the widely assumed domain-specific specialization, revealing an inherent centralized structure in sparse computation.
- Threshold Differential Attention: Sink-free, Ultra-sparse, and Non-dispersive Long-context Attention
-
TDA achieves sink-free, 99% precise sparsity, and competitive performance in long-context Transformer attention by combining length-adaptive thresholds with differential inhibitory views.
- TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
-
TokenTiming re-encodes token sequences generated by a draft model into the target tokenizer space and utilizes Dynamic Time Warping (DTW) to construct many-to-many token alignments. This enables off-the-shelf small models with different vocabularies to serve as draft models for speculative decoding, achieving up to 1.57x speedup on various 14B-70B target models.
- Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length
-
This paper systematically evaluates the degradation patterns of 16 LLMs in multi-instance processing (MIP). It finds that performance decline is not solely caused by increasing context length; the instance count itself exerts a stronger influence on success rates. Specifically, almost all models collapse when processing over 1,000 instances and rarely proactively alert the user.