Skip to content

⚡ LLM Efficiency

🧪 ICML2025 · 12 paper notes

📌 Same area in other venues: 📷 CVPR2026 (8) · 🔬 ICLR2026 (171) · 💬 ACL2026 (23) · 🧪 ICML2026 (48) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (34)

Autonomy-of-Experts Models (AoE)

AoE proposes allowing experts in an MoE to autonomously decide whether to process an input based on their own internal activation norms (rather than being determined by an external router). By reducing precomputation overhead through low-rank weight factorization, AoE outperforms traditional MoE in pre-training 700M-4B parameter language models.

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

This paper proposes the Cooperation of Experts (CoE) framework, which encodes heterogeneous information into multiplex networks. Through a two-level expert design and large-margin confidence tensor optimization, CoE achieves expert cooperation (rather than competition), comprehensively outperforming existing MoE and multiplex network methods in node classification tasks.

Curse of High Dimensionality Issue in Transformer for Long-context Modeling

This paper revisits the attention redundancy issue in sequence modeling from a supervised learning perspective and proposes the Dynamic Group Attention (DGA) mechanism. By dynamically grouping and aggregating unimportant tokens to reduce redundancy in attention computation, DGA maintains competitive performance while substantially reducing inference latency (achieving a 2.42x inference speedup for LLaMA2-7B under 16K context).

DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding

This paper proposes the Distributed Split Speculative Decoding (DSSD) framework, which splits the verification stage of speculative decoding between the device and the edge. By replacing multiple uplink transmissions (the SLM's \(\gamma\) vocabulary distributions) with a single downlink transmission (a single vocabulary distribution of the LLM), DSSD significantly reduces communication latency while maintaining identical inference quality.

EasyInv: Toward Fast and Better DDIM Inversion

Proposes EasyInv, which periodically aggregates the current latent state with the previous latent state via a weighted sum (analogous to Kalman filtering) during inversion. This enhances the influence of the initial latent and suppresses noise accumulation errors, achieving comparable or even superior inversion quality to iterative methods without requiring any iterative optimization, while speeding up the inference by approximately 3x.

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

This paper proposes the Grouped Cross-Attention (GCA) mechanism, which integrates chunk-level causal retrieval into the attention mechanism to achieve an end-to-end learnable retriever. The constructed Differentiable Retrieval-based Transformer (DRT) achieves near-perfect accuracy on the passkey retrieval test with a 16M context, achieving length generalization up to 1000 times the training length.

Ladder Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference

This paper proposes Ladder Residual, a simple architectural modification that shifts the input of each block from the output of the previous layer to the output of the layer before the previous one (staggered residual connection). This design decouples module computation from AllReduce communication, enabling overlap between communication and computation. It achieves a 29% end-to-end acceleration in 8-GPU Tensor Parallelism (TP) inference on a 70B model with performance comparable to standard Transformers.

Long-Short Alignment for Effective Long-Context Modeling in LLMs

This paper proposes a new perspective on length generalization from the angle of model output distributions, termed Long-Short Alignment. It highlights that the consistency of output distributions across inputs of different lengths is a key factor in length generalization. The authors introduce a Long-Short Misalignment metric and utilize it as a training regularization term, which significantly improves long-context modeling capabilities on both synthetic and natural language tasks.

Mixture of Lookup Experts

MoLE (Mixture of Lookup Experts) is proposed, which modifies the input of routing experts in MoE from intermediate features to embedding tokens. This allows experts to be reparameterized into lookup tables (LUTs) and offloaded to storage devices before inference, thereby achieving inference speeds and memory footprints comparable to dense models while maintaining MoE-level performance.

MoH: Multi-Head Attention as Mixture-of-Head Attention

This paper reformulates Multi-Head Attention (MHA) into a summation form and proposes Mixture-of-Head Attention (MoH) inspired by MoE. By utilizing a router to dynamically select the most relevant subset of attention heads for each token, MoH satisfies or even surpasses standard MHA performance while activating only \(50\% \sim 90\%\) of the heads. It also demonstrates that pre-trained models (such as LLaMA3-8B) can be successfully converted into MoH models via continue-tuning.

NExtLong: Toward Effective Long-Context Training without Long Documents

This paper proposes the NExtLong framework, which synthesizes long-context training data by segmenting documents into meta-chunks and inserting hard negative distractor texts retrieved from a pre-training corpus between these chunks. This forces the model to distinguish long-range dependency information from distractors, achieving an average improvement of 7.33% over the prior state-of-the-art long-context synthesis method, Quest, on the HELMET and RULER benchmarks.

Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

Proposes HC-SMoE, a retraining-free expert merging framework based on hierarchical clustering of expert outputs. It achieves efficient compression of SMoE models through output similarity metrics and hierarchical clustering, reducing expert parameters by 25%-50% on Qwen and Mixtral while maintaining superior performance.