⚡ LLM Efficiency¶

🔬 ICLR2026 · 19 paper notes

Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents: This paper formalizes multi-store retrieval in memory-augmented agents as a cost-sensitive store routing problem, demonstrates that selective retrieval can reduce context tokens by 62% while improving QA accuracy (86% vs. 81%) over exhaustive retrieval, and proposes a semantics-based heuristic routing baseline.
DND: Boosting Large Language Models with Dynamic Nested Depth: DND selects critical tokens at the end of each Transformer layer via a router and routes them back through the same layer for additional processing (nested depth). Combined with a routing control loss and a threshold control scheme for precise and stable token selection, DND achieves average performance gains of 1.88% and 0.87% on Qwen3-1.7B and Qwen3-30B-A3B, respectively, with fewer than 0.1M additional parameters.
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models: This paper proposes EvoEngineer, the first systematic LLM-based code evolution framework that decomposes code evolution into two orthogonal components — traverse technique (with a two-layer design: solution guiding + prompt engineering) and population management. On 91 real-world CUDA kernels, EvoEngineer achieves a median speedup of up to 2.72× and a code validity rate of 69.8%, outperforming existing methods on both performance and correctness.
Expert Divergence Learning for MoE-based Language Models: This paper addresses the expert homogenization problem in MoE training by maximizing the Jensen-Shannon divergence of routing distributions across different data domains, encouraging distinct expert subsets to be activated for different domains. The approach improves expert specialization and language modeling performance on a 15B-A1.5B model.
Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws: This paper derives the optimal batch size scheduling (BSS) strategy under a Functional Scaling Law (FSL) framework. For hard tasks, the optimal strategy is to train with small batches for most of the budget and switch to large batches only at the final stage (late switching). The paper further reveals a fast catch-up effect—after switching, the loss rapidly converges to the trajectory of full large-batch training—and validates these principles in LLM pretraining at 1.1B parameters and 1T tokens.
IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling: IterResearch is proposed as an MDP-based iterative deep research paradigm that replaces mono-contextual linear accumulation with periodic workspace reconstruction, enabling agents to scale to 2048 interactions within a 40K context length (performance improves from 3.5% to 42.5%), surpassing open-source agents by an average of 14.5 percentage points across 6 benchmarks.
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding: LycheeDecode is proposed to accelerate long-context LLM decoding by fine-grainedly partitioning attention heads into a small number of retrieval heads (performing full attention to select critical tokens) and a large number of sparse heads (reusing the selected tokens for sparse computation). Head roles are learned end-to-end via the Hard Kumaraswamy distribution, achieving 2.7× speedup at 128K context length with no performance degradation.
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding: This paper proposes LycheeDecode, a fine-grained hybrid-head sparse decoding method that partitions attention heads into a small number of "retrieval heads" and a large number of "sparse heads," employing the HardKuma distribution for differentiable head-type identification. The method achieves a 2.7× speedup under 128K context while matching or surpassing full-attention baselines.
MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning: This paper proposes MVAR (Markovian Visual AutoRegressive), which introduces a scale Markov assumption (conditioning only on the adjacent preceding scale rather than all prior scales) and spatial Markov attention (restricting neighborhood size to \(k\)), reducing VAR's attention complexity from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(Nk)\). MVAR achieves comparable or superior performance on ImageNet 256×256 while reducing inference memory by 3.0–4.2×, and requires only 8 RTX 4090 GPUs for training.
One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning: This paper proposes SMoPE, a framework that organizes a single shared prompt into multiple prompt experts within a sparse MoE structure. Dynamic sparse activation is achieved via prompt-attention score aggregation, significantly alleviating knowledge interference while maintaining high parameter efficiency, achieving SOTA on multiple continual learning benchmarks.
RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training: This paper proposes RACE Attention — replacing softmax with a power angular kernel and approximating attention outputs via differentiable LSH sketches — achieving strictly linear time complexity, supporting up to 12M tokens on a single GPU and 75M tokens on a single CPU, while matching or surpassing softmax accuracy across diverse tasks.
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective: This paper proposes the first unified mathematical model for KV cache-aware load balancing, introducing a randomized leaf-node eviction algorithm RLT (with \(O(\log n)\) competitive ratio) and a learning-based greedy router LBGR, achieving up to 11.96× latency reduction and 14.06× TTFT reduction in multi-LLM serving scenarios.
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling: This paper proposes the Semantic Parallelism (SP) paradigm, which predicts token-expert routing paths and co-schedules model placement with data dispatch to substantially reduce all-to-all communication overhead in MoE inference under expert parallelism. It achieves up to 2.78× throughput improvement in Attention-DP settings and up to 24.9% latency reduction in Attention-TP settings.
SwingArena: Adversarial Programming Arena for Long-context GitHub Issue Solving: This paper proposes SwingArena, an adversarial evaluation framework in which two LLMs alternately play the roles of patch submitter and test reviewer on real GitHub issues, with end-to-end verification through repository-native CI pipelines (compilation / lint / regression tests). Evaluated on 400 instances across C++, Python, Rust, and Go, the framework reveals behavioral divergence between models in terms of "aggressive patch generation" versus "defensive quality assurance."
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Selection: This paper proposes TokenSeek, a general instance-aware token seeking and discarding method that evaluates token importance by combining contextual (attention) and gradient information, updates parameters only on selected tokens, and achieves up to 65.7% reduction in activation memory while maintaining or surpassing full-token fine-tuning performance.
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models: This paper systematically dissects chunk-based sparse attention architectures, identifies three critical design principles (nonlinear Chunk Encoder + CLS token, Bypassing Residual Path, and enforced training-time sparsity), and successfully extrapolates a model trained on 4K context to 32 million tokens.
Universe Routing: Why Self-Evolving Agents Need Epistemic Control: This paper formalizes the tendency of autonomous agents to conflate incompatible epistemological frameworks (e.g., frequentist vs. Bayesian) during chain-of-thought reasoning as the "universe routing" problem. A lightweight 465M-parameter router is trained to classify queries into 7 mutually exclusive belief spaces and dispatch them to dedicated solvers. The work demonstrates that hard routing is 7× faster than soft MoE at equal accuracy, and that a modular architecture with rehearsal enables continual learning with zero forgetting.
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework: This paper proposes a theoretical framework that decomposes long-context task failures into three types of noise (task noise / model noise / aggregator noise), proves that weak models with chunked processing can outperform strong models with full-context processing when model noise grows superlinearly, and provides a method to efficiently estimate the optimal chunk size using only 3–5 samples.
xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity: This paper systematically compares the scaling laws of xLSTM and Transformer, demonstrating that xLSTM strictly dominates Transformer of the same scale on the training loss–compute Pareto frontier, in the overtrained regime, and in inference speed, with the advantage growing as context length increases.