⚡ LLM Efficiency¶

🤖 AAAI2026 · 9 paper notes

Connectivity-Guided Sparsification of 2-FWL GNNs Preserving Full Expressivity: Co-Sparsify proposes a connectivity-aware sparsification framework that restricts 3-node interactions to biconnected components and 2-node interactions to connected components, eliminating provably redundant computation. It preserves full 2-FWL expressivity while substantially improving efficiency, achieving state-of-the-art results on synthetic substructure counting tasks and benchmarks including ZINC and QM9.
Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models: This paper presents the first systematic study of how parametric knowledge influences generation in long-context language models (LCLMs), finding that such influence grows with context length and that methods designed to improve extrinsic retrieval suppress parametric recall. Based on these findings, the paper proposes the Hybrid Needle-in-a-Haystack (Hybrid NIAH) benchmark to jointly evaluate both capabilities.
How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts: This paper proposes MASS, a framework that adaptively expands the MoE expert pool via gradient-based semantic drift detection, combined with a Top-p confidence routing strategy, to automatically discover the optimal number of experts without hyperparameter search while enhancing semantic differentiation across experts.
InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE: This paper proposes InterMoE, a Dynamic Temporal-Selective MoE architecture for text-driven two-person 3D interaction motion generation that addresses individual identity preservation and semantic fidelity. A Synergistic Router fuses semantic and kinematic features to guide routing, while Dynamic Temporal Selection enables each expert to adaptively select key temporal frames. The method achieves a 9% FID reduction on InterHuman and 22% on InterX.
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction: This paper proposes Judge Q, which introduces trainable soft tokens into the model vocabulary and trains their attention patterns to align with those of actual decoding tokens, enabling them to replace local-window queries for evaluating KV cache importance during the prefill stage. This approach better preserves global information, achieving ~1-point improvement on LongBench and 3+ points on RULER.
MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm: This paper proposes MoETTA, a test-time adaptation framework that reparameterizes LayerNorm into multiple structurally decoupled expert branches. A routing mechanism assigns samples from different domains to different experts, enabling multi-directional parameter updates and overcoming the limitations of a single adaptation path under mixed distribution shifts. The paper also introduces two more realistic evaluation benchmarks—potpourri and potpourri+—and achieves state-of-the-art performance across all settings.
Resource Efficient Sleep Staging via Multi-Level Masking and Prompt Learning: This paper proposes MASS (Mask-Aware Sleep Staging), a framework that achieves reliable sleep staging using only 10% of the original EEG signal through a multi-level masking strategy and hierarchical prompt learning mechanism, providing a practical solution for resource-constrained wearable sleep monitoring systems.
Scaling and Transferability of Annealing Strategies in Large Language Model Training: This paper proposes a model-agnostic predictive framework that decomposes training loss into a forward-effect term (learning rate integral \(S\)), an annealing momentum term (Adam-style momentum integral \(M\)), and a model-size term \(N\). It demonstrates that annealing strategies can be transferred from small models/small batches to large models/large batches, achieving a prediction MAPE below 2%.
The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models: Using mechanistic interpretability tools including Patchscopes, attention knockout, and linear probes, this paper systematically reveals the internal mechanisms of analogical reasoning in LLMs: models can effectively encode relational information in middle-to-upper layers, but applying relational information to new entities is a greater bottleneck than extracting it; successful analogical reasoning correlates with strong structural alignment across stories, while failures reflect weakened or misaligned alignment.