Skip to content

⚡ LLM Efficiency

🔬 ICLR2026 · 171 paper notes

📌 Same area in other venues: 📷 CVPR2026 (8) · 💬 ACL2026 (23) · 🧪 ICML2026 (48) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (34) · 📹 ICCV2025 (1)

🔥 Top topics: LLM ×45 · Diffusion Models ×20 · Reasoning ×7 · Alignment/RLHF ×5 · Compression ×3

A Two-Phase Deep Learning Framework for Adaptive Time-Stepping in High-Speed Flow Modeling

ShockCast decouples "adaptive time-stepping for high-speed flows" into two learning problems: first, using a Neural CFL model to predict the next time step \(\Delta t\) based on the current flow field; then, using a Neural Solver conditioned on \(\Delta t\) to advance the flow field by \(\Delta t\). These two modules alternate autoregressively during inference, allowing the neural solver to process supersonic flow fields with shocks by refining or coarsening steps as needed, mirroring classical solvers.

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Addressing the issue that existing sampling strategies for Diffusion Large Language Models (dLLMs) have a "fixed speed that does not adjust with generation states," this paper summarizes three empirical laws (Certainty, Convergence, Locality). Based on these, it designs SlowFast Sampling, which dynamically switches between "Slow Phase Exploration" and "Fast Phase Acceleration." It can be orthogonally combined with dLLM-Cache—achieving up to 15.63× acceleration on LLaDA for GPQA, and reaching 34.22× when integrated with cache, with almost no loss in accuracy.

Attention Is All You Need for KV Cache in Diffusion LLMs

To address the redundancy in Diffusion Language Models (DLMs) where all tokens and layer KV pairs are recalculated at every step, this paper proposes Elastic-Cache—a training-free and architecture-agnostic method. It uses the "attention drift of the most-attended tokens" to determine when to refresh the cache, leverages the "deep-layers-change-first" pattern to decide from which layer to refresh, and applies block-level caching for distant MASK tokens outside a sliding window. It achieves up to 45.1× decoding acceleration on models like LLaDA and Dream-7B with almost no drop in performance.

Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Instead of appending randomly initialized "compression tokens" and relying on autoencoding pre-training for context reconstruction like ICAE, SAC directly selects several "anchor tokens" from the original text. It adds a learnable anchor embedding to these tokens and employs bidirectional attention to aggregate global information into the anchors' KV caches. By completely discarding the autoencoding task, SAC consistently outperforms current compression methods in question answering and long-document summarization.

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

AutoSP elevates Sequence Parallelism (SP) from manual, framework-coupled operators to two specialized passes within the PyTorch-2.0 compiler stack: an SP-Pass on Torch-IR that automatically inserts communication and resizes activation buffers, and a Sequence-Aware Checkpointing (SAC-Pass) on the joint Aten-IR graph that relaxes min-cut constraints to recompute compute-intensive operators. This allows users to compile single-GPU models into distributed long-context training pipelines with a few lines of code, extending trainable sequence lengths by up to 2.7× on NVIDIA and 2.5× on AMD with near-zero throughput loss.

BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models

BA-LoRA superimposes three output space regularizations—consistency, diversity, and SVD—onto the PiSSA spectral-initialized LoRA framework. These specifically address knowledge drift, representation collapse, and noise overfitting caused by the amplification of pre-training biases during fine-tuning. It consistently outperforms numerous LoRA variants on both NLG and NLU tasks, showing greater gains on noisier pre-trained models.

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

DAEDAL utilizes the internal signal of EOS token prediction confidence in Diffusion Large Language Models (DLLMs) during denoising. Without training, it coarse-tunes the sequence length from a short uniform initial value to a task-appropriate length prior to denoising, and locally inserts masks for expansion at low-confidence regions during the denoising process. This overcomes the constraint of "manually presetting generation length," achieving or exceeding the accuracy of fine-tuned fixed-length baselines across four math/code benchmarks while significantly increasing the proportion of effective tokens.

Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

DID completely replaces the "mask-unmask" paradigm in diffusion language models with two continuous-time Markov chains (CTMC) representing "deletion-insertion": the forward process reduces a sequence to empty by deleting tokens, while the backward process reconstructs it by inserting tokens. By incorporating an "insertion score"-based DISE training objective and parallel dynamic programming, DID eliminates <MASK> and <PAD> tokens that consume nearly half of the computational budget. It natively supports variable-length generation and self-correction, achieving up to 3.42× training speedup and 3.79× inference speedup.

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

RoPE++ reclaims the negative imaginary part discarded in standard RoPE complex attention and utilizes it as a parallel imaginary attention head, enhancing long-context modeling capabilities without increasing KV cache or while directly halving the cache configuration.

BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity

BoRA interprets the LoRA product \(BA\) as block matrix multiplication and breaks inter-block correlations by inserting an independent diagonal matrix \(\Sigma_{i,j}\) into each block product \(B_iA_j\). Using only \(b^2r\) additional parameters, BoRA scales the rank of LoRA weights by \(b\) times, achieving a 2-4% accuracy improvement over LoRA on GLUE, mathematics, and commonsense reasoning tasks with comparable parameter counts.

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

TRIM-KV inserts a lightweight "retention gate" into each attention head of a pre-trained LLM to predict the intrinsic long-term importance of a token (a scalar score that decays exponentially over time) at the time of generation. When the memory budget is exceeded, tokens with the lowest scores are evicted. By freezing the backbone and fine-tuning these gates with distillation and capacity losses, the method incurs almost zero overhead during inference while consistently outperforming heuristic eviction and learnable retrieval baselines across benchmarks such as mathematical reasoning and long-context memory. In memory-constrained scenarios, it even surpasses full-cache performance.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

This paper reformulates speculative sampling as a constrained optimization problem—"maximizing acceptance rate under controlled divergence constraints"—and proposes Cactus. By reading only one probability value of a candidate token and adding a "bonus" determined by \(q\) and a divergence budget \(\delta\), it significantly improves acceptance rate and throughput while keeping the deviation from the verifier distribution controllable.

Cartridges: Lightweight and General-Purpose Long Context Representations via Self-Study

Replace "online prefilling of long documents into KV cache" with "offline training of a small learnable KV cache (Cartridge) for each corpus." Use Self-Study (self-generated synthetic dialogues + context distillation) to replicate the general-purpose In-Context Learning (ICL) capabilities in the small cache, achieving a 38.6× reduction in memory and a 26.4× increase in throughput on average.

Cascadia: An Efficient Cascade Serving System for Large Language Models

Cascadia is a cascade serving system for large language models. it formulates the decisions of "which model size to use, how many GPUs to allocate, and which parallelism strategy to apply" as a constrained optimization problem. By using a bi-level iterative framework that combines MILP for deployment and Chebyshev-guided search for routing, it jointly solves these variables. While maintaining answer quality, it tightens latency SLOs by up to 4\(\times\) and improves throughput by up to 5\(\times\) compared to single-model deployment.

Command-V: Training-Free Representation Finetuning Transfer

Command-V (⌘V) transfers a Representation Finetuning (ReFT) adapter trained on one model to another model with a different architecture through a pair of linear "translators" without any backpropagation or original training data. This allows the recipient to "freely" acquire new behaviors from the donor (e.g., refusal enhancement, jailbreaking, auto-CoT) with performance close to direct fine-tuning while reducing computational costs by several orders of magnitude.

Composer: A Search Framework for Hybrid Neural Architecture Design

Composer transforms the manual design process of "how to interleave computational primitives like Attention and MLP into superior LLMs" into an automated search framework. By employing Bayesian optimization on million-parameter small models to discover optimal interleaving patterns and extrapolating them ~1000× to 3B/8B scales, the resulting Composite architectures consistently outperform Llama 3.2 across the 350M–8B range. This achieves an average downstream accuracy gain of 2–2.1% while providing 1.25× training throughput and a 1.69× reduction in KV cache size.

CONCUR: A Framework for Continual Constrained and Unconstrained Routing

CONCUR trains a pair of (accuracy classifier + cost regressor) predictors individually for each "model + decoding method" strategy. It then formulates the task assignment as an optimization problem with or without a budget. Consequently, when new strategies emerge, only new predictors need to be added without retraining the entire router. It achieves higher accuracy and lower inference FLOPs than the strongest single strategy and existing routing methods in both in-distribution and out-of-distribution, as well as constrained and unconstrained settings.

CPQS-Tuning: A Model Self-Perception-Based Data Filtering Algorithm for Efficient Instruction Fine-Tuning

This paper proposes CPQS-Tuning: instead of using external scoring models or manual metrics to filter instruction fine-tuning data, it directly reads the hidden states of the target LLM. A small CNN translates the model's "implicit evaluation" of data into a Contrastive Perception Quality Score (CPQS). Selecting less than 10% of high-quality data based on this score results in training performance that exceeds using the full dataset.

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

CR-Net discovers that the "difference between adjacent layer activations" possesses a strong low-rank structure. Consequently, it reformulates every linear mapping as "previous layer activation × learnable scaling + low-rank increment." This design reduces parameters by half without losing high-rank information. When combined with an activation recomputation strategy specifically designed for this cross-layer dependency, it scales pre-training from 60M to 13B, consistently outperforming existing low-rank methods while consuming less VRAM and computation.

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

DASH abstracts deterministic attention backward propagation as a DAG scheduling problem with the goal of minimizing critical path length. By employing two complementary strategies—"Descending Q-Tile Iteration" and "Shift Scheduling"—it eliminates pipeline bubbles, achieving up to a \(1.28\times\) throughput improvement for deterministic attention backward operators on H800 compared to FlashAttention-3's deterministic mode, making reproducible LLM training nearly cost-free.

Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

The paper proposes Nested Subspace Networks (NSN), which utilize low-rank decomposition to form strictly nested subspace hierarchies in linear layers. Combined with uncertainty-aware multi-rank training, a single model can adjust the trade-off between computation and performance on-the-fly during inference (50% FLOPs reduction with only 5% accuracy loss) and can be applied post-hoc to pretrained LLMs.

DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference

Addressing the issue where the "importance stability assumption" relied upon by KV cache eviction is fragile, and standard mean aggregation fails during critical moments, this paper proposes linear-time "Defensive Aggregation" (estimating worst-case risk using historical maximums + adaptive prior correction). Based on this, DefensiveKV and its cross-layer version Layer-DefensiveKV are constructed, reducing generation quality loss by 2.3\(\times\) and 4.3\(\times\) compared to the strongest baseline across 18 datasets with a 20% cache budget.

Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents

This paper systematically analyzes the inefficiency of LLM search agents (interleaved reasoning and retrieval). It reveals that retrieval accuracy does not monotonically improve end-to-end efficiency (low recall forces more retrieval rounds, while high recall has excessive overhead) and shows extreme sensitivity to retrieval latency (FCFS scheduling and retrieval-induced pauses repeatedly evict KV-cache of long requests). The authors propose SearchAgent-X, an inference system utilizing high-recall approximate retrieval, priority scheduling, and non-blocking retrieval to achieve up to 3.4× throughput and 0.2–0.6× latency while maintaining generation quality.

Developmental Federated Tuning: A Cognitive-Inspired Paradigm for Efficient LLM Adaptation

DEVFT decomposes federated fine-tuning into "small-to-large" developmental stages, growing from a compact sub-model to a full LLM. By employing de-conflicting layer grouping and differential layer fusion to enable cross-stage knowledge transfer, it achieves 4.59× convergence acceleration, 10.67× communication savings, and a 9.07% average performance improvement on edge devices.

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Authors discovered that reasoning LLMs present a "U-shaped entropy curve" across different problem difficulties—easy problems are answered correctly but with high entropy (overthinking). Consequently, a lightweight probe reading only the model's hidden states is trained to dynamically select among Easy, Normal, and Hard reasoning strategies for each problem. Without fine-tuning the base model, this method reduces token consumption by up to 22.4% and end-to-end latency to 1/6 of the original, while maintaining or even improving accuracy.

Difficulty–Diversity Collaborative Filtering for Data-Efficient LLM Fine-Tuning

This paper treats the interaction matrix of "model-question" correctness as a recommendation system rating matrix. It employs collaborative filtering to learn personalized question difficulty for each target model, then performs combinatorial optimization with semantic diversity. By selecting the 1000 most valuable samples from large-scale unlabeled corpora, it reduces annotation costs by 100–200x while achieving downstream performance close to full-dataset fine-tuning.

Diffusion Language Models Know the Answer Before Decoding

Diffusion Language Models (DLMs) often internally determine the correct answer mid-way through decoding. Based on this, this paper proposes Prophet, a training-free decoding paradigm that uses the "logit gap between the top-2 candidate tokens" to judge answer convergence. Once converged, it fills all remaining positions in a single step (Early-Submission Decoding), reducing decoding steps by up to 3.4\(\times\) on LLaDA-8B / Dream-7B with almost no loss in accuracy.

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

This paper proposes Discrete Diffusion Forcing (D2F), which transforms pre-trained diffusion language models (dLLMs) into a hybrid AR-diffusion paradigm featuring "block-level autoregression + inter-block parallel decoding." By acquiring this capability via low-cost asymmetric distillation and pairing it with pipeline parallel decoding, the authors demonstrate for the first time that open-source dLLM inference throughput can surpass that of autoregressive LLMs of the same scale (achieving 2.5× speedup over LLaMA3 on GSM8K and 50× over the original dLLM).

DirMoE: Dirichlet-Routed Mixture of Experts

DirMoE decouples MoE routing into two separate decisions: "which experts to select (Bernoulli/Gumbel-Sigmoid)" and "how to distribute weights among selected experts (Dirichlet)." Using a Dirichlet Variational Autoencoder framework, it achieves end-to-end differentiability and introduces a mathematically guaranteed "sparsity knob" \(\lambda\) for direct calibration, enhancing expert specialization without the need for auxiliary load-balancing losses.

DiSRouter: Distributed Self-Routing for LLM Selections

DiSRouter replaces the traditional "centralized external router" with a distributed self-routing paradigm where each LLM determines whether to answer. Queries are passed through a sequence of LLM agents arranged by increasing cost; each agent decides to either provide an answer or pass the query to a more powerful model based on its self-awareness, achieving superior utility between performance and cost.

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

When distilling a pre-trained softmax attention Transformer into a hybrid model with "few softmax layers + many linear attention layers," the importance of each layer is scored by temporarily restoring it to softmax and measuring the reduction in KL distillation loss. By greedily selecting the \(K\) most critical layers to remain as softmax, the method significantly improves inference efficiency while maintaining long-context retrieval capabilities.

DND: Boosting Large Language Models with Dynamic Nested Depth

DND selects key tokens via a router at the end of Transformer layers and sends them back to the same layer for additional processing (nested depth). Combined with routing control loss and threshold schemes for precise token selection, it achieves average performance gains of 1.88% on Qwen3-1.7B and 0.87% on Qwen3-30B-A3B with minimal parameter increase (<0.1M).

DPad: Efficient Diffusion Language Models with Suffix Dropout

DPad discovers that Diffusion Language Models (dLLMs) compute attention for all future suffix tokens at every step but retain very few, causing significant redundancy. It employs a "sliding window + distance-decay dropout" to discard distant suffix tokens before attention computation. This training-free, plug-and-play method achieves up to 61.39× acceleration on LLaDA-1.5/GSM8K (1024 tokens, 1-shot) when combined with parallel decoding and prefix caching, with accuracy even slightly improved.

dParallel: Learnable Parallel Decoding for dLLMs

By reshaping the "serial certainty convergence" (predicting token-by-token) of diffusion language models (dLLMs) into "parallel simultaneous convergence" through "Certainty-Forcing Distillation," LLaDA-8B achieves an 8.5× acceleration on GSM8K, reducing decoding steps from 256 to 30 without accuracy loss.

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

DualMap utilizes two independent hash functions to map each request to two candidate instances and selects the optimal one based on system status. By leveraging the "power of two choices" principle, it simultaneously achieves cache affinity and load balancing within a single scheduling framework, increasing effective request capacity by up to 2.25× under the same TTFT SLO.

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Dynamic-dLLM is a training-free inference acceleration framework for Diffusion LLMs. Addressing the "dynamic" variations of tokens across different layers and decoding steps, it employs Dynamic Cache Update (DCU) to adaptively allocate cache update budgets per layer and Adaptive Parallel Decoding (APD) to dynamically calibrate decoding thresholds per token. It achieves an average speedup of over 3× (up to 4.48×) on models like LLaDA and Dream with negligible accuracy loss.

Dynamic Speculative Agent Planning

Addressing the issue where fixed speculative step length \(k\) in "plan-while-speculating" LLM agents either fails to save time or wastes excessive redundant tokens, this paper proposes DSP. DSP utilizes a lightweight DistilBERT regressor to predict the optimal speculation distance for each step online (without pre-deployment preparation). The prediction is modeled as a state-value function in reinforcement learning and updated using TD learning. While maintaining "lossless acceleration," DSP reduces total costs by 30% and invalid costs by up to 60%, while providing a knob for users to navigate the latency-cost trade-off.

DynamicInfer: Runtime-Aware Sparse Offloading for LLMs Inference on a Consumer-Grade GPU

DynamicInfer targets consumer-grade GPUs with insufficient VRAM by dynamically scheduling LLM FFN neurons between CPU and GPU based on runtime activation patterns. It utilizes cross-layer prediction, layered neuron caching, and load-aware thresholds to ensure more neurons that are actually used reside on the GPU, achieving significant speedups over llama.cpp and PowerInfer while maintaining near-constant accuracy.

Efficient Resource-Constrained Training of Transformers via Subspace Optimization

The authors propose WASI (Weight-Activation Subspace Iteration), which leverages the hypothesis that "parameter subspaces remain stable during fine-tuning." It simultaneously compresses Transformer weights (via SVD + Gram-Schmidt subspace iteration) and activations (via Tucker decomposition). Both training and inference are performed within low-rank representations, achieving 62× training memory compression and 1.4× acceleration on Raspberry Pi 5 with negligible accuracy loss.

EntropyLong: Effective Long-Context Training via Predictive Uncertainty

EntropyLong utilizes the model's own predictive entropy to locate "information gaps," retrieves distant context, and empirically tests whether it reduces entropy at those positions. By retaining only dependencies that bring genuine information gain to concatenate 128K training samples, it constructs "verified" long-range dependencies, significantly outperforming heuristic data construction methods on RULER and LongBench-v2.

Equilibrium Language Models

Replaces a continuous segment of intermediate Transformer layers with a lightweight "fixed-point module" that uses equilibrium state solving to equivalently represent deep stacking. This achieves a 28% parameter reduction while retaining 99% accuracy, specifically designed for low-memory edge deployment.

Expert Divergence Learning for MoE-based Language Models

This work addresses the expert homogenization problem in MoE training by maximizing the Jensen-Shannon divergence of routing distributions across different data domains. This encourages different domains to activate distinct subsets of experts, improving expert specialization and language modeling performance on models ranging from 3B to 15B parameters.

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

The authors reinterpret "expert merging" in Sparse MoE as a cooperative-competitive game between experts. They derive the merging coefficients for each expert from first principles using the Nash Bargaining Solution (NBS) and incorporate complex momentum to accelerate cross-layer propagation. This results in NAMEx, a unified framework that replaces the heuristic weighting used in CAMEx.

Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

Addressing the mismatch where "fine-tuning data is sentence-level, but LLM optimization is token-level," this paper proposes XTF. It decomposes the contribution of each token into three explainable attributes: Reasoning Importance (RI), Knowledge Novelty (KN), and Task Relevance (TR). Tokens lacking any of these attributes are identified as noise and masked during training. This approach improves fine-tuning accuracy by up to 13.7% across math, code, and medical tasks on 7 mainstream LLMs.

Extending the Context of Pretrained LLMs by Dropping Their Positional Embedding

RoPE is a critical inductive bias for accelerating convergence during pretraining but serves as the root cause hindering length extrapolation. This paper proposes DroPE: by directly removing all positional embeddings after pretraining and performing a brief "re-calibration" with a minimal number of tokens, LLMs can zero-shot generalize to sequences far exceeding their training length without any long-context fine-tuning.

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Fast-dLLM accelerates bidirectional Diffusion LLMs without retraining by introducing a block-wise approximate KV Cache and replacing fixed top-K parallel decoding with a "confidence threshold" strategy. It achieves up to a 27.6× end-to-end throughput gain on LLaDA and Dream with almost no loss in accuracy.

Fast-dLLM v2: Efficient Block-Diffusion LLM

Fast-dLLM v2 transforms pre-trained autoregressive Qwen2.5 models into block-diffusion language models via lightweight fine-tuning with approximately 1B tokens. Combined with hierarchical caching and confidence-based parallel decoding, it achieves up to 2.5× speedup over AR decoding without performance degradation.

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

This paper theoretically derives the optimal strategy for batch size scheduling through the Functional Scaling Law framework—for difficult tasks, the optimal strategy involves training with a small batch size for most of the duration and switching to a large batch size only in the final stage (late switching). It reveals the "fast catch-up" effect—where the loss rapidly catches up to the trajectory of a constant large batch after the switch—and validates this principle in 1.1B parameter, 1T token LLM pretraining.

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Two training-free techniques—FreeCache, which reuses stable KV projections, and Guided Diffusion, which uses consistency signals from small AR models to guide parallel demasking—enable a 7B/8B diffusion language model to achieve an average end-to-end speedup of 12x, bringing diffusion LLM latency comparable to or even faster than same-sized autoregressive models for the first time.

Flatter Tokens are More Valuable for Speculative Draft Model Training

This paper discovers from a data-centric perspective that during speculative decoding draft model training, tokens with "flatter" (closer to uniform) target model prediction distributions are more valuable. Based on this, it proposes a target-model-only, offline-calculable flatness metric and the SFDD data distillation method, achieving over 2× training acceleration with 50% data while incurring less than a 4% loss in inference speedup.

FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention

FlexLA unifies various linear attention variants into a three-stage abstraction: "intra-chunk computation / inter-chunk state propagation / output merging." This allows users to describe algorithms in dozens of lines of PyTorch. A domain-specific compiler then automatically generates high-performance Triton kernels fused with computation and communication, achieving or exceeding the expert-handwritten library FLA (\(1.01\times \text{--} 4.9\times\)) on a single GPU, and up to \(7.2\times\) speedup over LASP2 on distributed systems, with near-linear scaling up to 128 GPUs and 16 million tokens.

FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment

FLoRG reparameterizes the two low-rank matrices of LoRA into a single low-rank matrix and aggregates only its Gram matrix. This transforms server-side aggregation from a "biased bilinear operation" into an "unbiased linear operation." It then employs Procrustes alignment to resolve the drift caused by non-unique decomposition, simultaneously eliminating aggregation errors, reducing communication overhead (up to 2041×), and tightening convergence bounds in federated fine-tuning.

Frayed RoPE and Long Inputs: A Geometric Perspective

This paper explains why RoPE models collapse beyond their training length through a unified geometric perspective: long inputs scatter and overlap the originally distinct key/query clusters, causing sink tokens (attention focal points) to fail. Based on this, the authors propose RoPE-ID, which applies high-frequency rotation to only half of the channels, enabling training-free extrapolation to longer contexts that matches or exceeds YaRN on RULER and LongBench.

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

FreeKV is a training-free algorithm-system co-optimization framework. It removes KV page selection and recall from the inference critical path via "speculative retrieval," compensates for accuracy loss with "fine-grained correction," and utilizes a hybrid CPU/GPU memory layout with double-buffering streaming recall. This allows retrieval-based KV cache compression to achieve up to 13× speedup over SOTA retrieval methods with almost no loss in accuracy.

From Collapse to Control: Understanding and Extending Context Length in Emerging Hybrid Models via Universal Position Interpolation

This paper systematically explains why hybrid Mamba-Transformer models suffer from context collapse beyond their training window and proposes Universal Position Interpolation (UPI). By simultaneously scaling Transformer RoPE frequencies and the step size \(\Delta_t\) of a few unstable Mamba heads, UPI extends the usable context of Bamba, Nemotron-H, and Mamba2 from 4K/8K up to 64K without retraining.

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

FSA reorders the NSA sparse attention kernel from "outer loop query token, inner loop KV block" to "outer loop KV block, inner loop query token." This eliminates padding waste in mainstream LLMs with few query heads per GQA group, reducing kernel latency by up to 3.5× and accelerating end-to-end training by up to 1.25×.

FutureFill: Fast Generation from Convolutional Sequence Models

To address the slow autoregressive decoding of convolutional/spectral sequence models (e.g., STU, Hyena), this paper proposes the FutureFill primitive. By using FFT to precompute the "contribution of generated tokens to future tokens," it reduces the time to generate \(L\) tokens from \(O(L^2)\) to quasi-linear \(O(L\log^2 L)\). For prompted generation, the cache is reduced from \(O(L+K)\) to \(O(K)\) (related only to generation length), all while remaining exact and without approximation.

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

This paper provides the first efficient algorithm for the exact solution of the Optimal Transport (OT) verification criterion in multi-draft speculative sampling. It simplifies the exponential-scale Optimal Transport Linear Programming (OTLP) into a convex minimization problem with at most \(V\) variables, achieving a 90% acceptance rate with a per-token overhead under 100 ms under the i.i.d. draft setting.

Group Representational Position Encoding (GRAPE)

Ours proposes the GRAPE framework to unify the multiplicative (RoPE) and additive (ALiBi/FoX) position encoding families in Transformers based on group actions. It proves RoPE and ALiBi are exact special cases and introduces a path-integral additive variant, GRAPE-AP, which outperforms existing methods on downstream tasks.

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

GSI utilizes a small draft model to sample reasoning steps and performs soft best-of-n using a "tilted reward" corrected by reward and log-likelihood ratios. It falls back to the target model for resampling when scores are too low. On mathematical reasoning benchmarks, GSI approaches or even exceeds the precision of the large model's best-of-n while reducing end-to-end latency by up to 28%. It is the first speculative test-time expansion method with distribution guarantees for the optimal tilted policy.

Gumbel Distillation for Parallel Text Generation

This work uses the Gumbel-Max trick to externalize the "sampling randomness" of an autoregressive teacher into a deterministic Gumbel noise "blueprint." This allows parallel student models to learn a simple supervised "noise \(\rightarrow\) text" mapping, thereby reducing the dimensionality of the difficult joint distribution modeling problem into a straightforward regression and significantly closing the quality gap between parallel and autoregressive decoding.

Hierarchy Decoding: A Training-free Parallel Decoding Strategy for Diffusion Large Language Models

Addressing the performance degradation issue in diffusion Large Language Models (dLLMs) when decoding multiple tokens per step, this paper proposes the training-free Hierarchy-dLLM. It employs a divide-and-conquer approach to recursively partition continuous masked regions into sparse sub-regions for parallel decoding. By maintaining a sparse distribution of undecoded tokens to suppress distribution drift, it achieves up to a 17× speedup (approximately 1.5× faster than Fast-dLLM) while maintaining or even slightly improving accuracy.

Householder-Diagonalized Linear Attention (HDLA): Utilizing Rank-Enhanced Decay Mechanism for Efficient Sequence Modeling

HDLA utilizes generalized Householder matrices to achieve congruence diagonalization of the decay matrix in linear attention, extending the structure from the prevalent "diagonal + rank-1" to a more expressive "diagonal + rank-2" form. Combined with a chunk-wise parallel algorithm supporting arbitrary ranks, HDLA outperforms existing linear attention baselines across language modeling perplexity, MQAR/RULER retrieval, and MAD synthetic tasks with lower computational overhead.

ICaRus: Identical Cache Reuse for Efficient Multi-Model Inference

ICaRus conceptually splits a decoder-only Transformer into a "logical encoder (generating KV cache) + logical decoder (predicting tokens)." By fine-tuning only the logical decoder and freezing the logical encoder, a suite of task-specialized models can share the same bit-wise identical KV cache. This eliminates cache explosion and redundant prefill in multi-model serving, reducing P95 latency by 11.1× and increasing throughput by 3.8× in 8-model agentic workflows.

IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs

IceCache integrates "clustering tokens by semantic similarity" with the paging mechanism of PagedAttention. By utilizing an incrementally updatable multi-level DCI tree, it packs semantically similar tokens into the same physical memory pages. This ensures that relevant tokens are highly co-located during query-aware retrieval, significantly improving hit rates and maintaining near-full cache accuracy with lower latency while using only 25% of the KV-cache budget.

In-Place Test-Time Training

This paper treats the down-projection matrix \(W_{down}\) of the MLP block in Transformers as "fast weights" that can be updated during inference. By combining an alignment objective for Next-Token Prediction with a chunk-based update mechanism, existing pre-trained LLMs can achieve "plug-and-play" Test-Time Training (TTT) capabilities without changing the architecture or training from scratch. This approach consistently outperforms the original models and competitors like GLA / DeltaNet / LaCT on long contexts ranging from 128k to 256k.

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

CAST introduces the overlooked system variable of "inference cost" (GPU model, batch size) into the dynamic draft tree construction of EAGLE-2/3. By modeling tree width, depth, and the number of verification tokens as a utility maximization problem of "acceptance gain vs. inference cost," it achieves slight improvements in single-sample scenarios and significant gains in batch scenarios over existing SOTA, reaching up to 5.2× acceleration.

InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

InfLLM-V2 utilizes a trainable sparse attention with "zero extra parameters and reused dense attention weights," allowing the model to switch seamlessly between dense and sparse modes based on sequence length. This aligns with the "short pre-training → long fine-tuning" paradigm and is implemented via hardware-friendly block selection, achieving 4× speedup over dense attention while retaining 98.1% / 99.7% of performance in long-context understanding and reasoning.

Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-Tuning

IPROX abandons off-the-shelf small models as proxies for gradient-based influence data selection. Instead, it "distills" a low-rank proxy that preserves influence information directly from the target LLM—combining influence-weighted SVD compression with gradient-alignment fine-tuning. This allows a smaller proxy to outperform larger off-the-shelf models in data selection tasks.

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

IterResearch proposes an MDP-based iterative deep research paradigm. By replacing linear context accumulation with periodic workspace reconstruction, the agent scales to 2048 interactions within a 40K context limit (improving performance from 3.5% to 42.5%), outperforming open-source agents by 14.5 percentage points on average across six benchmarks.

KnowProxy: Adapting Large Language Models by Knowledge-guided Proxy

KnowProxy uses a small proxy model to "digest" textual knowledge generated by a frozen large model for downstream adaptation. This approach moves away from the traditional proxy-tuning dependency on LLM probability distributions, enabling efficient fine-tuning for black-box LLMs while using dynamic routing to invoke the proxy only when the large model is uncertain.

Learning To Draft: Adaptive Speculative Decoding with Reinforcement Learning

LTD models "how deep to draft" and "how many candidate tokens to verify" in tree-based speculative decoding as two collaborative reinforcement learning policies. By directly using the throughput of each draft-and-verify cycle as the reward, it consistently improves LLM inference speed on Eagle3.

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Addressing the pain point that parallel decoding in diffusion large language models (dLLMs) relies on fixed heuristics (e.g., confidence thresholds) and lacks adaptability to different inputs, this paper employs an extremely lightweight learnable filter (2-layer MLP, ~2k parameters, 6-minute training) to approximate an oracle strategy of "finalize immediately once predicted correctly." Coupled with End-of-Text early stopping, it achieves up to 22.58× acceleration on LLaDA-8B with almost no performance loss, reaching 57.51× when combined with KV-Cache.

Let's (not) just put things in Context: Test-time Training for Long-context LLMs

This paper identifies that retrieval failure in long-context LLMs stems from the score dilution of static self-attention (distractor tokens dilute the attention quality on the target) and demonstrates that "thinking tokens" cannot fix this issue. It proposes query-only Test-time Training (qTTT)—caching KV during a single prefill, then performing a few gradient updates solely on the query projection matrix with fixed KV. Under an equivalent FLOP budget, qTTT significantly outperforms thinking tokens by redistributing inference-time compute from "generating more tokens" to "a few targeted query updates."

Libra: Effective yet Efficient Load Balancing for Large-scale MoE Inference

Libra achieves near-perfect load balancing for Qwen3MoE and GLM-4.5 on 8 H200 GPUs by combining "speculative execution to predict next-layer expert activation" with a "two-stage locality-aware execution flow," completely hiding the overhead of expert replication and token sharding behind MoE computation. It improves prefill throughput by up to 19.2%.

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention for Test-Time Regression

Viewing attention as a "test-time regression solver," the authors upgrade Softmax attention using local linear regression from statistics to derive Local Linear Attention (LLA). LLA combines the asymptotic convergence of linear attention with the strength of Softmax. Additionally, a hardware-efficient FlashLLA tiling algorithm is designed to reduce the memory complexity of a naive implementation from quadratic back to linear.

Log-Linear Attention

The authors replace the "fixed-size hidden state" in linear attention with a set of multi-scale hidden states that grow logarithmically with sequence length. This maintains matrix-multiply-friendly parallel training (\(O(T \log T)\) computation, \(O(\log T)\) decoding memory) while pushing the expressivity of linear attention toward that of softmax attention.

Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

This paper introduces LongCA-bench, a long-context attention benchmark that unifies 7 dense operators, 5 sparse operators, and 5 context parallelism mechanisms under a single data interface. Using up to 96 H100 GPUs and 768K sequence lengths, the study systematically evaluates the speed/memory trade-offs of various methods across a three-dimensional space of "mask patterns × sequence length × distributed scale."

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

LoopFormer explicitly conditions each iteration of a looped Transformer on a "normalized time \(t\) + step size \(\Delta t\)" and uses shortcut-consistency training to align trajectories of different lengths to the same endpoint. This enables a single model to gracefully scale its depth based on any arbitrarily specified inference budget \(M\) without retraining, avoiding the representation collapse typical of naive early exiting.

LoRA-S: An Efficient Low Rank Adaptation scheme via Sylvester equation

This paper employs the "horizontal lift" theory from differential geometry to optimize LoRA's two low-rank factors on a quotient manifold. It derives a universal iterative framework that enables any preconditioned optimizer to automatically achieve "Efficient Feature Learning (EFL) / transformation invariance." Furthermore, it replaces the hand-tuned weight decay hyperparameter with a decay matrix \(K\) solved via the Sylvester equation, resulting in two plug-and-play efficient LoRA optimizers: AdamS and LRACS.

LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation

LoRAGen focuses on the "structural characteristics of the LoRA parameter space" by employing weight space loss on the full adaptation matrix \(\Delta W\) and a module-aware MoE decoder. This allows a latent diffusion model to generate LoRA parameters directly from natural language task descriptions, achieving performance close to task-specific LoRAs in-distribution and exceeding baselines by nearly 5 points on unseen tasks.

LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

LouisKV observes that key KVs exhibit strong temporal locality during decoding and distinct distribution patterns across input and output sequences. By replacing "per-token retrieval + page-level management" with "semantic boundary-triggered retrieval + decoupled fine-grained management (input clustering/output segmenting)," it achieves up to a 4.7× speedup over SOTA retrieval methods on various long-sequence tasks with near-zero accuracy loss.

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Ours proposes LycheeDecode, which fine-grains attention heads into a few retrieval heads (responsible for full attention to select key tokens) and numerous sparse heads (reusing selected tokens for sparse computation). Using the HardKuma distribution for end-to-end head type learning, it achieves a 2.7× speedup on 128K contexts without performance degradation.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent segments "unbounded long documents" into fixed-length chunks for streaming processing, replacing the ever-expanding context with a fixed-length, overwritable token memory. By utilizing an extended version of Multi-Conv DAPO for end-to-end training of memory read/write strategies, 8K window models can extrapolate to 3.5M token QA tasks with nearly no performance loss and strictly linear inference complexity.

Merge before Forget: A Single LoRA Continual Learning via Continual Merging

This paper reformulates "Continual Learning" as a "Sequential Model Merging" problem, maintaining only one pair of LoRA matrices {A, B} throughout the process. It initializes A for new tasks using the orthogonal basis of the previous task and performs time-aware scaling for merging B based on LoRA asymmetry. This reduces memory complexity from linear growth to constant while mitigating forgetting and rigidity.

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

MesaNet pushes "Test-Time Training" to its optimum: unlike DeltaNet which takes a single gradient step per token, it optimally solves the cumulative regularized squared error of "fitting a linear model using context" at every time step. Through a Conjugate Gradient (CG) solver and block-parallelism, the Mesa layer—previously restricted to serial execution and numerically unstable—can be scaled to billion-parameter training on GPU/TPU for the first time.

MeSH: Memory-as-State-Highways for Recursive Transformers

This paper identifies two root causes for why recursive Transformers lag behind non-recursive models of equal compute—"undifferentiated computation" and "information overload." It proposes MeSH: a scheme using explicit memory slots and step-wise learnable read/write routers to replace the overloaded single hidden state. This allows a 1.4B recursive model to outperform a same-scale Vanilla Transformer while using 33% fewer parameters.

Meta-UCF: Unified Task-Conditioned LoRA Generation for Continual Learning in Large Language Models

The authors utilize a shared hypernetwork to instantly translate lightweight task embeddings into full-layer LoRA updates. By employing meta-contrastive and orthogonal objectives to push task embeddings toward near-orthogonality, they achieve continual learning without forgetting while maintaining constant memory (equivalent to the parameters of a single adapter).

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

This paper identifies "Global Context Collapse" as the root cause of performance degradation in linear attention—where all queries share a fixed \(d \times d\) global KV summary, capping the attention matrix rank at \(d\). The authors propose Multi-Head Linear Attention (MHLA), which partitions the sequence along the token dimension and utilizes a learnable coefficient matrix for query-conditioned mixing of local summaries. This approach elevates the rank upper bound to \(\sum_b \min(n_b, d)\) and restores the expressivity of softmax attention while maintaining \(O(N)\) complexity and avoiding extra convolutional or gating modules.

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure

MiSS replaces the dual-matrix \(BA\) update of LoRA with a shard-sharing structure "expanded" from a single zero-initialized small matrix \(D\). This approach accelerates convergence while simultaneously excelling in memory and computational efficiency, thereby achieving a superior trade-off in the performance–memory–efficiency triangle.

Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity

The paper proposes MEERKAT—a sparse zeroth-order federated fine-tuning method that updates only 0.1% of pre-trained sensitive parameters. It suppresses Non-IID drift through "extreme sparsity + high-frequency synchronization." Based on traceable virtual paths, the GradIP phenomenon is discovered, enabling MEERKAT-VP to identify and early-stop extreme Non-IID clients to improve global model quality.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

Under the premise of strictly equal total parameters N, training compute C, and data volume D, the authors optimize the MoE backbone and control the activation rate within an optimal range of approximately 20%. This demonstrates for the first time that MoE can consistently surpass dense models of equivalent resources. A data reuse strategy is employed to eliminate MoE's additional data requirements.

MoL: Adaptive Mixture-of-Length Reasoning for Efficient Question Answering with Context

MoL utilizes a difficulty assessment based on cross-document information redundancy to assign a "difficulty score" to each question. It employs a dual-objective reward—rewarding expansion for incorrect answers and compression for correct ones—integrated with GRPO training. This encourages the model to naturally exhibit "intelligent conciseness": answering simple questions briefly and complex ones extensively, simultaneously improving accuracy and significantly compressing tokens across multiple contextual QA tasks.

MoM: Linear Sequence Modeling with Mixture-of-Memories

MoM replaces the single fixed-size memory in linear models with a set of independent memory states and a routing network. This allows different tokens to update only their assigned memories, significantly expanding memory capacity and eliminating write interference while maintaining linear complexity, bringing performance on recall-intensive tasks close to Transformers.

Neuron-Aware Data Selection in Instruction Tuning for Large Language Models

NAIT proposes using "neuron activation patterns" to select instruction tuning data. Specifically, it extracts directional vectors corresponding to specific abilities from a small number of in-domain samples and then ranks candidate samples based on their alignment scores with these vectors. On LLaMA-2-7b, using only 10% of Alpaca-GPT4 data selected via NAIT achieves a 3.24% average improvement over full fine-tuning. The method does not rely on external LLMs and costs only 1/19th of AlpaGasus.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

The study transforms the conservative "unlocking only a few tokens per step" sampling of Discrete Diffusion Language Models (dLLMs) into "unlocking all correctly predictable tokens at once." It replaces fixed confidence thresholds with a lightweight neural network (Neural Indicator) to perform this judgment. Compared to full-step sampling, it achieves up to 14.3× acceleration (25.0× when combined with KV caching) on LLaDA / Dream with almost no performance degradation.

Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs

Addressing the problem of "how to dynamically select the optimal drafter from multiple domain experts for each query," this paper points out that exploration is redundant in speculative decoding. A single trajectory verified by the target can counterfactually evaluate all drafters. Thus, the original multi-armed bandit problem is transformed into a full-information online learning problem. The proposed HedgeSpec achieves no-regret across \(N\) drafters, accelerating EAGLE-3 by up to 83.7% and improving MAT by up to 49% compared to bandit baselines.

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Through 1700+ groups of experiments, this paper systematically demonstrates that the conclusion "4-bit quantization is memory-optimal" for non-reasoning models fails for reasoning models. The memory-optimal strategy is determined by the model's effective size (parameter count × bit-width), with a critical point at "8-bit 4B". Small models should spend memory on larger weights, while large models should spend it on longer generation or more parallel sampling.

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

This paper proposes two metrics, SRP and SCH, to quantify the "Local Routing Consistency" (whether consecutive tokens tend to activate the same set of experts) of MoE models. Through systematic analysis of 20 real-world MoE LLMs and a series of controlled TOY models, it reveals trade-offs between local load balancing and routing consistency. Key findings include that shared experts hurt consistency while domain-specialized experts enhance it. The paper concludes with the deployment insight that a cache size of twice the number of active experts is most cost-effective.

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

CoA-LoRA trains a "configuration-aware model" that directly maps any layer-wise quantization configuration to lightweight low-rank adjustments. This allows a single LoRA adapter to adapt to various bit-width combinations without per-configuration fine-tuning. Combined with a Pareto-based Gaussian Process configuration search to select high-quality training sets, it achieves a \(1.74\%–8.89\%\) accuracy improvement over SOTA on four GLUE tasks, with total fine-tuning time staying nearly constant regardless of the number of configurations.

One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

The SMoPE framework is proposed, organizing a single shared prompt into multiple prompt experts within a sparse MoE structure. By implementing dynamic sparse activation via prompt-attention score aggregation, it significantly alleviates knowledge interference while maintaining high parameter efficiency, achieving SOTA results across multiple continual learning benchmarks.

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

OPPO is a lightweight, model-agnostic framework for accelerating PPO-RLHF training. By overlapping actor generation and reward scoring via chunked streaming within a single step and utilizing inter-step "overcommitment" to defer tail latencies, it achieves training speedups of 1.8×–2.8× and increases GPU utilization by 1.4×–2.1× without altering PPO updates or degrading convergence quality.

Out of the Memory Barrier: A Highly Memory-Efficient Training System for LLMs with Million-Token Contexts

OOMB transforms million-token context LLM training into a system that proceeds serially by chunks, discards activations immediately, and recomputes them during the backward pass. It utilizes paged KV cache, asynchronous CPU offloading, and page-level sparse attention to manage the states that truly grow with sequence length, enabling Qwen2.5-7B to be trained with 4M token contexts on a single H200.

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

This paper proposes Hierarchical Speculative Decoding (HSD), a novel verification strategy using "hierarchical branch resampling + capping." Without altering the target model distribution (provably lossless), it significantly improves the number of draft tokens accepted per step, increasing average decoding speed by 6.7%, with gains exceeding 12% when integrated into EAGLE-3.

Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

PUNT is a training-free and model-agnostic sampler for Masked Diffusion Models (MDMs). In each step, it utilizes "contextual independence" testing combined with divide-and-conquer pruning to select a batch of non-interfering, high-confidence tokens for simultaneous decoding in only \(O(\log |M|)\) forward passes. It achieves higher generation quality with fewer forward passes on long-text alignment benchmarks (up to 16% higher on IFEval compared to baselines).

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

The authors recast the sequential recursion of nonlinear RNNs into a system of \(L\) nonlinear equations, solved simultaneously using Newton iteration combined with block bi-diagonal parallel reduction. This enables classic nonlinear RNNs (GRU/LSTM) to be trained in parallel along the sequence length for the first time—achieving up to 665× speedup over naive sequential application and resulting in 7B-scale RNN language models with competitive perplexity against similarly sized Transformer and Mamba2 models.

PARD: Accelerating LLM Inference with Low-Cost Parallel Draft Model Adaptation

PARD transforms an off-the-shelf small language model into a target-agnostic draft model that outputs \(K\) tokens in parallel in a single forward pass. By utilizing Conditional Output Dropping (COD), the training cost for this adaptation is reduced to \(O(N)\). On vLLM, it enables LLaMA3.1-8B to reach 264.88 tokens/s, which is \(3.67\times\) faster than autoregressive generation and \(1.15\times\) faster than EAGLE-3.

Planned Diffusion

The model first generates a "plan" autoregressively to partition the response into several semantically independent blocks, and then performs parallel diffusion denoising on all blocks. This allows the model to determine its own denoising order. On AlpacaEval, it achieves a \(1.27\times\) to \(1.81\times\) speedup relative to autoregressive models with only a \(0.87\%\) to \(5.4\%\) win rate drop, refreshing the quality-latency Pareto frontier for discrete diffusion parallel generation.

PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models

PLoP utilizes a gradient-free, near-zero-overhead "Normalized Feature Norm" (NFN) to automatically determine which module types should receive LoRA adapters. By placing adapters on modules with the lowest alignment to the task, PLoP consistently outperforms (or at least matches) common heuristics, such as "Attention-only" or "MLP-only," across both SFT and RL post-training scenarios.

Predicting LLM Output Length via Entropy-Guided Representations

Instead of training independent auxiliary models to predict LLM output length, this paper reuses the main model's own hidden states. It employs Entropy-Guided Token Pooling (EGTP) for static prediction and Progressive Length Prediction (PLP) to handle "one-to-many" stochastic generation in scenarios like RL sampling. The authors release ForeLen, the first length prediction benchmark containing long-sequence, CoT, and RL data, on which the Mean Absolute Error (MAE) of the strongest baseline is reduced by 29.16% on average, significantly improving end-to-end throughput.

PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

This paper first empirically demonstrates that the true cause of Prefix-Tuning's failure in modern large models is the "weight trade-off between prefix and input within the attention softmax." It then proposes PrefixMemory-Tuning (PMT): moving the prefix module out of the attention head and approximating it with a trainable memory matrix \(M\) plus a kernel feature map \(\phi(\cdot)\). This decoupling ensures prefix contributions are no longer diluted by sequence length. PMT consistently outperforms Prefix-Tuning and matches or exceeds LoRA in few-shot classification, preference alignment, and mathematical reasoning.

Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters

prima.cpp assembles home laptops, desktops, phones, and tablets into a heterogeneous low-end cluster. By utilizing "Pipeline Ring Parallelism (PRP) + Prefetching," it hides disk loading latency within computation time. The Halda scheduler solves for the optimal layering scheme based on real-world device compute, memory, and disk capacity via Integer Linear Programming (ILP). This enables running 30-70B models on common home clusters even with severe memory shortages, achieving 674 ms/token for a 70B model with <6% memory pressure—a 5-17× reduction in TPOT compared to llama.cpp.

ProtoKV: Knowledge in Long Context is Already Organized Before You Query

ProtoKV discovers that LLMs spontaneously categorize tokens into "Position-Determined Tokens" (PDT) and "Semantic-Anchored Tokens" (SAT) during the prefilling stage. Based on this, it constructs semantic prototypes, clusters tokens into groups based on these prototypes, and retains/evicts entire clusters. This approach improves average LongBench accuracy by 2.11% over SOTA under the same memory budget, with computational overhead comparable to SnapKV.

ProxyAttn: Guided Sparse Attention via Representative Heads

ProxyAttn observes that multiple attention heads in long contexts "focus on highly consistent tokens but differ in sparsity levels." It utilizes token-level scores from a few "proxy heads" to approximate block importance for all heads, while assigning independent budgets to each head for differentiated sparse masks. Without training, it preserves performance while achieving up to 10.3× attention speedup and 2.4× end-to-end prefilling acceleration.

QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill

QuoKA proposes a training-free sparse attention method that does not depend on custom kernels. During the chunked prefill phase, it utilizes the geometric observation that "queries farther from the mean are more important" to select a few representative queries. Key KVs are then chosen for these queries using cosine similarity. This reduces attention computation to sub-quadratic complexity, achieving a 3× reduction in TTFT latency and up to ~7× attention speedup while maintaining near-lossless performance on long-context tasks.

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

This paper replaces Softmax attention with a "sharpened angular kernel + differentiable LSH sketch (RACE)," turning attention into an operator that is strictly linear in both sequence length and embedding dimension. This pushes the manageable context for a single-layer attention from FlashAttention's ~4 million tokens to 12 million on GPU and 75 million on CPU, while maintaining parity or better accuracy on real-world tasks within 64K.

Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

The authors propose the first unified mathematical model for KV cache-aware load balancing, designing the Randomized Leaf-node Token eviction algorithm (RLT) with an \(O(\log n)\) competitive ratio and the Learning-Based Greedy Routing (LBGR). This approach reduces latency by up to 11.96× and TTFT by 14.06× in multi-LLM serving scenarios.

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

This is the first empirical study systematically characterizing the serving behavior of "Reasoning Large Language Models (RLLMs)" in online inference. The authors propose the ASU evaluation framework and ASU-Perf benchmark suite, identifying four significant differences between RLLMs and standard LLMs (intense VRAM fluctuations, straggler requests, difficulty-adaptive runtime, and domain preferences). The study further evaluates whether optimization techniques designed for traditional LLMs—such as weight quantization, KV Cache quantization, prefix caching, and speculative decoding—remain effective for RLLMs.

Reconstructing KV Caches with Cross-Layer Fusion for Enhanced Transformers

Aiming at the issue that cross-layer KV cache sharing (e.g., YOCO, CLA) consistently performs worse than intra-layer methods (GQA), this paper discovers the "key-value asymmetry" phenomenon—top-layer values primarily originate from the bottom layers, while keys come from both bottom and middle layers. Based on this, the authors propose FusedKV (learnable channel-wise fusion on post-RoPE keys) and its lightweight version FusedKV-Lite (direct asymmetric reuse). On 332M–4B models, these methods reduce KV cache memory by 50% while achieving lower perplexity than standard full-cache Transformers.

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion elevates parallel decoding in masked diffusion language models from the token level to the slot level (multi-token segments). Slots are selected in parallel via diffusion and filled serially via autoregression. By reordering generated slots ahead of masked ones at each step, it achieves full KV cache reuse and manageable learning complexity. Compared to previous diffusion models, it yields a 34% average performance improvement and 18× acceleration, while outperforming or matching strong autoregressive models with a 2.33× speed advantage.

RepSpec: Structural Re-parameterized Draft Model Training for Speculative Decoding

RepSpec adopts the structural re-parameterization idea from RepVGG, splitting each linear layer of the speculative decoding draft model into a multi-branch redundant structure during training and merging them back losslessly into a single layer during inference. This enhances the draft model's capability without increasing inference overhead. By adding a "LoRA-style non-linear hybrid branch" to further extend the accepted sequence length, it accelerates the SOTA EAGLE-3 by 4%–10%.

RESA: Bringing Back What Sparse Attention Ignores with Residual Estimation

Addressing the blind spot of Sparse Attention (SA)—which "only computes selected KV pairs and treats others as zero contribution"—RESA leverages the inherent low-rank property of attention logit matrices. It utilizes a rank-1 prior to estimate the contributions of ignored KVs and merges them online with an overhead of the same order as SA. This improves model quality by up to 26% under the same KV budget, or compresses the KV budget by 33.2% and increases attention throughput by 1.23× at equivalent quality.

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

ReST-KV redefines KV cache eviction as a "layer-wise output reconstruction" problem. It treats the error increment in the attention output of the current layer after deleting a specific KV pair as the importance metric, explicitly capturing the redistribution effect ignored by prior work. Combined with exponential moving average (EMA) temporal smoothing and adaptive window spatial (AWS) smoothing, it achieves a robust eviction strategy. It outperforms SOTA by 2.58% on LongBench and 15.2% on RULER, while reducing decoding latency by 10.61× for 128k sequences.

Retrospective Sparse Attention for Efficient Long-Context Generation

This paper proposes RetroAttention, a "retrospective" sparse attention mechanism. When new KV pairs are loaded during subsequent decoding steps, it retroactively corrects the already computed attention outputs of past queries. This allows historical queries to access more KV pairs without increasing the KV budget, mitigating error accumulation during long generation. Compared to the SOTA method Quest, it improves accuracy by up to 21.9% and expands effective KV exposure by up to 1.6×.

Revisiting Long-context Modeling from Context Denoising Perspective

This paper treats long-context modeling as a "signal denoising" problem: it uses Integrated Gradient (IG) scores to precisely locate critical tokens that truly influence predictions and employs a lightweight denoising strategy, CDT, to suppress irrelevant tokens at the input. This allows an 8B open-source model to achieve a score of 50.92 on LongBench-E, nearing GPT-4o's 51.00.

Revisiting Parameter Server in LLM Post-Training

Addressing the Extreme sequence length variance and severe device load imbalance in LLM post-training, this paper reintroduces the classic Parameter Server (PS) concept into modern sharded data parallelism. It proposes On-Demand Communication (ODC), replacing layer-wise all-gather/reduce-scatter in FSDP with point-to-point gather/scatter-accumulate. By relaxing the synchronization granularity from "once per layer" to "once per minibatch," faster devices are no longer hindered by stragglers, achieving up to a 36% end-to-end speedup over standard FSDP.

RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers

RMAAT incorporates two types of biological astrocyte mechanisms for memory regulation into Transformers: it replaces \(O(N^2)\) self-attention with a linear complexity attention inspired by short-term plasticity (STP), and uses a "memory retention factor" derived from long-term plasticity (LTP) saturation curves to adaptively compress cross-segment memory tokens. Complemented by the AMRB training algorithm, which caches only memory tokens and recomputes the forward pass during backpropagation, it improves average accuracy on the Long Range Arena from RMT's 63.6% to 68.0%, while peak memory usage is only about 1/4 of the recurrent baseline.

Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

This paper proposes RoMA (Routing Manifold Alignment), which incorporates a "manifold regularization term" into the post-training objective. By performing lightweight fine-tuning only on the final few layer routers of MoE LLMs, it ensures that semantically similar samples share similar expert selections, improving accuracy by 7–15% across three MoE models without increasing inference overhead.

Scaling Attention via Feature Sparsity

This paper accelerates attention by exploring a neglected axis—instead of pruning tokens, it applies Top-\(k\) feature sparsification to each \(d\)-dimensional query/key vector. This allows attention scores to be precisely calculated only on a few coordinates co-activated by queries and keys. Combined with an IO-aware FlashSFA kernel to avoid materializing the \(n\times n\) score matrix, the computational complexity of \(QK^\top\) is reduced from \(\Theta(n^2d)\) to \(\Theta(n^2k^2/d)\). It achieves up to 2.5× speedup and nearly 50% savings in both FLOPs and KV-cache while matching dense precision on GPT-2 / Qwen3.

Scaling Large Vision-Language Model RL Training via Efficient Load Balancing

Addressing two major system bottlenecks in VLM Reinforcement Learning (RL) training—centralized multimodal data loading and extreme sequence load imbalance across GPUs—this paper proposes FlexRL, an end-to-end system. FlexRL utilizes ShadowLoader to offload visual data decoding/preprocessing to workers while passing only lightweight metadata on the controller, and employs FlexUlysses to adaptively shard sequences into fine-grained chunks for sub-sequence level load balancing. This achieved a maximum end-to-end throughput gain of 8.47× on a 128-GPU cluster.

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

This paper extends the Chinchilla Scaling Law into a "conditional" version, explicitly incorporating three architectural factors—hidden dimension \(d_{model}\), MLP-to-attention parameter ratio \(r_{mlp/attn}\), and GQA—into loss prediction. Combined with a search framework, it identifies architectures that are both accurate and fast under fixed parameter and training token budgets. Models trained using this approach, the Panda and Surefire series, achieve up to a 2.1% accuracy improvement and 42% higher inference throughput compared to LLaMA-3.2.

Scaling Linear Attention Capacity with Sparse State Expansion

This paper reinterprets state updates in linear attention as "information classification." Based on this, it proposes Sparse State Expansion (SSE): using row-sparse writes and partitioned expansion to significantly increase fixed state capacity, enhancing long-context retrieval and mathematical reasoning without substantially increasing the number of parameters.

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

This paper constructs the first benchmark specifically for evaluating "speculative decoding for accelerating LLM test-time scaling." By comparing 9 speculative decoding methods under a unified protocol across Best-of-N (BoN) and multi-round thinking paradigms, the study finds that reasoning trajectories in test-time scaling are highly redundant. Consequently, simple N-gram-based methods (particularly SAM) can approach or even outperform the training-based EAGLE-3, while hybrid methods combining both achieve the highest overall speedup.

Self-Speculative Decoding Accelerates Lossless Inference in Any-Order and Any-Subset Autoregressive Models

This paper proposes Any-Subset Speculative Decoding (ASSD), enabling Any-Subset Autoregressive Models (AS-ARM) to utilize the same network as both a fast drafter and a joint density oracle. Through rejection sampling, it achieves multiple token generation in parallel while guaranteeing lossless sampling from the true joint distribution, theoretically proving that the number of neural network calls never exceeds the number of generated tokens.

Self-Speculative Masked Diffusions

Self-Speculative Masked Diffusions integrates the non-causal parallel draft distribution of masked diffusion and an arbitrary-order causal target distribution into the same Transformer. By using in-model speculative sampling to verify multiple masked tokens in a single primary forward pass, it reduces the number of network forward passes by approximately \(2\times\) for text modeling and protein sequence generation with nearly identical quality.

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

The authors propose the Semantic Parallelism paradigm, which significantly reduces all-to-all communication overhead in MoE inference by predicting token-expert routing paths and co-scheduling model placement and data distribution. This achieves up to 2.78× throughput improvement in Attention-DP scenarios and up to 24.9% latency reduction in Attention-TP scenarios.

Sequential Parallel Duality in Prefix Scannable Models

This paper utilizes parallel prefix scans to provide a unified characterization of efficient sequence models that are "parallelizable during training and streamable during inference." It extends this class of models to Prefix-Scannable Models (PSMs) by allowing non-associative aggregation operators, enabling Transformer-style softmax aggregation to achieve approximately linear training and \(O(\log n)\) memory streaming inference under fixed chunking.

Short Window Attention Enables Long-Term Memorization

This paper investigates the division of labor between short-term and long-term memory in SWAX, a hybrid architecture alternating Sliding Window Attention (SWA) and xLSTM linear RNNs. It uncovers a counter-intuitive finding: the shorter the sliding window, the better the long-context retrieval (as short windows compel the linear RNN to learn long-range dependencies). Based on this, it proposes stochastic window training (randomly switching between windows of 128 or 2048 per batch), enabling the model to achieve optimal results in both short-context and long-context tasks.

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

SinkTrack transforms the naturally stable <BOS> attention sink in decoder-only LLMs into a context anchor. By utilizing training-free dual-track cross-attention to inject input context into <BOS> during the prefill stage, it mitigates hallucinations and long-context forgetting with almost zero additional decoding overhead.

Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Understanding

Addressing the performance gap where recurrent LLMs (linear complexity with fixed memory) underperform self-attention LLMs on long-context tasks, this paper proposes Smooth Reading. It transforms the "single-pass reading" of the entire context into an "End-to-End Multi-Round (EMR)" inference paradigm—involving chunk-based processing, summarizing-while-reading, and cross-round hidden state accumulation. Furthermore, it identifies that this inference paradigm favors sliding window architectures with strong length extrapolation. Ultimately, this approach improves recurrent models on LongBench from 5.68% behind self-attention to 3.61% ahead, while maintaining a 2.5× training and 2× inference efficiency advantage.

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

SoLoPO decomposes long-context preference optimization into "preference learning on short contexts" and "short-to-long reward consistency." By using shorter and cleaner data to activate the LLM's long-context localization and reasoning capabilities, it significantly reduces the time and memory pressure of long-sequence training.

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Addressing the increasing memory-bound issues of "fine-grained + high-sparsity" MoE on modern hardware, SonicMoE employs three strategies: rewriting the backward computational graph to minimize activation buffering, utilizing fused kernels that overlap IO with computation, and implementing token rounding routing that aligns tokens per expert with hardware tiles. On Hopper, it increases operator throughput for 7B fine-grained MoE by 1.86× and reduces activation memory by 45% compared to ScatterMoE, while providing an additional 1.16× speedup under high sparsity.

SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

SpareTrain integrates Dual Modular Redundancy (DMR)—the most thorough but traditionally most expensive method for detecting Silent Data Corruption (SDC)—into LLM training systems. By recycling redundant computations naturally generated by backpropagation recomputation in Activation Checkpointing (AC) and masking verification tasks within GPU idle windows during communication, it reduces the training throughput loss from ~100% to only 3–14% overhead relative to unprotected training, without weakening detection capabilities.

Sparse Attention Adaptation for Long Reasoning

Ours proposes SeerAttention-R—a sparse attention framework specifically designed for the "long decoding" phase of reasoning models. By using a lightweight, plug-and-play self-distillation attention gate (AttnGate) to learn which KV blocks to activate at each step, it maintains near-lossless reasoning accuracy on benchmarks like AIME with only a 4K token budget. Trained on 0.4B tokens without freezing original model weights, it achieves up to a 9× speedup compared to FlashAttention-3 on H100 using a companion TileLang block-sparse decoding kernel.

SparseD: Sparse Attention for Diffusion Language Models

To address the quadratic explosion of bidirectional attention with context length and slow inference in Diffusion Language Models (DLMs), SparseD employs three strategies: using full attention in early steps, one-time pre-computation of head-specific sparse patterns reused across steps, and isolated selection for prefill/generation. It achieves up to 1.50× lossless acceleration relative to FlashAttention on 64k context and 1024 denoising steps.

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Inspired by CPU branch prediction, SpecBranch allows the draft model to generate multiple "speculative branches" in parallel during the target model's verification phase to hedge against rejections. Using a lightweight three-way classifier (H-RAD) that fuses explicit target features with implicit confidence, it adaptively determines draft lengths and branch points. It reduces the rollback rate from 66–90% to below 40%, achieving a 1.8×∼4.5× end-to-end speedup over autoregressive decoding while maintaining a lossless sampling distribution.

Speculative Speculative Decoding

This paper proposes Speculative Speculative Decoding, which transforms the serial dependency of "draft, then verify, then draft again" in standard speculative decoding into asynchronous pre-speculation: while verification is ongoing, the draft model guesses potential verification outcomes in advance and prepares the next round of candidates. The resulting SAGUARO algorithm is approximately 30% faster than strong speculative decoding baselines on Llama-3.1-70B and approaches a \(5\times\) speedup relative to autoregressive decoding.

Stacked From One: Multi-Scale Self-Injection for Context Window Extension

SHAREDLLM stacks a single short-context LLM into two parts: a "lower-layer compressor" and an "upper-layer decoder." The lower layer compresses long inputs into coarse-to-fine multi-granularity context trees and "self-injects" KV pairs into the upper layer only at the bottom few layers. This enables extrapolation to 128K using only 8K sequence training, achieving 2× speedup over streaming methods and 3× over encoder-decoder architectures, while maintaining equal or superior performance.

STEM: Scaling Transformers with Embedding Modules

STEM replaces the up-projection matrix in the SwiGLU FFN with a layer-local embedding table indexed by token ID. By utilizing static sparsity instead of dynamic MoE routing, it eliminates approximately one-third of FFN parameters and reduces per-token FLOPs. This approach results in more stable training and larger knowledge capacity, improving downstream average scores by approximately 3–4% at the 350M/1B scales.

SwingArena: Adversarial Programming Arena for Long-context GitHub Issue Solving

Ours proposes SwingArena, an adversarial evaluation framework where two LLMs alternately play the roles of patch submitter and test reviewer on real GitHub issues. Verified end-to-end via repository-native CI pipelines (compilation/lint/regression testing), across 400 instances in C++, Python, Rust, and Go, it reveals a behavioral divergence between "aggressive patch generation" and "defensive quality assurance."

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

Tactic replaces fixed token budgets in sparse attention with a "cumulative attention score" target \(P\). It selects tokens in descending order of attention scores until the cumulative sum reaches \(P\). To efficiently approximate this selection during decoding, it employs K-means clustering for similarity-based sorting and distribution fitting to estimate token scores. This approach achieves up to 7.29× speedup in decoding attention and 1.58× end-to-end speedup while maintaining accuracy close to full attention.

Test-Time Training Done Right

This paper points out that existing Test-Time Training (TTT) approaches fail on long sequences because they adhere to tiny online mini-batches (updating fast weights every 16~64 tokens), causing modern GPU utilization to remain below 5%. The authors take the opposite approach and propose LaCT (Large-Chunk Test-Time Training), which expands the update granularity to massive chunks of 2K~1M tokens. Combined with window attention to compensate for intra-chunk locality, LaCT achieves 70% GPU utilization with just a few dozen lines of pure PyTorch code. It demonstrates scalability up to 14B parameters and 56K~1M token contexts across three modalities: new view synthesis, language modeling, and autoregressive video diffusion.

The End of Manual Decoding: Towards Truly End-to-End Language Models

This paper proposes AutoDeco, which attaches two lightweight prediction heads to a standard Transformer. It enables the model to predict the temperature and top-p for each decoding step, transforming the manually tuned decoding process into a differentiable, end-to-end trainable component. Across 8 benchmarks, it consistently outperforms default sampling and matches the "oracle" upper bound obtained by tuning parameters on the test set, with almost zero extra latency.

The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

This paper proposes StateLM—a class of foundation models endowed with the ability to "edit their own context." By utilizing a set of memory tools (deleting context, indexing, note-taking), the model reads, records key points, and deletes original text during multi-turn reasoning. This maintains a "sawtooth" context length rather than monotonic accumulation, allowing StateLM to significantly outperform standard LLMs on long-document QA, dialogue memory, and deep retrieval tasks while using only 1/4 of the active context.

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ThinKV observes that attention sparsity in the long Chain-of-Thought (CoT) of reasoning models can categorize tokens into three types: "Reasoning, Execution, and Transition." It assigns quantization precision based on thought importance and progressively evicts low-value thought segments when the reasoning trajectory changes. By pairing this with an extended PagedAttention kernel that reuses evicted memory slots in-place, ThinKV achieves near-lossless accuracy with less than 5% of the KV cache, delivering up to 5.8x higher throughput than the SOTA.

Three Forward, One Backward: Memory-Efficient Full-Rank Fine-Tuning of Large Models via Extra Forward Passes

Aiming at the inherent flaws of LoRA ("restricted expressiveness due to updates only in low-rank subspaces") and MeZO ("high variance and slow convergence of pure zeroth-order estimation"), this paper proposes LMAO. By alternating one forward+backward pass for LoRA (updating low-rank matrices \(A, B\)) and two perturbed forward passes for zeroth-order estimation (updating base weights \(W\)) in each iteration, the method constructs a full-rank update using "three forwards and one backward." This approaches the performance of full-parameter fine-tuning (FT) under memory footprints characteristic of LoRA / MeZO.

TileLang: Bridging Programmability and Performance in Modern Neural Network Operators

TileLang proposes a programmable GPU operator language where the "tile" is a first-class citizen. It explicitly exposes low-level knobs such as memory placement, data movement, and parallel partitioning to developers. By utilizing a unified Fused Tile-level Dataflow Graph (FTG) combined with a two-stage auto-completion process—"tile recommendation + tile inference"—it enables writing operators in under 70 lines of Python that achieve performance near hand-written CUDA. Compared to Triton, it achieves an average speedup of 3.02× on H100 and 2.65× on AMD.

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

This paper introduces "Efficiency Leverage" (EL) to quantify the compute savings of MoE relative to dense models. By training 300+ MoE models up to 28B parameters, the authors fit a unified scaling law using activation rate, expert granularity, and compute budget as variables. Based on this, they design MoE-mini with only 0.85B active parameters, which matches a 6.1B dense model using 7x less compute.

Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

Addressing the issue where standard Speculative Decoding (SPD) "exact match" rules prematurely reject semantically correct drafts, FLy utilizes the target model's own entropy and "self-correction" behavior to accept tokens that differ in phrasing but are semantically equivalent without any training. While maintaining \(\ge 99\%\) accuracy, it accelerates Llama-3.1-70B by 2.81× and 405B by 5.07× on average, outperforming the training-based EAGLE-3 by 1.62× on out-of-distribution (OOD) tasks.

TrimR: Validator-based, Training-free Thinking Trimming for Efficient Test-time Scaling

TrimR utilizes a finetuning-free 7B small validator to real-time detect three types of redundancies—"overthinking / underthinking / repetition"—during the Chain-of-Thought (CoT) generation of Large Reasoning Models (LRMs). By injecting guide prompts to mildly or forcibly conclude the reasoning, TrimR reduces the runtime of QwQ-32B, R1-Distill-Qwen-32B, and Pangu-R-38B by up to 70% on MATH500, AIME24/25, and GPQA, while maintaining near-identical accuracy (maximum drop of 1.7%).

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

TyphoonMLA identifies that in shared prefix scenarios, the shared segment of Multi-Head Latent Attention (MLA) decoding is better suited for naive computation, while the non-shared segment remains suitable for absorb computation. By splitting a single attention operation into two kernel paths and merging them via Log-Sum-Exp (LSE), the method achieves up to a \(3.24\times\) increase in MLA attention throughput and a \(1.48\times\) improvement in end-to-end token generation rate without modifying model precision or requiring retraining.

UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

Addressing the long context scaling problem of diffusion large language models (diffusion LLMs), this paper proposes a Diffusion-aware NTK positional encoding scaling method that considers the bidirectional attention characteristics of diffusion. Combined with masked post-training to suppress cross-document interference, the context window of LLaDA-8B is extended from 4K to 128K with lightweight training (only 600 steps), significantly outperforming the training-free baseline LongLLaDA on NIAH, PPL, LongBench, and RULER.

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

UltraMemV2 redesigns the memory-layer sparse architecture by integrating memory layers into every Transformer block and utilizing more efficient retrieval, value processing, initialization, and computational proportions. This allows the memory network to approach the performance of an 8-expert MoE under the same active computation while demonstrating stronger long-context memory and in-context learning with lower inference memory access.

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

This paper systematically dissects chunk-based sparse attention architectures and identifies three critical design principles (Non-linear Chunk Encoder + CLS token, Bypassing Residual Path, and Forced Sparsity during training). These principles enable a model trained on a 4K context to successfully extrapolate to 32 million tokens.

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

This paper reinterprets MoE routing using classical Nadaraya-Watson kernel regression (routing weights = kernel function, expert outputs = weighted "labels"). Based on this, MoE is viewed as a "large FFN," leading to the proposal of KERN—a zero-additional-overhead FFN-style routing function (ReLU activation + \(\ell_2\) normalization). KERN consistently outperforms Softmax/Sigmoid routing across various model scales, sequence lengths, and sparsity levels.

Universal Model Routing for Efficient LLM Inference

This paper proposes UniRoute, which encodes each LLM into a "prediction error vector on a small batch of representative prompts." Paired with a bilinear scorer, it allows the trained router to route to new LLMs appearing only at test time without retraining, achieving a better cost-quality trade-off across 30+ unseen models.

Unlocking Full Efficiency of Token Filtering in Large Language Model Training

Addressing the paradox where "token filtering improves model performance but saves almost no training time," this paper proposes CENTRIFUGE. It further filters activations of discarded tokens within the attention backward kernel, propagating sparsity from the output layer through all preceding layers. By replacing inefficient sparse GEMMs with "dimension-reduced dense GEMMs," it enables real speedups at "intermediate" sparsity levels (30%–50%). It achieves up to 49.9% backward speedup and 34.7% end-to-end speedup when filtering 50% of tokens, fully preserving the accuracy gains of token filtering (up to +26.6%).

vAttention: Verified Sparse Attention via Sampling

vAttention unifies "deterministic top-k selection of key tokens" and "random sampling of the long tail" into a single attention calculation framework. By using the Central Limit Theorem to adaptively determine the sampling budget, it provides the first user-specified \((\epsilon, \delta)\) approximation guarantee for each attention head. On RULER-HARD at 10% sparsity, it outperforms HashAttention by approximately 4.5 percentage points.

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

This paper proposes a theoretical framework that decomposes long-context task failures into three types of noise (task, model, and aggregator noise). It proves that when model noise grows super-linearly, a weak model combined with chunking can outperform a strong model using single-pass processing. Furthermore, it introduces a method to quickly estimate the optimal chunk size using only 3-5 samples.

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Addressing the "quality-speed dilemma" in Diffusion Large Language Models (DLLMs) where parallel decoding inevitably leads to performance degradation, this paper proposes WINO, a training-free decoding algorithm. By employing a parallel draft-and-verify mechanism—consisting of a low-threshold "aggressive drafting (Wide-In)" and a high-threshold "strict verification and re-masking of suspicious tokens (Narrow-Out)"—early errors can be revoked and rewritten with richer subsequent context. This achieves a \(6\times\sim10\times\) speedup on LLaDA / MMaDA while even improving accuracy.

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

This paper systematically compares the scaling laws of xLSTM and Transformer, demonstrating that xLSTM consistently outperforms Transformers of the same scale in terms of the training loss-compute Pareto frontier, overtraining regime, and inference speed, with the advantage increasing as context length grows.