Skip to content

📦 Model Compression

🔬 ICLR2026 · 241 paper notes

📌 Same area in other venues: 📷 CVPR2026 (98) · 💬 ACL2026 (59) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (52)

🔥 Top topics: Model Compression ×75 · LLM ×42 · Compression ×23 · Diffusion Models ×14 · Reasoning ×11

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

This paper derives a Fano-style accuracy upper bound for single-pass LLM reasoning in Multi-Hop QA (MHQA) using information theory. It reveals a "cliff-like" precipitous drop in accuracy when task information requirements exceed model output capacity. Based on these insights, the authors design InfoQA, a multi-turn reasoning framework that breaks the single-pass bottleneck through capacity-aware decomposition, dependency-explicit workflows, and iterative query compression.

A Recovery Guarantee for Sparse Neural Networks

The authors prove the first sparse recovery guarantee for ReLU neural networks: for two-layer scalar output networks with Gaussian randomly sampled training data, an Iterative Hard Thresholding (IHT) algorithm based on convex reformulation precisely recovers sparse network weights, with memory requirements growing only linearly with the number of non-zero weights.

A universal compression theory for lottery ticket hypothesis and neural scaling laws

The paper proves a universal compression theorem: any permutation-invariant function can be asymptotically compressed to a \(\text{polylog}(d)\) scale with error approaching zero (which is the optimal compression rate). This directly leads to the proof of the dynamic lottery ticket hypothesis—any network can be compressed to polylogarithmic width while maintaining invariant learning dynamics—as well as dataset compression to polylogarithmic size while maintaining the loss landscape, and the acceleration of power-law scaling laws to arbitrarily fast decay rates.

ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models

Ours proposes ABBA-Adapters, which parameterize weight updates as the Hadamard product of two independent learnable low-rank matrices \(\Delta W = s(B_1A_1) \odot (B_2A_2)\). This achieves an effective rank significantly higher than LoRA (\(r_1 \cdot r_2\) vs. \(r\)) under the same parameter budget. Through Khatri-Rao reconstruction, it maintains memory efficiency comparable to LoRA and significantly outperforms existing PEFT methods on arithmetic and commonsense reasoning tasks.

Achieving low-bit Muon through subspace preservation and grid quantization

This paper presents the first study on 4-bit compression of Muon optimizer states. It reveals that Newton-Schulz orthogonalization primarily amplifies quantization errors in the top singular subspace of the momentum matrix. Consequently, the authors propose 4-bit-Muon-GRASP: utilizing 8-bit to preserve the top subspace, 4-bit for the residual subspace, and grid quantization normalized along both rows and columns to suppress bi-dimensional outliers. This method achieves near-lossless accuracy in LLaMA 130M~1.1B pre-training and Qwen2.5-7B fine-tuning, reducing training memory by up to 28%.

ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning

ACPBench Hard is constructed as an open-ended generative planning reasoning benchmark based on the PDDL formal system, containing 8 task categories (13 domains × 8 tasks = 1040 problems). Equipped with a symbolic validator that provides rigorous correctness guarantees, a systematic evaluation of 15 LLMs reveals that even the strongest reasoning model, o1-preview, achieves an accuracy of \(\le 66\%\) on half of the tasks. Furthermore, all models nearly fail the most basic "enumerate executable actions" task, exposing fundamental deficiencies in current LLMs regarding planning reasoning.

Adaptive Nonlinear Compression for Large Foundation Models

NLA employs piecewise linear kernels to perform "nonlinear low-rank approximation" on weight matrices, coupled with a reconstruction-free all-matrix forward algorithm and an adaptive budget scheduler that allocates compression rates based on importance. This allows low-rank compression to achieve lower information loss and higher compression rates under the same parameter budget.

Adaptive Width Neural Networks

The AWN framework is proposed to automatically learn unbounded layer widths (number of neurons) during training via variational inference. By applying a soft ordering to neurons using a monotonically decreasing importance function, it enables width adaptation to task difficulty and supports zero-cost post-training truncation compression.

AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

Ours proposes AdaRank, which adaptively selects singular components of task vectors using learnable binary masks (replacing heuristic top-k). Combined with test-time entropy minimization optimization, it significantly mitigates inter-task interference in multi-task model merging, achieving 89.4% accuracy on ViT-B/32.

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs

Through a systematic empirical analysis using erank (effective rank) and attention entropy, this study reveals the complementary characteristics of attention-based and diversity-based methods in visual token pruning—attention methods suppress hallucinations but have limited coverage, while diversity methods offer comprehensive coverage but are prone to introducing hallucinations. Based on these findings, AgilePruner is proposed to adaptively switch pruning strategies according to image complexity, demonstrating robust performance across 9 benchmarks.

AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models

AIRE-Prune calculates a closed-form "infinite-horizon impulse response energy" score for each state of diagonal State Space Models (SSMs). By using prefix normalization to align scores across different layers to a common scale, it prunes an average of 60.8% of states using only a single global threshold without retraining, while maintaining accuracy within a 0.29 percentage point drop.

Alignment-Enhanced Integration of Connectivity and Spectral Sparsity in Dynamic Sparse Training of LLM

This work for the first time integrates dynamic connectivity sparsity (CHTs) and dynamic low-rank spectral sparsity into a unified sparse pre-training framework. It discovers that a naive summation of the two branches leads to an output "cancellation" effect and introduces a simple alignment loss to synchronize them. The resulting CHTsL approaches dense performance on LLaMA-60M/130M while retaining only 10%–30% of the parameters.

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

The MetaAPO framework is proposed, using a lightweight meta-learner (two-layer MLP) to dynamically estimate the alignment gap between offline and online data. It guides "which prompts require online sampling" (addressing distribution mismatch) and adaptively weights offline/online data during training (optimizing learning efficiency). MetaAPO surpasses baselines like DPO and Online DPO on AlpacaEval 2, Arena-Hard, and MT-Bench while reducing online labeling costs by 42%.

AMiD: Knowledge Distillation for LLMs with \(\alpha\)-mixture Assistant Distribution

This paper unifies various previously proposed "assistant distributions" (teacher-student intermediate distributions) in knowledge distillation into an α-mixture assistant distribution family with a newly designed variable \(\alpha\) using the "generalized \(f_\alpha\)-mean" from information geometry. Based on this, the unified distillation framework AMiD is proposed, which theoretically demonstrates optimality, reveals how \(\alpha\) regulates mode-covering/mode-seeking behavior, and consistently surpasses existing assistant distribution methods in experiments.

AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

The authors propose AnyBCQ, a multi-precision LLM quantization framework based on binary-coded quantization (BCQ). By employing progressive precision expansion (freezing existing bit-planes and adding residual bit-planes), it supports dynamic switching between 2-4 bits for a single model. Dedicated CUDA kernels perform computations directly at the bit-plane level to avoid lookup table (LUT) and transposition overhead. In 2-bit settings, accuracy significantly outperforms Any-Precision LLM (35.3% vs 24.7% MMLU), with throughput reaching up to 3.0x that of FP16.

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

ARMOR reformulates 2:4 semi-structured pruning as a "factorization" problem. Instead of directly deleting weights, it decomposes each weight matrix into a 2:4 sparse kernel plus two lightweight block-diagonal "wrapper matrices" acting as error correctors. These components are jointly optimized using block coordinate descent, theoretically guaranteeing a proxy loss no worse than SOTA. Locally, it narrows the perplexity gap between 2:4 pruning and dense models by nearly 50% on Llama/Qwen while preserving most of the 2:4 acceleration and memory efficiency gains.

Asymmetric Synthetic Data Update for Domain Incremental Dataset Distillation

This paper introduces the new problem of "Domain Incremental Dataset Distillation (DIDD)" — continuously distilling sequentially arriving data from different domains into a single, fixed-size synthetic set. It proposes an Asymmetric Synthetic Data Update strategy based on meta-learning bi-level optimization to learn individual stability and plasticity gradient update rates for each synthetic image, thereby alleviating catastrophic forgetting under a fixed storage budget.

Automated Stateful Specialization for Adaptive Agent Systems

ASPEC proposes a fully automated lifecycle framework for "stateful expert agent teams": it first uses evolutionary search offline to discover a set of domain expert operators, then cultivates persistent memory through experience-based reflection, and finally utilizes a lightweight online "retain-then-escalate" meta-controller to decide whether to reuse the existing team or re-search the architecture for each query. It improved Gemini 2.0 Flash from 56.3% to 62.8% on the expert-level scientific benchmark GPQA, while maintaining significantly lower training and inference costs than similar automated frameworks.

Batch Pruning by Activation Stability

The paper proposes B-PAS—a method that monitors the variance of ReLU activations for each batch across epochs during training. It dynamically discards entire batches whose "activations have stabilized and no longer contribute to effective learning." On ResNet, CvT, and GPT-2, it achieves up to 57% data savings and 61% GPU node-hour reduction with maintained or slightly improved accuracy.

BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

BEP proposes a purely binary discrete version of the chain rule in backpropagation: error signals are propagated layer-wise as binary \(\pm 1\) vectors. The entire forward and backward process is executed using only bitwise operations such as XNOR, Popcount, and integer increments/decrements. This achieves the first end-to-end full binary training of binary MLPs and RNNs, providing gains of up to +6.89% on MLPs and an average of +10.57% on RNNs compared to previous local learning rules.

Beyond Outliers: A Study of Optimizers Under Quantization

The authors provide the first systematic study of the relationship between "optimizer selection" and "quantization robustness." Training 50M–1.5B LLMs with six different optimizers, they find that traditional outlier metrics (MMR, Kurtosis) fail to predict post-quantization accuracy. Instead, they propose an analytical ABC error propagation decomposition and a new metric \(R_L\), revealing the counter-intuitive conclusion that Shampoo, despite having the most severe outliers, exhibits the least accuracy drop under PTQ/QAT and the highest parameter efficiency.

Beyond Student: An Asymmetric Network for Neural Network Inheritance

Instead of training a small-capacity student network to approximate a teacher, this work directly performs asymmetric low-rank decomposition on teacher weights, inherits principal component knowledge via SVD initialization, and reconstructs a "wide and deep" yet lightweight "Inherited Network" with an MoE-style "one dimension-reduction + multiple dimension-expansion experts" structure. It achieves faster convergence and higher accuracy than traditional student networks under the same parameter constraints.

Beyond Uniformity: Sample and Frequency Meta Weighting for Post-Training Quantization of Diffusion Models

This paper proposes a sample and frequency meta-weighting method for post-training quantization (PTQ) of diffusion models. Instead of treating all calibration samples and frequency components equally, it automatically learns which samples and which timestep-specific frequency components should influence quantization calibration through bi-level optimization, stably reducing FID in low-bit diffusion models.

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

The authors propose BeyondBench, an evaluation framework that algorithmically and dynamically generates mathematical problems (44 tasks / 117 variants / 3 difficulty levels). This ensures each test is free from training data contamination. After evaluating 101 language models (0.5B to 141B parameters), it was found that even the strongest models only achieve a 56% accuracy rate on the Hard Suite, and performance drops significantly when tools are not utilized.

Biologically Plausible Learning via Bidirectional Spike-Based Distillation

This paper proposes BSD (Bidirectional Spike-based Distillation), which trains a feedforward spiking network (stimulus → concept for perception) and a feedback spiking network (concept → stimulus for memory recall) by distilling spike features between them. The entire process uses only discrete binary spikes and unsigned error signals, achieving accuracy comparable to backpropagation across image classification/generation, text prediction, and temporal regression while satisfying five biological plasticity criteria.

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

The "Boomerang Distillation" paradigm is proposed: by training only a single small student model and progressively "patching" the teacher's transformer layers back into the student, a whole family of intermediate-sized models can be constructed with zero training cost. Their performance smoothly interpolates between the student and teacher, matching or even exceeding same-sized models distilled individually.

Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices

ToaD (Trees on a Diet) introduces regularization terms during GBDT training that encourage "feature and threshold reuse." Combined with a pointer-less, bit-coded memory layout featuring globally shared thresholds and leaf values, it compresses LightGBM models by 4–16 times without accuracy loss, allowing boosted trees to fit within KB-level microcontrollers.

Boosting Entropy with Bell Box Quantization

This paper proposes Bell Box Quantization (BBQ), the first quantization method to simultaneously satisfy "Information-Theoretic Optimality" (ITO) and "compute-efficiency." The core insight is that learning is domain-agnostic—the output domain of a quantizer does not need to match the input domain. Consequently, ITO quantization is performed in the input domain to maximize entropy, while mapping to hardware-accelerated data types in the output domain. BBQ consistently outperforms QuEST and LSQ in 1-4 bit QAPT scenarios.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Starting from Kolmogorov complexity theory, this paper proposes a theoretical framework for "asymptotically optimal description length objectives." It proves the existence of such objectives for Transformers through a new proof of their computational universality and conducts empirical validation via a differentiable variational objective based on adaptive Gaussian mixture priors, revealing significant optimization challenges.

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

This paper systematically deconstructs the promise of "free speedup and preserved accuracy" offered by hardware-native 4-bit floating-point formats (MXFP4/NVFP4). Through theoretical proof of quantization errors, it identifies why existing quantization techniques fail on these formats. The authors propose MR-GPTQ, a tailored algorithm for FP4 characteristics, and the QuTLASS GPU kernel, achieving 2.2x~4x end-to-end acceleration on B200/RTX5090 while recovering MXFP4 accuracy from a 10% drop to near-NVFP4 levels.

Cannistraci-Hebb Training on Ultra-Sparse Spiking Neural Networks

CH-SNN integrates the Cannistraci-Hebb link prediction theory from brain science into the sparse training of Spiking Neural Networks (SNNs). Utilizing a four-stage pipeline—"Correlation-based Topological Initialization + Spike-aware Weight Initialization + Hybrid Scoring Pruning + CH3-L3 Topological Regrowth"—it achieves 97.75% structural sparsity across all linear layers while outperforming fully connected networks by 0.16% in accuracy. When deployed on edge neuromorphic chips, it reaches 98.84% sparsity, reduces synaptic operations by 97.5\(\times\), and lowers average energy consumption by 55\(\times\).

CAR-LoRA: Training Compression-Aware and Robust LoRA Adapters for Evolving LLMs

CAR-LoRA trains a "compression-aware and temporally robust" universal LoRA adapter by injecting random variations such as quantization, pruning, and layer skipping during training (using compressed weights for the forward pass and full-precision gradients for the backward pass). This allows a single adapter to be deployed directly to edge devices with various compression formats and future evolved base models without retraining, achieving performance close to QLoRA models that are specifically retrained for each configuration.

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

CARE utilizes "activation covariance-weighted SVD + layer-wise adaptive rank allocation" to convert pre-trained GQA/MHA into MLA with an equivalent KV budget in a one-shot manner. By shifting the error minimization target from "weight space" to "activation space," it reduces one-shot perplexity by up to 215× and improves average accuracy by up to 1.70×.

Channel-Aware Mixed-Precision Quantization for Efficient Long-Context Inference

ChanMix identifies significant differences in quantization sensitivity across different channels of the KV cache—retrieval and outlier channels are fragile, while subnormal channels are robust. Based on this, it non-uniformly allocates bits by channel sensitivity (4-bit retrieval / 3-bit outlier / 2-bit normal / 1-bit subnormal) and implements 8-bit aligned packing using custom Triton kernels, significantly mitigating precision collapse in long-context retrieval under a 2-bit average budget.

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

CodeQuant unifies "learnable rotation to move activation outliers to the weight side" and "using clustering centroids to absorb weight outliers" into a post-training quantization (PTQ) framework designed specifically for MoE. Accompanied by a Look-Up Table (LUT) kernel for implementation, it improves the average accuracy of Qwen3-30B-A3B by 11.3% compared to QuaRot under A4W4 settings and achieves up to a 4.15× inference speedup.

COMI: Coarse-to-fine Context Compression via Marginal Information Gain

This paper proposes COMI, a coarse-to-fine adaptive context compression framework based on Marginal Information Gain (MIG = query relevance - semantic redundancy). At a 32x compression ratio, it improves the NaturalQuestions EM score by approximately 25 points compared to the second-best method. The core contribution lies in the simultaneous optimization of information relevance and diversity.

Compute-Optimal Quantization-Aware Training

Based on 757 QAT experiments (86M-2.2B parameters, 1-6 bits), this paper discovers that the optimal QAT training fraction increases as total compute grows (contrary to the previous conclusion of a fixed 10%). It proposes a tokens-per-parameter-byte statistic and a new loss scaling law to accurately predict optimal QAT allocation strategies and final loss.

Constraint-guided Hardware-aware NAS through Gradient Modification

CONNAS replaces hardware constraints from "regularization terms in the loss" to "direct gradient direction modification of architecture weights." This allows the gradient search process to automatically avoid infeasible architectures, eliminating the need for differentiable hardware metrics and regularization weight tuning. On NATS-Bench, the gap between the discovered architectures and the optimal feasible solutions is as small as 0.14%.

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Proposes Expert-Router Coupling (ERC) Loss, a lightweight auxiliary loss function that achieves tight coupling between router decisions and expert capabilities by treating router parameters as proxy tokens for cluster centers and constraining expert activation norms on them, significantly improving MoE-LLM performance with only \(n^2\) activation computations.

Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

This paper explores the recursive merge structure of BPE tokenization and proposes a "relative alphabet" framework. This allows teacher models to calculate exact sequence likelihoods on student vocabularies that differ from their own, enabling the direct application of classic KL distillation to cross-tokenizer scenarios. It achieves a 2%+ improvement over SOTA on GSM8K distillation and saves 12% VRAM during vocabulary pruning while simultaneously improving performance.

Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport

This work formalizes cross-domain lossy compression—where the encoder observes a degraded source and the decoder reconstructs a sample from a different target distribution—as an optimal transport problem under dual constraints of compression rate and classification loss. It derives closed-form DRC/RDC and DRPC tradeoff functions for Bernoulli sources (Hamming distortion) and Gaussian sources (MSE). The theoretical predictions are validated through deep end-to-end compression models on super-resolution, denoising, and inpainting tasks, showing consistency between theory and experimental behavior.

Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

Structured pruning and model folding are unified into an orthogonal projection framework—where pruning is axis-aligned projection and folding is cluster-subspace projection. It is proven that under the condition of a rank difference of 1, folding's parameter reconstruction error is strictly smaller. Validation across 1000+ checkpoints shows folding typically outperforms pruning at medium-to-high compression rates.

Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

The Dataset Color Quantization (DCQ) framework is proposed to reduce color redundancy at the dataset level through three mechanisms: chroma-aware clustering, attention-guided palette allocation, and texture-preserving optimization, achieving storage compression while maintaining training effectiveness.

Dataset Distillation as Pushforward Optimal Quantization

Reformulates decoupled dataset distillation as an optimal quantization problem. It proves that latent space clustering using weights, combined with a diffusion prior, convergently approximates the true data distribution. The proposed DDOQ algorithm outperforms baselines like D4M on ImageNet-1K with minimal additional computational overhead.

DiCache: Let Diffusion Model Determine Its Own Cache

DiCache proposes a training-free adaptive caching strategy for diffusion models. It allows DiT to use shallow online probes during inference to determine when to reuse cache and how to combine historical caches. It improves speed while maintaining higher fidelity relative to the original model across WAN 2.1, HunyuanVideo, and Flux.

Differentiable JPEG-based Input Perturbation for Knowledge Distillation Amplification via Conditional Mutual Information Maximization

This paper proposes inserting a differentiable JPEG compression layer in front of a frozen teacher, training only 128 quantization parameters to perturb teacher inputs and directly maximize the teacher's Conditional Mutual Information (CMI). This generates "softer" and more informative supervisory signals—a plug-and-play distillation amplifier that does not modify teacher weights, achieving student Top-1 gains of up to 4.11%.

Diffusion Models as Dataset Distillation Priors

This paper formalizes "representativeness" as the Mercer kernel induced distance between synthetic and real samples within the feature space of diffusion models. By injecting this as an energy-based guidance into the reverse diffusion process, the method enables pre-trained diffusion models to output distilled datasets characterized by diversity, generalization, and representativeness in a training-free manner. It outperforms various SOTA generative distillation methods on ImageNet-1K and its subsets.

DiffVax: Optimization-Free Image Immunization Against Diffusion-Based Editing

DiffVax trains a feed-forward immunizer (UNet++) that generates imperceptible adversarial perturbations for any image in a single forward pass (~70ms). This causes diffusion-based malicious editing to fail, achieving a 250,000× speedup over prior per-image optimization methods and extending immunization to video content for the first time.

DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

This paper reveals two hidden failure modes of model merging—task vector norm discrepancy and low source model confidence—and proposes DisTaC: "pre-conditioning" task vectors via knowledge distillation (rescaling norms + boosting confidence) before merging, enabling existing SOTA merging methods to function in realistic scenarios where they would otherwise fail.

Distillation of Large Language Models via Concrete Score Matching

Concrete Score Distillation (CSD) is proposed as a knowledge distillation loss for LLMs based on discrete score matching. By matching the relative logit differences between student and teacher across all vocabulary pairs, it concurrently overcomes the issues of softmax smoothing and the restricted solution space inherent in direct logit distillation.

Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks

The CAZI-MBN framework is proposed, which integrates domain-specific LLM sequence embeddings, a topology-aware graph tokenizer, context-aware cross-layer attention, and teacher-student distillation. It achieves zero-shot interaction prediction for unseen entities in multiplex biological networks, improving AUROC by 3.1-20.4% over the best baselines across five benchmark datasets.

Distribution-Aware Multi-Granularity Phase Coding: Towards Lower Conversion Error for Spike-Driven Large Language Models

To address the conversion error caused by the "uniform discretization of non-uniform activations" in Spiking LLMs, this paper proposes Distribution-Aware Multi-Granularity Phase Coding. It uses multiple learnable phase bases to align discrete value density with activation distributions, coupled with an alternating optimization paradigm that trains only neurons without updating weights. On LLaMA-2-7B and LLaMA-3-8B, it achieves near-ANN accuracy and the lowest perplexity with extremely short conversion times (~2 minutes), while reducing MAC+AC energy consumption by 42%.

Draft-based Approximate Inference for LLMs

Ours proposes the Draft-based Approximate Inference framework, which utilizes lookahead predictions from a small draft model to more accurately estimate the importance of tokens and KV pairs. The framework includes SpecKV (KV cache dropping), SpecPC (prompt compression), and SpecKV-PC (cascaded compression), consistently outperforming existing baselines on long-context benchmarks.

Dr.LLM: Dynamic Layer Routing in LLMs

Lightweight routers are attached to each layer of a frozen pretrained LLM to decide whether a layer should be "skipped, executed, or repeated." Using high-quality paths searched via offline MCTS as supervision, this approach improves accuracy and saves computation without modifying base weights or requiring inference-time search.

DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation

DTO-KD treats the trade-off between "task loss vs. imitation loss" in knowledge distillation as a multi-objective optimization problem. It calculates the weights of the two losses dynamically through a closed-form solution at the gradient level to automatically resolve gradient conflict and gradient dominance. This approach eliminates the need for manual weight tuning, achieves SOTA results on ImageNet-1K classification and COCO detection, and converges faster (matching 300-epoch performance in just 240 epochs).

DTP: Delta-Guided Two Stage Pruning for Mamba-based Multimodal Large Language Models

Addressing the vision token redundancy in Mamba-based Multimodal Large Language Models (MLLMs), DTP utilizes the input-dependent internal parameter \(\Delta_t\) of Mamba to estimate token importance. It employs selective pruning in early layers and complete pruning in late layers to nearly halve FLOPs while maintaining multimodal task performance.

DVD-Quant: Data-free Video Diffusion Transformers Quantization

DVD-Quant proposes a completely data-free post-training quantization framework for Video Diffusion Transformers. By integrating three modules—Bounded-init Grid Refinement (BGR), Auto-scaling Rotated Quantization (ARQ), and \(\delta\)-Guided Bit Switching (\(\delta\)-GBS)—it marks the first time a Video DiT achieves W4A4 quantization without quality degradation, while delivering approximately \(2\times\) acceleration on HunyuanVideo.

Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation

PSOFT migrates orthogonal fine-tuning from the "full parameter space" to the "low-rank principal subspace of pre-trained weights." By utilizing SVD to construct dimension-compatible projections and providing a theoretical condition that strictly preserves subspace geometry (angles and norms), while adding two tunable vectors to relax orthogonality, PSOFT—for the first time—matches or exceeds LoRA across three dimensions: parameter count, VRAM usage, and computational overhead.

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

This paper proposes a per-expert mixed-precision quantization method with theoretical guarantees: it assigns bit-widths to each MoE expert based on the "change in router \(\ell_2\) norm during training (\(\Lambda_s\))." Experts with small norm changes (learning infrequent but critical features) receive high precision, while those with large changes receive low precision. Combined with "Max Intra-neuron Variance (MaxVar)" for local rearrangement, this allows Switch Transformer and Mixtral to be compressed to just over 2 bits with negligible accuracy drop and near-zero allocation overhead.

Efficient Reasoning with Balanced Thinking

Proposes ReBalance, a training-free framework that simultaneously alleviates overthinking and underthinking in Large Reasoning Models (LRMs) via confidence-based dynamic hidden state steering, achieving dual improvements in inference efficiency and accuracy.

Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval

Ours proposes the Global Temporal Retriever (GTR), a lightweight plug-and-play module. By maintaining adaptive global periodic embeddings and retrieving aligned global periodic information using absolute time indices, it enables any forecasting model to bypass look-back window limitations and effectively capture global periodic patterns far exceeding the input length.

Ensembling Pruned Attention Heads for Uncertainty-Aware Efficient Transformers

Hydra Ensembles achieves uncertainty quantification (UQ) performance comparable to or even superior to Deep Ensembles under near-single-model inference overheads (only 1.07×). This is achieved by applying differentiated attention head pruning to the same pre-trained Transformer and then fusing multiple pruned sub-networks into a single ensemble model for a single forward pass.

Entropy-Based Block Pruning for Efficient Large Language Models

This paper proposes EntroDrop, which utilizes the "entropy increase" of hidden states instead of traditional cosine similarity to measure the redundancy of Transformer calculation blocks. It identifies a two-stage pattern in LLM hidden state entropy—"compression followed by expansion." By pruning blocks with the minimum entropy increase only during the expansion stage, the method removes 37.5% of attention layers in Llama3.1-8B while retaining 95%+ performance, consistently outperforming cosine similarity-based pruning methods.

Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

Instead of directly distilling latent space features or outputs of the teacher, this work distills pairwise similarity relations between tokens using kernel functions (Gram matrices). It adaptively adjusts distillation weights based on the predicted entropy of each modality, achieving architecture-agnostic audio-visual model compression that retains ~97% performance with a 94% parameter reduction.

ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

ERTACache formalizes "quality degradation caused by cache acceleration" into two categories: feature shift error and step amplification error. It employs a "three-piece set" consisting of offline strategy calibration, trajectory-aware step adjustment, and closed-form residual linearization correction to suppress these errors simultaneously, achieving over 2× acceleration with near-lossless quality on video and image diffusion models.

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

To address the high token computational redundancy during Diffusion Large Language Model (dLLM) inference, this paper proposes ES-dLLM, a training-free Early-Skipping acceleration framework. By estimating token importance and skipping low-importance positions in early layers, it achieves 5.6×–16.8× acceleration on LLaDA-8B and Dream-7B without sacrificing generation quality.

Evolution and compression in LLMs: On the emergence of human-aligned categorization

Through the Information Bottleneck (IB) framework and the Iterated In-Context Language Learning (IICLL) paradigm, this work demonstrates that LLMs, even without being trained on IB objectives, can spontaneously emerge category structures that are highly aligned with human semantic systems and exhibit near-optimal compression efficiency.

Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

Using 5–10 unlabeled calibration samples, this work learns a set of layer-wise coefficients to align the hidden states and logits of the merged model with domain-specific experts. It further introduces importance-guided chunking (Expert Merging++), outperforming both training-free and training-based merging baselines on LLMs/MLLMs, and even surpassing supervised mixture training.

Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

Addressing the knowledge conflict problem in multi-teacher distillation where "more teachers lead to worse performance," this paper proposes the concept of "Knowledge Purification"—merging rationales from multiple teacher LLMs into a single unified rationale before distillation. By systematically comparing five purification methods across three categories (aggregation, routing, and RL selection), the authors find that routing-based methods are the most stable across both in-domain and out-of-domain scenarios.

Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products

This paper transforms the "element-wise multiplication (Hadamard product)" into a learnable and efficient feature expansion operator called ACH. By utilizing differentiable discrete sampling to automatically select channels for cross-multiplication and stabilization via dynamic softsign normalization, the method expands channel dimensions with nearly zero convolutional parameters. Integrated into Hadaptive-Net via NAS, it achieves superior accuracy-speed trade-offs on ImageNet/CIFAR-100.

E²LoRA: Efficient and Effective Low-Rank Adaptation with Entropy-Guided Adaptive Sharing

The authors utilize gradient-based "proxy entropy" to detect inter-layer similarity and layer-wise information heterogeneity in pre-trained models. Based on this, they adaptively partition adjacent similar layers into the same sharing interval and allocate LoRA ranks to each interval according to its information content. This approach halves trainable parameters while matching or exceeding the performance of LoRA and ShareLoRA.

FASA: Frequency-Aware Sparse Attention

This paper discovers functional sparsity at the Frequency Chunk (FC) level in RoPE—where a small number of "dominant FCs" can effectively predict token importance. Based on this, it proposes the FASA framework, which achieves training-free KV cache compression through a two-stage process: predicting token importance via dominant FCs and focusing attention computation. On LongBench, it achieves nearly 100% full-KV performance while retaining only 256 tokens; on AIME24, it achieves a 2.56× speedup using only 18.9% of the cache.

Faster Vision Transformers with Adaptive Patches

APT (Adaptive Patch Transformer) employs multiple patch sizes within a single image—using large patches for flat regions and small patches for complex regions—to reduce the number of tokens at the source. This provides 30–50% speedup for any pretrained ViT with almost no performance drop, requiring only 1 epoch of fine-tuning to converge.

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

The paper proposes QZO, a method that estimates gradients by applying zeroth-order perturbations to quantization scaling factors (rather than discrete weights). Combined with Directional Derivative Clipping (DDC) to stabilize training, it achieves extreme memory-efficient fine-tuning for 4-bit/2-bit LLMs, reducing total memory by over 18x.

FlexHiNM-GP: Flexible Hierarchical Pruning via Region Allocation and Channel Permutation

Weight layers are adaptively partitioned into "dense (4:4)", "N:M (2:4)", and "fully pruned (0:4)" regions. This is coupled with a structure-aware Gyro-Permutation and differentiable 2:4 mask learning, enabling structured pruning to approach unstructured accuracy while maintaining GPU Sparse Tensor Core compatibility.

FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation

FlexLoRA utilizes "spectral energy entropy" to measure the importance of each LoRA low-rank update at the matrix level. Under a global rank budget, it can both prune redundant ranks and extend new ranks for critical layers. Furthermore, a "Zero-Impact Initialization" strategy is employed to ensure training stability during capacity expansion, allowing for more efficient utilization of the parameter budget compared to fixed-rank LoRA and unidirectional pruning methods like AdaLoRA.

FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning

Inspired by the neurobiological sparse expansion and modular integration of the Drosophila mushroom body, the FlyPrompt framework is proposed for General Continual Learning (GCL). It achieves non-iterative expert selection via a Random-Expanded Analytical Router (REAR) and enhances expert capabilities through Temporal-Ensemble Task-Experts (TE²) utilizing multi-time-scale EMA output heads. It achieves gains of up to 11.23%, 12.43%, and 7.62% on CIFAR-100, ImageNet-R, and CUB-200, respectively.

FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension

FreqKV is proposed as a parameter-free, architecture-agnostic KV cache compression method. By iteratively compressing KV states in the frequency domain (preserving low frequencies and discarding high frequencies), it extends the context window of LLaMA-2-7B to 256K with only 8K length fine-tuning while maintaining stable perplexity.

FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation

Proposes FutureMind, a training-free framework that distills structured reasoning and retrieval strategies from LLMs into reusable thinking-pattern priors. Through a four-stage pipeline (problem analysis → logical reasoning → strategy planning → retrieval guidance) and three retrieval paradigms, it enables SLMs to achieve SOTA performance on multi-hop QA.

GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings

GAPrune measures parameter domain importance via Fisher Information and cross-domain alignment via the cosine similarity between general and domain gradients. These are fused into a Domain-Alignment Importance (DAI) score for one-shot pruning, ensuring compressed embedding models retain general language capabilities while strengthening domain expertise on finance and chemistry benchmarks.

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

This work reformulates the "full-sequence teacher \(\rightarrow\) partial-sequence student" distillation as an inverse problem. It treats the student's short-context features as "degraded observations" of the target long-context features. By using a diffusion model as a generative prior for the teacher's features to perform posterior sampling, the method provides each student feature with a set of "dynamic, diverse, and aggregatable" teacher signals. This enables a classifier seeing only sequence prefixes to achieve generalization capabilities approximating those of a full-sequence model.

GlowQ: Group-Shared Low-Rank Approximation for Quantized LLMs

GlowQ transforms the paradigm of "independent low-rank error correction for every layer" into "sharing a right factor \(B\) per input group, caching the projection \(BX\) once for reuse across modules," while selectively recovering specific groups/layers based on benefit. This significantly reduces first-token latency and increases throughput for quantized LLMs with negligible accuracy loss.

GmNet: Revisiting Gating Mechanisms From A Frequency View

This paper provides the first systematic frequency-domain explanation for the effectiveness of Gated Linear Units (GLUs): element-wise multiplication corresponds to frequency-domain convolution that widens the spectrum, and non-smooth activations preserve high-frequency energy. Based on this, the authors design GmNet, a minimalist architecture using a simple \(\sigma(x)\cdot x\) gate to correct the spectral bias in lightweight models, achieving new SOTA results for efficient models on ImageNet.

GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

GPTailor reformulates LLM structured pruning as a zeroth-order optimization problem of "layer-wise cutting and stitching over a family of fine-tuned variants from the same base." It supports three operations: layer deletion, cross-model layer selection, and layer merging. By employing a ParEGO multi-task objective and SMAC multi-fidelity search to automatically find configurations, it allows Llama2-13B to retain 97.3% of its original performance after removing approximately 25% of layers without any post-training repair, significantly exceeding previous SOTA.

Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

This paper improves Post-Training Quantization (PTQ) for diffusion models by learning a set of importance weights for calibration samples across different timesteps via meta-learning. This aligns gradient directions and mitigates gradient conflicts across timesteps in the quantized model.

Gradient Intrinsic Dimensionality Alignment:Narrowing The Gap Between Low-Rank Adaptation and Full Fine-Tuning

To be added after in-depth reading

Gradient Intrinsic Dimensionality Alignment: Closing the Gap between LoRA and Full Fine-Tuning

This paper identifies that the fundamental cause of the performance gap between LoRA and full fine-tuning (FFT) is that the low-rank subspace dimension of LoRA is significantly smaller than the number of truly effective update directions in FFT gradients (Gradient Intrinsic Dimensionality, GID, with up to a 100x difference). It proposes an entropy-based estimator to measure layer-wise GID and utilizes RaLoRA / RaLoRA-Pro to align the equivalent rank of LoRA with GID without increasing the parameter count. This approach consistently approaches or exceeds FFT performance on GLUE, GSM8K, HumanEval, MT-Bench, and image classification.

GradPruner: Gradient-guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs

GradPruner calculates the importance of each layer (IGIA-Matrix) using gradients accumulated during the initial 1% of steps of LoRA fine-tuning for layer pruning. It then performs "same-sign merging" of pruned layers into retained layers, achieving simultaneous training and inference speedups on downstream tasks: 40% parameter reduction with only 0.99% accuracy loss.

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

The InfoUtil framework is proposed to maximize sample informativeness (identifying the most critical patches) using game-theoretic Shapley Values and maximize sample utility (selecting the most valuable samples for training) via gradient norms. It achieves a 6.1% improvement over the previous SOTA on ImageNet-1K.

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr decomposes each MoE expert into irreducible "atomic experts" (one column of \(W_{up}/W_{gate}\) + one row of \(W_{down}\)). It measures the importance of each atomic expert using second-order information from the Optimal Brain Surgeon (OBS). By simplifying from the "parameter space \(\rightarrow\) output space," the Hessian storage complexity is reduced from \(O(d^4)\) to \(O(d^2)\). Global ranking and pruning of atomic experts across the entire model can be achieved on a small calibration set with only two forward passes and one backward pass, maintaining near-lossless performance at 20%~25% pruning ratios.

HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design

The HiFo-Prompt framework is proposed, which enhances LLM-driven Automatic Heuristic Design (AHD) through two synergistic modules: Hindsight (retrospective knowledge pool) and Foresight (prospective evolutionary navigator), significantly outperforming existing methods on tasks such as TSP and FSSP.

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Ours proposes a new framework representing LLM weights using multi-kernel Boolean parameters. It achieves direct fine-tuning of LLMs in the Boolean domain for the first time without full-precision latent weights, outperforming existing ultra-low-bit quantization and binarization methods in both representation capability and computational efficiency.

IDER: IDempotent Experience Replay for Reliable Continual Learning

This paper introduces idempotence into continual learning through two components: a standard idempotent module and an idempotent distillation module. By forcing the model to maintain output self-consistency when learning new tasks, it enhances prediction reliability (reducing calibration error) while significantly mitigating catastrophic forgetting.

IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring

To address the instability of rank allocation in AdaLoRA caused by scoring with instantaneous gradients, IGU-LoRA introduces "Integrated Gradients" into the parameter space to measure the importance of each singular value direction. It then calculates a signal-to-noise ratio (SNR) style uncertainty-aware score using EMA smoothing and deviation tracking to guide pruning, consistently exceeding LoRA / AdaLoRA / DoRA under the same parameter budget.

Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

This paper revisits the block-wise absmax quantization (NF4 / AF4) commonly used in QLoRA. It uses a Lloyd-style EM algorithm to directly solve for a 4-bit codebook (BOF4) that is optimal for end-to-end weight error. Combined with two simple modifications—"Signed Normalization" (BOF4-S) and "Outlier-Preserved Quantization" (OPQ)—it pushes quantization error and perplexity to the best levels among 4-bit block-wise quantization methods across three major LLM families.

In Good GRACES: Principled Teacher Selection for Knowledge Distillation

The authors propose GRACE, a lightweight scoring metric that predicts which teacher is most compatible with a specific student and task before distillation. By analyzing the student's gradient distribution on teacher-generated data—without requiring verifiers, teacher logits, internal states, or test data—it achieves up to 86% Spearman correlation with post-distillation performance on GSM8K/MATH.

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to alternate between reasoning and code execution tools during the evaluation process. It outperforms 32B reasoning reward models with only 8B parameters across 7 public benchmarks, and TIR-Judge-Zero enables self-bootstrapped improvement without distillation.

Inconsistency Biases in Dynamic Data Pruning

This paper identifies that dynamic data pruning is hindered by two types of "inconsistency biases": Score Context Drift, caused by comparing importance scores across different model states, and Temporal Gradient Bias, resulting from non-uniform sampling across epochs. The proposed RePB framework (Local Window Pruning + Uniform Resampling + Cumulative Temporal Reweighting) structurally eliminates these biases, achieving or exceeding full training accuracy with an approximately 30% pruning rate across 16 datasets, 17 models, and 13 tasks.

InfoScan: Information-Efficient Visual Scanning via Resource-Adaptive Walks

InfoScan replaces the fixed raster or Hilbert scanning orders in Mamba-like visual backbones with content-adaptive paths. By quantifying the information of each patch via "entropy + local variance," it utilizes reinforcement learning to learn a scanning sequence that prioritizes information-dense regions, achieving higher accuracy with fewer parameters across classification, detection, and segmentation tasks.

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok introduces Shannon’s source coding theorem into discrete video tokenization, utilizing ELBO to estimate the information volume of each video for adaptive token allocation. It proves that fixed or data-independent tokenizers are biased and suboptimal in representation length. InfoTok reduces token usage by approximately 20%~50% while maintaining the same reconstruction quality, achieving a 2.3× higher compression rate than heuristic adaptive methods (ElasticTok) with 11× less inference overhead.

Inheriting Generalizable Knowledge from LLMs to Diverse Vertical Tasks

This paper proposes MASA (Matrix-level Alignment and Scalable Adaptation), which uses a set of minimal "gene matrices" to align with the FFN weights of LLMs to extract generalizable knowledge (via output and spectral alignment). These matrices are then reshaped to arbitrary dimensions using SVD-based adaptive scaling to initialize the FFN layers of lightweight models. This allows an 877M small model to achieve over 85% of the performance of a 7B source model on various vertical tasks, requiring significantly less pre-training data and converging faster than random initialization, distillation, or pruning.

Inlier-Centric Post-Training Quantization for Object Detection Models

InlierQ decomposes object detection activations into "task-relevant inliers" and "anomalies caused by background clutter or sensor noise." It separates the two using gradient-aware voxel saliency scores combined with EM fitting for posterior probabilities. Quantization error minimization is performed exclusively on the inlier set, significantly improving 2D/3D camera and LiDAR detection accuracy under low-bit (W4A4) settings.

INSTANT: Compressing Gradients and Activations for Resource-Efficient Training

INSTANT simultaneously projects the activation \(x\) and the output gradient \(g_y\) during backpropagation into their respective calibrated low-rank subspaces. By replacing full-rank matrix multiplications with low-rank multiplications, it reduces the backpropagation computational cost by 15× and activation memory by 32× with negligible accuracy loss.

Is Finer Better? The Limits of Microscaling Formats in Large Language Models

Discovers and explains the counter-intuitive "finer-is-worse" anomaly in microscaling quantization—when block size decreases below a certain threshold, the limited dynamic range of the FP8 UE4M3 scale causes the quantization error of narrow-distribution tensors to increase instead. The paper proposes the FP8 UE5M3 scale format as a hardware-friendly solution.

KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

The authors propose KBVQ-MoE, the first vector quantization framework specifically designed for MoE architectures. By utilizing KLT-guided SVD for Input-Driven Redundancy Elimination (IDRE) and Bias-Corrected Output Stabilization (BCOS), it achieves a 10%+ accuracy improvement over existing methods under 2-bit quantization.

KDP: Simplifying Representation Dynamics in Kernel Space

Viewing the forward propagation of LLMs as a discrete dynamical system, it is observed that representations become highly similar after adjacent layers enter a "slow manifold." By projecting representations into a kernel space—where non-linear inter-layer transformations can be linearly approximated—and using a simple network for the inverse transformation, continuous Transformer layers can be folded. This achieves approximately 25% parameter reduction without requiring full-model fine-tuning.

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS utilizes KL divergence to measure the similarity of intermediate representations in pretrained models, automatically selecting optimal "anchor + block pairs" from \(O(k^2n^2)\) stitching configurations. It shifts the accuracy-efficiency curve of stitched networks upward at the same fine-tuning cost as baselines (ImageNet-1K +1.21% top-1 at the same compute, or 1.33× FLOPs savings at the same accuracy).

Knowledge Distillation as Decontamination? Revisiting the "Data Laundering" Concern in Classification Tasks

The authors systematically examine the severity of "data laundering" (where a contaminated teacher smuggles test set knowledge to a clean student via distillation) across eight classification benchmarks. They find that the inflated accuracy caused by laundering is much smaller than direct contamination and statistically insignificant in most cases. Furthermore, they demonstrate that laundering and direct contamination are distinct phenomena driven by different mechanisms, primarily appearing when there is a large gap between training and test distributions—concluding that knowledge distillation acts more as a "decontaminating" filter than a leakage amplifier.

Knowledge Distillation for Large Language Models through Residual Learning

Addressing the issue where "the teacher itself can be wrong" in white-box distillation, this paper proposes residual learning: allowing the student to learn the difference between its own representation and the teacher's only at positions where the teacher makes incorrect predictions. By incorporating low-dimensional projection, MoE expert fusion, and cross-tokenizer attention, the method consistently outperforms existing white-box approaches in both same- and cross-tokenizer distillation.

Knowledge Fusion of Large Language Models Via Modular Skillpacks

Ours proposes GraftLLM—extracting the capabilities of heterogeneous source models into compact and transferable "SkillPacks" (modular skill packages). By storing parameter increments through a module-aware adaptive compression strategy, it supports knowledge transfer, heterogeneous model fusion, and non-forgetting continual learning, significantly outperforming existing PEFT and parameter fusion methods across multiple scenarios.

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

The authors propose Landscape of Thoughts (LoT), the first tool to visualize LLM reasoning trajectories as 2D topographic maps. By using perplexity-based features and t-SNE projections, LoT reveals behavioral patterns in reasoning and can be adapted into a lightweight verifier to improve reasoning accuracy and test-time scaling effects.

LaplacianFormer: Rethinking Linear Attention with Laplacian Kernels

LaplacianFormer argues that existing linear attention mechanisms lack a theoretical basis for defaulting to Gaussian kernels, which excessively suppress moderately correlated tokens. By substituting the Gaussian kernel with a slower-decaying, non-vanishing gradient Laplacian kernel (based on the \(\ell_1\) distance) and combining it with injective normalization, Nyström low-rank approximation, and Newton–Schulz inversion, the model achieves a superior accuracy-efficiency trade-off on ImageNet with linear complexity.

Large Language Model Compression with Global Rank and Sparsity Optimization

This paper proposes CAP—a two-stage LLM compression framework that first uses Robust Principal Component Analysis (RPCA) to decompose weight matrices into low-rank and sparse candidate subspaces, then utilizes a global budget allocation based on Bernoulli probabilities and policy gradients to automatically decide which singular values and sparse entries to retain across layers. This approach requires no manual thresholds and no backpropagation through the original weights.

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

LD-MoLE is proposed, utilizing the Sparsegen closed-form projection to replace traditional TopK routing. It achieves differentiable, dynamic, and token-adaptive LoRA expert allocation. Combined with a lightweight MLP to predict sparsity factors and an analytic sparsity loss, it outperforms fixed-routing and ReLU-routing baselines across multiple benchmarks.

Learnable Sparsity for Vision Generative Models

EcoDiff utilizes an end-to-end differentiable masking objective spanning the entire denoising trajectory to perform structured pruning on diffusion and flow matching models. Combined with "Timestep Gradient Checkpointing," it reduces memory consumption from \(O(T)\) to \(O(1)\), enabling the pruning of 20% of parameters in SDXL/FLUX with minimal quality loss using only 100 samples and 10 A100 GPU hours.

Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork

This work employs a lightweight hypernetwork, shared across layers and conditioned on layer/component embeddings, to directly generate n:m semi-structured sparsity masks for LLMs layer-by-layer. By merging the advantages of "fast but coarse" heuristics and "refined but expensive" optimization, it enables pruning LLaMA-2 models ranging from 7B to 70B on a single A100 while achieving the state-of-the-art precision-sparsity trade-off.

LeSTD: LLM Compression via Learning-based Sparse Tensor Decomposition

LeSTD packages the Q/K/V/O weight matrices of a Multi-Head Attention (MHA) layer into a 4th-order tensor to perform a "cross-head shared" Tucker decomposition. It then sparsifies the dense core tensor using pruning supported by closed-form importance scores, breaking the "dense core bottleneck" of tensor decomposition methods to maintain accuracy at higher compression rates while enabling direct inference in the compressed domain.

Light Differentiable Logic Gate Networks

This paper identifies that the root causes of gradient vanishing, discretization errors, and high training costs in Differentiable Logic Gate Networks (DLGN) lie in the "function-wise enumeration" parametrization of logic gate neurons. It proposes a non-redundant Input-Wise Parametrization (IWP), reducing the parameter count per gate from \(2^{2^n}\) to \(2^n\) (\(4\times\) reduction for binary inputs). Combined with a negative-asymmetric heavy-tailed residual initialization, this makes the network more memory-efficient, enables \(8.5\times\) faster convergence, and speeds up backpropagation by up to \(1.86\times\), while maintaining or improving accuracy on CIFAR-100.

LightMem: Lightweight and Efficient Memory-Augmented Generation

LightMem is proposed as a three-stage lightweight memory system inspired by the human Atkinson-Shiffrin memory model. Through cognitive sensory memory pre-compression, topic-aware short-term memory integration, and offline sleep-time updates, it achieves up to a 7.7% accuracy improvement on LongMemEval while reducing token consumption by up to 38x.

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Ours proposes LipNeXt—the first unconstrained, convolution-free 1-Lipschitz architecture. By utilizing orthogonal manifold optimization to learn orthogonal matrices and a Spatial Shift Module driven by Theorem 1 for spatial mixing, the model successfully scales to the billion-parameter level. It establishes new SOTA Certified Robust Accuracy (CRA) on CIFAR-10/100, Tiny-ImageNet, and ImageNet, achieving a +8% CRA Gain at \(\varepsilon=1\) on ImageNet.

LLM DNA: Tracing Model Evolution via Functional Representations

Starting from a biological DNA analogy, this paper mathematically defines LLM DNA as a low-dimensional bi-Lipschitz representation of model functional behavior. It proves that this representation satisfies heredity and genetic determinism properties, and designs a training-free RepTrace pipeline to extract DNA and construct phylogenetic trees for 305 LLMs.

LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning

Ours proposes LoFT, a low-rank adaptation method that aligns internal optimizer dynamics (momentum and second moment) with full fine-tuning behavior. Composed of six building blocks, LoFT exactly recovers AdamW in the full-rank limit and significantly narrows the performance gap between LoRA and full fine-tuning across multiple benchmarks.

LogART: Pushing the Limit of Efficient Logarithmic Post-Training Quantization

LogART introduces "learnable rounding" into logarithmic post-training quantization (log-PTQ) for the first time. Combined with a logarithmic quantizer supporting dynamic multi-bases, asymmetry, and outlier resistance alongside efficient hyperparameter search, it pushes log-PTQ to ultra-low 3/4-bit widths. It achieves SOTA accuracy on LLMs, CNNs, and ViTs while enabling multiplier-less hardware with reduced area and power consumption.

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Proposes LookaheadKV, which utilizes learnable lookahead tokens and selectively activated LoRA modules to predict the attention importance scores of actual responses. This achieves fast and accurate KV cache eviction without draft generation, outperforming existing methods on multiple long-context benchmarks while reducing eviction overhead by up to 14.5x.

Lookup multivariate Kolmogorov-Arnold Networks

By replacing the 1D trainable functions in KAN with 2D counterparts and utilizing B-spline look-up tables for \(O(1)\) evaluation, the proposed lmKAN module can directly replace linear layers—reducing inference FLOPs by 1.6–78× at equivalent precision and achieving an order-of-magnitude speedup on H100 GPUs via optimized CUDA kernels.

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

The paper "serializes" multiple independently trained LoRA experts into the input/output projection matrices of attention modules rather than replacing FFNs or using parallel branches. It incorporates a Routing Specialization Loss (RSL) that unifies load balancing and input-aware specialization via entropy regularization, outperforming LoRA-MoE SOTA on 15 multi-task benchmarks using only 48% of trainable parameters.

LS-Merge: Merging Language Models in Latent Space

LS-Merge encodes LLM weight tensors into a smooth latent space, performs interpolation in the latent space, and decodes them back into weights. This supports "single-model self-merging" and "heterogeneous merging across architectures (different widths/depths/model families)," which are either impossible or fragile in traditional weight-space merging.

LSA: Layer-wise Sparsity Allocation for Large Language Model Pruning Based on Minimal Linear Reconstruction Error

LSA directly characterizes the redundancy of Transformer layers using the "minimal linear reconstruction error assuming 50% least important weights are pruned." This approach bypasses Wanda-style weight scoring and manual reduce functions, enabling non-uniform sparsity allocation across layers (and even blocks/projections), outperforming methods like OWL and DLP under 70% high sparsity.

Many Eyes, One Mind: Temporal Multi-Perspective and Progressive Distillation for Spiking Neural Networks

To address two major pain points in Spiking Neural Network (SNN) distillation—"using a fixed ANN output to supervise all timesteps" and "information loss during truncated inference"—this paper proposes masked re-weighting to generate diverse temporal teacher signals (Many Eyes) and cumulative average predictions to progressively align with full-length predictions (One Mind). It achieves SOTA on CIFAR/ImageNet and supports reliable inference at any timestep.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

MaskPro reduces the logit storage for learned (N:M) semi-structured sparsity from \(O\!\left(\binom{M}{N}\frac{d}{M}\right)\) used in MaskLLM to linear \(O(d)\). It employs a forward-only policy gradient (enhanced with loss-residual and a moving average tracker for variance reduction) to train masks. This achieves (2:4) sparse masks approximating the performance of MaskLLM with memory costs comparable to rule-based methods and significantly lower training overhead than MaskLLM.

MASS: MoErging through Adaptive Subspace Selection

MASS stores the low-rank singular subspaces updated by each task into a shared model. During inference, it utilizes a data-free and training-free "projection residual" router to automatically select the task subspace and classification head that best match the input without knowing the task identity, pushing model merging accuracy to approximately 98% of individually fine-tuned models.

Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba

Memba is proposed as a parameter-efficient fine-tuning method inspired by biological neuron membrane potentials. By introducing Leaky Integrate Membrane (LIM) neurons into the Mamba gating branch to achieve temporal adaptation, combined with LoRA placement optimization and cross-layer membrane transmission, it outperforms existing Mamba PEFT methods on language and vision tasks with minimal parameters.

MergOPT: A Merge-Aware Optimizer for Robust Model Merging

By shifting "merging" considerations forward to the fine-tuning stage, MergOPT models "other experts to be merged" as adversarial perturbations in the weight space during training. Using distributionally robust optimization (DRO), it trains experts that are naturally more robust to merging, achieving a gain of 3.5% (up to 9.5%) in subsequent merging with almost no additional training cost.

Metis: Training LLMs with FP4 Quantization

Metis identifies the "anisotropy of weights/activations/gradients singular spectra" as the root cause of FP4 training failure. It proposes splitting the spectrum in the spectral domain into "a few dominant components + long-tail residuals" for separate quantization. By utilizing sparse sampling and random projection, the SVD overhead is reduced to a negligible level, enabling W4A4G4 full FP4 training on LLaMA-3 8B. The training loss is only 0.4% higher than BF16, while downstream accuracy differs by only 0.1%.

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

MicroMix implements weight-activation quantization for LLMs based on NVIDIA Blackwell's MXFP4/MXFP6/MXFP8 microscaling formats. It adaptively assigns 4/6/8 bits to activation channels based on a "quantization error threshold" per layer. Supported by a fused CUTLASS GEMM kernel that integrates reordering, quantization, and dequantization, it achieves near FP16 accuracy at an average precision of approximately 5 bits, while providing a 2.3–3.4x speedup over FP16.

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

MoBE decomposes the up/gate matrices of MoE experts into a rank decomposition \(W=AB\), where the larger \(B\) is represented as a linear combination of a small set of shared basis matrices across all experts in a layer. By minimizing reconstruction error alone, it compresses trillion-parameter MoEs like DeepSeek-V3 and Kimi-K2 by 24%–30% with an accuracy drop of only 1%–2%.

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Through meticulous data filtering and adaptive mixing strategies, the sub-billion parameter reasoning model MobileLLM-R1-950M was pre-trained using only 4.2T tokens (11.7% of Qwen3). It matches or exceeds Qwen3-0.6B on reasoning benchmarks such as AIME while fully open-sourcing data sources and training recipes.

Modality-free Graph In-context Alignment

The authors propose MF-GIA, the first graph in-context learning framework that simultaneously satisfies tuning-free inference, cross-domain alignment, and modality-free requirements. By capturing domain features via gradient fingerprints and aligning features and labels through FiLM-conditioned transformations, MF-GIA achieves SOTA performance on few-shot tasks across multiple graph domains.

MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

This paper proposes MoNE (Mixture-of-Novices-and-Experts), which identifies redundant experts by jointly evaluating expert access frequency and output variance. These experts are replaced with their output mean (a "novice" constant vector). MoNE achieves more effective and robust compression across five MoE models compared to existing pruning methods, with an average accuracy drop of only 0.14 at a 25% pruning rate.

MoSA: Mosaic Shared Adaptation of Large Language Models

MoSA replaces LoRA's low-rank decomposition with mosaic-style parameter sharing, where the weight matrix is randomly partitioned into small blocks with each sharing a learnable scalar. This achieves full-rank, element-wise weight updates under an identical parameter budget, while custom backward kernels ensure zero inference overhead and efficient training.

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

MOSS employs "two-level microscaling" to quantize sensitive activations and "automatic scaling" to predict weight scaling factors. This allows FP8 training of 7B models to match BF16 accuracy while increasing throughput to 1.34×.

Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

This paper proposes Agentic Predictor, a multi-view workflow encoding framework that predicts the performance of LLM Agent workflows by jointly modeling graph structure, code semantics, and prompt information, significantly reducing expensive trial-and-error evaluations.

Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

This paper proposes FLEXMERGE, a data-free model merging framework that decomposes each fine-tuned model into sequential blocks and greedily merges them in pairs based on block-level cosine similarity. This approach generates models of any size (including decimals) between "1× single merged model" and "M× retaining all models," providing the first systematic characterization of the "accuracy-size" trade-off curve for different merging algorithms.

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Proposes NerVE, a lightweight eigenspectrum analysis framework. Through four complementary indicators (Spectral Entropy, Participation Ratio, Early Eigenvalue Enrichment, and JS Divergence), it systematically reveals how nonlinearities in LLM FFNs reinject variance, reshape the eigenspectrum, and how architecture and optimizer choices imprint unique spectral signatures.

NLI: Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference

The selection of piecewise interpolation points for nonlinear functions on an FP16 grid is modeled as a dynamic programming problem to obtain globally optimal, calibration-free non-uniform piecewise linear interpolation tables. Coupled with a two-level addressing hardware circuit, this allows nonlinear operators like SiLU, Softmax, and RMSNorm to achieve near-zero accuracy loss in LLMs while being 4 times more hardware-efficient than the SOTA.

No Outlier Channels but with Outlier Blocks

This paper points out that outliers in non-uniform quantization are no longer concentrated in "outlier channels" as in uniform quantization, but appear scattered as "outlier blocks." Accordingly, it proposes NuBitQ, a flexible arbitrary bit-width quantization framework, with an external Hessian-free and fine-tuning-free Outlier Compensation Plug-in (OCP). It achieves near-lossless 4-bit quantization and significantly outperforms existing non-uniform quantization methods at 2-bit.

Null-Space Filtering for Data-Free Continual Model Merging: Preserving Stability, Promoting Plasticity

Ours proposes the NUFILT framework, which leverages the geometric property that "task vectors approximately align with representation subspaces." By using null-space filtering to suppress interference to old tasks and projection-aware LoRA to restore plasticity for new tasks, NUFILT achieves continual model merging without any data access. It outperforms OPCM by 4-8% across vision, NLP, and multimodal benchmarks, approaching the upper bound of independent fine-tuning.

OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

OBS-Diff resurrects and adapts the classic Optimal Brain Surgeon (OBS) pruning for large-scale text-to-image diffusion models. Through a "Timestep-aware Hessian," it makes pruning criteria more sensitive to early denoising steps. By using "Module Packages," it amortizes expensive layer-wise calibration. Without requiring any training or fine-tuning, it supports unstructured, N:M semi-structured, and structured granularities, significantly outperforming baselines like Wanda and DSnoT at high sparsities (50%–70%).

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

This paper proposes a unified theoretical framework for activation steering based on Ordinary Differential Equations (ODEs), interpreting traditional activation addition as Euler discretization of an ODE and equating steering direction identification to defining a barrier function. Based on this, the ODESteer method is designed to achieve fine-grained steering through multi-step adaptive ODE solving, yielding improvements of 5.7% on TruthfulQA, 2.5% on UltraFeedback, and 2.4% on RealToxicityPrompts.

Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

This paper proposes OBR (Optimal Brain Restoration), a training-free framework that utilizes a closed-form solution for group error compensation based on second-order Hessian information. It reconciles the conflicting weight distribution requirements of pruning and quantization, achieving the first W4A4KV4 + 50% sparse LLM. On Llama2-7B, it incurs only a 1.4 perplexity drop compared to the FP16 dense baseline while delivering up to 4.72× speedup and 6.4× memory compression.

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

OrderDP reformulates dynamic data pruning as a straightforward two-stage process: uniformly sampling a candidate pool followed by training on the Top-q samples with the highest losses. It proves that this mechanism unbiasedly minimizes a surrogate loss defined by weighted order statistics. This provides the first theoretical guarantee for convergence and generalization in dynamic pruning, achieving near-lossless performance with over 40% training cost savings on CIFAR/ImageNet.

Otters: An Energy-Efficient Spiking Transformer via Optical Time-to-First-Spike Encoding

This paper reinterprets the "natural signal decay" of optoelectronic devices—originally considered a physical defect—as the temporal decay function required for Time-to-First-Spike (TTFS) encoding. By combining a stepped dynamic threshold and a lossless QNN-to-SNN conversion algorithm, the authors develop a 1-bit KV Spiking Transformer. It achieves SOTA performance among SNNs across seven GLUE tasks while improving energy efficiency by 1.77x compared to the previous best spiking language models.

Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

The authors build Paper Copilot—a persistent digital archive and analysis platform for peer reviews across dozens of AI/ML conferences. By collecting review data via a hybrid strategy of OpenReview API, web scraping, and community contributions, the platform archives real-time rating snapshots (including dynamics before and after rebuttals). It reveals a structural shift in ICLR 2025 where decision entropy decreased anomalously, indicating a transition from probabilistic tiering to deterministic score-driven decisions. Additionally, it supports talent trajectory tracking through LLM-driven author-affiliation metadata extraction.

Parallel Token Prediction for Language Models

Ours proposes Parallel Token Prediction (PTP), which moves sampling stochasticity from post-processing to model inputs (auxiliary variables), making future tokens deterministic functions and enabling the joint prediction of multiple tokens in a single forward pass.

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Ours proposes ParoQuant, which eliminates weight outliers through a combination of hardware-efficient, optimizable independent Givens rotations and channel scaling, achieving high-precision, low-overhead 4-bit weight quantization on reasoning LLMs.

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

This paper proposes PASER, a post-training data selection method for recovering pruned LLMs. By utilizing manifold learning and spectral clustering to identify capability-related instruction sets and adaptively allocating data budgets based on capability degradation, PASER significantly outperforms full-data recovery using only 4%-20% of the original data.

Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

This paper proposes the IOA (Identifier-Organizer-Adapter) framework, which draws on Bloom’s mastery learning principles and Vygotsky’s Zone of Proximal Development (ZPD) theory. It achieves pedagogy-driven LLM knowledge distillation through three stages: diagnosing knowledge deficiencies, designing progressive curricula, and adapting to cognitive levels.

π-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Proposes π-Flow, which modifies the output layer of a student flow model to predict a "policy." This policy performs precise ODE integration through dynamic flow velocities across multiple sub-steps within a single network evaluation. By employing an imitation distillation method to match teacher velocities on the student's own trajectories, it achieves stable and scalable few-step generation while avoiding the quality-diversity trade-off.

PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

PiCa demonstrates that projecting the fine-tuning update \(\Delta W\) onto the principal column space of pre-trained weights (the subspace spanned by top-\(r\) left singular vectors) is a theoretically supported and effective inductive bias. Building on this, it enables layers within the same functional group to share a single trainable matrix, achieving stable performance gains over SOTA methods like SVFT while using fewer parameters than rank-1 LoRA across NLP and vision tasks.

Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

CEM achieves high-fidelity DiT generation by modeling intrinsic "timestep × cache interval" errors offline and applying dynamic programming to find an optimal caching strategy under a given acceleration budget. It serves as a zero-overhead, plug-and-play plugin for various caching acceleration and quantization methods.

PM-KVQ: Progressive Mixed-Precision KV Cache Quantization for Long-CoT LLMs

To address the KV Cache VRAM explosion in Long Chain-of-Thought (long-CoT) reasoning models, PM-KVQ utilizes "progressive precision reduction + per-block bitwidth allocation" to maximize the VRAM budget and mitigate cumulative quantization errors. It further employs "short data + Positional Interpolation" for calibration to approximate long-sequence distributions. PM-KVQ improves reasoning benchmark accuracy by up to 8% under the same VRAM constraints while achieving 2.73–5.18× throughput compared to the 16-bit FP baseline.

Post-Training Quantization for Video Matting

This paper proposes PTQ4VM, the first post-training quantization framework specifically designed for video matting models. By utilizing a "Block-wise Initial Quantization + Global Affine Correction + Optical Flow Assistance" triad, it reduces errors by 10%–20% compared to existing PTQ methods under 4-bit settings, approaching full-precision performance while achieving an 8× reduction in computational cost.

Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression

When combining pruning and quantization into a single pipeline, the order of execution significantly impacts final accuracy. This paper formalizes the long-neglected problem of "compression order optimization," proposes the "Progressive Intensity Hypothesis" (weaker perturbations first, stronger perturbations later), and provides theoretical proof alongside extensive empirical support across language and vision models.

PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models

Proposes PTQ4ARVG, the first systematic PTQ framework tailored for AutoRegressive Visual Generation (ARVG) models. It addresses three unique quantization challenges in ARVG through Gain Projection Scaling (GPS), Static Token-wise Quantization (STWQ), and Distribution-Guided Calibration (DGC).

Q&C: When Quantization Meets Cache in Efficient Generation

This paper presents the first systematic study of the joint effects of "quantization + cache" acceleration mechanisms. It identifies that the superposition of these two techniques compromises the sample effectiveness of PTQ calibration sets and amplifies exposure bias in the sampling distribution. It proposes Temperature-Aware Parallel Clustering (TAP) to re-select calibration samples and training-free Variance Compensation (VC) to correct distribution variance, achieving up to a \(12.7\times\) speedup on DiT with almost no loss in generation quality.

QKV Projections Require a Fraction of Their Memory

PAMM (Point-Approximate Matrix Multiplication) is proposed as an activation compression technique that approximates QKV projection layer activations by randomly selecting a small number of representative tokens. It achieves up to 512× compression without compromising model performance.

Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

Qronos is a novel Post-Training Quantization (PTQ) rounding algorithm that executes "error correction" and "error diffusion" alternately in a column-wise and element-wise manner. It not only corrects current weight/activation quantization errors but also explicitly compensates for residual errors accumulated from previously quantized layers. The paper proves an equivalent efficient implementation that reduces the peak VRAM of Llama3-8B by 18x and accelerates single-layer computation by up to 13.8x, consistently outperforming SOTA rounding methods like OPTQ, GPFQ, and GPTAQ on Llama3/Qwen3 at 4-bit and lower bit-widths.

Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models

This paper proposes Quant-dLLM, a 2-bit weight-only post-training quantization (PTQ) framework specifically designed for Diffusion Large Language Models (dLLMs). It utilizes Masked Calibration Simulation (MCS) to align calibration data with the timestep-mask distribution of diffusion denoising, a Data-Aware arbitrary-order Quantizer (DAQ) to represent weights as an aggregation of multiple binary matrices, and Adaptive Block Mixed-Precision (ABMP) to distribute bits by importance under a strict average 2-bit budget. This improves average accuracy from the Prev. SOTA of 40.9% to 51.3% at 2-bit.

Quantized Gradient Projection for Memory-Efficient Continual Learning

Ours proposes QGPM, which compresses the basis vectors in "Gradient Projection Memory" (GPM) used for anti-forgetting in continual learning via quantization. By utilizing a trio of designs—Outlier-robust Quantization (CINF), Error-Aware Gradient Projection (QEA), and Sparse Sketching Acceleration—the memory overhead is reduced to 1/4–1/6 of the original while maintaining nearly full precision (8-bit drops <0.5% ACC compared to full-precision GPM).

QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization

QVLA identifies that directly applying "uniform bit-width quantization" from LLMs to VLA models causes collapse due to action error accumulation. It proposes a fine-grained quantization framework governed by action space sensitivity, assigning \(\{0,2,4,8,16\}\) bits (where 0 indicates pruning) to individual weight channels. On LIBERO, it allows OpenVLA-OFT to maintain a 98.9% success rate while using only 29.2% of VRAM and achieving a 1.49× speedup.

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

QWHA utilizes the Walsh-Hadamard Transform (WHT) as the transform kernel for adapters, combined with a quantization-aware initialization scheme featuring "per-channel budget allocation + maximum magnitude selection + numerical refinement." This allows Fourier-like sparse adapters to be effectively applied to low-bit quantization scenarios for the first time, achieving stable accuracy superior to LoRA-based and other FT adapters at 2~4 bits, with training speeds several times faster than existing FT adapters.

RAPID\(^3\): Tri-Level Reinforced Acceleration Policies for Diffusion Transformer

Three lightweight policy heads (Step-Skip / Cache-Reuse / Sparse-Attention) are attached to a frozen Diffusion Transformer. These heads are trained online via GRPO to decide acceleration strategies per-timestep and per-image. An adversarial discriminator is employed to prevent reward hacking, achieving approximately 3× speedup on SD3 and FLUX with almost no loss in image quality.

RCPU: Rotation-Constrained Error Compensation for Structured Pruning of Large Language Models

RCPU applies a "rotation-constrained" closed-form parameter update (Orthogonal Procrustes problem) after structured column pruning to realign the pruned subspace with the original output. This compensates for errors without destroying the geometric structure of pre-trained representations using only a small amount of calibration data. Combined with a column importance score that considers input variance, RCPU consistently outperforms baselines such as WANDA-sp and FLAP in perplexity and downstream task accuracy on Llama-7B and Llama-2-13B.

REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression

This paper theoretically proves that "expert merging" introduces irreducible errors due to the loss of the router's independent, input-dependent modulation capability over experts. Consequently, it proposes REAP—a one-shot pruning criterion that simultaneously considers router gating values and expert activation norms. REAP significantly outperforms merging and other pruning methods across various SMoEs ranging from 20B to 1T, particularly in generative tasks and at a 50% compression rate, achieving near-lossless performance on Qwen3-Coder-480B and Kimi-K2 for code generation.

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

The authors observe that applying standard LLM pruning methods (e.g., SparseGPT) directly to long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 leads to significant performance degradation and even slower inference. The root cause is that these methods only use "input prompts" for calibration, whereas reasoning is a "decode-dominated" task. They propose RAC (Reasoning-Aware Compression), which enables the model to self-generate CoTs during calibration and incorporates these on-policy activations into the reconstruction objective. As a plug-and-play patch, RAC allows SparseGPT to maintain approximately 95% of dense model accuracy at 50% sparsity.

Reassessing Layer Pruning in LLMs: New Insights and Methods

This paper invests thousands of GPU hours to systematically reassess "layer selection metrics" and "post-pruning fine-tuning methods" for LLM layer pruning. It reaches two counter-intuitive conclusions: the simplest "reverse-order pruning of the last few layers" outperforms various complex metrics, and "partial layer fine-tuning of only the lm head + last 3 layers" exceeds the standard LoRA. The authors provide a theoretical explanation using Pre-LN gradient flow, ultimately outperforming existing SOTA pruning methods by 2.36%–19.45% on Llama-3.1-8B, Llama-3-8B, and Llama-3-70B.

Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation

Ours proposes RD3 (Rectified Decoupled Dataset Distillation), systematically revealing that performance gaps in existing decoupled dataset distillation methods stem primarily from inconsistent post-evaluation settings rather than distillation quality. It establishes a unified fair evaluation framework, correcting the reported 27.3% performance gap to 6.7%.

Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems

Theory proves that maximizing the NDCG lower bound via CE loss in RecSys KD requires the "Closure Assumption" (the subset must contain the student's top items). However, the actual goal is to distill the ranking of the teacher's top items—this conflict causes vanilla CE to perform poorly. Accordingly, RCE-KD is proposed: it splits the teacher's top-K items into two groups based on whether they are in the student's top-K, applying exact CE and sampled approximate closure CE respectively, with adaptive fusion weights that adjust dynamically during training.

Rethinking Continual Learning with Progressive Neural Collapse

Ours proposes the ProNC framework, which balances maximum inter-class separation and minimum forgetting in continual learning by progressively expanding Equiangular Tight Frame (ETF) targets instead of using fixed, predefined ETFs.

Rethinking Residual Errors in Compensation-based LLM Quantization

This paper revisits the column-level calibration objectives of "column-wise quantization + compensated residual weights" methods like GPTQ / GPTAQ. It points out that these methods erroneously treat the "output of compensated weights" as the alignment baseline. Consequently, it derives a missing residual term—Compensation-aware Error (CAE)—and integrates it efficiently into weight update formulas using the neuron decomposition from GPTAQ. This modification, which requires almost zero structural changes to GPTQ / GPTAQ, consistently improves perplexity and downstream accuracy in 2~3 bit quantization.

Revisiting Weight Regularization for Low-Rank Continual Learning

This work reintroduces Elastic Weight Consolidation (EWC) into low-rank continual learning by estimating the Fisher Information Matrix in the full-dimensional space to regularize shared LoRA modules, achieving effective forgetting mitigation with constant storage overhead.

Robust Selective Activation with Randomized Temporal K-Winner-Take-All in Spiking Neural Networks for Continual Learning

To address catastrophic forgetting in Spiking Neural Networks (SNNs) for continual learning, this paper upgrades the traditional rate-based deterministic K-WTA to "Randomized Temporal K-WTA (RTK-WTA)". By ranking neurons based on their temporal traces rather than instantaneous firing rates and injecting controlled randomness \(\alpha\) into Top-K selection, the method expands the effective feature space and increases inter-class margins, achieving a 3.07–10.05% improvement over deterministic K-WTA on splitMNIST/splitCIFAR100.

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

This paper argues that the instability in ultra-low-bit quantization training stems not from "non-differentiability" but from the fact that Straight-Through Estimator (STE) backpropagation is blind to quantization errors. By redefining quantization as additive noise and employing a denoising dequantization transform \(g\) derived from ridge regression, the authors explicitly reintegrate errors into the gradient path, enabling stable training of A1W1 and even sub-1-bit networks under standard training recipes.

S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion

Proposes S2R-HDR, the first large-scale high-quality synthetic HDR fusion dataset (24,000 samples), and designs S2R-Adapter, a domain adaptation method to bridge the synthetic-to-real gap, achieving SOTA HDR fusion performance on real-world datasets.

SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

SAES-SVD explicitly incorporates a "full-precision reference output alignment" cumulative error compensation term into the objective of layer-wise SVD low-rank compression. It derives a closed-form solution relying only on second-order activation statistics and adaptively selects the optimal compensation weight for each layer. This ensures the compressed model's output remains close to the full-precision baseline—reducing the average accuracy drop of LLaMA-7B at a 0.2 compression ratio from >0.05 to approximately 0.02, without requiring fine-tuning or mixed-rank allocation.

SAFA-SNN: Sparse-aware Fast Adaptive Spiking Neural Network for On-device Few-Shot Class-Incremental Learning

This paper proposes SAFA-SNN, which utilizes a "sparse-aware dynamic threshold + zero-order optimization + prototype orthogonal subspace projection" toolkit to enable Spiking Neural Networks (SNNs) to perform Few-Shot Class-Incremental Learning (FSCIL) on resource-constrained edge devices. It achieves a 4.01% higher accuracy than the second-best method in the final session of Mini-ImageNet and reduces training energy consumption by approximately 20% on CIFAR-100.

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

This work systematically reveals the internal mechanism behind Large Language Model (LLM) failures in reasoning hop generalization—specifically, the competition between attention heads driving correct versus erroneous reasoning trajectories. The authors propose TCR (Test-time Correction of Reasoning), which dynamically identifies and deactivates erroneous processing heads (ep heads) during inference, achieving an average accuracy improvement of 5-7%.

ScalingCache: Extreme Acceleration of DiTs through Difference Scaling and Dynamic Interval Caching

ScalingCache is a training-free DiT inference acceleration framework. By offline estimation of a "difference scaling coefficient \(\alpha\)", it adaptively fuses zero-order (direct reuse) and first-order (linear extrapolation) cached features. Combined with a runtime dynamic caching interval strategy, it achieves approximately 2.5× acceleration on Wan2.1 and HunyuanVideo with only a 0.5% drop in VBench, and 3.1× near-lossless acceleration on FLUX.

Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning

This paper proposes OSFT (Orthogonal Subspace Fine-Tuning): performing SVD on each layer's weights, freezing the "high-rank subspace" corresponding to large singular values as old knowledge, and performing full-parameter updates only in the orthogonal "low-rank subspace." This enables continuous learning of new tasks with fixed parameters and no task-gradient storage, achieving near-zero forgetting. It outperforms O-LoRA by 1.7 points on a 15-task benchmark and exceeds average accuracy on TRACE by approximately 7 points.

SeeDNorm: Self-Rescaled Dynamic Normalization

SeeDNorm is proposed as an adaptive dynamic normalization layer that dynamically adjusts the scaling coefficient by using the input itself as a condition. This preserves input norm information during the forward pass while maintaining RMSNorm-like adaptive gradient adjustment capabilities during backpropagation. It consistently outperforms RMSNorm, LayerNorm, and DyT across language modeling and vision tasks with minimal additional parameters.

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

The SERE method is proposed to dynamically re-route secondary experts to the most similar primary experts during batch decoding by pre-computing an expert similarity matrix. This achieves up to 2.0× acceleration with minimal quality loss and provides a plug-and-play CUDA kernel for vLLM.

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

SERQ unifies activation outliers and weight saliency into a single low-rank compensation matrix. Through three steps—static activation flattening, saliency-aware error reconstruction, and offline weight permutation—linear layers achieve a pure 4-bit end-to-end computation path under W4A4. It outperforms previous LoRA-style reconstruction and rotation-based methods in accuracy while adding negligible inference latency.

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

This paper systematically revisits the impact of domain-specific SFT on the general capabilities of LLMs. It finds that using a smaller learning rate can significantly mitigate general capability degradation, and proposes the Token-Adaptive Loss Reweighting (TALR) method to further optimize the trade-off between domain adaptation and general capabilities by adaptively down-weighting the loss of low-probability tokens.

SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

This paper views Knowledge Distillation (KD) from a Bayesian perspective as "supervision using Bayes Class Probabilities (BCP) instead of one-hot labels." It rigorously analyzes the convergence behavior of students trained with SGD, proving that learning from precise BCP eliminates the "neighborhood term" in the convergence bound (zeroing variance and allowing larger learning rates). Based on this, it provides a practical guideline: use better-calibrated Bayesian teachers for distillation. Experiments show student accuracy improvements of up to +4.27% and convergence noise reduction of up to 30%.

Shift-and-Sum Quantization for Visual Autoregressive Models

This paper proposes Shift-and-Sum quantization and calibration data resampling for Visual Autoregressive (VAR) models. The former specifically reduces errors in attention-value products for high-attention value tokens, while the latter aligns the VQ-VAE codebook sampling frequency in small calibration sets with the model's prediction probabilities. The method consistently outperforms BRECQ and LiteVAR on low-bit VAR and Infinity generation tasks.

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

The authors discover that attention weights in Diffusion Transformers can be decomposed into a "small number of high-rank + large number of extremely low-rank" components. They propose SLA—applying precise sparse attention to critical blocks, linear attention to marginal blocks, and skipping negligible blocks. Integrated into a single GPU kernel and requiring only a few thousand fine-tuning steps, SLA reduces attention computation by approximately 95% and achieves a 2.2× end-to-end acceleration in video generation with nearly zero quality loss.

SliderQuant: Accurate Post-Training Quantization for LLMs

SliderQuant observes that shallow/deep layers (especially the first and last) of LLMs are significantly more sensitive to quantization than middle layers. It proposes an adaptive sliding quantization framework incorporating "inter-layer sliding windows (progressive expansion for shallow, fixed for middle, progressive contraction for deep) + intra-layer incremental quantization." This approach significantly outperforms existing PTQ methods like GPTQ, OmniQuant, and CBQ under extremely low-bit settings such as W4A4 and W2A16.

SMixer: Rethinking Efficient-Training and Event-Driven SNNs

Addressing the dilemma where high-performance Spiking Neural Network (SNN) architectures are not truly event-driven and suffer from high training overheads, this paper proposes the Spiking-token Mixer (SMixer) backbone for deployment on asynchronous chips, combined with a zero-parameter Dynamic Spatial-Temporal Spiking Pruning (DSTSP) framework. This approach reduces training memory and energy consumption by approximately half while maintaining accuracy.

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Ours proposes SPARTA, an end-to-end framework for automatically constructing large-scale table-text multi-hop QA benchmarks. It generates high-quality nested SQL queries by referencing fact databases, utilizing provenance-based refinement and realistic structural constraints. SOTA models show an F1 drop of over 30 points on SPARTA.

SPR\(^2\)Q: Static Priority-based Rectifier Routing Quantization for Image Super-Resolution

SPR\(^2\)Q targets ultra-low bit Post-Training Quantization (PTQ) for image super-resolution models. By learning a set of low-rank rectifiers to compensate for weight increments before quantization and merging the optimal increments into layer weights via offline static priority routing, it significantly mitigates detail recovery loss in MambaIRv2-light under 4-bit, 2-bit, and even 1-bit settings without increasing inference overhead.

SSDi8: Accurate and Efficient 8-bit Quantization for State Space Duality

SSDi8 is the first post-training quantization framework specifically designed for the Mamba-2 State Space Duality (SSD) module. By employing a "sparsity-aware restructuring + persistent INT8 state path + dimension-decomposition-aware channel quantization + mean correction" suite, it maintains near-FP16 accuracy under W8A8 / W4A8 while accelerating SSD inference by up to 1.4×.

Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation

This paper analyzes LoRA training dynamics from the perspective of feature learning stability, proving that LoRA can be "self-stabilizing" under appropriate hyperparameters and initialization, while the commonly used non-zero initialization \(A_0\) disrupts this stability in the long term. Consequently, Stable-LoRA is proposed—applying exponential shrinkage to \(A\) during the initial training steps to retain the benefits of non-zero initialization while eliminating the induced instability, consistently outperforming baselines like AdamW across multiple models and tasks with almost no additional memory or computation.

STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

STaMP proposes a reversible linear transformation along the sequence dimension (utilizing DCT/wavelets to compact activation energy into a few tokens) and سپس allocating higher bits to these high-energy tokens. This significantly reduces activation quantization errors under a fixed average bit budget. It is orthogonally complementary to existing feature-dimension transformations (Hadamard/QuaRot), providing plug-and-play improvements for W4A4 quantization in LLMs and LVMs.

STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models

The STAR framework is proposed to synergize Constrained Knowledge Distillation (CKD) and Similarity-guided Reinforcement Learning (Sim-RL). It effectively transfers the function calling capabilities of large models to super-tiny 0.6B models, significantly outperforming baselines on BFCL and ACEBench.

Steering MoE LLMs via Expert (De)Activation

SteerMoE is proposed to detect behavior-associated experts via contrastive paired inputs and steer the behavior of MoE LLMs by activating or deactivating specific experts during inference (+20% safety, +27% faithfulness). The study also reveals a unique safety alignment vulnerability in MoE models (safety drop -100%).

Study of Training Dynamics for Memory-Constrained Fine-Tuning (TraDy)

Addressing the issue where edge devices are extremely memory-constrained and cannot perform full backpropagation, this paper utilizes three observations regarding fine-tuning training dynamics (heavy-tailed gradients, architecture-driven layer importance, and task-dependent channel importance) to decompose "where to update" into two steps: offline layer selection and online dynamic channel selection. The proposed TraDy randomly resamples input channels within pre-selected high-importance layers every epoch to approximate the full gradient under strict memory budgets. It achieves up to 99% activation sparsity, 95% weight derivative sparsity, and a 97% reduction in backward FLOPs, with accuracy even surpassing the deterministic oracle.

SumRA: Parameter Efficient Fine-Tuning with Singular Value Decomposition and Summed Orthogonal Basis

SumRA compresses all singular vectors obtained from SVD of pre-trained weights into the LoRA down-projection matrix \(A\) in a "disjoint and load-balanced" manner. By freezing \(A\) and training only the up-projection matrix \(B\), it halves the trainable parameters and enables cross-task sharing of \(A\), reducing the WER of Whisper adapted to five new languages from 14.42% (LoRA) to 12.41%.

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning

Ours proposes SwiReasoning, a training-free LLM reasoning framework that dynamically switches between explicit (chain-of-thought) and implicit (latent space) reasoning modes through block-level confidence estimation based on entropy trends. It simultaneously improves accuracy (+1.8% to 3.1%) and token efficiency (+57% to 79%) in a Pareto-superior manner.

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

This work reveals that momentum EMA updates are equivalent to gradient descent for online linear regression. Based on this insight, the authors propose LoRA-Pre, which compresses optimizer momentum through low-rank decomposition to achieve memory-efficient LLM pre-training and fine-tuning, reaching optimal performance across all model scales with only 1/8 the rank of baseline methods.

TD-MoE: Tensor Decomposition for MoE Models

TD-MoE stacks all expert weights in an MoE layer into a three-dimensional tensor for joint Tucker decomposition, combined with multilinear whitening and adaptive 3D rank allocation. This captures "inter-expert structural redundancy" ignored by expert-wise methods, achieving nearly lossless performance at 20% compression and outperforming SVD-based SOTA by 11%~14% at 40%/60% compression.

Tequila: Trapping-free Ternary Quantization for Large Language Models

Addressing the issue in ternary quantization (compressing weights to \(\{-1, 0, +1\}\)) where many weights become "trapped" at the deadzone boundary and fail to receive effective gradients, this paper proposes Tequila. This method reactivates these "dead weights" as differentiable dynamic biases, allowing them to contribute signals in the forward pass and receive direct gradients in the backward pass. This improves accuracy on ARC by >4% over SOTA ternary methods with near-zero inference overhead, approaching full-precision performance (gap <1%) while achieving \(3.0\times\) inference acceleration.

Textual Equilibrium Propagation for Deep Compound AI Systems

Ours proposes Textual Equilibrium Propagation (TEP), an optimization method for compound AI systems based on local learning principles. Through a two-phase design consisting of a Free Phase and a Nudged Phase, it avoids the gradient explosion/vanishing problems inherent in global textual backpropagation, significantly outperforming TextGrad on deep workflows.

The Curious Case of In-Training Compression of State Space Models

This paper proposes COMPRESSM, which introduces "Balanced Truncation + Hankel Singular Value (HSV) analysis" from control theory into the training process of SSMs. By identifying and discarding state dimensions with low contribution to input-output mapping early in training, the model "starts big and shrinks during training," accelerating training while preserving critical structures that are lost when training small models from scratch.

The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

This paper provides the first proof that GPTQ (when executed back-to-front) is mathematically equivalent to the classical Babai's Nearest Plane algorithm in lattice theory. This equivalence yields a geometric interpretation and a layer-wise error upper bound, based on which an improved unclipped quantization method is designed.

The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's Algorithm

Independent of Chen et al. (2026), this work provides a more concise and elegant proof that GPTQ is equivalent to Babai's nearest plane algorithm, elucidating the potential of lattice basis reduction to improve neural network quantization.

The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM

The Elsa method is proposed to directly solve the sparsity-constrained problem through ADMM-based constrained optimization without surrogate objectives. It breaks the 50-60% "sparsity wall" bottleneck in LLM pruning, maintaining high model fidelity even at 90% sparsity.

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers

To make small-scale plain ViTs both fast and accurate, this paper replaces the original CLS token with a "Jumbo token" that is \(J\) times wider than patch tokens and equips it with a cross-layer shared, dedicated wide FFN. This supplements global representation capacity with almost no added computation or memory overhead—achieving a 13% improvement over ViT+Registers at the ImageNet-1K Nano scale while maintaining full compatibility with the plain ViT ecosystem (MAE, SAR, segmentation heads, multi-modality, and time series).

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Ours proposes the TiTok framework, which achieves efficient LoRA adapter transfer across models through token-level contrastive excess scores. It requires no additional discriminator models and consistently outperforms TransLoRA and knowledge distillation baselines in reasoning and personalization tasks.

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

This paper discovers the "exponent concentration" (low entropy) phenomenon in post-training GenAI weights. It theoretically proves bounded exponent entropy via \(\alpha\)-stable distribution theory, corresponding to a compression limit of approximately FP4.67. Based on this, the authors design ECF8, a lossless FP8 compression framework (entropy-aware Huffman coding + GPU parallel decoding + just-in-the-time tensor management), achieving up to 26.9% memory savings and 177.1% throughput improvement on LLMs and DiTs with up to 671B parameters, maintaining zero bit-wise deviation in output.

Token Distillation: Attention-Aware Input Embeddings for New Tokens

Ours proposes the Token Distillation method, which distills multi-subword interaction information encoded by Transformer layers into a single token embedding. This achieves high-quality initialization for new token embeddings without pre-training hypernetworks and outperforms existing methods.

Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Size

This work systematically investigates the connectivity and singularity of the parameter space for feedforward ReLU networks based on general Directed Acyclic Graph (DAG) architectures from the perspectives of algebraic geometry and algebraic topology. It reveals the critical roles of bottleneck nodes and balance conditions in determining the topology of the parameter space and establishes a theoretical link between singularities and differentiable pruning.

Towards Efficient Constraint Handling in Neural Solvers for Routing Problems

The proposed Construct-and-Refine (CaR) framework implements efficient feasibility restoration through the joint training of a construction module and a lightweight refinement module. This provides the first general and efficient neural constraint handling solution for hard-constrained routing problems, significantly outperforming classical and neural SOTA solvers on TSPTW and CVRPBLTW.

Towards Lossless Memory-efficient Training of Spiking Neural Networks via Gradient Checkpointing and Spike Compression

Addressing the \(O(LT)\) memory explosion issue in Spiking Neural Networks (SNNs) during direct training with BPTT, this work packages "layer-wise gradient checkpointing + lossless binary spike compression + multi-stage checkpoint adjustment" into an automatic optimization pass. It reduces peak memory to 0.12×–0.47× without accuracy loss and with less than 20% slowdown.

Towards Quantization-Aware Training for Ultra-Low-Bit Reasoning LLMs

To address the issue where ultra-low-bit (\(\le\) 2 bit) quantization severely degrades reasoning capabilities, this paper proposes a two-stage QAT pipeline for reasoning LLMs. The first stage performs block-wise quantization calibration using "80% Reasoning + 20% Pre-training" mixed-domain data, and the second stage employs a teacher-guided reward rectification loss for fine-tuning. The 2-bit quantized Qwen3-8B outperforms the PTQ baseline by an average of 50.45% across five reasoning benchmarks.

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Ours proposes the FuncBenchGen framework, which models multi-step function calling as a DAG traversal problem to achieve data contamination-free evaluation with fine-grained task difficulty control. It reveals critical failure modes of reasoning models under long calling chains and connected distracting functions.

TP-Spikformer: Token Pruned Spiking Transformer

To address the high deployment overhead of Spiking Transformers, this paper proposes TP-Spikformer, a training-free and architecture-invariant token pruning method. It employs a neuroscience-inspired "Information Retention-driven Token Pruning" (IRToP) criterion to score tokens and a "Block-level Early-stopping Architecture" (IR-Arc) that allows unimportant tokens to skip subsequent computations instead of being deleted. Without fine-tuning, it achieves up to a 48% reduction in computational cost on ImageNet across multiple architectures and tasks, with only a 0.5–1.5% decrease in accuracy.

TRAC: Tensor-Train Based Across-Layer Compression for Parameter-Efficient Fine-Tuning

TRAC reformulates LoRA's low-rank incremental matrices \(A\) and \(B\) into Tensor-Train (TT) core sequences. By employing a strategy of "freezing/sharing specific cores across layers + restoring inter-layer flexibility via lightweight vector controllers," it reduces trainable parameters to an order of magnitude smaller than LoRA (20× on LLaMA2-13B, 14× on ViT-Large) while maintaining or exceeding LoRA's performance across NLU, NLG, common sense/mathematical reasoning, and image classification tasks.

Training Dynamics Impact Post-Training Quantization Robustness

The authors systematically measured GPTQ post-training quantization (PTQ) errors across open-source large model training trajectories (up to 32B parameters and 15T tokens). They found that the surge in quantization error is driven by training dynamics, such as learning rate decay, rather than increased training data volume. Accordingly, they propose two types of interventions—maintaining a larger learning rate and performing weight averaging along the trajectory—which significantly improve quantization robustness without sacrificing precision. These findings are unified through an explanation based on loss surface flatness (curvature/Hessian).

TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

TurboBoA proposes a backpropagation-free post-training quantization (PTQ) method for LLMs. By introducing three innovations—multi-out-channel joint quantization, preceding layer error compensation, and adaptive grid selection—it achieves a speedup of over 3x while maintaining the accuracy advantages of BoA.

TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate

TurboQuant is a data-oblivious (requires no calibration, ready for online use) vector quantization algorithm: it first applies a random rotation to "transform" the coordinates of any input vector into nearly independent Beta distributions, then applies pre-solved Max-Lloyd optimal scalar quantizers to each coordinate. This keeps MSE distortion within a constant factor (\(\approx 2.7\)) of the information-theoretic lower bound across all bitrates and dimensions. To address bias in inner product estimation, it adds a 1-bit QJL layer to process residuals for unbiased estimation, outperforming existing product quantization in KV cache compression and ANN retrieval.

Understanding Dataset Distillation via Spectral Filtering

This paper proposes the UniDD spectral filtering framework, which unifies various dataset distillation methods as applying different filtering functions to the Feature-Feature Correlation (FFC) matrix to match the frequency information of the Feature-Label Correlation (FLC) matrix. Based on this insight, the authors introduce Curriculum Frequency Matching (CFM).

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

A universal unified tokenizer, UniFlow, is proposed. It preserves semantic understanding through hierarchical adaptive self-distillation and achieves high-fidelity reconstruction using a lightweight patch-wise pixel flow decoder. It achieves a win-win in understanding and generation across 13 benchmarks; the 7B UniFlow-XL outperforms the 14B TokenFlow-XL by 6.05% using 40% less data.

UniQL: Unified Quantization and Low-Rank Compression for Adaptive Edge LLMs

UniQL unifies post-training quantization and structured low-rank pruning into a "compute once in cloud, trim as needed on edge" pipeline. By employing pseudo-inverse-free weight ordering, quantization-aware SVD, and state-aware ordering, it enables Transformer, SSM, and hybrid models to configure 0–35% pruning rates in real-time on-device based on system load. After a single compression, the method achieves 4×–5.7× memory compression and 2.7×–3.4× throughput improvement while maintaining accuracy close to the original model.

UNITE: Universal Knowledge Integration from Task-Specific Experts

Addressing the issues of "fragmented and cross-layer redundant" expert knowledge in Mixture-of-Experts (MoE) models, UNITE first uses Fisher information to weighted-fuse multiple experts per layer into a single expert, then applies Tucker decomposition to decouple cross-layer shared low-rank input/output subspaces (as "universal knowledge / learngene") from layer-specific coefficients. Finally, this set of shared subspaces is extracted once and reassembled repeatedly to construct lightweight target models of any depth—outperforming random initialization baselines by over +6% on reasoning tasks with only a fraction of the parameters compared to compression baselines.

Unveiling Super Experts in Mixture-of-Experts Large Language Models

This paper identifies and systematically investigates "Super Experts" (SE) in MoE LLMs—an extremely small subset of experts crucial for model inference, which drive massive activations and attention sinks through extreme activation outliers in down_proj.

Vulcan: Tailoring Compact Class-Specific Vision Transformers for Edge Intelligence

Vulcan discovers that the Feed-Forward Network (FFN) in a ViT stores "class-specific knowledge" while the Multi-Head Attention (MHA) stores "class-agnostic patterns." It proposes a "train-then-prune" post-training method that collapses FFN neurons toward high-activation anchor neurons and uses Truncated Nuclear Norm Regularization (TNNR) to compress MHA projection matrices into low-rank structures. This approach yields compact edge ViTs (20%–40% of the original size) that认-target classes with nearly lossless accuracy—sometimes outperforming the original ViT on specific classes by up to 15.12%.

What Layers When: Learning to Skip Compute in LLMs with Residual Gates

GateSkip is proposed—inserting a sigmoid-linear gate at the output of each Attention/MLP branch in a decoder-only Transformer. During fine-tuning, gate sparsity and the language modeling objective are jointly learned. During inference, low-importance tokens are deterministically skipped based on a quantile threshold of gate values, achieving token-level layer-wise adaptive depth. On Llama 8B, it saves 15% computation while maintaining >90% accuracy. For instruction-tuned models, full computation actually improves accuracy, and ~50% savings still match the baseline. It is orthogonal and combinable with INT4 quantization, structured pruning, and self-speculative decoding.

Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Ours proposes the TAPPA framework, which provides a unified explanation for the formation mechanisms of various attention patterns in LLMs (attention sink, diagonal, periodicity, etc.) from a temporal continuity perspective. It introduces the query self-similarity (q-similarity) metric to guide KV cache compression and model pruning tasks.

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

WINA incorporates "weight column norm" and "hidden state magnitude" into the gating criterion for training-free sparse activation. By selecting top-K neurons using \(|x_i \cdot c_i|\) instead of merely \(|x_i|\), it provides a theoretically tighter approximate error upper bound. It achieves higher accuracy than TEAL/CATS/R-Sparse at identical sparsity levels on Llama/Mistral/Phi-4, with significant advantages at extreme sparsity levels such as 65%.

WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Learning Models

WSVD replaces traditional SVD performed on the entire K/V projection matrix with a "per-head" SVD approach, restores precision through Fisher importance-weighted fine-tuning, layers W8A8 quantization, and implements a Triton operator that fuses low-rank reconstruction directly into Flash Decoding. This achieves a real-world decoding speedup of over 1.8× compared to Flash Decoding for Vision-Language Models (VLMs) with almost no loss in accuracy.

Zeros Can Be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

The authors observe that adding an explicit "zero" state to the weights of a binary U-Net allows sparsity to reach 90%+ while significantly recovering accuracy. Consequently, they propose MBU-Net, which selects key layers for zero-masking based on "cost-effectiveness" and maps these masked binary weights directly to GPU binary Tensor Cores (BMMA) using a "subtractive bit encoding" scheme. It achieves near full-precision accuracy (average 3% drop) while providing a 2.04× speedup and 3.54× energy reduction compared to FP16 U-Net across three segmentation datasets.