ICML2026 Model Compression AI paper notes paper summaries LLM Compression Reasoning Diffusion Models Adversarial Robustness

📦 Model Compression¶

🧪 ICML2026 · 117 paper notes

📌 Same area in other venues: 📷 CVPR2026 (108) · 🔬 ICLR2026 (240) · 💬 ACL2026 (59) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (143) · 📹 ICCV2025 (52)

🔥 Top topics: Model Compression ×26 · LLM ×19 · Compression ×15 · Reasoning ×6 · Diffusion Models ×5

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search: The paper converts LoRA hyperparameter configurations into text with domain explanations, using a frozen LLM, learnable tokens, and a projection layer to construct a continuous search space for Bayesian Optimization (BO). By employing 10% of the data for proxy evaluation to reduce trial costs, it significantly outperforms default LoRA configurations and conventional HPO methods within approximately 30 search iterations.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints: The paper establishes the first queueing model for LLM inference that explicitly incorporates KV cache memory dynamics, deriving a closed-form stability condition \(\lambda < \mu(1-\delta)\). This allows operators to directly calculate the required number of GPUs; validation on single GPU, 8-GPU clusters, and LongBench real-world data demonstrates errors \(\leq 10\%\).
Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning: This paper models training budget allocation in scaling law experiments as a multi-round resource selection problem. By combining Successive Halving with learning curve surrogates to predict future potential, it approximates the full scaling law with up to 98.7% training cost savings on synthetic and nanoGPT learning curves.
Active Tabular Augmentation via Policy-Guided Diffusion Inpainting: This paper formalizes the "fidelity-utility gap" in tabular augmentation (where generators optimize for distribution matching, yet augmentation value stems from low-density regions). It proposes the TAP algorithm, which utilizes diffusion inpainting for manifold-constrained proposals, policy-guided utility-aligned selection, and hard-constraint gating with conservative window commitment. On 7 real-world tabular datasets, it achieves up to a 15.6% improvement in classification accuracy and a 32% reduction in regression RMSE compared to baselines.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation: This paper identifies that GRPO loses gradient signals under binary verifiable rewards when intra-group rewards are identical. It proposes the ACR metric for real-time diagnosis of this "advantage collapse" and introduces AVSPO to inject virtual reward samples, restoring intra-group variance. This approach consistently improves performance by 4-6 percentage points across various Qwen2.5 mathematical reasoning models.
An Algebraic View of the Expressivity of Recurrent Language Models: This paper unifies the formal language expressivity of RNNs/SSMs as an algebraic problem: once numerical semantics are fixed, the languages a model can recognize are determined by its hierarchical transition monoids and their wreath products. Furthermore, the same architecture yields entirely different counting capabilities under floating-point versus unsigned integer semantics.
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin: The authors diagnose the root cause of VQ-VAE codebook collapse as "codebook vector \(\ell_2\) norm imbalance + geometric clustering." They propose SAMP: Ball-Bounded Norm Regularization to constrain all codebook vectors within a time-varying Euclidean ball, and ArcCosine Additive Margin Loss—drawing inspiration from ArcFace—to push latent vectors apart on the sphere. This results in uniformly distributed codebooks and significantly higher utilization, outperforming mainstream VQ-VAE variants in ImageNet reconstruction and generation FID.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning: This paper decomposes forgetting in CLIP-based class-incremental learning into "attribute extraction drift" and "attribute aggregation drift." It proposes Area, which utilizes Principal Geodesic Analysis (PGA) to fix visual/textual attribute anchors on the hypersphere, combined with lightweight task experts, Variational Information Bottleneck (VIB) regularization, and Optimal Transport (OT) routing to stabilize attribute aggregation. This approach significantly improves average and final accuracy across nine CLIP-CIL benchmarks.
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice: This paper discovers that tabular foundation models (TFMs) such as TabPFN and Mitra exhibit high accuracy in discrete choice tasks but violate price-demand monotonicity and produce untrustworthy value-of-time (VOT) estimates. Consequently, it proposes a two-stage behavioral adapter that embeds TFM predictions into a utility model constrained by economic theory, achieving 100% behavioral validity while recovering most accuracy gains.
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion: This paper demonstrates through controlled experiments that Hyperfitting (training LLMs to near-zero loss on small datasets) is not a temperature-scaling-style distribution sharpening, but a dynamic, context-dependent token Rank Reordering mechanism. This mechanism concentratedly occurs in the final layer of the Transformer as a "Terminal Geometric Expansion" (\(\Delta \text{Dim} \approx +80.8\)). Based on this, Late-Stage LoRA is proposed—fine-tuning only the last 5 layers—maintaining generation diversity while reducing trainable parameters by approximately 80%.
Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning: StructRTL is proposed for structural-aware graph self-supervised pre-training (Masked Node Modeling + Edge Prediction) on Control Data Flow Graphs (CDFG) of RTL designs. Combined with knowledge distillation from post-mapping netlists to CDFGs, it significantly outperforms LLM-based and handcrafted feature methods in area and delay prediction tasks.
BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models: BioArc proposes a heterogeneous neural architecture search framework for biological foundation models. By automatically discovering optimal hybrid architectures in a search space containing five basic modules (CNN/LSTM/Transformer/Mamba/Hyena), it outperforms existing SOTA biological foundation models with less than 1/25 of the parameters.
Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models: Ours proposes Bounded Hyperbolic Tanh (BHyT), a data-driven input-bounded \(\tanh\) transformation, as a plug-and-play alternative to Pre-Layer Normalization. It suppresses depth-wise activation growth while avoiding redundant variance calculations, achieving 1.6% faster training and a 1.77% increase in generation throughput compared to RMSNorm, with downstream performance consistently exceeding existing methods.
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression: To address the "load imbalance – parameter redundancy – communication overhead" trilemma in MoE LLMs, this paper proposes a unified framework: using online clustering based on "parameter + activation" dual similarity to group experts. Within groups, structured compression (~5×) is applied via a "shared base matrix + low-rank residuals." This is combined with two-stage hierarchical routing ("select group then select expert"), FP16/INT4 heterogeneous precision, and offline offloading of idle groups. On GLUE/WikiText-103, Ours matches standard MoE performance with ~80% parameter reduction, 10–20% throughput gain, and a 3× reduction in expert load variance.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video: This paper identifies the theoretical requirement of "frame-wise injectivity" and proposes Causal Forcing—a method that replaces the bidirectional teacher with an autoregressive teacher for ODE distillation initialization. This avoids the performance collapse seen in Self-Forcing, achieving significant gains over Self-Forcing in dynamics (+19.3%), VisionReward (+8.7%), and instruction following (+16.7%), while maintaining the same inference latency (0.69s).
Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering: By leveraging the deterministic negation mapping between hypothesis \(H\) and its negation \(\neg H\) in three-way logical QA, multiple LLM calls are composed at test-time and disambiguated through consistency constraints. This reduces epistemic Unknowns and improves reasoning accuracy without requiring training.
Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter: The Compress-then-Merge (CtM) pipeline is proposed to learn a shared \(r\)-dimensional subspace and project each adapter into \(r \times r\) coordinate matrices before merging multiple LoRAs. Merging is then executed in the low-dimensional space, architecturally ensuring the output is a rank-\(r\) LoRA and avoiding the performance loss associated with truncated SVD in traditional Merge-then-Compress methods.
Continual Model Routing in Evolving Model Hubs: When the number of available experts in a model hub grows from hundreds to thousands and models are continuously added or retired, traditional "train-once routers" or "pure model card retrieval" become insufficient. The authors formalize this as a "continual classification (expanding label space)" problem, construct the CMRBench benchmark covering 4 phases and over 2,000 candidate models, and propose CARvE—a continual router using contrastive embedding scoring, checkpoint anchoring to prevent drift, and structured negative sample replay to maintain discriminative power. CARvE outperforms standard LoRA replay by 5 percentage points in D-Acc while reducing forgetting by half.
Critique-Guided Distillation for Robust Reasoning via Refinement: Enable the student to consume rather than generate the teacher's critique during training—predict the teacher's refined answer conditioned on (prompt, student draft, teacher critique). At inference, a single prompt pass generates longer and more accurate reasoning chains without compromising instruction-following capabilities, unlike CFT.
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts: Replaces the standard "weighted sum" aggregation of top-\(K\) expert outputs in MoE with structural aggregation via a dynamically learned Directed Acyclic Graph (DAG), significantly enhancing MoE expressivity and downstream inference performance with negligible increases in routing or parameter overhead.
Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning: This paper addresses the cross-task gradient conflict in multi-task instruction fine-tuning by proposing Badit. It first uses SVD to decompose pre-trained weights into naturally orthogonal high-singular-value LoRA "basic ability" experts. During training, it applies spherical K-means to dynamically group rank-1 components orthogonally. This shifts the focus from "isolating parameters by task" to "decoupling by basic abilities," achieving an average improvement of 2.68 Rouge over GainLoRA across six LLMs.
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training: To identify optimal data mixing ratios in LLM pre-training without the prohibitive cost of proxy experiments, this paper proposes DeMix. The method involves training \(N\) component models only once (each corresponding to a candidate subset). Subsequently, any candidate ratio \(\{\alpha_i\}\) is treated as a "training-free" proxy through weighted merging \(\sum_i \alpha_i \Theta_i\). LightGBM is employed for iterative regression on the simplex to select the optimal recipe. DeMix achieves superior downstream scores using approximately \(6\times\) less compute than RegMix/CLIMB and provides the open-source 22T token DeMix Corpora.
Demystifying When Pruning Works via Representation Hierarchies: Starting from the three-level representation hierarchy of "embedding \(\rightarrow\) logit \(\rightarrow\) probability," this paper uses second-order Taylor expansion theory to prove: perturbations caused by pruning in the embedding and logit spaces are inherently small, but the non-linear softmax step amplifies these perturbations into the probability space by a factor of \(\mathrm{Var}_r(\Delta z)/(2T^2)\). Combined with step-wise accumulation through auto-regressive decoding, this ultimately causes generative tasks to collapse. In contrast, non-generative tasks remain naturally robust because they rely only on candidate token subspaces—unifying the explanation for why pruning is nearly lossless on MMLU and retrieval but drops to zero on GSM8K and HumanEval.
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes: The authors model "fluent optimization-based jailbreak suffix detection" as online changepoint detection on token-level entropy streams. By using the entropy distribution of fixed system prompts to calculate a MAD robust baseline for normalizing user token entropy, they run a Page-CUSUM cumulative statistic \(W_t^+\) that triggers an alert upon exceeding a threshold. Across 6 open-source aligned LLMs, this method achieves higher F1 scores than window-based perplexity for five attack types (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA), accurately localizes 79.6% of alerts within suffixes, and serves as a lightweight gate for LLaMA Guard, saving 17-42% of guard calls.
Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models: This paper systematically observes the universal phenomenon of "embedding condensation," where token embeddings in small language models collapse into a narrow cone as depth increases—unlike in large models. The authors design an angular dispersion loss \(\mathcal{L}_{\text{disp}}\) to explicitly force embeddings to spread out. Without adding any parameters, this approach achieves an average improvement of 3.3% for Qwen3 / GPT2 across 10 benchmarks.
DIVER: Diving Deeper into Distilled Data via Expressive Semantic Recovery: DIVER transforms the classic Dataset Distillation (DD) from a "single-stage direct evaluation" into a two-stage paradigm: "distill first, then revive semantics with pretrained diffusion models." Through a three-step process of semantic inheritance, guidance, and fusion, it recovers suppressed high-level semantics from "gibberish" images distilled via ConvNets. This improves the accuracy of the same distilled data on heterogeneous architectures like ResNet18/ViT by 3–10 percentage points, requiring only 2.48s and 4GB VRAM per image.
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation: Ours proposes TAD (Tail-Aware Distillation): It explicitly decouples teacher top-\(K\) probabilities from "tail" probabilities within the standard KD KL divergence and amplifies the tail contribution. This allows LLM pre-training distillation to be completed within academic-scale compute (single H100 + 1 week), outperforming data-centric methods like MiniPLM.
DSL-Topic: Improving Topic Modeling by Distilling Soft Labels from Language Models: The authors utilize the next-token probability distribution generated by a small language model (SLM), prompted to "generate a theme word for the document," and project it onto the topic model vocabulary as dense soft labels. These labels replace the traditional Bag-of-Words (BoW) reconstruction target to train neural topic models (ProdLDA / ECRTM / FASTopic). This significantly improves Purity across three datasets (20NewsGroup, TweetTopic, StackOverflow) and provides a Bayesian interpretation of "projecting implicit LM posterior predictions onto a structured topic family."
Easier to Judge Than to Find: Predicting In-Context Learning Success for Demonstration Selection: This paper reframes ICL demonstration selection from "searching for the optimal \(D^\star\) in a vast combinatorial space" to "judging whether a sampled \((q,D)\) pair will succeed." It proposes DiSP—a framework that stratifies queries by difficulty and employs lightweight judge models for "sample-and-judge" with early stopping. DiSP achieves up to a 3.4% improvement over strong baselines on five classification benchmarks while reducing end-to-end real-time latency by up to 23×.
Effective Model Pruning: Measure the Redundancy of Model Components: Ours borrows the concept of "effective sample size" from particle filtering to map any scoring vector directly to an adaptive retention count \(N_{\text{eff}} = \lfloor 1/\sum_i \omega_i^2 \rfloor\) as a pruning threshold. This approach avoids manual sparsity setting and provides a theoretical upper bound on the loss change before and after pruning.
Efficient Learned Image Compression without Entropy Coding: EF-LIC replaces the slow and serial entropy coding module in the learned image compression pipeline with a two-step approach: "unconstrained vector quantization to maximize index entropy + representation-domain context reparameterization to eliminate latent correlations." It is theoretically proven that its R–D performance can approach that of entropy coding schemes. In practice, it saves 67.86% bitrate compared to MS-ILLM on Kodak/LPIPS and achieves 10x faster decoding.
End-to-End Compression for Tabular Foundation Models: TACO attaches a learnable transformer compressor in front of TabPFN-like tabular foundation models. It compresses \(N\) rows of training context into \(K\ll N\) rows of latent representations before feeding them to the predictor. Through end-to-end joint meta-learning, it achieves a 94x inference speedup and 97% memory saving at a 1% compression rate with almost no loss in ROC-AUC.
Energy-Structured Low-Rank Adaptation for Continual Learning: E2-LoRA moves away from orthogonal constraints in parameter or input feature spaces, focusing instead on "task-induced output feature drift" \(\Delta \mathbf{Y}_t = \Delta \mathbf{W}_t \mathbf{X}_t\). By performing SVD on this drift, LoRA parameters are rearranged onto energy-concentrated and rank-ordered bases. This allows discarding low-energy ranks to reclaim capacity for new tasks, which, combined with an adaptive rank allocation strategy based on energy retention, achieves SOTA performance across several continual learning benchmarks.
Entropy-Aware On-Policy Distillation of Language Models: Addressing the issues of diversity collapse and gradient instability caused by reverse KL in high-entropy teacher regions during on-policy distillation, this paper proposes an adaptive strategy that mixes forward and reverse KL based on token-level teacher entropy, achieving up to a +5.05 improvement in Pass@8 across six mathematical reasoning benchmarks.
EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments: This paper proposes EpiCache, a training-free KV cache management framework. By employing block-wise prefilling to control memory bounds, episodic clustering to preserve topic-relevant context, and sensitivity-aware budget allocation for layer-wise optimization, it achieves near full-cache accuracy with 4-6x compression and reduces peak memory by 3.7x across three long-conversation QA benchmarks.
Event2Vec: Processing Neuromorphic Events Directly by Representations in Vector Space: Following the word2vec philosophy, sparse asynchronous events \((x,y,t,p)\) generated by event cameras are embedded directly into a vector space. By employing Parametric Spatial Embedding + Convolutional Temporal Embedding + K-Means++ aggregation, standard Transformers can preserve the sparse asynchronous nature of events while achieving high throughput on GPUs. The parameter count is reduced to \(\tfrac{1}{2.8} \sim \tfrac{1}{816}\) of previous SOTAs.
EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation: Addressing the VLM distillation problem for ECG interpretation—where teacher and student models are heterogeneous in visual token count, tokenizers, and sequence lengths—EVL-ECG introduces a cross-architecture distillation framework combining "Multi-Head Cross-Attention Alignment + Optimal Transport Visual Feature Matching + Geometric Intra-Architecture Matching." This pushes a 2B student model to SOTA, achieving a 2.4% higher AUC and 1.1% higher clinical accuracy than existing KD methods.
Exploiting Weight-Space Symmetries for Approximating Curvature: This paper proves that by exploiting the invariance of neural network loss to "weight-space symmetry groups" (such as parameter rearrangement/rescaling) and performing orbit averaging on a single gradient, a highly structured Hessian approximation—cheap to store and invert—can be analytically derived. Furthermore, Shampoo and Muon are shown to be special cases where an "identity group" is assigned to specific layers, thereby integrating these empirical optimizers into a unified symmetry-curvature framework.
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models: Addressing the "write-once, no-edit" vulnerability in Diffusion Large Language Models (dLLMs), FAIR-Calib utilizes a full-precision teacher to detect a "frontier-aware position prior." This prior is then applied as weights for layer-wise hidden-state MSE calibration. By specifically protecting boundary tokens that, once flipped by quantization errors, would be permanently locked and amplified, FAIR-Calib consistently outperforms existing quantization baselines under W4A4 on LLaDA and Dream.
FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA: This paper identifies that the true "enemy" of naive factor-wise averaging in Federated LoRA is potential subspace misalignment caused by rotational invariance. It proposes solving for a rotation matrix \(R_i^t\) via orthogonal Procrustes on the client side to align \(A\) and \(B\) factors before aggregation. Both theory and experiments demonstrate that this significantly reduces aggregation error without increasing communication overhead.
FedSDR: Federated Self-Distillation with Rectification: To address "weight drift" caused by heterogeneous client data distributions in federated LLM fine-tuning, this work first uses the model itself to rewrite original instructions into a "model-understandable space" for data-level alignment (FedSD). It then employs a LoRA-S/LoRA-R dual-stream structure to absorb style noise and anchor factual correctness, respectively, while aggregating only LoRA-R. This decouples alignment from faithfulness, achieving SOTA results under various Non-IID settings.
Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters: The authors replace LoRA's "growth by rank" with "growth by CP tensor component," reducing the single-step parameter increment from 4096 to 193 (a 21× reduction). Through a strict controlled study on OPT-1.3B / SST-2/RTE/BoolQ, they prove that finer parameter granularity serves as a tool for "diagnosing PEFT budget sensitivity," but does not inherently yield a better accuracy-budget curve—yielding a sober negative-neutral conclusion rather than "our method is stronger" propaganda.
FlattenGPT: Depth Compression for Transformer with Layer Flattening: This paper proposes FlattenGPT, which first "flattens" adjacent Transformer layers with high input similarity into a single layer of \(2\times\) width (preserving all parameter knowledge) and then applies channel pruning to restore the width to its original scale. This approach achieves the inference speedup of depth compression while avoiding the performance collapse caused by knowledge loss in traditional layer pruning.
Float8@2bits: Entropy Coding Enables Data-Free Model Compression: EntQuant preserves weights with Float8/Int8 precision but adds an additional \(\ell_1\) regularization during the quantization phase to "align" weights toward a low-entropy distribution. These are then losslessly compressed to approximately 2 bits using parallelized ANS entropy coding on the GPU. This achieves over 8× compression for 70B LLMs in under 10 minutes without requiring calibration data or recovery training, while inference is only 1.5–2× slower.
FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision–Language Models: FRISM refines "VLM × LRM merging" from layer-wise granularity to SVD subspace granularity. It utilizes the SVD subspaces of LRM task vectors as reasoning priors and employs an unlabeled self-distillation process (using KL divergence to preserve vision and spectral magnitude maximization to absorb reasoning) with learnable gates to find the optimal injection intensity. This significantly enhances VL reasoning performance without a substantial drop in vision capabilities.
From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers: The authors utilize a three-perspective diagnostic—sample-wise SVD, dataset-level PCA, and token-level Spectral Energy Pattern (SEP)—to reveal a seemingly paradoxical ViT representation geometry: "while per-image feature matrices are low-rank, the shared subspace across the dataset is nearly full-rank, and single-token spectral bandwidth approaches 100%." Based on this, they propose two minimalist patches, Lift (retaining a lifting projector at inference) and WideLast (widening only the final block to the teacher's width), which boost vanilla MSE feature distillation from 74.86% to 78.23% for DeiT-Tiny \(\leftarrow\) CaiT-S24.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs: GEMQ upgrades expert-level bit allocation for MoE models from intra-layer local Linear Programming (LP) to cross-layer global LP. Combined with "post-quantization router weight fine-tuning" to realign distorted routing distributions and a "progressive bit reduction" iterative framework to refine importance estimation, GEMQ compresses models like Mixtral-8×7B to an average of 2.5 bits per expert with less than a 7% drop in average zero-shot performance across 7 benchmarks. It significantly outperforms PMQ, SpQR, MoEQuant, and EAQuant under the same bit budget.
Geo-Expert: Fine-Tuning an 8B Model into an Expert-Level Geological Reasoning LLM via LoRA: Geo-Expert utilizes 11,518 CoT-enhanced instruction data points distilled from five classic geology textbooks to fine-tune Qwen3-8B/32B and Gemma-3-27B models via LoRA. On Geo-Eval (comprising 387 hard boundary problems), Qwen3-8B-geo achieves an average score of 6.27, surpassing Llama-3.1-70B-Instruct (4.12) and GPT-4o (5.93), while Qwen3-32B-geo reaches 6.82, approaching GPT-5.4 (7.15). This demonstrates that high-quality domain alignment is more critical than scaling.
Global Convergence of Adaptive Sensing for Principal Eigenvector Estimation: This paper establishes the optimal convergence rate for compressed streaming PCA. The upper bound for the Oja algorithm using two adaptive measurements per step under noisy observations matches the information-theoretic lower bound (\(\Theta(\lambda_1 \lambda_2 d^2 / (\Delta^2 t))\)). It reveals for the first time that the fundamental cost of compression relative to full observation is an extra factor of \(d\), while adaptivity saves a factor of \(d\) compared to non-adaptive sensing.
GradPower: Powering Gradients for Faster Language Model Pre-Training: GradPower applies an element-wise "sign-preserving power" transformation \(\varphi_p(g_i)=\mathrm{sign}(g_i)\,|g_i|^p\) to raw gradients before feeding them into any gradient-based optimizer. With just one additional line of code and without altering internal AdamW/Muon logic or hyperparameters, it consistently achieves lower final loss across multiple scales of LLaMA and Qwen2MoE (66M to 2B). The gains are most significant under MoE architectures and wsd learning rate schedules.
Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift: Addressing the exorbitant storage costs of "storing massive soft labels per image" in large-scale dataset distillation, this paper demonstrates that Local View Semantic Drift (LVSD) occurs when the number of soft labels per image \(s\) is restricted. A three-stage training paradigm, HALD (soft→hard→soft), is proposed to use smoothed hard labels as semantic anchors to pull training back on track. On ImageNet-1K, it achieves 42.7% accuracy with 285M soft label storage, outperforming the SOTA LPLD by 9.0% while compressing soft label storage by 100x.
Hierarchical Image Tokenization for Multi-Scale Image Super Resolution: H-VAR reslices the VAR paradigm, which uses residual quantization for multi-scale generation, into Hierarchical Image Tokenization (HIT). This allows a small 310M model to output three meaningful intermediate resolutions (128 / 256 / 512) in a single forward pass. Combined with a DPO regularization term that favors HR outputs without requiring external reward models, it competes with the 1B-parameter VARSR on standard ISR datasets.
IDLM: Inverse-distilled Diffusion Language Models: This paper extends "Inverse Distillation" from continuous diffusion to discrete text diffusion models. By proving that the unique optimal solution for the IDLM loss under SEDD/MDLM/Duo is the true data distribution \(p^*\), and combining simplex relaxation with Gaussian reparameterization to solve the instability of discrete backpropagation, it compresses a 1024-step teacher DLM to 16 or even 4 steps while maintaining GenPPL, Entropy, and MAUVE with almost no degradation.
Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images: The authors reformulate AI-generated image detection into a three-stage pipeline: first, a frozen DINOv3 compresses each image into a 768-dimensional CLS vector; next, PCA reduces this to 500 dimensions to serve as a single row in a table; finally, TabPFN performs in-context inference. This approach transforms the need to "retrain classification heads for new generators" into simply "replacing context samples in TabPFN." In low-data and cross-generator scenarios on GenImage, this method leads the strong baseline LATTE by up to 8.2% and wins in 54 out of 64 generator transfer pairs.
Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models: In the Fine-tuning-as-a-Service (FaaS) scenario, the authors reinterpret the strategy of "temporarily jailbreaking a model before user fine-tuning" as a gradient saturation mechanism. Based on this observation, they design the Buffer-and-Reinforce framework: a removable BufferLoRA is used to absorb harmful gradients during user fine-tuning, followed by a ReinforceLoRA that restores safety via QR orthogonal merging. This approach reduces harmful scores to approximately 8.5 without requiring any user-side safety data, while maintaining downstream task accuracy above 76.
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization: LC-QAT employs a parameterization of "shared linear transformation + discrete integer vectors" to reformulate the codebook lookup of Vector Quantization (VQ) as a differentiable round-and-project operation. This enables, for the first time, end-to-end QAT for 2-bit VQ. Leveraged by high-quality PTQ initialization, it matches or surpasses existing SOTA quantization methods using only 0.1%–10% of the training data.
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models: LEAP replaces the "categorical logit per valid sparsity pattern in each group" parameterization found in learnable mask pruning (MaskLLM/PATCH) with a "per-weight Bernoulli gate via Gumbel-Sigmoid". This bypasses the combinatorial explosion deadlock in unstructured sparsity, enabling end-to-end mask learning for unstructured LLM pruning for the first time. On five models (0.5B–8B) at 50%/60% sparsity, it achieves an average zero-shot accuracy improvement of +2.59 points over the strongest layer-wise baseline, ADMM.
Learned Subspace Compression for Communication-Efficient Pipeline Parallelism: To address the "inter-stage communication" bottleneck of pipeline parallelism in low-bandwidth networks, this paper proposes MAPL: allowing each pipeline stage to learn its own orthogonal projection on the Stiefel manifold to compress boundary activations. Combined with factorized anchor embeddings to decouple token shifts and residual vector quantization (RVQ), it achieves 4–16× communication compression on 150M–1B LLaMA models with only ~1% performance degradation compared to uncompressed baselines, significantly outperforming the fixed-subspace SSN.
LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs: Addressing the quality degradation of block-wise PTQ in generation tasks, LFQ replaces the quantization objective of the final Transformer block from MSE to logit-level cross-entropy loss. This aligns the token distribution of the quantized model with the full-precision model, consistently improving accuracy across generation benchmarks such as IFEval, GSM8K, MATH500, and AIME.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection: LiftQuant decouples LLM quantization bit-width from discrete integers (2/3/4 bit) into continuous fractions (e.g., 2.4-bit) through a "high-dimensional 1-bit lattice \(\rightarrow\) low-dimensional weight space projection" (lift-then-project) mechanism. This allows a 70B model to fit precisely into a 24GB GPU with a PPL significantly better than 2-bit baselines. The entire decoding path uses only linear transformations and 1-bit uniform quantizers, making it hardware-friendly.
LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding: This paper argues that using KL divergence as a proxy for acceptance rate in speculative decoding training is sub-optimal—minimizing KL for small-capacity draft models does not imply maximizing the acceptance rate. The authors propose LK losses (direct maximization of the negative log acceptance rate + a trust-region hybrid with KL) as a plug-in replacement. Across 4 draft architectures and 6 target models (8B-685B), it consistently improves the average acceptance length by 8-10%.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws: This paper reinterprets LLM training as a Shannon-Hartley noisy channel—where parameter count corresponds to bandwidth, training tokens to signal power, and data/model noise to channel noise. From this framework, it derives the Shannon Scaling Law \(C_{\text{LLM}} = aN^\alpha \log_2(1 + bD^\beta / (c(DN)^\gamma + dD^\delta + e))\), which unifies the explanation of classical monotonic scaling and recently discovered U-shaped degradation (catastrophic overtraining, quantization-induced degradation). It achieves an extrapolation \(R^2 = 0.847\) on a 12B model with 307B tokens using data from Pythia/OLMo2 models \(\le 6.9\text{B}\).
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis: LoRA-DA reformulates the problem of "how to initialize LoRA matrices \(A\) and \(B\)" as an optimization problem aimed at minimizing the expected gap between the fine-tuned model and target model parameters. Through asymptotic analysis, the objective is decomposed into variance and bias terms. Using Fisher Information to characterize sampling randomness while preserving the anisotropy of the parameter space, LoRA-DA provides an initialization superior to "single-step gradient" methods, achieving stable performance gains across multiple NLP benchmarks.
Making Models Unmergeable via Scaling-Sensitive Loss Landscape: TRAP² embeds "unmergeability" directly into published weight updates during the fine-tuning stage. By performing adversarial optimization on the "update scaling factor \(s\)", the model maintains high utility at the authorized \(s=1\) but degrades rapidly at \(s \neq 1\) (off-nominal scaling commonly introduced by merging pipelines). This approach avoids reliance on Transformer architectural symmetries or full weight access, protecting both LoRA adapters and full checkpoints across Transformer and non-Transformer backbones against unauthorized model merging.
Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds: This work presents the design of the "DNN Pipeline Scheduling Subsystem" within the CROWDio framework. Without modifying the model itself (no pruning, quantization, or distillation), a complete ONNX model is partitioned into layers and distributed across multiple Android devices with RAM as low as 3.3–7.4 GB for pipelined inference. By employing five mechanisms—JIT lazy loading, single-partition residency constraint, 4-tier affinity scheduling, zlib-compressed tensor transfer, and streaming 1:1 dependencies—the peak RSS on each device is suppressed to \(43\pm 2\) MB, achieving batch latencies 34% faster than traditional barrier synchronization.
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment: This paper proposes MIC, which adds two geometric regularizations—SCR (limiting correlation between prefix/residual subspaces) and SIR (enforcing uniform variance for prefixes + hyperspherical uniformity)—on top of Matryoshka Representation Learning (MRL). This allows the model to maintain high discriminativeness even when truncated to extremely low dimensions such as 16/32/64, on average surpassing baselines like MRL and ESE.
Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?: This paper proposes the C2R framework, which reframes the robustness issue in dataset distillation as a "minimum robust margin" problem. By utilizing a triad of "Attack-Aware Curriculum (AAC) + Contrastive Robustness Loss (CRL) + Line-Search PGD (LS-PGD)," models trained on the resulting synthetic sets achieve approximately 2.8% higher average robust accuracy across six types of attacks compared to previous robust distillation SOTAs.
Model Merging Scaling Laws in Large Language Models: The authors empirically derived a dual-axis power law of the form \(L=L_*+BN^{-\beta}+A_0 N^{-\gamma}/(k+b)\) using 10,866 merged models. The base scale \(N\) determines the performance floor, while the number of experts \(k\) determines the tail. Four mainstream merging methods (Average, TA, TIES, DARE) share the same curve, transforming the engineering questions of "how many experts to merge" and "when to stop" into a predictable and budgetable problem.
Multi-Adapter Representation Interventions via Energy Calibration: MARI identifies that existing "representation intervention" methods rely on a linear representation hypothesis—adding a single global steering vector to all inputs—which is unreliable because the optimal correction direction varies significantly across samples and can degrade general capabilities on benign inputs. It replaces the single adapter with multiple low-rank adapters and utilizes "competitive training + entropy routing" for sample-adaptive intervention. An independently trained low-rank probe calculates "propagation energy" for threshold gating to decide whether to enable intervention. This achieves a significant lead over ReFT in TruthfulQA/BBQ/Safety while maintaining or slightly improving MMLU/ARC scores.
NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models: NanoQuant reformulates weight quantization as a "low-rank binary decomposition" problem. It employs Hessian-aware ADMM to precisely initialize \(\pm 1\) factors and floating-point scales, followed by block-level STE reconstruction and global-scale KL calibration. Utilizing only 0.26M tokens of calibration data on a single H100 card, it enables PTQ to compress LLMs to true 1-bit or even sub-1-bit for the first time. For instance, it compresses Llama2-70B from 138 GB to 5.35 GB, allowing it to run on 8 GB consumer-grade GPUs.
NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs: This paper points out that prevailing Post-Training Quantization (PTQ) methods all follow the Min-Max formula for initializing scale and zero-point. This legacy formula contains two long-overlooked constraints: "parameters determined by extreme values" and "zero-point must be an integer." The authors propose NeUQI, which removes these constraints by analytically solving for the optimal zero-point given a scale and employing a coarse-to-fine scale search. In LLaMA-2 7B 2-bit per-channel quantization, NeUQI reduces the C4 perplexity from 47.55 (Prev. SOTA MagR) to 17.50 and allows lightweight distillation to outperform the significantly more expensive PV-tuning.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization: OSAQ leverages the consistent low-rank null space of the Hessian across different inputs in LLM layers to construct an additive weight perturbation \(\Delta W\) from a linear combination of null space vectors. This "self-absorbs" outlier weights without altering the second-order task loss, reducing the perplexity of 2-bit weight-only quantization by over 40% compared to vanilla GPTQ.
PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning: PADD decomposes the task of "guiding a pre-trained MoE student to learn high-quality routing using a dense teacher without a router" into a unified two-stage, four-step pipeline. By first initializing and warming up student experts through teacher FFN neuron clustering, and then simultaneously performing online adaptive distillation, path-refined policy optimization (PR-GRPO), and reward-enhanced load balancing in a single training run, PADD enables small-activation MoE students to match or even surpass 7B dense teachers in mathematical reasoning at the same inference cost.
Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing: The authors treat "parameters themselves as experts"—maintaining a per-stage shared trainable parameter reservoir (shared expert center). A lightweight router dynamically synthesizes weights for low-rank projections and multi-scale depthwise convolutions for each ParaX adapter based on the current input. This simultaneously addresses the "input-agnosticism" and "cross-layer redundancy" of traditional adapters, consistently surpassing full fine-tuning with <5% trainable parameters on dense prediction tasks.
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation: The authors propose Partial Fusion: a method using partial optimal transport (partial OT) to merge only the "most similar" neurons between two networks while allowing the remaining neurons to exist independently. This creates a smooth, monotonic, and tunable accuracy–parameter curve between "weight aggregation (1× parameters)" and "full ensemble (2× parameters)". Furthermore, it is unified under a "generalized pruning of ensembles" perspective, enabling the same toolkit to compress individual models.
Persona-Pruner: Sculpting Lightweight Models for Role-Playing: Instead of equipping every character with a full general-purpose large model, this work uses only a text-based persona description to synthesize persona-specific calibration data. It then learns a binary mask on the intermediate dimensions of FFNs to "sculpt" the sub-network responsible for the character's identity from the base model. Under 50% sparsity, it recovers up to 93.8% of the performance loss in role-playing scores compared to the strongest pruning baselines while preserving general capabilities.
Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers: The authors decompose the three most difficult nonlinear operators in Transformers (Softmax, SiLU, RMSNorm) into three common primitives: "division / exponential / \(\ell_2\) norm." These are implemented as spike-friendly modules using LIF neuron group computation and shift-scaling, which can be assembled like building blocks back into the original operators. This plug-and-play approach requires no fine-tuning and integrates directly into existing ANN-to-SNN pipelines, resulting in \(<1\%\) accuracy loss for models like LLaMA-3-8B / Qwen3-8B / BERT.
Post-Hoc Merging Is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing: This paper points out that mainstream model merging methods are "post-hoc"—merging only once after training—which is prone to information erasure caused by task interference. Instead, it proposes a many-shot iterative merging framework and introduces METIS. METIS uses task-level loss-gap weighting to compensate for erased tasks and a consensus mask to locate compatible updates, significantly improving multi-task capabilities while preserving single-task knowledge, particularly recovering the "worst-performing task."
Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs: The authors propose SRR (Structured Residual Reconstruction), which explicitly splits the fixed low-rank budget \(r\) in Quantization Error Reconstruction (QER) into two parts: "preserving \(k\) principal singular directions before quantization" and "fitting the residual with the remaining \(r-k\) rank". Using a closed-form criterion based on a one-shot random probe to select \(k^\star\) per layer, SRR consistently outperforms LQER/QERA in 2/3-bit PTQ and QPEFT.
PRISM: Synergizing Vision Foundation Models via Self-Organized Expert Specialization: PRISM distills three heterogeneous Vision Foundation Models (CLIP, SAM, and DINOv2) into a single ViT student. By employing a "dual-stream conditional MoE"—consisting of a shared anchor stream for gradient stability and a context-routed sparse expert stream for conflict resolution—experts self-organize to share consensus knowledge and branch for conflicting knowledge. It outperforms the previous SOTA, SAK, across all five tasks on PASCAL-Context.
Procedural Pretraining: Warming Up Language Models with Abstract Data: Injecting a lightweight "procedural data" warm-up (formal languages, stacks, cellular automata, etc.) before standard language/code/math pretraining consistently improves downstream performance with only 0.1–0.3% additional tokens. This strategy enables models to replicate the same loss using only 55–86% of the original data, representing a pretraining strategy that decouples "reasoning scaffolds" from "knowledge."
ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression: ProjQ actively "shapes" the quantization noise of PTQ into a low-rank subspace and delegates this part to the subsequent LoRA adapter for elimination, thereby preserving LoRA capacity for learning downstream tasks. It achieves parity with standard 4-bit baselines using only 3 bits on LLaMA-2 / Qwen2.5 / Qwen3.
Provably Learning Attention with Queries: The authors prove that single-head softmax attention can be precisely recovered with surprising simplicity under value-query access—requiring only \(O(d^2)\) queries, which is significantly easier than ReLU MLPs of equivalent structure. When the head dimension \(r \ll d\), the complexity can be reduced to \(O(rd)\) using compressed sensing. The findings are extended to noisy oracles, membership queries, and multi-head unidentifiability.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL: QHyer replaces the trajectory-dependent RTG in Decision Transformers with state-dependent Q-values estimated via Normalizing Flows, and utilizes a gated Attention-Mamba hybrid backbone to achieve content-adaptive history compression, simultaneously setting a new SOTA on the non-Markovian and Markovian offline goal-conditioned RL datasets of OGBench/D4RL.
Quantifying the Uncertainty of Foundation Models with Singular Value Ensembles: Singular Value Ensemble (SVE) expresses "ensemble diversity" purely through the distinct re-weighting of SVD singular values—freezing the left and right singular vectors (shared "knowledge basis") of pre-trained weights while training an independent set of singular values for each ensemble member. With a parameter overhead of \(\lesssim 1\%\), its calibration quality approaches that of a true Deep Ensemble, bringing UQ into PEFT-friendly, resource-constrained scenarios.
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs: This paper addresses a failure mode in residual binarized LLMs named "inter-path adaptation," where parallel binary paths learn redundant features. The authors propose RaBiT, which online derives all binary paths from a single shared full-precision weight combined with function-aware initialization. This structurally enforces a residual hierarchy, allowing 2-bit Llama2-7B to outperform strong VQ baselines in a matmul-free architecture for the first time (Wiki2 PPL 5.78 vs. QTIP 5.86) while achieving 4.49× inference acceleration.
ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training: This paper discovers that FP4 quantization failures in large reasoning models (LRMs) are concentrated on "low-entropy tokens" (deterministic symbolic commitments like numbers and operators). It proposes ReQAT—a three-component toolkit (Trajectory-Aligned QAT + Selective Entropy Minimization + Quantization-Friendly Initialization for KV cache) specifically targeting these tokens. Under full W4A4KV4 quantization, ReQAT not only matches but even surpasses BF16 fine-tuning accuracy while achieving up to 3.9× throughput speedup.
ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation: ReSpinQuant preserves the dual advantages of low-bit LLM PTQ: "global rotations fused with weights" and "layer-wise rotations adaptable to outliers." It replaces the non-fusable rotation transition matrix \(\mathbf{T}=\mathbf{R}_{out}\mathbf{R}_{in}^{\top}\) at residual connections with a subspace orthogonal approximation of rank \(r\!\approx\!32\). This increases online overhead by only \(\sim0.2\%\), while outperforming SpinQuant and FlatQuant on W4A4/W3A3 tasks.
RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression: RQ-MoE introduces a "two-level MoE + dual-stream quantization" design, enabling the codebook in Residual Quantization (RQ) to be dynamically generated per input. By decoupling the instruction stream from the reconstruction stream, it achieves 6–14× decoding acceleration while matching or exceeding QINCo's MSE/Recall performance across four retrieval benchmarks.
Saliency-Aware Model Merging: SA-Merging adapts the SynFlow connectivity score from structured pruning to data-free model merging scenarios. For each expert's task vector, it computes "end-to-end path sensitivity × aggregation direction consistency" as the saliency, iteratively removing updates with low saliency. This pushes data-free merging performance close to test-time adaptation levels across vision, language, and LoRA multi-task benchmarks.
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning: The authors prove that LoRA's cumulative updates are trapped in a fixed low-rank subspace and propose ScaLoRA: after merging the old \(AB^\top\) into \(W^{pt}\) at each step, the adapter is restarted using an analytically derived optimal "column scaling". This allows the first and second moments of AdamW to be transferred equivariantly in \(O((m+n)r)\) (eliminating the need for resets or warm-ups), enabling cumulative updates to naturally achieve high rank. ScaLoRA consistently outperforms LoRA, MoRA, HiRA, ReLoRA, and LoRA-GA on DeBERTaV3, LLaMA2-7B, LLaMA3-8B, and Gemma3-12B.
Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers: MaskAQ redefines Data-Free Quantization (DFQ) for ViTs as "aligning the attention of the full-precision model \(P\) and quantized model \(Q\) on sparse informative regions of synthetic samples." By decoupling foreground patches through differential entropy maximization, aligning attention with adaptive masks, and utilizing periodic refreshing to let samples evolve with \(Q\), MaskAQ improves ImageNet Top-1 accuracy by 3.1% over the previous SOTA on 3-bit DeiT-T.
Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching: Targeting the disaggregated LLM serving scenario where a "base model acts as the producer and a fine-tuned model acts as the consumer," this paper proposes SCD: replacing raw KV Cache transferred across devices with offline-learned low-rank semantic codes. Most layers use "Reuse" for bandwidth-saving low-rank reconstruction, while a few key layers use "Patch" to recompute pre-normalization inputs to truncate error accumulation. At 200 Gbps bandwidth, it achieves a 2.65× TTFT speedup relative to the Oracle with an F1 drop of less than 3%.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression: This paper first introduces a new benchmark, KVFundaBench, to systematically reveal the critical asymmetry where "retrieval-based long contexts are easy to compress, while reasoning-based ones are not." The authors attribute this to KV compression destroying the integrity of "semantic units" (few-shot examples). Consequently, they propose ShotKV—preserving entire shots as indivisible units during the prefill phase and performing dynamic token-level compression during the decoding phase. This approach improves LG-GSM8K performance from a baseline of 46.0 to 47.33 at a 40% compression rate and reduces end-to-end latency by 11.3% in long-input settings.
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding: SPEED-Bench is a unified benchmark for Speculative Decoding (SD). By combining a Qualitative split (880 samples maximizing semantic diversity) and a Throughput split (large-batch data organized by 1k–32k input length buckets across three entropy levels) with a measurement framework interfacing with vLLM / TensorRT-LLM / SGLang, it reveals the actual deployment behavior often obscured by "small data + single batch + HuggingFace" evaluations in previous SD papers.
SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models: The authors reconceptualize LoRA merging from "arithmetic in parameter space" to "routing internal signals within a unified subspace." By concatenating LoRAs along the rank dimension and inserting a router \(R=\mathbf{Q}\mathbf{G}^{-1}\) constructed via second-order statistics (de-correlation + directional guidance), they achieve a solution theoretically equivalent to the Ordinary Least Squares (OLS) optimum. This method is training-free, supports streaming updates, has zero inference overhead, and significantly outperforms SOTA methods like TIES/DARE on FLUX.1-dev.
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks: SURGE connects a "full-precision auxiliary branch" in parallel with each binarization layer. While the forward output remains unchanged, the backward pass propagates an additional "non-STE truncated" high-order gradient from the full-precision branch. By using AGS to dynamically balance the contributions based on the gradient norm ratio, SURGE achieves 62.0% top-1 on ResNet-18/ImageNet, outperforming ReCU by 1.0% and IR-Net by 3.9%.
Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression: Addressing the dilemma where existing SVD low-rank compression either yields suboptimal reconstruction errors or achieves optimality at the cost of slow and numerically unstable Cholesky + multiple SVD operations, this paper proves a closed-form spectral solution theorem—optimal activation-aware compression is achieved by a single eigen-decomposition of \(Y=XW\). Combined with incremental covariance aggregation and dynamic rank allocation driven by the "negative correlation between layer importance and local compressibility," the proposed method achieves state-of-the-art compression accuracy across 6 LLMs and 8 datasets while accelerating end-to-end compression by 3–70×.
Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning: LoDA decomposes the LoRA down-projection matrix into a shared universal subspace and a task-specific isolation subspace based on "projection energy." It utilizes gradient alignment optimization (GAO) to train up-projections and applies a closed-form feature recalibration during fusion, consistently outperforming existing LoRA-CL methods across multiple benchmarks.
The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works: Authors find that a "linear mixture of soft and hard labels" consistently outperforms pure soft labels in LLM distillation. They shift the explanation from "hard labels facilitate optimization" to "hard labels suppress exposure bias." Using Bridge-Garden decomposition, they categorize sequence positions into "Bridges" (requiring precision) and "Gardens" (allowing flexibility), binding the mix coefficient to contextual risk. They propose four adaptive hybrid strategies that surpass mainstream on-policy/divergence-based KD across seven teacher-student pairs with a 9.7× training cost advantage.
The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models: The authors discover that activations in the final layer residual stream of Qwen3-4B are organized into a hierarchical manifold of "digit basins × carry fibers" during multi-operand addition. "Off-by-one" errors are reinterpreted as geometric slippage across quantization thresholds of a continuous carry potential along Isoremainder-sum Trajectories (IRST). Based on this, a dual-stream consistency check is proposed to correct off-by-one errors—where the model "internally knows" the truth but outputs the wrong token—during inference.
ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT: ToaSt "decouples" ViT compression into two targeted strategies: for Multi-Head Self-Attention (MHSA), which accounts for less than 40% of FLOPs, it employs coupled per-head structured weight pruning to preserve the mathematical integrity of attention; for the Feed-Forward Network (FFN), which accounts for over 60% of FLOPs, it uses a training-free, plug-and-play "Token Channel Selection (TCS)" during inference to filter redundant noisy channels. This achieves a superior accuracy–efficiency trade-off across nine ViT models, such as 88.52% Top-1 (+1.64%p) on ViT-MAE-Huge while reducing FLOPs by 39.4%.
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection: The authors observe that token "importance" fluctuates drastically across layers and heads, making traditional one-time token eviction an irreversible early decision error. They propose Token Sparse Attention, where each layer and attention head independently selects \(L' \ll L\) tokens for dense attention. The output is then scattered back to the original sequence length, coupled with a residual path that allows skipped tokens to be re-evaluated in subsequent layers. This mechanism maintains dynamic layer/head selection while remaining compatible with dense kernels like FlashAttention, achieving ×3.23 attention speedup on 128K context when combined with FlexPrefill with <1% accuracy loss.
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail: This paper identifies a "robustly unlearnable set" that remains stable across various adversarial training methods. Through the feature learning theory of two-layer networks, it proves that when a strongly robust teacher provides high-confidence supervision on these samples, it forces the student to memorize pseudo-noise, thereby triggering robust overfitting. Conversely, maintaining high entropy on these samples suppresses noise gradients. Based on this, a teacher selection criterion based on the predictive entropy of unlearnable samples is proposed.
Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines: The authors developed a staged GPU energy collection framework based on NVML, decomposing the distillation pipeline into "teacher side + student side + evaluation" for stepwise measurement. Findings indicate that for one-off runs, teacher logit caching and synthetic data generation represent the primary energy costs, causing KD and synthetic SFT to consume approximately \(2.4\times\) more energy than direct SFT on 1B–13B OLMo-2 students. A closed-form break-even formula is provided, showing that distillation only becomes "energy-efficient" when teacher outputs are reused more than \(N^*\) times.
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions: The authors derive a scaling constraint for the joint training of steering vector factors and directions, \(\eta_{\mathbf{v}}\eta_{\alpha}=\Theta(1)\), using infinite-width neural network scaling theory, thereby eliminating manual \(\alpha\) selection during inference. Simultaneously inspired by ReFT, they implement additive interventions only on the first 4 prompt tokens (PrOSV). This approach maintains model utility while consistently outperforming full-sequence FSSV across three scales of Gemma2 and Qwen2.5 models on AxBench.
T3S: Training Trajectory-Aware Token Selection, Breaking the "Imitation Shock" in Reasoning Distillation: This paper discovers a universal "Imitation Shock" when a strong student (e.g., Qwen3-8B) is distilled from DeepSeek-R1—where loss decreases monotonically, but accuracy first plunges before recovering. The root cause is that "Imitation-Anchor Tokens" dominate optimization in early stages, suppressing the tokens truly responsible for reasoning. T3S identifies anchor tokens via training trajectories and masks them, allowing yet-to-learn reasoning tokens to be learned earlier. This achieves performance gains in both AR and dLLM settings (Qwen3-8B surpasses DeepSeek-R1, Qwen3-32B approaches Qwen3-235B, and LLaDA-2.0-Mini achieves a 16B no-think SOTA by surpassing AR baselines).
Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization: This paper stores "stale" block-cyclic coordinate descent gradient estimates in a FIFO buffer, reuses them with momentum decay, and proves this is equivalent to BCCD with a warm-start. Simultaneously, it provides the counter-intuitive conclusion that a larger finite-difference step size \(\epsilon\) implicitly smooths the loss landscape and reduces the effective Lipschitz constant, allowing stale gradients to achieve stable descent.
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization: TWLA is the first post-training quantization (PTQ) framework capable of simultaneously compressing weights to 1.58-bit (ternary) and activations to 4-bit. By employing a "two-stage ternary calibration from Euclidean to manifold + Kronecker orthogonal rotation for tri-modal weight shaping and outlier suppression + inter-layer aware activation mixed-precision allocation" trio, it maintains high accuracy under W1.58A4 and achieves true end-to-end inference acceleration.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-Adaptive Federated Fine-tuning of Foundation Models: The authors observe that directly applying Sparse MoE to heterogeneous federated LoRA fine-tuning leads to two critical issues: "expert utilization imbalance" and "non-differentiability of Top-K". They propose Dynamic Modulated Routing (DMR) to rebalance expert activation and Universal Pseudo-Gradient (PG) to provide signals for inactive experts, forming a self-reinforcing cycle. This allows low-compute clients to achieve an 8.7× performance improvement while saving 45% of computation.
Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression: This paper first exposes the illusion that "Dataset Distillation (DD) outperforms Pruning" using a unified dataset compression benchmark—revealing that DD's gains primarily stem from soft labels rather than synthetic images. It then proposes the hard-label-only PCA (Prune-Combine-Augment) framework, which significantly outperforms existing DD and DP methods under extreme compression ratios on ImageNet-1K while eliminating soft labels that occupy 40x the storage of images.
UniSVQ: 2-bit Unified Scalar-Vector Quantization: UniSVQ unifies Scalar Quantization (SQ) and Vector Quantization (VQ) through an "affine transformation of integer lattices." This yields a 2-bit Post-Training Quantization (PTQ) scheme that achieves VQ-level accuracy with only 20 extra parameters per weight matrix, while maintaining the integer operator structure and inference throughput of SQ.
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging: This paper identifies that model merging suffers not only from task conflicts but also from the redundant accumulation of shared spectral directions into excessively large singular values. It proposes Singular Value Calibration (SVC), a training-free and data-free method that recalibrates singular values without altering singular vectors, consistently improving merging performance across vision and language tasks.
WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points: WinQ attributes the slow convergence of low-bit language model QAT to weights being trapped near low-curvature saddle points. By utilizing periodic weight-quantization interpolation for re-initialization and noise perturbation for gradients, it accelerates 1-2 bit QAT by 1.5-4x with almost no additional training overhead, improving perplexity and zero-shot accuracy across various LLaMA/Qwen quantization configurations under the same training budget.
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization: WUSH derives closed-form, data-adaptive blockwise linear transforms for LLM weight-activation low-bit quantization. It combines the uniform diffusion capability of Hadamard with second-order statistics of weights and activations, significantly improving accuracy for W4A4 (especially MXFP4) scenarios with almost no sacrifice to FP4 kernel throughput.
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction: xKV discovers that while the per-token cosine similarity of KV-caches across different LLM layers is low, their principal singular vectors are highly aligned. Consequently, xKV utilizes a cross-layer shared low-rank basis to compress multiple layers of KV-cache simultaneously. Combined with selective reconstruction, it achieves up to 8x compression and a 4.23x increase in end-to-end throughput for long-context inference.
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling: ZipMoE targets Large MoE model inference on mobile and edge devices. It decomposes BF16 expert parameters into compressible exponent bits and high-entropy sign-mantissa bits. Through lossless compression, hierarchical caching, and cache-affinity scheduling, it transforms the expert loading process—previously bottlenecked by SSD I/O—into a parallelized decompression and reconstruction pipeline hidden by multi-core CPUs. This reduces latency and enhances throughput without altering model semantics.