Skip to content

📦 Model Compression

🧪 ICML2025 · 74 paper notes

📌 Same area in other venues: 📷 CVPR2026 (108) · 🔬 ICLR2026 (240) · 💬 ACL2026 (59) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (143)

🔥 Top topics: LLM ×17 · Model Compression ×15 · Compression ×8 · Knowledge Distillation ×4 · Continual Learning ×4

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

Proposes Semi-Clipped (a CLIP-based cross-modal distillation method) and PEA (Perturbation Embedding Augmentation). In weakly paired data scenarios, these methods distill rich morphological features from microscopy images into transcriptomics representations, significantly improving their predictive power while maintaining the interpretability of gene expression.

A Mathematical Framework for AI-Human Integration in Work

This paper proposes a mathematical framework for evaluating AI-human work integration, decomposing skills into decision-level and execution-level sub-skills. It theoretically proves that the probability of work success exhibits a phase transition effect, and that merging complementary skills can yield super-additive gains. It also mathematically explains the "productivity compression" phenomenon where low-to-medium skilled workers benefit more from GenAI assistance, validating the framework using O*NET and Big-bench Lite data.

ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence

This paper provides an in-depth analysis of the probability mass allocation deficiencies of FKLD and RKLD in knowledge distillation, finding that they represent extremes in two effects: Hardness-Concentration and Confidence-Concentration. Based on this, the ABKD framework utilizing \(\alpha\)-\(\beta\)-divergence is proposed to flexibly balance these two effects by tuning \(\alpha\) and \(\beta\), achieving SOTA performance across 17 language/vision datasets and 12 teacher-student configurations.

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Proposes the RSR/RSR++ algorithm—by preprocessing fixed binary/ternary weight matrices to build bucketed permutation indices, it achieves vector-matrix multiplication with \(O(n^2/\log n)\) complexity, achieving up to 29× faster matrix multiplication and 6× memory savings compared to the standard \(O(n^2)\) method, as well as a 5.24× speedup in 1.58-bit LLM inference.

any4: Learned 4-bit Numeric Representation for LLMs

This paper proposes any4, a method that learns the optimal 4-bit non-uniform quantization codebook for each row of the weight matrix via k-means clustering. Without requiring weight/activation preprocessing, any4 outperforms int4/fp4/nf4 on Llama 2/3, Mistral, and Mixtral, using only a single calibration sample.

BECAME: BayEsian Continual Learning with Adaptive Model MErging

This paper proposes BECAME, which reformulates the model merging mechanism based on Bayesian continual learning principles. It derives a closed-form solution for the optimal merging coefficient using Laplace approximation. Combining gradient projection (stability) and unconstrained training (plasticity) into a two-stage framework, it significantly outperforms SOTA on multiple continual learning benchmarks.

Best Subset Selection: Optimal Pursuit for Feature Selection and Elimination

This paper revisits the feature selection/elimination criteria in classical best subset selection from an optimization perspective. It reveals that traditional criteria (correlation selection + Wald-T elimination) only capture "one-step changes" in the objective function while neglecting feature interactions. Consequently, the authors propose "objective-aware" optimal selection and elimination criteria to enhance classic algorithms such as OMP, CoSaMP, and (A)BESS as a plug-and-play Meta-Substitution. This achieves significant performance improvements in compressed sensing and sparse regression tasks without increasing computational complexity.

Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning

This paper proposes a gradient compression scheme based on Multilevel Monte Carlo (MLMC), which constructs statistically unbiased gradient estimators from biased compressors, turning compression bias into manageable variance. This allows the approach to enjoy the theoretical guarantees of unbiased methods while maintaining the empirical efficiency of biased compressors. Combined with adaptive probability optimization, its superiority is validated on BERT fine-tuning and CIFAR-10.

Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

This work theoretically analyzes and experimentally validates, from an infinite-width perspective, that initializing both the A and B matrices of LoRA as non-zero (Init[AB]), compared to the traditional zero initialization (Init[A]), significantly enhances robustness to suboptimal learning rates. Furthermore, the introduced random noise does not impair the fine-tuning performance—meaning that fine-tuning does not strictly need to start from the pre-trained model.

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Proposes BlockDialect—a block-wise fine-grained mixed-format quantization method for weights and activations. It selects the optimal numerical format for each block from a formatbook of FP4 variants (dialects), improving accuracy on LLaMA3-8B by 10.78% compared to MXFP4, and remaining only 5.45% below full precision.

BoA: Attention-aware Post-training Quantization without Backpropagation

This paper proposes BoA—the first backpropagation-free algorithm for post-training quantization that accounts for cross-layer dependencies. By constructing an attention-aware Hessian matrix, it captures inter-layer interactions within the attention module, significantly outperforming existing PTQ methods at ultra-low bit-widths (INT2).

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

By directly perform weighted averaging of the parameters of a mathematical reasoning LLM and the text components of a VLM (model merging), reasoning capability is transferred to the VLM without any training. Furthermore, a layer-wise distribution pattern is discovered where perception capabilities are concentrated in the early layers, while reasoning capabilities are concentrated in the middle and late layers.

Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Proposes CoTo (Come Together), a progressive training strategy: randomly deactivates LoRA adapters during the early phase of fine-tuning, with the activation probability linearly increasing from 0 to 1, encouraging even gradient distribution across all layers. It theoretically guarantees dropout stability and linear mode connectivity (LMC). Empirical results demonstrate simultaneous improvements in single-task generalization, multi-task merging, and pruning robustness while reducing training overhead.

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

This paper proposes ConfPO, which identifies preference-critical tokens using the policy model's own confidence scores and optimizes only these tokens without requiring extra models or computational overhead. It consistently outperforms uniformly optimized DAA methods on AlpacaEval 2 and Arena-Hard while alleviating reward hacking.

Context Tuning for In-Context Optimization

This paper proposes Context Tuning, which initializes trainable prompt/KV prefixes using few-shot exemplars and optimizes these context representations (instead of model parameters) via gradient descent to enhance the few-shot adaptation of LLMs. The CT-KV variant achieves competitive accuracy to TTT with linear time complexity.

Core Context Aware Transformers for Long Context Language Modeling

Proposes Core Context Aware (CCA) Attention, which dynamically compresses input tokens into a small number of core tokens through globality-aware pooling, combined with a locality-preserving module to capture adjacent fine-grained information. It achieves plug-and-play replacement of standard self-attention, yielding a 7.9× speedup and 46% GPU memory savings under 128K context while maintaining modeling performance.

Diffusion Sampling Correction via Approximately 10 Parameters

Proposes the PCA-based Adaptive Search (PAS) method, which leverages the geometric property where sampling trajectories reside in a low-dimensional subspace of a high-dimensional space. By extracting a small number of orthogonal basis vectors via PCA and learning only about 10 coordinate parameters, PAS corrects the truncation errors of existing fast samplers. With sub-minute training on a single A100 GPU, it reduces the FID of DDIM on CIFAR10 from 15.69 to 4.37 (NFE=10).

Distilling Tool Knowledge into Language Models via Back-Translated Traces

This paper proposes a multi-agent back-translation pipeline: it first utilizes a Solver Agent to invoke tools (code interpreters) for solving mathematical problems and generating Tool-Integrated Reasoning (TIR) traces, then leverages a Translator Agent and a Rephrase Agent to translate the tool-execution trajectories into pure natural language reasoning chains. Finally, these synthetic data are used to fine-tune small language models, enabling them to internalize tool knowledge and structured reasoning capabilities without requiring tool access at inference time.

DLP: Dynamic Layerwise Pruning in Large Language Models

Proposes a dynamic layerwise pruning method, DLP, which adaptively calculates the relative importance of each layer using the median of weights and activation values. It performs non-uniform pruning based on the principle of "more important layers get lower sparsity ratios", reducing the perplexity of LLaMA2-7B by 7.79 and improving the average zero-shot accuracy by 2.7% at a high sparsity ratio of 70%.

Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment

A temporally decoupled logit distillation framework is proposed, which exploits the inherent spatio-temporal dynamics of SNNs to decompose the training objective to each timestep. This achieves high-performance deployment of a single model across a full range of inference timesteps without needing to retrain for different timesteps.

Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

The paper proposes FARMS (Fixed-Aspect-Ratio Matrix Subsampling) to eliminate the aspect ratio bias in weight eigenspectrum analysis via fixed-aspect-ratio submatrix sampling, thereby significantly improving HT-SR-based layer-wise learning rate allocation and model pruning.

FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression

The FGFP framework is proposed, which integrates fractional calculus with Gaussian functions to construct a fractional Gaussian filter (FGF), requiring only 7 parameters per kernel. In conjunction with adaptive unstructured pruning (AUP), it achieves an 85.2% model compression rate with only a 1.52% accuracy drop for ResNet-20 on CIFAR-10, and a 69.1% compression rate with a 1.63% accuracy drop for ResNet-50 on ImageNet.

FlatQuant: Flatness Matters for LLM Quantization

This paper proposes FlatQuant, which uses learnable affine transformations (via Kronecker decomposition) to flatten weight and activation distributions. This achieves a \(\le 1\%\) accuracy loss on LLaMA-3-70B under W4A4 quantization for the first time, while yielding a \(2.3\times\) speedup in the prefill stage and a \(1.7\times\) speedup during decoding.

FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Proposes FloE, an on-the-fly MoE inference system tailored for consumer-grade GPUs. By design of specialized intra-expert hybrid compression (contextual sparsification + ultra-low-bit quantization) and dual predictors, FloE pipelines computation and transmission. It successfully deploys Mixtral-8×7B on a single RTX 3090 with only 11GB VRAM, achieving a 48.7x speedup compared to DeepSpeed-MII with only a 4.4%--7.6% performance degradation.

From Language Models over Tokens to Language Models over Characters

An algorithmic framework is proposed to precisely convert token-level language models into character-level language models. By defining "covering" (minimally covering prefix encoding sets) and approximating it using beam search, user-end issues caused by tokenization, such as the prompt boundary problem, are resolved, while improving the compression rate (bits/byte).

From Logits to Hierarchies: Hierarchical Clustering made Simple

The L2H (Logits to Hierarchies) algorithm is proposed, which utilizes only the logits output from pre-trained flat clustering models. Through masked softmax and iterative merging strategies, it constructs high-quality hierarchical clusterings without fine-tuning, significantly outperforming dedicated deep hierarchical clustering models while running on a CPU in less than a minute on ImageNet-scale datasets.

From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

By revealing the low-rank convergence differences across various LLM weight matrices through Hessian spectral analysis, this work proposes WeLore—a non-uniform low-rank decomposition method that unifiedly addresses both model compression and parameter-efficient fine-tuning.

Function-Space Learning Rates

An efficient Monte Carlo estimation method for layer-wise function-space learning rates is proposed. Based on this, FLeRM (Function-space Learning Rate Matching) is designed to record function-space learning rates on a small model and automatically adjust the parameter-space learning rates of a large model, enabling hyperparameter transfer across width, depth, initialization scale, and LoRA rank.

Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks

This paper proposes a hypernetwork-based meta-learning framework to obtain tight generalization bounds for neural networks. Three encoder-decoder architectures are designed (PAC-Bayes encoder, sample compression encoder, and a hybrid encoder). The hybrid approach is based on a new PAC-Bayes Sample Compression theorem supporting continuous messages, explicitly measuring model complexity via an information bottleneck, and obtaining non-vacuous generalization guarantees on both synthetic and real datasets.

GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

GPTAQ proposes a tuning-free quantization method featuring asymmetric calibration. By aligning the output of quantized layers with the exact output of the full-precision model (instead of just the current layer's output) and deriving a closed-form solution based on the Optimal Brain Compression (OBC) framework to jointly minimize both quantization and cumulative asymmetric errors, GPTAQ significantly improves the performance of GPTQ in low-bit quantization while adding only about 20 lines of code.

GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

GuidedQuant is proposed, which improves existing SOTA PTQ methods in scalar, vector, and weight-activation quantization as a plug-and-play module by incorporating end-to-end loss gradient information into layer-wise quantization objectives (preserving weight interactions within output channels). Meanwhile, the LNQ algorithm for non-uniform scalar quantization is proposed, reducing the 2-bit Llama-2-7B perplexity from 39.58 to 8.83.

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Proposes Gumiho, a hybrid draft model architecture for speculative decoding: the first two tokens are generated using sequential Transformers to ensure accuracy, while subsequent tokens are generated using parallel MLP heads to improve efficiency. It further increases the accepted length through a Full Tree Attention mechanism, achieving up to 3.65x speedup on Vicuna/LLaMA.

Instruction-Following Pruning for Large Language Models

IFPruning is proposed: a small sparsity predictor dynamically generates pruning masks based on user instructions, trimming the intermediate dimensions of FFNs on demand. This enables a 9B model activating only 3B parameters to outperform dense models of the same scale by 5-8 percentage points in coding/math, while maintaining inference latency on par with a 3B dense model.

Joker: Joint Optimization Framework for Lightweight Kernel Machines

The Joker framework is proposed, which achieves unified and efficient training of various large-scale kernel models (KRR / KLR / SVM, etc.) with ~2GB of memory through dual block coordinate descent with trust region (DBCD-TR) and random Fourier feature approximation, achieving memory savings of up to 90% without performance loss.

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

This paper proposes a ladder-shaped KV caching pattern that retains KV states of different token ranges across different layers, thereby expanding the capturable context span within a fixed cache budget, and supports infinite-length continuous generation through an iterative compaction mechanism.

Lego Sketch: A Scalable Memory-augmented Neural Network for Sketching Data Streams

This paper proposes Lego Sketch, a scalable memory-augmented neural network (MANN) based on modular "memory bricks". It addresses the scalability bottleneck of existing neural sketches—which require retraining across different data domains and spatial budgets—using normalized multi-hash embedding, scalable memory, and a self-guided weighted loss. It also provides the first theoretical error upper bound for neural sketches.

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reliable Model Merging

It is discovered that the weights with the largest magnitude after low-rank approximation (Principal Weights) are the critical parameters for fine-tuning. LIFT is proposed, which updates only the top 5% of Principal Weights to outperform full-parameter fine-tuning on reasoning tasks while maintaining LoRA-level memory efficiency.

LightGTS: A Lightweight General Time Series Forecasting Model

This paper proposes LightGTS, which leverages the inherent scale-invariant periodic inductive bias of time series. Through two core techniques, Periodical Tokenization and Periodical Parallel Decoding, it achieves SOTA performance in both zero-shot and full-shot settings across 9 benchmark datasets using fewer than 5 million parameters, representing a 10-100x reduction in size compared to existing time series foundation models.

LoRA Fine-Tuning without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs

A meta-generation framework is proposed for efficient LoRA fine-tuning on CPUs, which avoids GPU dependence through pre-computation and caching strategies, making LLM fine-tuning feasible in resource-constrained environments.

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

GOAT significantly enhances LoRA performance without altering the training algorithm or main architecture via "SVD-segmented initialized LoRA-MoE + theoretically derived scaling alignment", achieving SOTA on 25 tasks and substantially narrowing the gap with Full FT.

Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation

This paper presents the first systematic analysis of how discretization noise introduced by quantization degrades model merging performance. It proposes HDRQ (Hessian and Distance Regularizing Quantization), which uses Hessian regularization to flatten the loss landscape, distance regularization to align weights across quantized models, and noise-sampling rounding to resolve rounding ambiguity. This allows quantized models to achieve merging performance close to full-precision equivalents in multi-target domain adaptation.

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Memory-Keyed Attention (MKA) is proposed to organize the KV cache into a three-level hierarchical memory (local, session, and long-term), dynamically allocating attention via a learnable routing gate. An accelerated version, FastMKA, fuses the memory sources prior to the attention computation, achieving up to 5 times the training throughput of MLA and reducing decoding latency to 54% of MLA, with only approximately 1% loss in perplexity.

MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

This paper proposes the Mixture-of-Roles (MoR) framework, which decomposes agent capabilities into three roles: Reasoner, Executor, and Summarizer, with each role allocated a specialized group of LoRAs. It achieves agent performance close to or even exceeding full-parameter fine-tuning while only introducing minimal extra parameters (0.16B–0.36B).

Neutral Residues: Revisiting Adapters for Model Extension

This paper proposes Neutral Residues, which introduces a ReLU gate, an \(\ell_1\) sparse local loss, and low-variance initialization into adapters. This design forces the added residual blocks to output near-zero values on the original distribution, achieving the optimal trade-off between learning a new language and preserving English capabilities on Gemma-2B.

Olica: Efficient Structured Pruning of Large Language Models without Retraining

This paper proposes Olica, a retraining-free framework for structured LLM pruning. By performing orthogonal decomposition (PCA/SVD) on the matrix products of MHA layers and linear calibration (ridge regression closed-form solution + low-rank approximation) on FFN layers, Olica prunes LLaMA-7B in just 7 minutes using 256 samples and 3GB VRAM, outperforming existing methods that require retraining.

OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM Inference

This paper proposes OrthoRank, a training-free dynamic token selection method. By utilizing the orthogonality between the sink token and other tokens in the hidden state space to measure token importance, OrthoRank selects the Top-K important tokens for full computation at each layer. Other tokens only participate in KV computation. OrthoRank achieves lower perplexity and higher zero-shot accuracy compared to layer pruning at the same sparsity rate.

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

This paper proposes ParallelComp, a training-free parallel long-context compression method. By leveraging parallel KV cache eviction and attention calibration strategies, it enables an 8B LLM to extrapolate from an 8K context window up to 128K tokens on a single A100 GPU.

Parameter-Efficient Fine-Tuning of State Space Models

This work presents the first systematic benchmarking of six PEFT methods on State Space Models (SSMs/Mamba). It reveals that LoRA should be applied to linear projection layers rather than SSM modules, and proposes Sparse Dimension Tuning (SDT) to selectively update key state dimensions for more efficient SSM fine-tuning.

Persistent Topological Features in Large Language Models

This work introduces zigzag persistence from topological data analysis (TDA) to analyze the internal representations of LLMs. By tracking the continuous evolution of topological features of prompts across representation spaces of different layers, it identifies four processing phases and proposes a layer pruning criterion based on topological descriptors, achieving performance comparable to SOTA methods.

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

The PreSelect method is proposed based on the hypothesis that "the data that can predict model capability is the data that can teach the model." By leveraging the rank correlation of multi-model losses to quantify document predictive strength, a fastText classifier is trained for efficient data selection. On a 1B model, training with 30B tokens selected by PreSelect outperforms random selection with 300B tokens, achieving a 10x compute saving.

Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models

This paper systematically evaluates the impact of mainstream quantization methods (AWQ, AQLM, LLM-QAT, QLoRA) on LLM safety across different calibration datasets and bit-widths. It reveals that all quantization methods lead to a dramatic rise in Attack Success Rate (ASR) (from 0.3% to 85%). To address this, the authors propose the Q-resafe framework, which efficiently restores the safety capabilities of quantized models with extremely low computational overhead through safety patch data construction, DPO alignment, and selective safety-critical weight updates.

RADIO: Rate-Distortion Optimization for Large Language Model Compression

Starting from the rate-distortion theory in information theory, RADIO establishes a theoretical foundation for LLM quantization and proposes a concise quantization technique based on rate-distortion optimization. This technique can be scaled to models with hundreds of billions of parameters and allows users to flexibly specify the target model size or accuracy for post-training compression.

Random Initialization of Gated Sparse Adapters (RIGSA)

Proposes RIGSA, a sparse fine-tuning method based on randomly initialized full-rank adapters + ReZero gating + iterative magnitude pruning, which retains source task performance better than QLoRA while learning new tasks.

Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding

This paper proposes TAPE (contexTualized equivariAnt Position Encoding), which replaces traditional fixed positional patterns by dynamically updating positional encodings in each layer based on the sequence content. It simultaneously enforces permutation and orthogonal equivariance to guarantee stability, significantly outperforming existing positional encoding methods on language modeling, arithmetic reasoning, and long-context retrieval tasks.

Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective

This paper reveals an inherent conflict between stability and plasticity at the architectural level in continual learning: wide-and-shallow networks exhibit better stability, whereas deep-and-narrow networks possess stronger plasticity. Consequently, the authors propose the Dual-Arch framework, which delegates stability and plasticity to two dedicated lightweight architectures and coordinates them via knowledge distillation. This achieves up to an 87% reduction in parameter count while simultaneously improving CL performance.

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

RocketKV is proposed, a training-free two-stage KV cache compression method. The first stage employs SnapKV for coarse-grained permanent eviction, and the second stage utilizes Hybrid Sparse Attention (HSA) for fine-grained dynamic top-k selection. RocketKV achieves up to a 400× compression ratio, 3.7× end-to-end speedup, and 32.6% peak memory savings with negligible accuracy loss on models such as Mistral-7B.

SAFE: Finding Sparse and Flat Minima to Improve Pruning

Models the pruning problem as a sharpness-aware optimization problem under sparse constraints, solved via the Alternating Direction Method of Multipliers (ADMM), simultaneously achieving sparsity and flat minima to improve generalization performance and robustness of pruned networks.

Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning

This work proposes the DPCR method, which estimates semantic shift through dual projection (task-level TSSP + category-level CIP) and reconstructs the classifier without backpropagation (BP) using ridge regression. It simultaneously addresses semantic shift and decision bias in exemplar-free class-incremental learning, outperforming SOTA on multiple benchmarks.

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

SepLLM is proposed, leveraging the property of separator tokens (such as punctuation) to naturally compress text segment information, by only retaining the KV cache of Initial + Separator + Neighboring tokens. This significantly reduces attention computation and memory usage while maintaining performance.

Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation

SpaLLM proposes a parameter sharing method based on sketching to unify the compression and fine-tuning processes of LLMs. By compressing pre-trained weights into a lookup table (LUT) and directly fine-tuning on the table values, this approach avoids the low-rank assumption and implementation complexity of dual-tower architectures like QLoRA. It achieves superior performance compared to QLoRA/LoftQ across multiple benchmarks with fewer trainable parameters.

SlimLLM: Accurate Structured Pruning for Large Language Models

This paper proposes SlimLLM, a structured pruning method for LLMs. It evaluates channels via feature space importance (considering both weight direction and magnitude), assesses attention heads holistically using Pearson similarity, and couples this with a simple linear regression recovery strategy and layer-wise pruning ratio allocation. SlimLLM retains 98.7% of performance on LLaMA under 20% pruning.

Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

Sparse Spectral Training (SST) is proposed to achieve a performance close to full-rank pre-training with memory overheads on par with LoRA. This is achieved by updating all singular values at each step in the spectral domain, selectively updating singular vectors via multinomial sampling according to their magnitudes, and periodically running re-SVD to maintain orthogonality.

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

This paper proposes Decentralized Speculative Decoding (DSD), a plug-and-play decentralized LLM inference acceleration framework. By converting cross-node communication wait time into effective computation and combining it with an adaptive verification strategy based on semantic importance, DSD achieves up to \(2.59\times\) end-to-end acceleration without requiring retraining.

Strategic Fusion Optimizes Transformer Compression

This paper proposes the Strategic Fusion framework, which fuses 12 layer pruning signals based on activation, mutual information, gradient, weight, and attention through linear regression and random forest. Validated on the BERT model and 9 text classification datasets, multi-signal fusion pruning outperforms single-signal strategies, and when combined with knowledge distillation, the average accuracy-to-size ratio is improved by 18.84 times.

Text-to-LoRA: Instant Transformer Adaption

Text-to-LoRA (T2L) trains a hypernetwork to generate task-specific LoRA adapters for LLMs in a single forward pass using only natural language task descriptions. It matches the performance of specialized fine-tuned LoRAs on 9 training tasks and generalizes zero-shot to unseen tasks, enabling language-driven instant model adaptation.

The Diffusion Duality

Reveals that Uniform-state discrete diffusion processes inherently emerge from underlying Gaussian diffusion (via the argmax mapping). Leveraging this duality, curriculum learning strategies and consistency distillation from Gaussian diffusion are transferred to the discrete setting, achieving a 2x training speedup and acceleration of sampling by two orders of magnitude (from 1024 to 8 steps), outperforming autoregressive models on 3 out of 7 datasets in zero-shot perplexity.

Toward Data-centric Directed Graph Learning: An Entropy-driven Approach

Proposed EDEN (Entropy-driven Digraph Knowledge Distillation), which constructs a Hierarchical Knowledge Tree (HKT) from a data-centric perspective. By measuring directed topological structures and quantifying node mutual information, it reveals the latent associations between topology and node attributes in directed graphs. Serving as a plug-and-play module, it brings an average performance gain of 2-5% to any DiGNN and achieves SOTA on 14 datasets and 4 downstream tasks.

Towards an Optimal Control Perspective of ResNet Training

Formulates ResNet training as an optimal control problem, achieving self-regularization by adding stage cost losses to intermediate layers, and proves that redundant deep weights asymptotically vanish, laying the foundation for theory-driven layer pruning.

Training a Generally Curious Agent (Paprika)

Proposes the Paprika framework, which fine-tunes LLMs on diverse text-based decision-making tasks, enabling the model to learn general information-gathering and decision-making capabilities and transfer zero-shot to completely unseen tasks.

TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree

This paper proposes TreeLoRA, which organizes LoRA adapters of historical tasks by constructing a hierarchical K-D tree based on gradient similarity. It utilizes the Lower Confidence Bound (LCB) multi-armed bandit algorithm to search for the most relevant task branches efficiently for knowledge sharing. Working in tandem with sparse gradient updates, it achieves a 3.2× speedup on ViT and a 2.4× speedup on LLMs, while maintaining or surpassing SOTA performance.

VocabTrim: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Proposed VocabTrim, a training-free method that reduces draft latency in speculative decoding by pruning the vocabulary of the draft model's LM head, achieving up to 16% memory-bound speedup on Llama-3.

Weak-to-Strong Jailbreaking on Large Language Models

This paper proposes the weak-to-strong jailbreak attack: using two small models (one safe, one unsafe) during inference to modify the decoding distribution of a large model via log-probability algebra. It requires only a single forward pass to increase the malicious response rate of aligned LLMs to over 99%, revealing a previously unnoticed, highly efficient attack surface in LLM alignment.

When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution

This paper investigates the challenges faced by Data-Free Knowledge Distillation (DFKD) when the teacher model is designed to be "non-transferable." Since synthetic samples tend to fall into out-of-distribution (OOD) regions which leads to distillation failure, this work proposes an "escaping OOD" approach to achieve effective distillation.

WildChat-50m: A Deep Dive Into the Role of Synthetic Data in Post-Training

Constructs the largest public chat dataset to date, WildChat-50m (50+ open-source models \(\times\) 1M+ conversations = 125 million transcripts), systematically investigates the synthetic data quality of different data generation models (DGMs), and designs the Re-Wild SFT mixing scheme, which outperforms Tulu-3 using only 40% of its SFT data volume.