Skip to content

📦 Model Compression

🧪 ICML2026 · 21 paper notes

📌 Same area in other venues: 💬 ACL2026 (31) · 📷 CVPR2026 (50) · 🔬 ICLR2026 (90) · 🤖 AAAI2026 (53) · 🧠 NeurIPS2025 (131) · 📹 ICCV2025 (48)

🔥 Top topics: Model Compression ×5 · Compression ×5 · LLM ×3

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

The authors diagnose the root cause of codebook collapse in VQ-VAE as "codebook vector \(\ell_2\) norm imbalance + geometric clustering." They propose SAMP: Ball-Bounded Norm Regularization constrains all codebook vectors within a time-varying Euclidean ball, and ArcCosine Additive Margin Loss, inspired by ArcFace, pushes latent vectors apart on the sphere. This leads to uniform codebook dispersion and a significant increase in utilization, outperforming mainstream VQ-VAE variants on ImageNet reconstruction and generation FID.

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

To address the MoE LLM "load imbalance–parameter redundancy–communication overhead" trilemma, this paper proposes a unified framework: experts are grouped online via dual "parameter + activation" similarity clustering; within each group, "shared base matrix + low-rank residual" structured compression (~5×) is applied; then, a two-level hierarchical routing ("group selection then expert selection") is performed, combined with FP16/INT4 heterogeneous precision and offline unloading of idle groups. On GLUE/WikiText-103, this achieves about 80% parameter reduction, 10–20% throughput improvement, and a 3× reduction in expert load variance, while matching standard MoE performance.

Demystifying When Pruning Works via Representation Hierarchies

Starting from the three-stage representation hierarchy "embedding → logit → probability," this paper uses Taylor local expansion theory to prove: pruning introduces inherently small perturbations in the embedding and logit spaces, but the nonlinear softmax step amplifies these perturbations into the probability space by a factor of \(\mathrm{Var}_r(\Delta z)/(2T^2)\). Through stepwise accumulation in autoregressive decoding, this ultimately leads to catastrophic failure in generation tasks. In contrast, non-generation tasks are naturally robust to pruning since they only depend on a candidate token subspace—this unifies the explanation for why pruning is nearly lossless on MMLU and retrieval but drops to zero on GSM8K and HumanEval.

Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

This work systematically observes the prevalent phenomenon of "token embeddings in small language models condensing into a narrow cone with depth" (embedding condensation)—a phenomenon not seen in large models—and proposes an angular dispersion loss \(\mathcal{L}_{\text{disp}}\) that directly encourages embedding dispersion. Without introducing extra parameters, this loss yields an average improvement of 3.3% on 10 benchmarks for Qwen3 / GPT2.

Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

This paper proposes TAD (Tail-Aware Distillation): by explicitly separating the teacher's top-\(K\) probabilities from the "tail" probabilities in the standard KD KL divergence and amplifying the tail's contribution, it enables LLM pretraining distillation within academic-level compute (single H100 + 1 week), achieving average performance superior to data-centric methods like MiniPLM.

FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA

This paper identifies that the true "enemy" of naive factor-wise averaging in federated LoRA is the latent subspace misalignment caused by rotational invariance. It proposes that each client solves for a rotation matrix \(R_i^t\) via orthogonal Procrustes to align \(A,B\) factors before aggregation. Both theoretical and experimental results demonstrate significant reduction in aggregation error without increasing communication overhead.

FlattenGPT: Depth Compression for Transformer with Layer Flattening

This paper proposes FlattenGPT, which first "flattens" and merges adjacent transformer layers in LLMs with highly similar inputs into a single layer with 2× width (retaining all parameter knowledge), then applies channel pruning to the merged layer to restore the original width—thus achieving inference acceleration via depth compression while avoiding the catastrophic performance drop from directly discarding entire layers as in traditional pruning.

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

The authors use a three-perspective analysis—sample-wise SVD, dataset-level PCA, and token-level Spectral Energy Pattern (SEP)—to reveal a seemingly paradoxical geometry in ViT representations: "Each image's feature matrix is low-rank, but the cross-image shared subspace is nearly full-rank, and the spectral bandwidth of single tokens approaches 100%." They then propose two minimal patches, Lift (retaining the lifting projector at inference) and WideLast (widening only the last block to teacher width), which enable plain MSE feature distillation to boost DeiT-Tiny ← CaiT-S24 from 74.86% to 78.23%.

Linearizing Vision Transformer with Test-Time Training

The authors observe that the inner model of two-layer TTT is structurally equivalent to Softmax attention (Softmax can be viewed as a two-layer dynamic MLP). This enables direct inheritance of all Q/K/V/MLP weights. Key Instance Normalization is used to handle shift-invariance, and depthwise conv on Q/K is added to inject locality. With only 1 hour of fine-tuning, Stable Diffusion 3.5 is linearized and accelerated by 1.32×–1.47×.

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

OSAQ leverages the observation that the Hessian of each LLM layer maintains a consistent low-rank null space across different inputs. By linearly combining the null space vectors into an additive weight perturbation \(\Delta W\), OSAQ "self-absorbs" outlier weights without altering the second-order task loss, reducing the perplexity of 2-bit weight-only quantization by over 40% compared to naive GPTQ.

Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs

The authors propose SRR (Structured Residual Reconstruction), which explicitly splits the fixed low-rank budget \(r\) in QER (Quantization Error Reconstruction) into two parts: "preserve the top \(k\) principal singular directions before quantization" and "use the remaining \(r-k\) ranks to fit the residual." They provide a closed-form criterion requiring only a single random probe to select \(k^\star\) per layer, consistently outperforming LQER/QERA in 2/3-bit PTQ and QPEFT.

Proxy Compression for Language Modeling

The authors propose "proxy compression": during training, 90% of data is fed as short sequences produced by a tokenizer or neural compressor, and 10% as raw UTF-8 bytes, combined with sentinel tokens and a brief in-context translation warm-up. At inference, all compressors are discarded and the model sees only raw bytes, yet it significantly outperforms pure byte models under fixed compute, and at scale matches or surpasses tokenizer baselines.

Resting Neurons, Active Insights: Robustify Activation Sparsity for Large Language Models

This paper attributes the performance drop in LLMs caused by activation sparsity to "representation drift." Inspired by biological spontaneous firing, it injects a small, input-independent vector (SPON) into each layer, which can be absorbed into the bias after training. This approach significantly narrows the gap between sparse and dense models with nearly zero inference overhead.

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

RQ-MoE employs a "two-level MoE + dual-stream quantization" design, enabling the codebook for residual vector quantization (RQ) to be dynamically generated per input, and achieves 6–14× decoding acceleration by decoupling the instruction and reconstruction streams. On four retrieval benchmarks, it matches or surpasses QINCo in MSE/Recall.

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

The authors prove that LoRA's cumulative updates are trapped in a fixed low-rank subspace and propose ScaLoRA: at each step, after merging the old \(AB^\top\) into \(W^{pt}\), the adapter is restarted with an analytically optimal "column scaling", enabling AdamW first/second moments to be transferred equivariantly in \(O((m+n)r)\) time (no reset/warm-up needed), and cumulative updates naturally become high-rank. ScaLoRA consistently outperforms LoRA / MoRA / HiRA / ReLoRA / LoRA-GA on DeBERTaV3, LLaMA2-7B, LLaMA3-8B, and Gemma3-12B.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

This paper first uses the new benchmark system KVFundaBench to reveal a key asymmetry: "retrieval-type long-context tasks can be compressed, reasoning-type cannot." The root cause is attributed to KV compression breaking the integrity of few-shot examples as "semantic units." Based on this, ShotKV is proposed—preserving each shot as an indivisible unit during prefill, and applying dynamic token-level compression during decoding. This allows LG-GSM8K to improve from a baseline of 46.0 to 47.33 at a 40% compression rate, and reduces end-to-end latency by 11.3% under long-input settings.

Stochastic Sparse Attention for Memory-Bound Inference

SANTA treats attention value aggregation \(AV\) as "weighted sum of value rows \(V\) by softmax probabilities \(A\)," and replaces it with an unbiased estimator: "sample \(S\ll n_k\) indices from \(A\) without replacement and directly average the corresponding \(V\) rows." Stratified/systematic sampling is used to reduce variance, and the method is implemented as a GPU kernel aligned with FlashDecoding. On 32k context, it achieves 1.5× end-to-end speedup over FlashInfer/FlashDecoding without loss of accuracy.

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

SURGE attaches a "full-precision auxiliary branch" in parallel to each binarized layer. The forward output remains unchanged, but in the backward pass, an extra "non-STE truncated" higher-order gradient is backpropagated from the full-precision branch. AGS dynamically balances the contributions of both paths according to the gradient norm ratio, enabling BNNs to achieve 62.0% top-1 on ResNet-18/ImageNet—1.0% higher than ReCU and 3.9% higher than IR-Net.

Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning

LoDA decomposes the LoRA down-projection matrix by "projection energy" into a general subspace shared across tasks and an isolated subspace that is only activated by new tasks. It then uses gradient alignment to train the up-projection and applies a closed-form recalibration to the general branch during merging, thereby consistently outperforming existing LoRA-CL methods on multiple continual learning benchmarks.

Test-Time Training with KV Binding Is Secretly Linear Attention

This paper uses four "memory paradox" counterexamples and a set of rigorous unrolling theorems to prove that TTT with KV-binding inner loops (e.g., LaCT, ViTTT), even with multi-layer MLPs and momentum, is essentially "learned linear attention operators." Based on this, the authors simplify and parallelize it into standard linear attention, achieving a 4× throughput boost with almost no performance drop.

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

The authors observe that the "importance" of tokens varies drastically across layers and heads; traditional token eviction, which removes tokens in one shot, is an irreversible early decision error. They propose Token Sparse Attention, where each attention head in each layer independently selects \(L' \ll L\) tokens for dense attention, then scatters the output back to the original sequence length, with a residual path allowing skipped tokens to be reconsidered in the next layer. This preserves both head/layer-level dynamic selection and compatibility with dense kernels like FlashAttention. Combined with FlexPrefill, it achieves ×3.23 attention speedup with <1% accuracy loss on 128K context.