📦 Model Compression¶
🧠 NeurIPS2025 · 140 paper notes
📌 Same area in other venues: 📷 CVPR2026 (98) · 🔬 ICLR2026 (241) · 💬 ACL2026 (59) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 📹 ICCV2025 (52)
🔥 Top topics: LLM ×23 · Compression ×19 · Model Compression ×16 · Knowledge Distillation ×8 · Adversarial Robustness ×7
- 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
-
This paper proposes 4DGCPro, a hierarchical 4D Gaussian compression framework that achieves multi-bitrate progressive volumetric video streaming within a single model, via perception-weighted hierarchical Gaussian representation, motion-aware adaptive grouping, and end-to-end entropy-optimized training. The framework supports real-time decoding and rendering on mobile devices and surpasses existing SOTA in rate-distortion performance.
- A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings
-
This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.
- A Granular Study of Safety Pretraining under Model Abliteration
-
This paper systematically investigates the effects of model abliteration—a inference-time activation space editing attack—on various data-driven safety pretraining stages. It finds that safety mechanisms relying solely on refusal training are highly vulnerable, whereas combining multiple safety signals (safe-only filtering + rephrasing + metatags + refusals) distributes safety behavior across a broader representational space, making it substantially more resistant to single-direction projection removal.
- A Partition Cover Approach for Tokenization
-
This paper reformulates tokenization as a partition cover optimization problem, proves it NP-hard, and proposes a polynomial-time greedy algorithm GreedTok that outperforms BPE in both compression rate and downstream tasks when pretraining a 1B-parameter LLM.
- A Simple Linear Patch Revives Layer-Pruned Large Language Models
-
LinearPatch inserts a lightweight symmetric matrix — fusing a Hadamard transform with channel scaling — at the pruning interface to repair activation magnitude mismatches caused by layer pruning. On LLaMA-3-8B, it retains 94.15% of the original performance without any training, and reaches 95.16% after 30 minutes of distillation.
- A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
-
This paper proposes Low-Rank Clone (LRC), which compresses teacher weights into student weights via learnable low-rank projection matrices (soft pruning), while aligning intermediate activations of both attention and FFN modules (activation cloning). A 1.7B model trained on only 20B tokens surpasses Qwen3-1.7B trained on 36T tokens (64.98 vs. 63.17), achieving a 1,000× improvement in training efficiency.
- Accurate and Efficient Low-Rank Model Merging in Core Space
-
This paper proposes the Core Space Merging framework, which performs model merging within a common reference basis space constructed from low-rank LoRA matrices. This approach losslessly reduces the merging operation from the full \(m \times n\) space to a compact \(Tr \times Tr\) space (where \(T\) is the number of tasks and \(r\) is the LoRA rank), achieving state-of-the-art merging accuracy on Llama 3 8B while reducing computational cost by several orders of magnitude.
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
-
Existing KV cache eviction methods uniformly allocate budgets across all attention heads, ignoring the substantial variation in attention concentration across heads. This paper proposes Ada-KV — the first head-wise adaptive budget allocation strategy — which redistributes budget from sparse heads to dispersed heads. It provides a theoretical proof that the approach minimizes an upper bound on eviction loss, and serves as a plug-and-play improvement over existing methods across 29 datasets.
- Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
-
This paper proposes the R-AutoEval+ framework, which introduces an adaptive weighting mechanism within the testing-by-betting framework to dynamically regulate reliance on LLM-judge-generated synthetic data. It is the first method to simultaneously guarantee evaluation reliability and sampling efficiency no worse than approaches using only real data under finite samples, validated across three scenarios: LLM quantization, prompt selection, and inference budget allocation.
- AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees
-
This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.
- AI-Generated Video Detection via Perceptual Straightening
-
This paper proposes ReStraV, a method grounded in the perceptual straightening hypothesis—which posits that real videos form straighter trajectories in neural representation space—to detect AI-generated videos. Using temporal curvature and step-size statistics extracted from DINOv2 feature space, a lightweight classifier is trained to distinguish real from generated content, achieving 97.17% accuracy and 98.63% AUROC on VidProM with only ~48ms inference time.
- All You Need is One: Capsule Prompt Tuning with a Single Vector
-
This paper proposes Capsule Prompt-Tuning (CaPT), identifying that existing task-aware soft prompts exhibit minimal interaction with input tokens — an "attention island" phenomenon. Incorporating instance-aware information into a single capsule prompt enables it to serve as an "attention anchor" that activates attention toward critical structural information, achieving superior performance over multi-prompt methods with extremely few parameters (e.g., only 0.003% of parameters on Llama3.2-1B).
- ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data
-
ATLAS proposes a data generation framework based on a concept repository, expert iteration with knowledge distillation, and two novel augmentation strategies. It constructs a parallel corpus of 117K theorem statements, and achieves SOTA on all autoformalization benchmarks after fine-tuning Llama3.1-8B-Instruct.
- AutoJudge: Judge Decoding Without Manual Annotation
-
AutoJudge automates the annotation of "critical tokens" in Judge Decoding — by using a semi-greedy search to replace mismatched tokens and checking whether the final answer changes, it labels token importance, trains a logistic regression classifier to predict importance at inference time, enabling speculative decoding to accept 40+ tokens per round (vs. ~20 in standard methods), achieving 1.5× speedup on GSM8K with less than 1% accuracy loss.
- BaRISTA: Brain-Scale Informed Spatiotemporal Representation of Human Intracranial EEG
-
BaRISTA systematically investigates spatial encoding scales (electrode/parcel/lobe) for iEEG Transformers, finding that atlas parcel-level encoding combined with spatial masked reconstruction achieves 86.2% AUC on language task decoding (vs. PopT 79.5%). The choice of encoding scale has greater impact than masking strategy, and the model generalizes well across subjects.
- Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation
-
TopLoRA analyzes the expressive capacity of LoRA from an input-output projection perspective, identifying that all tokens sharing a single projection matrix constitutes a critical bottleneck. It proposes dynamically adjusting LoRA weights via a learnable token-wise diagonal matrix \(\Sigma_X\) (i.e., \(\Delta W_X = B\Sigma_X A\)), achieving fine-grained adaptation without increasing rank, and consistently outperforming LoRA by 2–3% across tasks.
- Beyond Random: Automatic Inner-Loop Optimization in Dataset Distillation
-
This paper proposes AT-BPTT (Adaptive Truncation BPTT), which partitions DNN training into early/middle/late stages and adaptively adjusts truncation strategies and window sizes accordingly. The method achieves average accuracy gains of 3–17% on CIFAR-10/100/Tiny-ImageNet/ImageNet-1K, while delivering 3.9× speedup and 63% memory reduction.
- Bézier Splatting for Fast and Differentiable Vector Graphics Rendering
-
Bézier Splatting integrates the Gaussian Splatting framework with Bézier curves by uniformly sampling 2D Gaussian points along each curve and rendering via α-blending to achieve differentiable vector graphics. The method achieves 30× forward and 150× backward speedups over DiffVG while maintaining or surpassing the image quality of methods such as LIVE.
- Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression
-
BQQ proposes quadratic binary quantization—representing weight matrices as products (rather than linear combinations) of binary matrices—thereby surpassing the expressive capacity of conventional first-order quantization. By extending AMFD (Annealed Mean-Field Descent) to PUBO problems for mixed-integer optimization, BQQ achieves a dramatic accuracy leap from 10.83% to 58.25% on 2-bit data-free ViT quantization.
- BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks
-
BioBench is proposed as a unified benchmark spanning 9 ecological vision tasks, 4 taxonomic kingdoms, 6 image modalities, and 3.1 million images. It demonstrates that ImageNet top-1 accuracy explains only 34% of the variance across ecological tasks, and that approximately 30% of model rankings are incorrect among frontier models exceeding 75% accuracy.
- C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models
-
This paper proposes C-LoRA, which introduces a lightweight contextual module to condition the distribution of LoRA low-rank matrices on the input data, enabling sample-level heteroscedastic uncertainty estimation and significantly improving calibration quality in few-shot fine-tuning scenarios.
- CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
-
CAS-Spec constructs a multi-level draft model hierarchy from the target model itself via Dynamically Switchable Inference Acceleration (DSIA) strategies (e.g., layer sparsity at varying degrees), and employs the Dynamic Tree Cascade (DyTC) algorithm to adaptively route among draft models and allocate draft lengths based on online acceptance rates and latency predictions. The approach achieves lossless inference acceleration of 1.1×–2.3× in a fully training-free manner, with DyTC yielding gains of 47% and 48% over cascade and tree baselines, respectively.
- ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
-
ChunkKV elevates the basic unit of KV cache compression from discrete tokens to semantic chunks (groups of contiguous tokens). By aggregating attention scores at the chunk level, it selects semantically intact segments for retention, and leverages the high cross-layer index similarity induced by chunking to enable layer-wise index reuse. At a 10% compression ratio, ChunkKV improves over SnapKV/PyramidKV by up to 8.7% and achieves a 26.5% throughput gain.
- CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs
-
This paper proposes CodeGEMM, a codebook-centric GEMM kernel that precomputes inner products between centroids and activations and caches them as a Psumbook, replacing the conventional dequantization pipeline to achieve end-to-end speedups of 1.83× (8B) to 8.93× (70B) on 2-bit quantized LLMs.
- Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
-
This paper proposes REFORM, an inference framework that efficiently processes ultra-long contexts (up to millions of tokens) via a compress–gather–recompute three-stage pipeline. REFORM achieves improvements of 52% and 34% over the strongest baselines on RULER and BABILong respectively, while reducing inference time by 30% and peak memory usage by 5%.
- Correlation Dimension of Auto-Regressive Large Language Models
-
This paper introduces the correlation dimension from fractal geometry into LLM analysis. By measuring the recursive structure among next-token log-probability vectors, it quantifies the hierarchical complexity of text, revealing a three-stage evolution of LLM pretraining, an indicator of hallucination tendency, and a unified detection capability for multiple text degeneration patterns — none of which can be captured by perplexity.
- Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning
-
This paper proposes DEAL, a framework that leverages wavelet kernel feature filtering to preserve core historical knowledge in LoRA low-rank matrices, combined with a controlled knowledge update module and asymmetric regularization, enabling LLMs to acquire new knowledge without forgetting old tasks under few-shot continual fine-tuning.
- DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method
-
This paper proposes DeltaFlow (ΔFlow), which extracts motion cues via inter-frame voxel differences (Δ scheme) to enable multi-frame scene flow estimation with feature sizes that remain constant regardless of the number of input frames. The method achieves state-of-the-art performance on Argoverse 2, Waymo, and nuScenes while running 2× faster than the second-best multi-frame approach.
- DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
-
This paper proposes DenoiseRotator, a pre-pruning method that applies learnable orthogonal transformations to minimize the information entropy of parameter importance scores, concentrating importance into a small subset of parameters. On LLaMA3-70B under 2:4 semi-structured sparsity, perplexity degradation is reduced by 58% (8.1→3.4). The method is plug-and-play and compatible with Magnitude, Wanda, and SparseGPT.
- Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
-
This paper proposes Default MoE, a method that maintains exponential moving averages (EMA) of inactive expert outputs as surrogate signals, enabling dense gradient updates for the MoE router without significant computational overhead, thereby improving sparse MoE training performance.
- Dependency Parsing is More Parameter-Efficient with Normalization
-
This paper identifies that the lack of normalization in biaffine scoring for dependency and semantic parsing leads to systematic overparameterization, and demonstrates that a simple \(1/\sqrt{d}\) scaling can reduce BiLSTM parameters by up to 85% while matching or surpassing original performance.
- Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers
-
DCR mixes teacher and student module outputs via a deterministic annealing weight \(\alpha(t)\), eliminating the gradient variance introduced by stochastic gating (e.g., BERT-of-Theseus), and achieves faster convergence and stronger feature alignment in cold-start module replacement scenarios.
- Disentangling Latent Shifts of In-Context Learning with Weak Supervision
-
WILDA treats ICL as a weak supervision signal and encodes demonstration-induced latent shifts into lightweight LoRA adapters via a teacher-student framework, enabling efficient inference without repeated prompting. The student surpasses the teacher through pseudo-label correction and coverage extension, demonstrating weak-to-strong generalization.
- DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
-
DP-LLM identifies that per-layer quantization sensitivity varies dynamically across decoding steps, and proposes a dynamic layer-wise precision selection mechanism based on relative error. At runtime, each layer is assigned a precision (h-bit or l-bit) conditioned on the current input, achieving a better performance–latency trade-off than static mixed-precision methods.
- DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
-
This paper proposes DuoGPT, a dual-sparse framework that reinterprets activation sparsity as dynamic structured weight sparsity and combines it with unstructured weight pruning. By extending the OBC framework with activation-aware calibration and a dense-model output residual correction term, DuoGPT achieves significant speedup and memory savings during the LLM decoding phase without any retraining.
- Efficient Parametric SVD of Koopman Operator for Stochastic Dynamical Systems
-
This paper proposes a low-rank approximation (LoRA)-based objective to learn the top-k singular functions of the Koopman operator for stochastic dynamical systems, entirely avoiding the numerically unstable matrix decomposition operations present in VAMPnet/DPNet, with naturally unbiased gradients.
- Elastic ViTs from Pretrained Models without Retraining
-
SnapViT proposes a post-training structured pruning method that combines a local Hessian diagonal approximation derived from self-supervised gradients with global cross-module correlations estimated via evolutionary algorithms. Without any retraining or labels, it generates elastic ViT sub-networks spanning continuous sparsity levels in a single run, requiring less than 5 minutes on an A100 GPU.
- Explaining and Mitigating Crosslingual Tokenizer Inequities
-
This work systematically trains approximately 7,000 monolingual tokenizers covering 97 languages, providing the first demonstration that significant token premium disparities persist across languages even after controlling for training data size, vocabulary size, and algorithm. It further identifies vocabulary size and pre-tokenization strategy as key contributing factors, and proposes two mitigation approaches: language-specific optimal vocabulary size and SuperBPE.
- FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
-
FALQON eliminates the small-matrix quantization overhead introduced by standalone LoRA paths by directly melding LoRA adapters into FP8-quantized backbone weights. Combined with efficient gradient computation and a row-wise proxy update mechanism, it achieves approximately 3× training speedup over existing quantized LoRA methods.
- FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing
-
FastLongSpeech is proposed to compress redundant speech representations via an iterative fusion strategy and to transfer short-speech capabilities to long-speech scenarios through dynamic compression training, enabling large speech-language models (LSLMs) to efficiently process long speech without long-speech training data, achieving state-of-the-art performance on long-speech QA with a 70% improvement in inference efficiency.
- Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
-
This paper proposes CoD (Counterfactual-explanation-infused Distillation), which injects counterfactual explanations into few-shot training sets to precisely map the teacher's decision boundary, achieving significant improvements over standard distillation methods across 6 datasets using only 8–512 samples.
- Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation
-
Fin3R is proposed to improve the geometric accuracy and robustness of feed-forward 3D reconstruction models (DUSt3R/MASt3R/CUT3R/VGGT) in a unified and lightweight manner, by freezing the decoder and fine-tuning the encoder via monocular knowledge distillation with re-normalization LoRA adapters.
- Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization
-
MaO proposes a novel approach for Small Object Image Retrieval (SoIR) that integrates multi-object pre-training with attention-based feature refinement, aggregating representations of multiple objects into a single global descriptor, achieving substantial improvements over existing retrieval methods across multiple benchmarks.
- FiRA: Can We Achieve Full-Rank Training of LLMs Under Low-Rank Constraint?
-
This paper proposes Fira, the first LLM training framework that achieves full-rank training (full-rank gradients + full-rank weights) under low-rank constraints. By observing that the optimizer scaling factors in low-rank and full-rank training are highly similar, Fira approximates the correction of out-of-subspace gradients using low-rank scaling factors, and employs a norm-growth limiter to prevent loss spikes. Fira outperforms LoRA and GaLore in both pretraining and fine-tuning settings.
- FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings
-
This paper introduces FirstAidQA, a dataset of 5,500 synthetic first aid question-answer pairs generated by ChatGPT-4o-mini from certified first aid textbooks and validated by human experts, designed to support fine-tuning of first aid AI systems in low-connectivity or offline environments.
- Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models
-
This paper proposes GainLoRA, which introduces a gating module for each new task's LoRA branch in continual learning to generate adaptive integration coefficients. By enforcing orthogonal constraints, the new branch's output on old tasks is driven toward zero, effectively mitigating catastrophic forgetting.
- Geometric Data Valuation via Leverage Scores
-
This paper proposes a geometric data valuation method based on statistical leverage scores as an efficient proxy for Data Shapley values. The proposed method satisfies the axioms of symmetry, efficiency, and dummy player, and extends to ridge leverage scores to address the dimensionality saturation problem, providing theoretical guarantees of \(O(\varepsilon)\)-approximate optimality.
- Geometry of Decision Making in Language Models
-
By measuring the Intrinsic Dimension (ID) of hidden representations across layers in 28 open-source Transformer models at scale, this paper reveals a consistent "low–high–low" pattern: early layers operate on low-dimensional manifolds, middle layers expand the representational space, and later layers re-compress into low-dimensional representations aligned with decision-making.
- Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks
-
This paper proves that, for single-hidden-layer ReLU networks, global minimizers of the \(\ell^p\) (\(0 < p < 1\)) path norm correspond exactly to the sparsest data-interpolating networks, thereby recasting the combinatorial sparse interpolation problem as a continuously differentiable optimization task.
- GoRA: Gradient-Driven Adaptive Low Rank Adaptation
-
GoRA is proposed to leverage pre-computed gradient information to simultaneously perform adaptive rank allocation and weight initialization prior to training — assigning per-layer ranks based on parameter sensitivity and initializing the \(B\) matrix via the gradient pseudo-inverse so that the initial output approximates one step of gradient descent, thereby addressing both major bottlenecks of LoRA in a unified framework.
- Graph Your Own Prompt
-
This paper proposes a Graph Consistency Regularization (GCR) framework that inserts parameter-free Graph Consistency Layers (GCL) at arbitrary network depths. GCL aligns the relational graph of intermediate features with a class-aware semantic graph derived from model predictions, promoting semantically consistent feature learning in a self-prompting manner—improving classification generalization without modifying the architecture or introducing additional parameters.
- GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection
-
GraSS and FactGraSS are proposed as a two-stage gradient compression algorithm that exploits the inherent sparsity of per-sample gradients to achieve sublinear time and space complexity (\(O(k')\)), outperforming the SOTA baseline LoGra by 165% in throughput on billion-parameter models while maintaining data attribution quality.
- Graver: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning
-
This paper proposes Graver, a framework that decouples ego-graphs to extract transferable subgraph vocabularies, models their distributions via graphon experts, and routes relevant vocabularies to augment support samples through MoE-CoE, addressing the instability caused by structural mismatch in few-shot fine-tuning of graph foundation models (GFMs).
- Hankel Singular Value Regularization for Highly Compressible State Space Models
-
By regularizing the Hankel singular value nuclear norm of SSM layers during training to encourage rapid decay, the trained model can be compressed to 10% of its original order via balanced truncation with negligible accuracy loss. A block-diagonal rotation matrix parameterization reduces Gramian computation from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(n^2)\).
- Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs
-
This paper identifies a previously overlooked local Key-Value asymmetry in LLM attention mechanisms—neighboring Keys exhibit homogeneity (similar attention weight distributions), while neighboring Values are heterogeneously distributed. Based on this observation, the paper proposes AsymKV, a training-free compression framework that merges Keys via homogeneity and represents Values losslessly through cardinality normalization, outperforming H2O by 5 points on LongBench.
- How to Build a Consistency Model: Learning Flow Maps via Self-Distillation
-
This paper proposes a unified self-distillation framework for directly learning flow maps (the generalized form of consistency models). By exploiting the tangent condition, any distillation scheme is converted into a direct training algorithm that requires no pretrained teacher. Three algorithm families are derived (Eulerian / Lagrangian / Progressive), among which the Lagrangian method avoids spatial gradients and bootstrapping, achieving the most stable training and best performance.
- Hyperbolic Dataset Distillation
-
This paper proposes HDD, the first method to incorporate hyperbolic space into dataset distillation. By matching the Riemannian centroids of real and synthetic data in the Lorentz hyperbolic space—rather than performing distribution matching in Euclidean space—HDD leverages the hierarchical weighting property of hyperbolic geometry to assign higher influence to more representative, low-level samples. The method consistently improves over DM/IDM baselines across multiple datasets.
- Inference-Time Hyper-Scaling with KV Cache Compression
-
This paper proposes the Inference-Time Hyper-Scaling paradigm: by efficiently compressing the KV cache, more or longer parallel reasoning sequences can be generated under the same compute/memory budget, substantially improving the accuracy of reasoning models on tasks such as mathematics, code, and scientific reasoning.
- KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
-
This paper proposes KeyDiff — an attention-score-free KV cache eviction strategy that maintains the cache by retaining keys with the lowest average cosine similarity to other keys (i.e., geometrically most unique). Under strict memory constraints in block-wise inference settings, KeyDiff achieves ≤0.04% accuracy loss on LongBench with an 8K cache budget, while reducing end-to-end inference latency by up to 30%.
- KINDLE: Knowledge-Guided Distillation for Prior-Free Gene Regulatory Network Inference
-
This paper proposes KINDLE, a three-stage framework that transfers gene regulatory knowledge learned by a prior-guided teacher model to a prior-free student model via knowledge distillation, achieving state-of-the-art performance in gene regulatory network (GRN) inference without relying on any external prior knowledge.
- Knowledge Distillation Detection for Open-weights Models
-
This paper introduces the task of knowledge distillation detection, proposing a data-free input synthesis and statistical scoring framework to determine whether an open-weights student model has been distilled from a specific teacher model.
- KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
-
This paper proposes KVzip, a query-agnostic KV cache eviction method that quantifies the importance of each KV pair by leveraging the LLM itself to reconstruct the original context from the cached KV pairs. KVzip achieves 3–4× KV cache compression and approximately 2× reduction in FlashAttention decoding latency, while significantly outperforming existing query-aware methods in multi-query scenarios.
- LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions
-
LayerIF proposes using influence functions (IFs) to quantify the training quality of each layer in LLMs. By aggregating positive influence scores per layer, it derives a data-driven layer importance estimate, which is subsequently applied to two downstream tasks: LoRA-MoE expert allocation and layer-wise sparse pruning. The method achieves accuracy gains of 1.61% and 0.90% on Mistral-7B and Gemma-7B, respectively.
- Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression
-
GLVQ learns a dedicated lattice codebook (defined by a learnable generator matrix) for each weight group of an LLM, combined with group-specific μ-law companding to handle heavy-tailed distributions. Under 2-bit quantization, it achieves a Wikitext-2 perplexity of 3.36 on Llama-2-70B, substantially outperforming QuIP# (3.91) and QTIP (3.78).
- Learning to Better Search with Language Models via Guided Reinforced Self-Training
-
This paper proposes Guided-ReST, which progressively incorporates optimal solutions as subgoals into model-generated search trajectories to produce high-quality training data and distill more efficient search strategies. The approach yields substantial improvements in search efficiency and accuracy on Countdown and code self-repair tasks.
- Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models
-
This paper proposes FactoST-v2, a factorized spatio-temporal foundation model framework that decouples universal temporal pre-training from domain-specific spatial adaptation, achieving cross-domain zero-shot/few-shot/full-shot spatio-temporal forecasting with linear complexity.
- Less is More but Where: Dynamic Token Compression via LLM-Guided Keyframe Prior
-
This paper proposes DyToK, a training-free dynamic video token compression method that leverages query-conditioned keyframe priors inherent in the deep attention layers of VLLMs to adaptively allocate token budgets across frames, achieving plug-and-play optimal efficiency–accuracy trade-offs.
- Linear Attention for Efficient Bidirectional Sequence Modeling
-
This paper proposes Lion, a framework that, for the first time, systematically extends linear Transformers to bidirectional sequence modeling. It unifies three equivalent representations—full linear attention, bidirectional RNN, and chunkwise parallel—achieving training speeds up to 10× faster than SSMs while matching softmax Transformer performance.
- LittleBit: Ultra Low-Bit Quantization via Latent Factorization
-
This paper proposes LittleBit, a framework that achieves extreme LLM compression down to 0.1 BPW (bits per weight) via low-rank latent-space matrix factorization, binarization, and a multi-scale compensation mechanism. It compresses Llama2-13B to under 0.9 GB and substantially outperforms STBLLM in the sub-1-bit regime.
- Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving
-
Loquetier is a framework that unifies the fine-tuning and inference of multiple LoRA adapters within a single runtime via a Virtualized Module and a Segmented Multi-LoRA Multiplication (SMLM) kernel, achieving a 3.0× throughput improvement for inference-only tasks and a 46.4× higher SLO attainment rate for unified tasks.
- LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups
-
This paper proposes LT-Soups, a two-stage model merging framework that trains multiple models on subsampled datasets with progressively varying imbalance ratios and aggregates them via weight averaging, achieving balanced performance across head and tail classes over the full long-tail spectrum.
- Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs
-
This paper proposes Matryoshka Pilot (M-Pilot), which employs a lightweight white-box LLM as a controller to generate intermediate guidance (task decomposition, high-level plans, user profiles) for driving black-box LLMs on complex long-horizon tasks such as reasoning, planning, and personalization, with iterative DPO enabling continual self-improvement.
- Memory-Efficient Training with In-Place FFT Implementation
-
This paper proposes rdFFT—the first truly in-place real-domain Fast Fourier Transform framework—which eliminates intermediate buffers via an implicit complex encoding scheme, achieving zero extra memory overhead for FFT/IFFT computation during training, with memory efficiency improvements exceeding 1500× in extreme cases.
- Mingle: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging
-
This paper proposes a new paradigm called Test-Time Continual Model Merging (TTCMM) and the Mingle framework, which employs a low-rank mixture-of-experts architecture with an adaptive null-space constrained gating mechanism to dynamically merge models at test time using a small number of unlabeled samples. Mingle outperforms state-of-the-art methods by 7–9% across multiple benchmarks while reducing forgetting to near zero.
- Mitigating Semantic Collapse in Partially Relevant Video Retrieval
-
To address semantic collapse in Partially Relevant Video Retrieval (PRVR), this paper proposes Text Correlation Preservation Learning (TCPL) and Cross-Branch Video Alignment (CBVA), which mitigate collapse phenomena in the text and video embedding spaces respectively, achieving substantial improvements in retrieval accuracy.
- Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency
-
This paper applies continuous-time mixed monotonicity techniques to the reachability analysis of Neural ODEs. By embedding Neural ODE dynamics into a mixed monotone system, it exploits the geometric simplicity of interval boxes to achieve efficient over-approximation, providing a controllable trade-off between tightness and computational efficiency.
- Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning
-
This paper proposes learning beneficial "mixture of noise" to suppress parameter drift in pre-trained models during class-incremental learning. By dynamically mixing task-specific noise with learned weights across tasks, the method achieves state-of-the-art performance, particularly in the challenging 50-step incremental setting.
- MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver
-
This paper proposes MTL-KD, a multi-task learning framework based on knowledge distillation. It distills policy knowledge from multiple RL single-task teacher models into a heavy-decoder student model, achieving efficient unified solving across diverse VRP variants with superior generalization on large-scale instances.
- Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP
-
This paper proposes the State-Decomposable MDP (SDMDP) framework, which reformulates multiple VRP variants as Cartesian products of base state spaces, and introduces the Mixture-of-Specialized-Experts Solver (MoSES), which leverages dedicated LoRA experts to enable latent space reuse of base policies, efficiently handling 16 VRP variants.
- MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
-
This paper proposes MUSTAFAR, a framework that systematically demonstrates the superiority of unstructured sparsity for KV cache pruning—achieving 70% sparsity on both Key and Value caches without accuracy degradation—and introduces a bitmap-based sparse format with a custom attention kernel, yielding up to 2.23× end-to-end inference throughput improvement.
- Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025
-
In the NeurIPS 2025 Mouse vs. AI competition, this paper presents the counterintuitive finding that a lightweight two-layer CNN substantially outperforms deep networks on visual robustness tasks, while demonstrating that a deeper ResNet architecture is more advantageous for neural alignment, revealing a fundamental tension between behavioral robustness and biological plausibility.
- Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users
-
This paper conducts offline policy evaluation (OPE) on a deployed LLM health coaching system with real users. It finds that a uniformly high tool-use policy improves average reward but harms specific user subgroups. Through simulator experiments, the paper further validates that early information-gain exploration (curiosity reward) accelerates user profile identification and improves task success rates.
- On the Creation of Narrow AI: Hierarchy and Nonlocality of Neural Network Skills
-
This paper investigates two fundamental challenges in creating narrow AI systems: the hierarchical dependencies among tasks require that certain narrow skills can only be learned effectively when trained on broad distributions; and the nonlocality of skills makes it impossible to precisely separate desired from undesired capabilities via pruning—yet pruning followed by recovery fine-tuning still outperforms both distillation and training from scratch.
- On the Hardness of Approximating Distributions with Tractable Probabilistic Models
-
This paper proves that approximating arbitrary distributions with tractable probabilistic models (e.g., decomposable probabilistic circuits) under bounded \(f\)-divergence is NP-hard, and establishes an exponential size separation between decomposable PCs and (deterministic + decomposable) PCs under approximate modeling, demonstrating that approximation relaxations do not alleviate the complexity bottlenecks inherent in exact modeling.
- One-Step Diffusion-Based Image Compression with Semantic Distillation
-
This paper proposes OneDC—the first one-step diffusion-based generative image codec—which replaces text with the hyperprior as the semantic conditioning signal for the diffusion model and enhances its representational capacity via semantic distillation, achieving state-of-the-art perceptual quality with 39% bitrate savings and 20× decoding speedup over multi-step diffusion codecs.
- Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making
-
This paper proposes an Online Mixture of Experts (OMoE) framework comprising two algorithms — UCB-Successive Elimination and Online Weighted Majority Voting — with theoretical no-regret guarantees, and applies them to the online dynamic aggregation of LLM experts.
- Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation
-
This paper reformulates dataset distillation as an optimal transport (OT) distance minimization problem and achieves fine-grained distributional geometry alignment through a three-stage pipeline (OT-guided diffusion sampling, label-image alignment soft re-labeling, and OT logit matching), yielding at least 4% improvement over the previous state of the art on ImageNet-1K at IPC=10.
- Order-Level Attention Similarity Across Language Models: A Latent Commonality
-
This paper proposes Order-Level Attention (OLA)—an order-wise decomposition of Attention Rollout—and discovers that different language models exhibit significant similarity in same-order OLA (OLAS). OLA is shown to implicitly encode syntactic knowledge, and based on this finding, the paper proposes TOA, the first training-free cross-LM adapter transfer method.
- ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation
-
This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.
- ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization
-
This paper proposes ParetoQ — the first unified framework supporting 1/1.58/2/3/4-bit quantization — which systematically studies training strategies (full-precision pretraining vs. QAT budget allocation) and quantization function design (introducing the SEQ quantizer). The work demonstrates that 2-bit and 1.58-bit quantization outperform conventional 4-bit in the accuracy–model-size trade-off, and achieves state-of-the-art results across all bit-widths.
- PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models
-
This paper proposes PermLLM, the first learnable channel permutation (LCP) framework for N:M sparse LLMs. By relaxing discrete permutation matrices into differentiable soft permutation matrices via Sinkhorn normalization, PermLLM enables end-to-end optimization. Combined with a block-level permutation strategy that substantially reduces computational overhead, the framework effectively improves the performance of N:M sparse LLMs.
- Perturbation Bounds for Low-Rank Inverse Approximations under Noise
-
This work derives the first non-asymptotic spectral norm perturbation bounds for low-rank inverse approximations \(\|(\tilde{A}^{-1})_p - A_p^{-1}\|\) under noise, via a novel contour bootstrapping technique that handles the non-entire function \(f(z) = 1/z\). Under favorable conditions, the proposed bounds improve upon classical bounds by a factor of \(\sqrt{n}\).
- PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation
-
PPG-Distill proposes a knowledge distillation framework tailored for PPG signals. By combining prediction-level, feature-level, and patch-level (morphology + rhythm) distillation, it transfers knowledge from large PPG foundation models to lightweight student models, achieving up to 21.8% performance improvement alongside 7× inference speedup and 19× memory compression.
- PKD: Preference-driven Knowledge Distillation for Few-shot Node Classification
-
PKD is a framework that jointly leverages LLMs and multiple GNN teachers for few-shot node classification on text-attributed graphs. A GNN-preference node selector (GNS) uses KL divergence-based uncertainty to identify nodes requiring LLM annotation, while a node-preference GNN selector (NGS) employs RL to match each node with its optimal GNN teacher. PKD achieves consistent state-of-the-art performance across 9 datasets (e.g., Cornell 87% vs. baselines 59–82%).
- Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
-
This paper derives optimal bit allocation for Gaussianized weights from an information-theoretic perspective, proposes the Q-Palette collection of fractional-bit quantizers and a mixed-scheme quantization framework, and achieves near-optimal quantization performance with inference acceleration in LLM deployment.
- QSVD: Efficient Low-Rank Approximation for Unified Query-Key-Value Weight Compression
-
This paper proposes QSVD, which performs SVD on the joint QKV weight matrix and shares a single down-projection matrix across Q, K, and V to reduce KV cache size and computational overhead. Combined with importance-score-based adaptive rank allocation and a quantization scheme compatible with low-rank decomposition, QSVD achieves over 10% accuracy improvement on VLMs at lower hardware cost.
- QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks
-
This paper proposes a lightweight quadratic enhancer (QuadEnhancer) that introduces sparsified quadratic interaction terms into each linear layer, achieving significant performance improvements over existing neural network architectures with negligible additional parameters and computational overhead.
- Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
-
This paper identifies a critical bottleneck in existing layer-wise PTQ methods—namely, their neglect of cross-layer accumulation and growth of quantization errors—and proposes the QEP framework, which explicitly corrects accumulated errors via error propagation and compensation, achieving substantial performance gains under extremely low-bit settings (INT2/INT3).
- RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
-
This paper proposes RAT (Recurrence And aTtention), a chunk-based intermediate architecture that models local dependencies within chunks via linear RNNs and enables global access across chunks via softmax attention. At \(L=16\), RAT achieves a 9× single-layer decoding speedup and 10× maximum throughput improvement over standard attention with comparable performance; a hybrid variant alternating with sliding window attention achieves state-of-the-art results on nearly all benchmarks.
- RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget
-
This paper proposes RCCDA, a lightweight model update policy based on the Lyapunov drift-plus-penalty framework. Under concept drift scenarios where the data distribution shifts over time, RCCDA greedily determines when to retrain the model using only historical inference loss and a tunable threshold, while provably satisfying strict resource budget constraints.
- Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
-
This paper identifies a dual entangled bias in soft labels within long-tailed dataset distillation — originating from both the distillation model and the distilled images — and proposes ADSA, an Adaptive Soft-label Alignment module that eliminates this bias via post-hoc calibration in logit space. As a plug-and-play module, ADSA integrates seamlessly into existing distillation pipelines, achieving up to 11.8% accuracy improvement on tail classes on ImageNet-1k-LT.
- Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
-
This paper proposes rLiVS (Recurrent LLM-informed Visual Selection), a training-free and model-agnostic method for streaming video understanding. It achieves state-of-the-art performance on streaming video benchmarks through three complementary designs: LLM attention-guided visual token selection (retaining only ~6% of tokens), recurrent reuse of historical tokens, and caption-based retrieval for question answering.
- RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models
-
RefLoRA selects the optimal low-rank factorization form at each iteration by minimizing an upper bound on the loss, thereby addressing the weight update inconsistency and imbalance caused by the non-uniqueness of the LoRA decomposition. It accelerates convergence and improves fine-tuning performance with negligible additional computational overhead.
- Reject Only Critical Tokens: Pivot-Aware Speculative Decoding
-
PAD proposes a new speculative decoding paradigm based on utility matching rather than distribution matching. It trains a lightweight classifier to identify pivot tokens and rejects only those draft tokens that would degrade final output utility, achieving a 2.46× speedup on GSM8K with negligible accuracy loss.
- REOrdering Patches Improves Vision Models
-
This paper reveals that patch ordering significantly affects the performance of long-sequence vision models, and proposes the REOrder framework, which leverages information-theoretic priors and reinforcement learning to automatically discover optimal patch permutations, achieving up to 3.01% improvement on ImageNet-1K and 13.35% on FMoW.
- REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning
-
REP reduces training time by up to 51% and memory consumption by up to 41% for prompt-based rehearsal-free continual learning methods, with negligible accuracy loss, via three complementary techniques: fast prompt selection using a lightweight surrogate model, Adaptive Token Merging (AToM), and Adaptive Layer Dropping (ALD).
- ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
-
ReplaceMe is a training-free depth pruning method that uses a small calibration dataset to estimate a linear transformation approximating groups of pruned Transformer blocks. This transformation is fused into adjacent layer weights without introducing additional parameters, achieving 25% pruning on LLaMA-2-7B while retaining approximately 90% of original performance.
- Restoring Pruned Large Language Models via Lost Component Compensation
-
RestoreLCC proposes a targeted recovery strategy for pruned LLMs: it uses contrastive probing to identify critical attention heads, applies SVD decomposition to extract activation components lost during pruning, and injects them back into the pruned model as learnable bias vectors — significantly restoring performance without compromising sparsity or inference speed.
- Revisiting Semi-Supervised Learning in the Era of Foundation Models
-
A systematic study reveals that conventional SSL methods offer limited benefit in the VFM era—PEFT on labeled data alone can match SSL—motivating V-PET: a simple and effective semi-supervised learning approach that ensembles pseudo-labels from multiple PEFT methods and multiple VFMs.
- Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA
-
This paper proposes RoLoRA, which alternately optimizes the down-projection (\(\mathbf{A}\)) and up-projection (\(\mathbf{B}\)) matrices of LoRA to address imprecise aggregation and limited expressiveness in federated learning. RoLoRA significantly outperforms FedAVG of LoRA and FFA-LoRA on RoBERTa-Large and Llama-2-7B.
- Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency
-
This paper proposes Guard, a lightweight robustification framework that improves the robustness of a broad class of learning-augmented caching algorithms to \(2H_{k-1}+2\) while preserving 1-consistency and incurring only O(1) additional overhead per request.
- S2M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection
-
This paper proposes S2M-Former, a spiking-driven symmetric mixing Branchformer framework that achieves SOTA-level accuracy on EEG-based auditory attention detection with only 0.06M parameters, via complementary learning across spatial-frequency dual branches and lightweight 1D token representations, while reducing energy consumption to 1/5.8 of dual-branch ANN counterparts.
- Ensemble++: Scalable Exploration via Ensemble
-
This paper proposes Ensemble++, which achieves regret bounds comparable to exact Thompson Sampling using only \(\Theta(d\log T)\) ensemble size via an incremental update mechanism over shared factor matrices, with natural extension to nonlinear/neural network settings.
- SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
-
This paper proposes the SCAN framework, which analyzes the noise distribution in Monte Carlo annotations to design a self-denoising sampling strategy and a robust learning loss. A PRM trained on only 101K samples generated by a 1.5B model surpasses the effect of the human-annotated dataset PRM800K.
- Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity
-
This paper proposes Angular-KD, which attaches multiple lightweight linear branches to a single teacher model and introduces two angular diversity losses — a constrained inter-angle diversity loss and an intra-angle diversity loss — to generate diverse supervisory signals from a single teacher. This approach serves as a low-cost alternative to multi-teacher distillation and achieves state-of-the-art performance across multiple KD benchmarks.
- Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling
-
To address the training inefficiency caused by mixing long and short sequences in Long-context Supervised Fine-Tuning (Long-SFT), this paper proposes Skrull, a dynamic data scheduler consisting of two components — Distribution-Aware Context Parallelism (DACP) and Global Data Scheduling (GDS) — achieving an average 3.76× (up to 7.54×) training speedup in realistic Long-SFT scenarios.
- Smooth Regularization for Efficient Video Recognition
-
This paper proposes a Gaussian Random Walk (GRW)-based smooth regularization technique that imposes temporal smoothness constraints (penalizing high-acceleration changes) on intermediate-layer embeddings of video recognition models, achieving 3.8%–6.4% accuracy improvements on lightweight models and establishing a new state of the art on Kinetics-600 under corresponding FLOP constraints.
- Spark Transformer: Reactivating Sparsity in FFN and Attention
-
This paper proposes the Spark Transformer architecture, which simultaneously achieves high-level activation sparsity in both FFN and attention mechanisms (only 8% of neurons activated in FFN; each token attends to at most 256 tokens) via a Statistical Top-k operator. The approach achieves a 2.5× FLOPs reduction and up to 1.79× inference speedup while maintaining quality comparable to Gemma-2.
- SpecAttn: Speculating Sparse Attention
-
SpecAttn proposes a training-free method that leverages attention weights already computed by the draft model in speculative decoding to predict important tokens for the verification model. Through KL divergence layer mapping, sorting-free top-p nucleus selection, and dynamic KV cache pruning, it achieves a 78.4% reduction in KV cache accesses with only a 15.29% increase in perplexity, significantly outperforming existing sparse attention methods.
- Spiking Brain Compression: Post-Training Second-Order Compression for Spiking Neural Networks
-
This paper proposes Spiking Brain Compression (SBC), a second-order post-training one-shot compression framework based on the Van Rossum Distance, designed specifically for spiking neural networks (SNNs). By introducing a Surrogate Membrane Potential (SMP) Hessian, SBC enables efficient module-wise pruning and quantization, and for the first time compresses SEW-ResNet152 and Spike-Driven Transformer at the ImageNet scale.
- Synergy between the Strong and the Weak: Spiking Neural Networks Are Inherently Superior in Temporal Processing
-
This paper identifies that SNNs can be naturally decomposed into multiple sub-models along the temporal dimension. By comparing output confidence across timestep sub-models to identify "strong" and "weak" instances, the paper proposes two self-distillation schemes — Strong2Weak and Weak2Strong — that significantly improve SNN performance without any external teacher model, achieving gains of up to 5.36% on neuromorphic datasets.
- The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis
-
This paper proposes the "Graphon Limit Hypothesis": as network width tends to infinity, the binary mask sequences produced by different pruning methods converge, under the cut distance, to their respective unique graphon limits. Building on this foundation, the paper derives a Graphon NTK to analyze the training dynamics of sparse networks, providing a theoretical explanation for why different pruning methods yield markedly different performance at the same sparsity level.
- The Structure of Relation Decoding Linear Operators in Large Language Models
-
This paper reveals that linear relation embeddings (LREs) in Transformer language models do not encode fine-grained relations but instead extract shared coarse-grained semantic attributes (e.g., "country," "gender"). A rank-3 tensor network is employed to compress large collections of relation decoding matrices by several orders of magnitude.
- Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization
-
By incorporating stochastic projection and lossy compression into the CMI (conditional mutual information) framework, this paper derives tighter generalization bounds, resolves the failure of classical CMI bounds on SCO counterexamples, and proves that memorization is not necessary for good generalization.
- TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs
-
TokenSqueeze proposes a three-stage pipeline — adaptive reasoning depth selection, intra-step linguistic refinement (with KL divergence constraints), and length-aware preference optimization — achieving 50% token compression of reasoning chains without accuracy degradation, using only self-generated data.
- Toward Efficient Inference Attacks: Shadow Model Sharing via Mixture-of-Experts
-
This paper proposes a Mixture-of-Experts (MoE)-based shadow model sharing framework that reduces the overall training cost of shadow models by sharing feature extraction layers across multiple inference attack tasks while training only lightweight task-specific expert modules, maintaining or improving attack performance.
- Towards Effective Federated Graph Foundation Model via Mitigating Knowledge Entanglement
-
This work is the first to propose the Federated Graph Foundation Model (FedGFM) paradigm, which integrates the distributed collaborative capability of federated graph learning with the cross-domain generalization capability of graph foundation models. Two modules — AncDAI (Anchor-based Domain-Aware Initialization) and AdaDPP (Adaptive Domain-sensitive Prompt Pool) — are introduced to mitigate knowledge entanglement, achieving state-of-the-art performance on 8 cross-task, cross-domain datasets against 20 baselines.
- Traversal Verification for Speculative Tree Decoding
-
This paper proposes Traversal Verification, a bottom-up verification algorithm that traverses from leaf nodes to the root. Rather than making acceptance/rejection decisions based on per-token probabilities, it considers the sequence-level probability of entire paths, thereby maximizing candidate utilization. The method is theoretically proven to be lossless and optimal on single chains, and consistently improves acceptance length by 2.2%–5.7% across diverse tree structures and tasks.
- Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning
-
This paper proposes Twilight, which replaces fixed-budget top-k attention sparsity with a top-p (nucleus sampling) inspired approach — dynamically selecting the minimum set of tokens whose cumulative attention weights reach p%, adapting to the distribution characteristics of different attention heads. Twilight achieves up to 1.4× additional speedup over state-of-the-art sparse attention methods while maintaining accuracy.
- Understanding Differential Transformer Unchains Pretrained Self-Attentions
-
This paper conducts an in-depth analysis of the internal mechanism of the Differential Transformer, revealing that the differential operation is equivalent to a robust attention denoising process — it "unchains" pretrained self-attentions from the constraints of softmax normalization, enabling attention weights to be more freely allocated to genuinely important tokens.
- Uni-LoRA: One Vector is All You Need
-
This paper proposes Uni-LoRA, a unified framework demonstrating that the parameter reduction strategies of various LoRA variants (Tied-LoRA, VeRA, VB-LoRA, etc.) are fundamentally distinguished by the choice of projection matrix mapping the full parameter space \(\mathbb{R}^D\) to a low-dimensional subspace \(\mathbb{R}^d\). An isometric random grouping projection matrix is designed such that training a single vector suffices to reconstruct all LoRA parameters of an LLM, achieving extreme parameter efficiency.
- Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
-
This paper proposes Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation method based on binarized f-divergence, which for the first time enables effective distillation and pure distillation across fundamentally different tokenizers (e.g., subword → byte-level).
- VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
-
VESSA proposes an unsupervised adaptation method for visual foundation models using short object-centric videos. Through a self-distillation framework combined with LoRA parameter-efficient fine-tuning and an uncertainty-weighted loss, it significantly improves downstream classification performance in target domains without requiring any labeled data.
- Vision-centric Token Compression in Large Language Model
-
Vist proposes a vision-centric slow-fast dual-path token compression framework that renders distant long-context text as images and compresses them with a lightweight vision encoder, coupled with a Probability-guided Visual Enhancement (PVE) training objective. Across 11 ICL benchmarks, it achieves comparable accuracy with 2.3× fewer tokens, reducing FLOPs by 16% and memory by 50%.
- VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models
-
VQToken introduces the first vector-quantization-based framework for extreme video token compression. By adaptively discretizing continuous ViT embeddings into a compact codebook and preserving spatiotemporal positional information via a token hash function, it achieves only 0.66% accuracy loss on NextQA-MC using merely 0.07% of the original tokens (approximately 13 tokens).
- Weight Weaving: Parameter Pooling for Data-Free Model Merging
-
This paper proposes Weight Weaving, a plug-and-play data-free model merging enhancement method that eliminates the dependency on evaluation data by pooling model parameters (e.g., via averaging or random selection) over the scaling factor search space. Across three scenarios — multi-task learning, continual learning, and domain generalization — the method achieves an average accuracy improvement of up to 15.9 percentage points.
- When Worse is Better: Navigating the Compression-Generation Trade-off in Visual Tokenization
-
This paper systematically investigates the trade-off between visual tokenizer compression rate and generation quality through scaling laws. It finds that more aggressive compression—despite yielding worse reconstruction—benefits generation for smaller models. The paper proposes Causally Regularized Tokenization (CRT), which embeds autoregressive inductive bias into Stage 1 training, achieving 2–3× computational efficiency gains. A 775M-parameter model with 256 tokens/image matches LlamaGen-3B's FID of 2.18.
- Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation
-
Through theoretical analysis on Gaussian mixture models and large-scale experiments on the SmolLM2 family via multi-level distillation, this paper reveals the core mechanism of knowledge distillation in generative models: distillation induces a tradeoff in the student model between precision (generation quality) and recall (distribution coverage), governed by the entropy of the teacher distribution.
- Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs
-
This paper proposes ZEDD (Zero-shot Embedding Drift Detection), which detects prompt injection attacks by measuring semantic drift between benign and suspicious inputs in the embedding space. It leverages GMM/KDE to automatically determine detection thresholds, achieving >93% detection accuracy with <3% false positive rate across multiple LLM architectures.
- zip2zip: Inference-Time Adaptive Tokenization via Online Compression
-
This paper proposes zip2zip, which deeply integrates the classical LZW online lossless compression algorithm into the LLM inference pipeline. During decoding, frequently co-occurring tokens are continuously merged into reusable "hypertokens" to dynamically expand the vocabulary. Combined with a dynamic embedding layer and training on compressed-space language modeling, zip2zip enables existing LLMs to acquire inference-time adaptive tokenization capability with only 10 GPU-hours of LoRA fine-tuning, achieving 15–40% reduction in input/output sequence length and up to 40% reduction in end-to-end decoding latency, with negligible downstream task performance degradation.