📦 Model Compression¶

🧠 NeurIPS2025 · 134 paper notes

4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming: This paper proposes 4DGCPro, a hierarchical 4D Gaussian compression framework that achieves multi-bitrate progressive volumetric video streaming within a single model, via perception-weighted hierarchical Gaussian representation, motion-aware adaptive grouping, and end-to-end entropy-optimized training. The framework supports real-time decoding and rendering on mobile devices and surpasses existing SOTA in rate-distortion performance.
A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings: This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.
A Granular Study of Safety Pretraining under Model Abliteration: This paper systematically investigates the effects of model abliteration—a inference-time activation space editing attack—on various data-driven safety pretraining stages. It finds that safety mechanisms relying solely on refusal training are highly vulnerable, whereas combining multiple safety signals (safe-only filtering + rephrasing + metatags + refusals) distributes safety behavior across a broader representational space, making it substantially more resistant to single-direction projection removal.
A Partition Cover Approach for Tokenization: This paper reformulates tokenization as a partition cover optimization problem, proves it NP-hard, and proposes a polynomial-time greedy algorithm GreedTok that outperforms BPE in both compression rate and downstream tasks when pretraining a 1B-parameter LLM.
A Simple Linear Patch Revives Layer-Pruned Large Language Models: LinearPatch inserts a lightweight symmetric matrix — fusing a Hadamard transform with channel scaling — at the pruning interface to repair activation magnitude mismatches caused by layer pruning. On LLaMA-3-8B, it retains 94.15% of the original performance without any training, and reaches 95.16% after 30 minutes of distillation.
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone: This paper proposes Low-Rank Clone (LRC), which compresses teacher weights into student weights via learnable low-rank projection matrices (soft pruning), while aligning intermediate activations of both attention and FFN modules (activation cloning). A 1.7B model trained on only 20B tokens surpasses Qwen3-1.7B trained on 36T tokens (64.98 vs. 63.17), achieving a 1,000× improvement in training efficiency.
Accurate and Efficient Low-Rank Model Merging in Core Space: This paper proposes the Core Space Merging framework, which performs model merging within a common reference basis space constructed from low-rank LoRA matrices. This approach losslessly reduces the merging operation from the full \(m \times n\) space to a compact \(Tr \times Tr\) space (where \(T\) is the number of tasks and \(r\) is the LoRA rank), achieving state-of-the-art merging accuracy on Llama 3 8B while reducing computational cost by several orders of magnitude.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: Existing KV cache eviction methods uniformly allocate budgets across all attention heads, ignoring the substantial variation in attention concentration across heads. This paper proposes Ada-KV — the first head-wise adaptive budget allocation strategy — which redistributes budget from sparse heads to dispersed heads. It provides a theoretical proof that the approach minimizes an upper bound on eviction loss, and serves as a plug-and-play improvement over existing methods across 29 datasets.
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees: This paper proposes the R-AutoEval+ framework, which introduces an adaptive weighting mechanism within the testing-by-betting framework to dynamically regulate reliance on LLM-judge-generated synthetic data. It is the first method to simultaneously guarantee evaluation reliability and sampling efficiency no worse than approaches using only real data under finite samples, validated across three scenarios: LLM quantization, prompt selection, and inference budget allocation.
Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling: By theoretically analyzing the complementary weaknesses of ODE and SDE solvers (ODE solvers accumulate irreducible gradient errors; SDE solvers amplify discretization errors at large step sizes), this paper proposes AdaSDE—a method that introduces a learnable stochastic coefficient \(\gamma_i\) at each denoising step to control noise injection intensity. Optimized via lightweight distillation, AdaSDE achieves state-of-the-art FID of 4.18 on CIFAR-10 and 8.05 on FFHQ at 5 NFE.
AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees: This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.
AI-Generated Video Detection via Perceptual Straightening: This paper proposes ReStraV, a method grounded in the perceptual straightening hypothesis—which posits that real videos form straighter trajectories in neural representation space—to detect AI-generated videos. Using temporal curvature and step-size statistics extracted from DINOv2 feature space, a lightweight classifier is trained to distinguish real from generated content, achieving 97.17% accuracy and 98.63% AUROC on VidProM with only ~48ms inference time.
All You Need is One: Capsule Prompt Tuning with a Single Vector: This paper proposes Capsule Prompt-Tuning (CaPT), identifying that existing task-aware soft prompts exhibit minimal interaction with input tokens — an "attention island" phenomenon. Incorporating instance-aware information into a single capsule prompt enables it to serve as an "attention anchor" that activates attention toward critical structural information, achieving superior performance over multi-prompt methods with extremely few parameters (e.g., only 0.003% of parameters on Llama3.2-1B).
ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data: ATLAS proposes a data generation framework based on a concept repository, expert iteration with knowledge distillation, and two novel augmentation strategies. It constructs a parallel corpus of 117K theorem statements, and achieves SOTA on all autoformalization benchmarks after fine-tuning Llama3.1-8B-Instruct.
AutoJudge: Judge Decoding Without Manual Annotation: AutoJudge automates the annotation of "critical tokens" in Judge Decoding — by using a semi-greedy search to replace mismatched tokens and checking whether the final answer changes, it labels token importance, trains a logistic regression classifier to predict importance at inference time, enabling speculative decoding to accept 40+ tokens per round (vs. ~20 in standard methods), achieving 1.5× speedup on GSM8K with less than 1% accuracy loss.
BaRISTA: Brain-Scale Informed Spatiotemporal Representation of Human Intracranial EEG: BaRISTA systematically investigates spatial encoding scales (electrode/parcel/lobe) for iEEG Transformers, finding that atlas parcel-level encoding combined with spatial masked reconstruction achieves 86.2% AUC on language task decoding (vs. PopT 79.5%). The choice of encoding scale has greater impact than masking strategy, and the model generalizes well across subjects.
Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs: This paper demonstrates that numerical hallucinations in LLMs originate from the Benford's Law-conforming digit frequency distribution in pretraining corpora—where digit 1 appears with ~30% probability while digit 9 appears with only ~5%—and that this bias is internalized by specific "digit-selective neurons" in the later FFN layers. A Digit Selectivity Coefficient (DSC) is proposed to localize biased neurons, and pruning 0.01% of neurons corrects 1.36–3.49% of erroneous predictions.
Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation: TopLoRA analyzes the expressive capacity of LoRA from an input-output projection perspective, identifying that all tokens sharing a single projection matrix constitutes a critical bottleneck. It proposes dynamically adjusting LoRA weights via a learnable token-wise diagonal matrix \(\Sigma_X\) (i.e., \(\Delta W_X = B\Sigma_X A\)), achieving fine-grained adaptation without increasing rank, and consistently outperforming LoRA by 2–3% across tasks.
Beyond Random: Automatic Inner-Loop Optimization in Dataset Distillation: This paper proposes AT-BPTT (Adaptive Truncation BPTT), which partitions DNN training into early/middle/late stages and adaptively adjusts truncation strategies and window sizes accordingly. The method achieves average accuracy gains of 3–17% on CIFAR-10/100/Tiny-ImageNet/ImageNet-1K, while delivering 3.9× speedup and 63% memory reduction.
Bézier Splatting for Fast and Differentiable Vector Graphics Rendering: Bézier Splatting integrates the Gaussian Splatting framework with Bézier curves by uniformly sampling 2D Gaussian points along each curve and rendering via α-blending to achieve differentiable vector graphics. The method achieves 30× forward and 150× backward speedups over DiffVG while maintaining or surpassing the image quality of methods such as LIVE.
Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression: BQQ proposes quadratic binary quantization—representing weight matrices as products (rather than linear combinations) of binary matrices—thereby surpassing the expressive capacity of conventional first-order quantization. By extending AMFD (Annealed Mean-Field Descent) to PUBO problems for mixed-integer optimization, BQQ achieves a dramatic accuracy leap from 10.83% to 58.25% on 2-bit data-free ViT quantization.
BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks: BioBench is proposed as a unified benchmark spanning 9 ecological vision tasks, 4 taxonomic kingdoms, 6 image modalities, and 3.1 million images. It demonstrates that ImageNet top-1 accuracy explains only 34% of the variance across ecological tasks, and that approximately 30% of model rankings are incorrect among frontier models exceeding 75% accuracy.
C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models: This paper proposes C-LoRA, which introduces a lightweight contextual module to condition the distribution of LoRA low-rank matrices on the input data, enabling sample-level heteroscedastic uncertainty estimation and significantly improving calibration quality in few-shot fine-tuning scenarios.
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs: CAS-Spec constructs a multi-level draft model hierarchy from the target model itself via Dynamically Switchable Inference Acceleration (DSIA) strategies (e.g., layer sparsity at varying degrees), and employs the Dynamic Tree Cascade (DyTC) algorithm to adaptively route among draft models and allocate draft lengths based on online acceptance rates and latency predictions. The approach achieves lossless inference acceleration of 1.1×–2.3× in a fully training-free manner, with DyTC yielding gains of 47% and 48% over cascade and tree baselines, respectively.
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference: ChunkKV elevates the basic unit of KV cache compression from discrete tokens to semantic chunks (groups of contiguous tokens). By aggregating attention scores at the chunk level, it selects semantically intact segments for retention, and leverages the high cross-layer index similarity induced by chunking to enable layer-wise index reuse. At a 10% compression ratio, ChunkKV improves over SnapKV/PyramidKV by up to 8.7% and achieves a 26.5% throughput gain.
CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs: This paper proposes CodeGEMM, a codebook-centric GEMM kernel that precomputes inner products between centroids and activations and caches them as a Psumbook, replacing the conventional dequantization pipeline to achieve end-to-end speedups of 1.83× (8B) to 8.93× (70B) on 2-bit quantized LLMs.
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers: This paper proposes REFORM, an inference framework that efficiently processes ultra-long contexts (up to millions of tokens) via a compress–gather–recompute three-stage pipeline. REFORM achieves improvements of 52% and 34% over the strongest baselines on RULER and BABILong respectively, while reducing inference time by 30% and peak memory usage by 5%.
Correlation Dimension of Auto-Regressive Large Language Models: This paper introduces the correlation dimension from fractal geometry into LLM analysis. By measuring the recursive structure among next-token log-probability vectors, it quantifies the hierarchical complexity of text, revealing a three-stage evolution of LLM pretraining, an indicator of hallucination tendency, and a unified detection capability for multiple text degeneration patterns — none of which can be captured by perplexity.
Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning: This paper proposes DEAL, a framework that leverages wavelet kernel feature filtering to preserve core historical knowledge in LoRA low-rank matrices, combined with a controlled knowledge update module and asymmetric regularization, enabling LLMs to acquire new knowledge without forgetting old tasks under few-shot continual fine-tuning.
DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method: This paper proposes DeltaFlow (ΔFlow), which extracts motion cues via inter-frame voxel differences (Δ scheme) to enable multi-frame scene flow estimation with feature sizes that remain constant regardless of the number of input frames. The method achieves state-of-the-art performance on Argoverse 2, Waymo, and nuScenes while running 2× faster than the second-best multi-frame approach.
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts: This paper proposes Default MoE, a method that maintains exponential moving averages (EMA) of inactive expert outputs as surrogate signals, enabling dense gradient updates for the MoE router without significant computational overhead, thereby improving sparse MoE training performance.
Dependency Parsing is More Parameter-Efficient with Normalization: This paper identifies that the lack of normalization in biaffine scoring for dependency and semantic parsing leads to systematic overparameterization, and demonstrates that a simple \(1/\sqrt{d}\) scaling can reduce BiLSTM parameters by up to 85% while matching or surpassing original performance.
Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers: DCR mixes teacher and student module outputs via a deterministic annealing weight \(\alpha(t)\), eliminating the gradient variance introduced by stochastic gating (e.g., BERT-of-Theseus), and achieves faster convergence and stronger feature alignment in cold-start module replacement scenarios.
Disentangling Latent Shifts of In-Context Learning with Weak Supervision: WILDA treats ICL as a weak supervision signal and encodes demonstration-induced latent shifts into lightweight LoRA adapters via a teacher-student framework, enabling efficient inference without repeated prompting. The student surpasses the teacher through pseudo-label correction and coverage extension, demonstrating weak-to-strong generalization.
Distillation Robustifies Unlearning: This paper reveals the core finding that "distillation can robustify unlearning" — distilling an unlearned model into a randomly initialized student network effectively discards latent capabilities. Building on this insight, the paper proposes UNDO (Unlearn-Noise-Distill-on-Outputs), which applies weight perturbation to the unlearned model prior to distillation, establishing a tunable compute–robustness trade-off that approaches the gold standard of retraining from scratch on both synthetic tasks and the WMDP benchmark.
DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment: DP-LLM identifies that per-layer quantization sensitivity varies dynamically across decoding steps, and proposes a dynamic layer-wise precision selection mechanism based on relative error. At runtime, each layer is assigned a precision (h-bit or l-bit) conditioned on the current input, achieving a better performance–latency trade-off than static mixed-precision methods.
DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning: DRAGON proposes a systematic LLM unlearning framework that requires no fine-tuning of the base model. It employs a two-layer detection module to identify prompts subject to unlearning, then uses a specially fine-tuned guard model to generate CoT reasoning instructions for in-context intervention, effectively removing private or harmful knowledge while preserving the model's general capabilities.
DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs: This paper proposes DuoGPT, a dual-sparse framework that reinterprets activation sparsity as dynamic structured weight sparsity and combines it with unstructured weight pruning. By extending the OBC framework with activation-aware calibration and a dense-model output residual correction term, DuoGPT achieves significant speedup and memory savings during the LLM decoding phase without any retraining.
Elastic ViTs from Pretrained Models without Retraining: SnapViT proposes a post-training structured pruning method that combines a local Hessian diagonal approximation derived from self-supervised gradients with global cross-module correlations estimated via evolutionary algorithms. Without any retraining or labels, it generates elastic ViT sub-networks spanning continuous sparsity levels in a single run, requiring less than 5 minutes on an A100 GPU.
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic: FALQON eliminates the small-matrix quantization overhead introduced by standalone LoRA paths by directly melding LoRA adapters into FP8-quantized backbone weights. Combined with efficient gradient computation and a row-wise proxy update mechanism, it achieves approximately 3× training speedup over existing quantized LoRA methods.
FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing: FastLongSpeech is proposed to compress redundant speech representations via an iterative fusion strategy and to transfer short-speech capabilities to long-speech scenarios through dynamic compression training, enabling large speech-language models (LSLMs) to efficiently process long speech without long-speech training data, achieving state-of-the-art performance on long-speech QA with a 70% improvement in inference efficiency.
Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization: MaO proposes a novel approach for Small Object Image Retrieval (SoIR) that integrates multi-object pre-training with attention-based feature refinement, aggregating representations of multiple objects into a single global descriptor, achieving substantial improvements over existing retrieval methods across multiple benchmarks.
FiRA: Can We Achieve Full-Rank Training of LLMs Under Low-Rank Constraint?: This paper proposes Fira, the first LLM training framework that achieves full-rank training (full-rank gradients + full-rank weights) under low-rank constraints. By observing that the optimizer scaling factors in low-rank and full-rank training are highly similar, Fira approximates the correction of out-of-subspace gradients using low-rank scaling factors, and employs a norm-growth limiter to prevent loss spikes. Fira outperforms LoRA and GaLore in both pretraining and fine-tuning settings.
FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings: This paper introduces FirstAidQA, a dataset of 5,500 synthetic first aid question-answer pairs generated by ChatGPT-4o-mini from certified first aid textbooks and validated by human experts, designed to support fine-tuning of first aid AI systems in low-connectivity or offline environments.
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models: This paper proposes GainLoRA, which introduces a gating module for each new task's LoRA branch in continual learning to generate adaptive integration coefficients. By enforcing orthogonal constraints, the new branch's output on old tasks is driven toward zero, effectively mitigating catastrophic forgetting.
Geometric Data Valuation via Leverage Scores: This paper proposes a geometric data valuation method based on statistical leverage scores as an efficient proxy for Data Shapley values. The proposed method satisfies the axioms of symmetry, efficiency, and dummy player, and extends to ridge leverage scores to address the dimensionality saturation problem, providing theoretical guarantees of \(O(\varepsilon)\)-approximate optimality.
Geometry of Decision Making in Language Models: By measuring the Intrinsic Dimension (ID) of hidden representations across layers in 28 open-source Transformer models at scale, this paper reveals a consistent "low–high–low" pattern: early layers operate on low-dimensional manifolds, middle layers expand the representational space, and later layers re-compress into low-dimensional representations aligned with decision-making.
Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks: This paper proves that, for single-hidden-layer ReLU networks, global minimizers of the \(\ell^p\) (\(0 < p < 1\)) path norm correspond exactly to the sparsest data-interpolating networks, thereby recasting the combinatorial sparse interpolation problem as a continuously differentiable optimization task.
GoRA: Gradient-Driven Adaptive Low Rank Adaptation: GoRA is proposed to leverage pre-computed gradient information to simultaneously perform adaptive rank allocation and weight initialization prior to training — assigning per-layer ranks based on parameter sensitivity and initializing the \(B\) matrix via the gradient pseudo-inverse so that the initial output approximates one step of gradient descent, thereby addressing both major bottlenecks of LoRA in a unified framework.
Graph Your Own Prompt: This paper proposes a Graph Consistency Regularization (GCR) framework that inserts parameter-free Graph Consistency Layers (GCL) at arbitrary network depths. GCL aligns the relational graph of intermediate features with a class-aware semantic graph derived from model predictions, promoting semantically consistent feature learning in a self-prompting manner—improving classification generalization without modifying the architecture or introducing additional parameters.
GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection: GraSS and FactGraSS are proposed as a two-stage gradient compression algorithm that exploits the inherent sparsity of per-sample gradients to achieve sublinear time and space complexity (\(O(k')\)), outperforming the SOTA baseline LoGra by 165% in throughput on billion-parameter models while maintaining data attribution quality.
Graver: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning: This paper proposes Graver, a framework that decouples ego-graphs to extract transferable subgraph vocabularies, models their distributions via graphon experts, and routes relevant vocabularies to augment support samples through MoE-CoE, addressing the instability caused by structural mismatch in few-shot fine-tuning of graph foundation models (GFMs).
Hankel Singular Value Regularization for Highly Compressible State Space Models: By regularizing the Hankel singular value nuclear norm of SSM layers during training to encourage rapid decay, the trained model can be compressed to 10% of its original order via balanced truncation with negligible accuracy loss. A block-diagonal rotation matrix parameterization reduces Gramian computation from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(n^2)\).
Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs: This paper identifies a previously overlooked local Key-Value asymmetry in LLM attention mechanisms—neighboring Keys exhibit homogeneity (similar attention weight distributions), while neighboring Values are heterogeneously distributed. Based on this observation, the paper proposes AsymKV, a training-free compression framework that merges Keys via homogeneity and represents Values losslessly through cardinality normalization, outperforming H2O by 5 points on LongBench.
Hyperbolic Dataset Distillation: This paper proposes HDD, the first method to incorporate hyperbolic space into dataset distillation. By matching the Riemannian centroids of real and synthetic data in the Lorentz hyperbolic space—rather than performing distribution matching in Euclidean space—HDD leverages the hierarchical weighting property of hyperbolic geometry to assign higher influence to more representative, low-level samples. The method consistently improves over DM/IDM baselines across multiple datasets.
Inference-Time Hyper-Scaling with KV Cache Compression: This paper proposes the Inference-Time Hyper-Scaling paradigm: by efficiently compressing the KV cache, more or longer parallel reasoning sequences can be generated under the same compute/memory budget, substantially improving the accuracy of reasoning models on tasks such as mathematics, code, and scientific reasoning.
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments: This paper proposes KeyDiff — an attention-score-free KV cache eviction strategy that maintains the cache by retaining keys with the lowest average cosine similarity to other keys (i.e., geometrically most unique). Under strict memory constraints in block-wise inference settings, KeyDiff achieves ≤0.04% accuracy loss on LongBench with an 8K cache budget, while reducing end-to-end inference latency by up to 30%.
KINDLE: Knowledge-Guided Distillation for Prior-Free Gene Regulatory Network Inference: This paper proposes KINDLE, a three-stage framework that transfers gene regulatory knowledge learned by a prior-guided teacher model to a prior-free student model via knowledge distillation, achieving state-of-the-art performance in gene regulatory network (GRN) inference without relying on any external prior knowledge.
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning: KTAE proposes a model-free token-level advantage estimation algorithm that quantifies the statistical association between each token and correct reasoning outcomes via Fisher's exact test and information gain. The resulting fine-grained token importance is superimposed on the rollout-level advantage of GRPO/DAPO, achieving superior performance on five mathematical reasoning benchmarks while significantly reducing generation length.
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction: This paper proposes KVzip, a query-agnostic KV cache eviction method that quantifies the importance of each KV pair by leveraging the LLM itself to reconstruct the original context from the cached KV pairs. KVzip achieves 3–4× KV cache compression and approximately 2× reduction in FlashAttention decoding latency, while significantly outperforming existing query-aware methods in multi-query scenarios.
LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions: LayerIF proposes using influence functions (IFs) to quantify the training quality of each layer in LLMs. By aggregating positive influence scores per layer, it derives a data-driven layer importance estimate, which is subsequently applied to two downstream tasks: LoRA-MoE expert allocation and layer-wise sparse pruning. The method achieves accuracy gains of 1.61% and 0.90% on Mistral-7B and Gemma-7B, respectively.
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression: GLVQ learns a dedicated lattice codebook (defined by a learnable generator matrix) for each weight group of an LLM, combined with group-specific μ-law companding to handle heavy-tailed distributions. Under 2-bit quantization, it achieves a Wikitext-2 perplexity of 3.36 on Llama-2-70B, substantially outperforming QuIP# (3.91) and QTIP (3.78).
Learning to Better Search with Language Models via Guided Reinforced Self-Training: This paper proposes Guided-ReST, which progressively incorporates optimal solutions as subgoals into model-generated search trajectories to produce high-quality training data and distill more efficient search strategies. The approach yields substantial improvements in search efficiency and accuracy on Countdown and code self-repair tasks.
Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models: This paper proposes FactoST-v2, a factorized spatio-temporal foundation model framework that decouples universal temporal pre-training from domain-specific spatial adaptation, achieving cross-domain zero-shot/few-shot/full-shot spatio-temporal forecasting with linear complexity.
Less is More but Where: Dynamic Token Compression via LLM-Guided Keyframe Prior: This paper proposes DyToK, a training-free dynamic video token compression method that leverages query-conditioned keyframe priors inherent in the deep attention layers of VLLMs to adaptively allocate token budgets across frames, achieving plug-and-play optimal efficiency–accuracy trade-offs.
Linear Attention for Efficient Bidirectional Sequence Modeling: This paper proposes Lion, a framework that, for the first time, systematically extends linear Transformers to bidirectional sequence modeling. It unifies three equivalent representations—full linear attention, bidirectional RNN, and chunkwise parallel—achieving training speeds up to 10× faster than SSMs while matching softmax Transformer performance.
LittleBit: Ultra Low-Bit Quantization via Latent Factorization: This paper proposes LittleBit, a framework that achieves extreme LLM compression down to 0.1 BPW (bits per weight) via low-rank latent-space matrix factorization, binarization, and a multi-scale compensation mechanism. It compresses Llama2-13B to under 0.9 GB and substantially outperforms STBLLM in the sub-1-bit regime.
Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving: Loquetier is a framework that unifies the fine-tuning and inference of multiple LoRA adapters within a single runtime via a Virtualized Module and a Segmented Multi-LoRA Multiplication (SMLM) kernel, achieving a 3.0× throughput improvement for inference-only tasks and a 46.4× higher SLO attainment rate for unified tasks.
LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups: This paper proposes LT-Soups, a two-stage model merging framework that trains multiple models on subsampled datasets with progressively varying imbalance ratios and aggregates them via weight averaging, achieving balanced performance across head and tail classes over the full long-tail spectrum.
Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs: This paper proposes Matryoshka Pilot (M-Pilot), which employs a lightweight white-box LLM as a controller to generate intermediate guidance (task decomposition, high-level plans, user profiles) for driving black-box LLMs on complex long-horizon tasks such as reasoning, planning, and personalization, with iterative DPO enabling continual self-improvement.
Memory-Efficient Training with In-Place FFT Implementation: This paper proposes rdFFT—the first truly in-place real-domain Fast Fourier Transform framework—which eliminates intermediate buffers via an implicit complex encoding scheme, achieving zero extra memory overhead for FFT/IFFT computation during training, with memory efficiency improvements exceeding 1500× in extreme cases.
Mingle: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging: This paper proposes a new paradigm called Test-Time Continual Model Merging (TTCMM) and the Mingle framework, which employs a low-rank mixture-of-experts architecture with an adaptive null-space constrained gating mechanism to dynamically merge models at test time using a small number of unlabeled samples. Mingle outperforms state-of-the-art methods by 7–9% across multiple benchmarks while reducing forgetting to near zero.
Mitigating Semantic Collapse in Partially Relevant Video Retrieval: To address semantic collapse in Partially Relevant Video Retrieval (PRVR), this paper proposes Text Correlation Preservation Learning (TCPL) and Cross-Branch Video Alignment (CBVA), which mitigate collapse phenomena in the text and video embedding spaces respectively, achieving substantial improvements in retrieval accuracy.
Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning: This paper proposes learning beneficial "mixture of noise" to suppress parameter drift in pre-trained models during class-incremental learning. By dynamically mixing task-specific noise with learned weights across tasks, the method achieves state-of-the-art performance, particularly in the challenging 50-step incremental setting.
ModHiFi: Identifying High Fidelity Predictive Components for Model Modification: This paper proposes the Subset Fidelity metric and the ModHiFi framework. Through theoretical analysis, it proves that local reconstruction error linearly upper-bounds global prediction error for Lipschitz continuous networks. Without requiring training data, loss functions, or gradients—using only synthetic data—the framework identifies high-fidelity (HiFi) components within a model, and unifies the tasks of structured pruning and class unlearning under a single formulation.
Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP: This paper proposes the State-Decomposable MDP (SDMDP) framework, which reformulates multiple VRP variants as Cartesian products of base state spaces, and introduces the Mixture-of-Specialized-Experts Solver (MoSES), which leverages dedicated LoRA experts to enable latent space reuse of base policies, efficiently handling 16 VRP variants.
MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference: This paper proposes MUSTAFAR, a framework that systematically demonstrates the superiority of unstructured sparsity for KV cache pruning—achieving 70% sparsity on both Key and Value caches without accuracy degradation—and introduces a bitmap-based sparse format with a custom attention kernel, yielding up to 2.23× end-to-end inference throughput improvement.
Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025: In the NeurIPS 2025 Mouse vs. AI competition, this paper presents the counterintuitive finding that a lightweight two-layer CNN substantially outperforms deep networks on visual robustness tasks, while demonstrating that a deeper ResNet architecture is more advantageous for neural alignment, revealing a fundamental tension between behavioral robustness and biological plausibility.
Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users: This paper conducts offline policy evaluation (OPE) on a deployed LLM health coaching system with real users. It finds that a uniformly high tool-use policy improves average reward but harms specific user subgroups. Through simulator experiments, the paper further validates that early information-gain exploration (curiosity reward) accelerates user profile identification and improves task success rates.
On the Creation of Narrow AI: Hierarchy and Nonlocality of Neural Network Skills: This paper investigates two fundamental challenges in creating narrow AI systems: the hierarchical dependencies among tasks require that certain narrow skills can only be learned effectively when trained on broad distributions; and the nonlocality of skills makes it impossible to precisely separate desired from undesired capabilities via pruning—yet pruning followed by recovery fine-tuning still outperforms both distillation and training from scratch.
On the Hardness of Approximating Distributions with Tractable Probabilistic Models: This paper proves that approximating arbitrary distributions with tractable probabilistic models (e.g., decomposable probabilistic circuits) under bounded \(f\)-divergence is NP-hard, and establishes an exponential size separation between decomposable PCs and (deterministic + decomposable) PCs under approximate modeling, demonstrating that approximation relaxations do not alleviate the complexity bottlenecks inherent in exact modeling.
One-Step Diffusion-Based Image Compression with Semantic Distillation: This paper proposes OneDC—the first one-step diffusion-based generative image codec—which replaces text with the hyperprior as the semantic conditioning signal for the diffusion model and enhances its representational capacity via semantic distillation, achieving state-of-the-art perceptual quality with 39% bitrate savings and 20× decoding speedup over multi-step diffusion codecs.
Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making: This paper proposes an Online Mixture of Experts (OMoE) framework comprising two algorithms — UCB-Successive Elimination and Online Weighted Majority Voting — with theoretical no-regret guarantees, and applies them to the online dynamic aggregation of LLM experts.
Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation: This paper reformulates dataset distillation as an optimal transport (OT) distance minimization problem and achieves fine-grained distributional geometry alignment through a three-stage pipeline (OT-guided diffusion sampling, label-image alignment soft re-labeling, and OT logit matching), yielding at least 4% improvement over the previous state of the art on ImageNet-1K at IPC=10.
Order-Level Attention Similarity Across Language Models: A Latent Commonality: This paper proposes Order-Level Attention (OLA)—an order-wise decomposition of Attention Rollout—and discovers that different language models exhibit significant similarity in same-order OLA (OLAS). OLA is shown to implicitly encode syntactic knowledge, and based on this finding, the paper proposes TOA, the first training-free cross-LM adapter transfer method.
ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization: This paper proposes ParetoQ — the first unified framework supporting 1/1.58/2/3/4-bit quantization — which systematically studies training strategies (full-precision pretraining vs. QAT budget allocation) and quantization function design (introducing the SEQ quantizer). The work demonstrates that 2-bit and 1.58-bit quantization outperform conventional 4-bit in the accuracy–model-size trade-off, and achieves state-of-the-art results across all bit-widths.
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models: This paper proposes PermLLM, the first learnable channel permutation (LCP) framework for N:M sparse LLMs. By relaxing discrete permutation matrices into differentiable soft permutation matrices via Sinkhorn normalization, PermLLM enables end-to-end optimization. Combined with a block-level permutation strategy that substantially reduces computational overhead, the framework effectively improves the performance of N:M sparse LLMs.
PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation: PPG-Distill proposes a knowledge distillation framework tailored for PPG signals. By combining prediction-level, feature-level, and patch-level (morphology + rhythm) distillation, it transfers knowledge from large PPG foundation models to lightweight student models, achieving up to 21.8% performance improvement alongside 7× inference speedup and 19× memory compression.
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment: This paper derives optimal bit allocation for Gaussianized weights from an information-theoretic perspective, proposes the Q-Palette collection of fractional-bit quantizers and a mixed-scheme quantization framework, and achieves near-optimal quantization performance with inference acceleration in LLM deployment.
QSVD: Efficient Low-Rank Approximation for Unified Query-Key-Value Weight Compression: This paper proposes QSVD, which performs SVD on the joint QKV weight matrix and shares a single down-projection matrix across Q, K, and V to reduce KV cache size and computational overhead. Combined with importance-score-based adaptive rank allocation and a quantization scheme compatible with low-rank decomposition, QSVD achieves over 10% accuracy improvement on VLMs at lower hardware cost.
QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks: This paper proposes a lightweight quadratic enhancer (QuadEnhancer) that introduces sparsified quadratic interaction terms into each linear layer, achieving significant performance improvements over existing neural network architectures with negligible additional parameters and computational overhead.
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization: This paper identifies a critical bottleneck in existing layer-wise PTQ methods—namely, their neglect of cross-layer accumulation and growth of quantization errors—and proposes the QEP framework, which explicitly corrects accumulated errors via error propagation and compensation, achieving substantial performance gains under extremely low-bit settings (INT2/INT3).
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling: This paper proposes RAT (Recurrence And aTtention), a chunk-based intermediate architecture that models local dependencies within chunks via linear RNNs and enables global access across chunks via softmax attention. At \(L=16\), RAT achieves a 9× single-layer decoding speedup and 10× maximum throughput improvement over standard attention with comparable performance; a hybrid variant alternating with sliding window attention achieves state-of-the-art results on nearly all benchmarks.
RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget: This paper proposes RCCDA, a lightweight model update policy based on the Lyapunov drift-plus-penalty framework. Under concept drift scenarios where the data distribution shifts over time, RCCDA greedily determines when to retrain the model using only historical inference loss and a tunable threshold, while provably satisfying strict resource budget constraints.
Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation: This paper identifies a dual entangled bias in soft labels within long-tailed dataset distillation — originating from both the distillation model and the distilled images — and proposes ADSA, an Adaptive Soft-label Alignment module that eliminates this bias via post-hoc calibration in logit space. As a plug-and-play module, ADSA integrates seamlessly into existing distillation pipelines, achieving up to 11.8% accuracy improvement on tail classes on ImageNet-1k-LT.
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs: This paper proposes rLiVS (Recurrent LLM-informed Visual Selection), a training-free and model-agnostic method for streaming video understanding. It achieves state-of-the-art performance on streaming video benchmarks through three complementary designs: LLM attention-guided visual token selection (retaining only ~6% of tokens), recurrent reuse of historical tokens, and caption-based retrieval for question answering.
RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models: RefLoRA selects the optimal low-rank factorization form at each iteration by minimizing an upper bound on the loss, thereby addressing the weight update inconsistency and imbalance caused by the non-uniqueness of the LoRA decomposition. It accelerates convergence and improves fine-tuning performance with negligible additional computational overhead.
Reject Only Critical Tokens: Pivot-Aware Speculative Decoding: PAD proposes a new speculative decoding paradigm based on utility matching rather than distribution matching. It trains a lightweight classifier to identify pivot tokens and rejects only those draft tokens that would degrade final output utility, achieving a 2.46× speedup on GSM8K with negligible accuracy loss.
REOrdering Patches Improves Vision Models: This paper reveals that patch ordering significantly affects the performance of long-sequence vision models, and proposes the REOrder framework, which leverages information-theoretic priors and reinforcement learning to automatically discover optimal patch permutations, achieving up to 3.01% improvement on ImageNet-1K and 13.35% on FMoW.
REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning: REP reduces training time by up to 51% and memory consumption by up to 41% for prompt-based rehearsal-free continual learning methods, with negligible accuracy loss, via three complementary techniques: fast prompt selection using a lightweight surrogate model, Adaptive Token Merging (AToM), and Adaptive Layer Dropping (ALD).
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization: ReplaceMe is a training-free depth pruning method that uses a small calibration dataset to estimate a linear transformation approximating groups of pruned Transformer blocks. This transformation is fused into adjacent layer weights without introducing additional parameters, achieving 25% pruning on LLaMA-2-7B while retaining approximately 90% of original performance.
Representation Consistency for Accurate and Coherent LLM Answer Aggregation: This paper proposes Representation Consistency (RC), which improves answer aggregation by analyzing the consistency of internal activations when an LLM generates multiple candidate answers. Reasoning paths that yield the same answer with highly consistent internal representations are more likely to be correct. A sparse variant, RC-S, leveraging sparse autoencoders achieves the best performance, consistently outperforming Self-Consistency across 4 LLMs and 4 reasoning datasets.
Restoring Pruned Large Language Models via Lost Component Compensation: RestoreLCC proposes a targeted recovery strategy for pruned LLMs: it uses contrastive probing to identify critical attention heads, applies SVD decomposition to extract activation components lost during pruning, and injects them back into the pruned model as learnable bias vectors — significantly restoring performance without compromising sparsity or inference speed.
Revisiting Semi-Supervised Learning in the Era of Foundation Models: A systematic study reveals that conventional SSL methods offer limited benefit in the VFM era—PEFT on labeled data alone can match SSL—motivating V-PET: a simple and effective semi-supervised learning approach that ensembles pseudo-labels from multiple PEFT methods and multiple VFMs.
Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA: This paper proposes RoLoRA, which alternately optimizes the down-projection (\(\mathbf{A}\)) and up-projection (\(\mathbf{B}\)) matrices of LoRA to address imprecise aggregation and limited expressiveness in federated learning. RoLoRA significantly outperforms FedAVG of LoRA and FFA-LoRA on RoBERTa-Large and Llama-2-7B.
Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency: This paper proposes Guard, a lightweight robustification framework that improves the robustness of a broad class of learning-augmented caching algorithms to \(2H_{k-1}+2\) while preserving 1-consistency and incurring only O(1) additional overhead per request.
S2M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection: This paper proposes S2M-Former, a spiking-driven symmetric mixing Branchformer framework that achieves SOTA-level accuracy on EEG-based auditory attention detection with only 0.06M parameters, via complementary learning across spatial-frequency dual branches and lightweight 1D token representations, while reducing energy consumption to 1/5.8 of dual-branch ANN counterparts.
Ensemble++: Scalable Exploration via Ensemble: This paper proposes Ensemble++, which achieves regret bounds comparable to exact Thompson Sampling using only \(\Theta(d\log T)\) ensemble size via an incremental update mechanism over shared factor matrices, with natural extension to nonlinear/neural network settings.
Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity: This paper proposes Angular-KD, which attaches multiple lightweight linear branches to a single teacher model and introduces two angular diversity losses — a constrained inter-angle diversity loss and an intra-angle diversity loss — to generate diverse supervisory signals from a single teacher. This approach serves as a low-cost alternative to multi-teacher distillation and achieves state-of-the-art performance across multiple KD benchmarks.
Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling: To address the training inefficiency caused by mixing long and short sequences in Long-context Supervised Fine-Tuning (Long-SFT), this paper proposes Skrull, a dynamic data scheduler consisting of two components — Distribution-Aware Context Parallelism (DACP) and Global Data Scheduling (GDS) — achieving an average 3.76× (up to 7.54×) training speedup in realistic Long-SFT scenarios.
Smooth Regularization for Efficient Video Recognition: This paper proposes a Gaussian Random Walk (GRW)-based smooth regularization technique that imposes temporal smoothness constraints (penalizing high-acceleration changes) on intermediate-layer embeddings of video recognition models, achieving 3.8%–6.4% accuracy improvements on lightweight models and establishing a new state of the art on Kinetics-600 under corresponding FLOP constraints.
Spark Transformer: Reactivating Sparsity in FFN and Attention: This paper proposes the Spark Transformer architecture, which simultaneously achieves high-level activation sparsity in both FFN and attention mechanisms (only 8% of neurons activated in FFN; each token attends to at most 256 tokens) via a Statistical Top-k operator. The approach achieves a 2.5× FLOPs reduction and up to 1.79× inference speedup while maintaining quality comparable to Gemma-2.
SpecAttn: Speculating Sparse Attention: SpecAttn proposes a training-free method that leverages attention weights already computed by the draft model in speculative decoding to predict important tokens for the verification model. Through KL divergence layer mapping, sorting-free top-p nucleus selection, and dynamic KV cache pruning, it achieves a 78.4% reduction in KV cache accesses with only a 15.29% increase in perplexity, significantly outperforming existing sparse attention methods.
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models: This paper proposes a "specialization after generalization" framework that theoretically and empirically explains the effectiveness of test-time training (TTT) on in-distribution data under the Linear Representation Hypothesis (LRH). Foundation models are globally underparameterized, leading to concept superposition interference. TTT mitigates this by locally specializing the model—reallocating model capacity to the small subset of concepts relevant to the test task—thereby improving predictive performance without increasing model size.
Spiking Brain Compression: Post-Training Second-Order Compression for Spiking Neural Networks: This paper proposes Spiking Brain Compression (SBC), a second-order post-training one-shot compression framework based on the Van Rossum Distance, designed specifically for spiking neural networks (SNNs). By introducing a Surrogate Membrane Potential (SMP) Hessian, SBC enables efficient module-wise pruning and quantization, and for the first time compresses SEW-ResNet152 and Spike-Driven Transformer at the ImageNet scale.
Synergy between the Strong and the Weak: Spiking Neural Networks Are Inherently Superior in Temporal Processing: This paper identifies that SNNs can be naturally decomposed into multiple sub-models along the temporal dimension. By comparing output confidence across timestep sub-models to identify "strong" and "weak" instances, the paper proposes two self-distillation schemes — Strong2Weak and Weak2Strong — that significantly improve SNN performance without any external teacher model, achieving gains of up to 5.36% on neuromorphic datasets.
The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis: This paper proposes the "Graphon Limit Hypothesis": as network width tends to infinity, the binary mask sequences produced by different pruning methods converge, under the cut distance, to their respective unique graphon limits. Building on this foundation, the paper derives a Graphon NTK to analyze the training dynamics of sparse networks, providing a theoretical explanation for why different pruning methods yield markedly different performance at the same sparsity level.
The Structure of Relation Decoding Linear Operators in Large Language Models: This paper reveals that linear relation embeddings (LREs) in Transformer language models do not encode fine-grained relations but instead extract shared coarse-grained semantic attributes (e.g., "country," "gender"). A rank-3 tensor network is employed to compress large collections of relation decoding matrices by several orders of magnitude.
Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization: By incorporating stochastic projection and lossy compression into the CMI (conditional mutual information) framework, this paper derives tighter generalization bounds, resolves the failure of classical CMI bounds on SCO counterexamples, and proves that memorization is not necessary for good generalization.
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs: TokenSqueeze proposes a three-stage pipeline — adaptive reasoning depth selection, intra-step linguistic refinement (with KL divergence constraints), and length-aware preference optimization — achieving 50% token compression of reasoning chains without accuracy degradation, using only self-generated data.
Toward Efficient Inference Attacks: Shadow Model Sharing via Mixture-of-Experts: This paper proposes a Mixture-of-Experts (MoE)-based shadow model sharing framework that reduces the overall training cost of shadow models by sharing feature extraction layers across multiple inference attack tasks while training only lightweight task-specific expert modules, maintaining or improving attack performance.
Towards Effective Federated Graph Foundation Model via Mitigating Knowledge Entanglement: This work is the first to propose the Federated Graph Foundation Model (FedGFM) paradigm, which integrates the distributed collaborative capability of federated graph learning with the cross-domain generalization capability of graph foundation models. Two modules — AncDAI (Anchor-based Domain-Aware Initialization) and AdaDPP (Adaptive Domain-sensitive Prompt Pool) — are introduced to mitigate knowledge entanglement, achieving state-of-the-art performance on 8 cross-task, cross-domain datasets against 20 baselines.
Towards Unsupervised Open-Set Graph Domain Adaptation via Dual Reprogramming: This paper proposes GraphRTA, a framework that addresses the challenges of known-class classification and unknown-class detection in unsupervised open-set graph domain adaptation through two complementary mechanisms: model reprogramming (gradient-guided weight pruning) and graph reprogramming (target graph structure and feature optimization), without requiring manually specified thresholds.
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning: This paper proposes the Perturb-and-Merge (P&M) framework, which introduces model merging mechanisms into the continual learning paradigm. During training, random perturbations are added along the task vector direction to smooth the loss landscape; during inference, a closed-form optimal coefficient is used to compute a convex combination of the historical model and the current task model. Combined with LoRA, the framework achieves memory-efficient state-of-the-art continual learning performance.
Traversal Verification for Speculative Tree Decoding: This paper proposes Traversal Verification, a bottom-up verification algorithm that traverses from leaf nodes to the root. Rather than making acceptance/rejection decisions based on per-token probabilities, it considers the sequence-level probability of entire paths, thereby maximizing candidate utilization. The method is theoretically proven to be lossless and optimal on single chains, and consistently improves acceptance length by 2.2%–5.7% across diverse tree structures and tasks.
Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning: This paper proposes Twilight, which replaces fixed-budget top-k attention sparsity with a top-p (nucleus sampling) inspired approach — dynamically selecting the minimum set of tokens whose cumulative attention weights reach p%, adapting to the distribution characteristics of different attention heads. Twilight achieves up to 1.4× additional speedup over state-of-the-art sparse attention methods while maintaining accuracy.
Understanding Differential Transformer Unchains Pretrained Self-Attentions: This paper conducts an in-depth analysis of the internal mechanism of the Differential Transformer, revealing that the differential operation is equivalent to a robust attention denoising process — it "unchains" pretrained self-attentions from the constraints of softmax normalization, enabling attention weights to be more freely allocated to genuinely important tokens.
Uni-LoRA: One Vector is All You Need: This paper proposes Uni-LoRA, a unified framework demonstrating that the parameter reduction strategies of various LoRA variants (Tied-LoRA, VeRA, VB-LoRA, etc.) are fundamentally distinguished by the choice of projection matrix mapping the full parameter space \(\mathbb{R}^D\) to a low-dimensional subspace \(\mathbb{R}^d\). An isometric random grouping projection matrix is designed such that training a single vector suffices to reconstruct all LoRA parameters of an LLM, achieving extreme parameter efficiency.
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching: This paper proposes Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation method based on binarized f-divergence, which for the first time enables effective distillation and pure distillation across fundamentally different tokenizers (e.g., subword → byte-level).
VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models: VESSA proposes an unsupervised adaptation method for visual foundation models using short object-centric videos. Through a self-distillation framework combined with LoRA parameter-efficient fine-tuning and an uncertainty-weighted loss, it significantly improves downstream classification performance in target domains without requiring any labeled data.
Vision-centric Token Compression in Large Language Model: Vist proposes a vision-centric slow-fast dual-path token compression framework that renders distant long-context text as images and compresses them with a lightweight vision encoder, coupled with a Probability-guided Visual Enhancement (PVE) training objective. Across 11 ICL benchmarks, it achieves comparable accuracy with 2.3× fewer tokens, reducing FLOPs by 16% and memory by 50%.
VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models: VQToken introduces the first vector-quantization-based framework for extreme video token compression. By adaptively discretizing continuous ViT embeddings into a compact codebook and preserving spatiotemporal positional information via a token hash function, it achieves only 0.66% accuracy loss on NextQA-MC using merely 0.07% of the original tokens (approximately 13 tokens).
When Worse is Better: Navigating the Compression-Generation Trade-off in Visual Tokenization: This paper systematically investigates the trade-off between visual tokenizer compression rate and generation quality through scaling laws. It finds that more aggressive compression—despite yielding worse reconstruction—benefits generation for smaller models. The paper proposes Causally Regularized Tokenization (CRT), which embeds autoregressive inductive bias into Stage 1 training, achieving 2–3× computational efficiency gains. A 775M-parameter model with 256 tokens/image matches LlamaGen-3B's FID of 2.18.
zip2zip: Inference-Time Adaptive Tokenization via Online Compression: This paper proposes zip2zip, which deeply integrates the classical LZW online lossless compression algorithm into the LLM inference pipeline. During decoding, frequently co-occurring tokens are continuously merged into reusable "hypertokens" to dynamically expand the vocabulary. Combined with a dynamic embedding layer and training on compressed-space language modeling, zip2zip enables existing LLMs to acquire inference-time adaptive tokenization capability with only 10 GPU-hours of LoRA fine-tuning, achieving 15–40% reduction in input/output sequence length and up to 40% reduction in end-to-end decoding latency, with negligible downstream task performance degradation.