📦 Model Compression¶

🔬 ICLR2026 · 92 paper notes

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA: This paper derives a Fano-style accuracy upper bound for LLM single-pass reasoning on multi-hop QA using information theory, revealing a "cliff-like" accuracy collapse when task information demand exceeds model output capacity. Based on this analysis, the authors design a multi-turn reasoning framework, InfoQA, which overcomes the single-pass bottleneck via capacity-aware decomposition, dependency-explicit workflows, and iterative query compression.
A Recovery Guarantee for Sparse Neural Networks: This paper establishes the first sparse recovery guarantee for ReLU neural networks: for two-layer scalar-output networks with Gaussian random training data, an iterative hard thresholding (IHT) algorithm based on convex reformulation can exactly recover sparse network weights, with memory requirements scaling only linearly in the number of nonzero weights.
A State-Transition Framework for Efficient LLM Reasoning: This paper proposes an efficient reasoning framework that models the LLM reasoning process as a state-transition process. It uses Linear Attention to compress information from historical reasoning steps into a state matrix, reducing attention complexity from \(O(C^2)\) to \(O(C)\) and KV cache from \(O(C)\) to \(O(1)\), while preserving the full CoT sequence and maintaining reasoning capability. An additional momentum strategy mitigates the overthinking problem caused by noisy reasoning steps.
A universal compression theory for lottery ticket hypothesis and neural scaling laws: This paper proves a Universal Compression Theorem: any permutation-invariant function over \(d\) objects can be asymptotically compressed to polylog(d) objects with error approaching zero (which is the optimal compression rate). From this theorem, the authors directly derive: (1) a proof of the dynamic lottery ticket hypothesis — any network can be compressed to polylogarithmic width while preserving its learning dynamics; (2) a dataset compression result — any dataset can be compressed to polylogarithmic size while preserving the loss landscape; and (3) an acceleration of power-law scaling laws to arbitrarily fast decay rates.
ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models: This paper proposes ABBA adapters, which parameterize weight updates as the Hadamard product of two independently learnable low-rank matrices, \(\Delta W = s(B_1A_1) \odot (B_2A_2)\). Under the same parameter budget, ABBA achieves an effective rank of \(r_1 \cdot r_2\) compared to LoRA's \(r\), representing a quadratic improvement. Through Khatri-Rao reconstruction, ABBA maintains memory efficiency comparable to LoRA, and significantly outperforms existing PEFT methods on arithmetic and commonsense reasoning tasks.
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning: This paper introduces ACPBench Hard — an open-ended generative planning reasoning benchmark comprising 8 task types grounded in PDDL formal systems (13 domains × 8 tasks = 1,040 questions), equipped with symbolic validators that provide rigorous correctness guarantees. A systematic evaluation of 15 LLMs reveals that even the strongest reasoning model, o1-preview, achieves accuracy ≤66% on half the tasks, and all models fail almost completely on the most fundamental task of enumerating applicable actions, exposing fundamental deficiencies in current LLMs' planning reasoning capabilities.
Adaptive Width Neural Networks: This paper proposes the AWN framework, which automatically learns the unbounded width (number of neurons) of each layer during training via variational inference. A monotonically decreasing importance function imposes a soft ordering on neurons, enabling width to adapt to task difficulty and supporting zero-cost post-training truncation for compression.
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs: Through systematic empirical analysis using erank (effective rank) and attention entropy, this work reveals the complementary nature of attention-based and diversity-based visual token pruning methods — attention methods suppress hallucinations but suffer from limited coverage, while diversity methods achieve broad coverage but tend to introduce hallucinations. Based on these findings, AgilePruner is proposed to adaptively switch pruning strategies according to image complexity, achieving robust performance across 9 benchmarks.
AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution: This paper proposes the α-mixture assistant distribution and a unified distillation framework, AMiD. By introducing a new design variable α that controls the geometric shape of the interpolation path between teacher and student distributions, AMiD generalizes existing assistant distribution methods (m-mixture and e-mixture are special cases at α=±1), proves optimality guarantees under arbitrary divergences and α values, and achieves state-of-the-art performance on multiple LLM distillation benchmarks.
AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs: This paper proposes AnyBCQ, a multi-precision LLM quantization framework based on Binary-Coded Quantization (BCQ). By progressively expanding precision (freezing existing bit-planes and appending residual bit-planes), a single model supports dynamic switching between 2-bit and 4-bit precision. Dedicated CUDA kernels perform computation directly at the bit-plane level, eliminating lookup-table and transpose overhead. At 2-bit, AnyBCQ substantially outperforms Any-Precision LLM in accuracy (MMLU 35.3% vs. 24.7%) and achieves up to 3.0× throughput over FP16.
BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models: This paper proposes BeyondBench, an evaluation framework that algorithmically generates mathematical problems on-the-fly (44 tasks / 117 variants / 3 difficulty levels) to ensure each evaluation instance is free from training data contamination. It evaluates 101 language models (0.5B–141B parameters), finding that even the strongest models achieve only 56% accuracy on the Hard Suite, with substantial performance drops when tools are unavailable.
Boomerang Distillation Enables Zero-Shot Model Size Interpolation: This paper proposes the Boomerang Distillation paradigm: train a single small student model, then construct an entire family of intermediate-sized models at zero additional training cost by progressively grafting teacher transformer layer blocks back onto the student. The resulting models interpolate smoothly in performance between the student and teacher, matching or even surpassing independently distilled models of equivalent size.
Boosting Entropy with Bell Box Quantization: This paper proposes Bell Box Quantization (BBQ), the first quantization method that simultaneously satisfies information-theoretic optimality (ITO) and compute-efficiency. The core insight is the domain-agnosticity of learning—the output domain of a quantizer need not coincide with its input domain. BBQ performs ITO quantization in the input domain to maximize entropy, then maps to hardware-acceleratable data types in the output domain, achieving comprehensive improvements over QuEST and LSQ in 1–4 bit QAPT settings.
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers: Starting from Kolmogorov complexity theory, this paper proposes a theoretical framework of "asymptotically optimal description length objectives," proves the existence of such objectives for Transformers via a novel proof of their computational universality, and empirically validates the framework through a differentiable variational objective based on an adaptive Gaussian mixture prior, revealing significant optimization challenges.
COMI: Coarse-to-fine Context Compression via Marginal Information Gain: This paper proposes COMI, a coarse-to-fine adaptive context compression framework based on Marginal Information Gain (MIG = query relevance − semantic redundancy). At a 32× compression ratio, COMI improves NaturalQuestions EM by approximately 25 points over the second-best method, with the core insight being the joint optimization of relevance and diversity among retained information.
Compute-Optimal Quantization-Aware Training: Through 757 QAT experiments spanning 86M–2.2B parameters and 1–6 bits, this paper demonstrates that the optimal QAT training fraction grows with total compute budget—contradicting the previously held belief that 10% is universally optimal—and proposes the tokens-per-parameter-byte statistic along with a new loss scaling law to accurately predict the optimal QAT allocation strategy and final loss across all configurations.
ConFu: Contemplate the Future for Better Speculative Sampling: ConFu introduces contemplate tokens into the draft model of speculative decoding, enabling it to anticipate the target model's future generation direction. Combined with a MoE dynamic mechanism and anchor-point sampling training, ConFu achieves 8–11% improvements in acceptance rate and generation speed over EAGLE-3.
Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport: This paper formalizes cross-domain lossy compression — where the encoder observes a degraded source and the decoder reconstructs samples from a different target distribution — as an optimal transport problem subject to dual constraints on rate and classification loss. Closed-form DRC/RDC and DRPC tradeoff functions are derived for Bernoulli sources (Hamming distortion) and Gaussian sources (MSE). The theoretical predictions are validated against the empirical behavior of deep end-to-end compression models on super-resolution, denoising, and inpainting tasks.
Cut Less, Fold More: Model Compression through the Lens of Projection Geometry: This paper unifies structured pruning and model folding under an orthogonal projection framework—pruning as coordinate-aligned projection and folding as clustering subspace projection—and proves that folding yields strictly smaller parameter reconstruction error under a rank-one difference condition. Validation across 1,000+ checkpoints demonstrates that folding consistently outperforms pruning at medium-to-high compression ratios.
Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression: This paper proposes the Dataset Color Quantization (DCQ) framework, which reduces color redundancy at the dataset level through three mechanisms — chromaticity-aware clustering, attention-guided palette allocation, and texture-preserved palette optimization — achieving storage compression while maintaining training performance.
Dataset Distillation as Pushforward Optimal Quantization: This work reformulates decoupled dataset distillation as an optimal quantization problem, proves that latent-space clustering with learned weights via a diffusion prior can converge to approximate the true data distribution, and proposes the DDOQ algorithm, which surpasses baselines such as D4M on ImageNet-1K with minimal additional computation.
DiffVax: Optimization-Free Image Immunization Against Diffusion-Based Editing: DiffVax trains a feed-forward immunizer (UNet++) that generates imperceptible adversarial perturbations for arbitrary images in a single forward pass (~70ms), causing diffusion-based malicious editing to fail. Compared to prior per-image optimization methods, DiffVax achieves a 250,000× speedup and is the first to extend immunization to video content.
Distillation of Large Language Models via Concrete Score Matching: This paper proposes Concrete Score Distillation (CSD), a knowledge distillation loss for LLMs grounded in discrete score matching. By matching the relative logit differences between all vocabulary token pairs across the student and teacher, CSD simultaneously overcomes the softmax-smoothing problem and the solution-space restriction inherent in direct logit distillation.
Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks: This paper proposes CAZI-MBN, a framework that integrates domain-specific LLM sequence embeddings, a topology-aware unified graph tokenizer, context-aware cross-layer attention, and teacher-student distillation to enable zero-shot interaction prediction for unseen entities in multiplex biological networks, achieving AUROC improvements of 3.1–20.4% over the best baseline across 5 benchmark datasets.
Draft-based Approximate Inference for LLMs: This paper proposes the Draft-based Approximate Inference framework, which leverages lookahead predictions from a lightweight draft model to more accurately estimate token/KV pair importance. The framework comprises three methods — SpecKV (KV cache dropping), SpecPC (prompt compression), and SpecKV-PC (cascaded compression) — and consistently outperforms existing baselines on long-context benchmarks.
Efficient Reasoning with Balanced Thinking: This paper proposes ReBalance, a training-free framework that simultaneously mitigates overthinking and underthinking in large reasoning models (LRMs) via confidence-guided dynamic hidden-state steering vectors, achieving joint improvements in both reasoning efficiency and accuracy.
Embedding Compression via Spherical Coordinates: This paper proposes an embedding compression method based on spherical coordinate transformation. By exploiting the mathematical property that angular coordinates of high-dimensional unit vectors concentrate near \(\pi/2\), the method substantially reduces the entropy of the exponent bits and high-order mantissa bits in IEEE 754 floating-point representations, achieving a 1.5× compression ratio — a 25% improvement over the best lossless methods — with reconstruction error below float32 machine precision.
ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping: To address the extensive computational redundancy in diffusion large language model (dLLM) inference, this paper proposes ES-dLLM, a training-free Early-Skipping acceleration framework. By estimating token importance and skipping low-importance positions in early layers, ES-dLLM achieves 5.6×–16.8× speedup on LLaDA-8B and Dream-7B without degrading generation quality.
Evolution and compression in LLMs: On the emergence of human-aligned categorization: Through the Information Bottleneck (IB) framework and the Iterated In-Context Language Learning (IICLL) paradigm, this paper demonstrates that LLMs can spontaneously develop category structures that are highly aligned with human semantic categorization systems and achieve near-optimal compression efficiency, without having been trained on any IB objective.
FASA: Frequency-aware Sparse Attention: This paper identifies a functional sparsity in RoPE attention at the frequency chunk (FC) level—fewer than 1% of "dominant FCs" suffice to approximate the token selection behavior of full attention heads. Building on this finding, the authors propose FASA, a training-free framework that employs a two-stage strategy (dominant FCs predict token importance → full attention is computed only over important tokens), achieving 8× memory compression and 2.6× inference speedup with negligible quality loss.
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization: This paper proposes QZO, a method that estimates gradients via zeroth-order perturbations applied to quantization scaling factors (rather than discrete weights), and stabilizes training with directional derivative clipping (DDC). QZO enables memory-efficient fine-tuning of 4-bit/2-bit LLMs with over 18× total memory reduction.
FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning: Inspired by the neurobiology of the Drosophila mushroom body—specifically its sparse random expansion and modular integration mechanisms—FlyPrompt is proposed as a framework for General Continual Learning (GCL). It introduces a Random-Expanded Analytic Router (REAR) for non-iterative expert selection, combined with a multi-timescale EMA output head Temporal Ensemble (TE²) to enhance expert capacity, achieving gains of up to 11.23%/12.43%/7.62% on CIFAR-100/ImageNet-R/CUB-200.
FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning: Inspired by the mushroom body circuitry of Drosophila, FlyPrompt decomposes General Continual Learning (GCL) into two sub-problems—expert routing and expert capacity—and addresses them respectively with a Random Expanded Analytic Router (REAR) and Temporal-Ensemble Experts (TE2), achieving improvements of 11.23% / 12.43% / 7.62% on CIFAR-100 / ImageNet-R / CUB-200.
FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension: This paper proposes FreqKV, a parameter-free and architecture-agnostic KV cache compression method that iteratively compresses KV states in the frequency domain by retaining low-frequency components and discarding high-frequency ones. With only lightweight fine-tuning on 8K-length sequences, FreqKV extends the context window of LLaMA-2-7B to 256K while maintaining stable perplexity.
Grounding and Enhancing Informativeness and Utility in Dataset Distillation: This paper proposes InfoUtil, a framework that maximizes sample informativeness via game-theoretic Shapley Values (to identify the most critical patches) and maximizes sample utility via gradient norms (to select the most training-valuable samples), achieving a 6.1% improvement over the previous SOTA on ImageNet-1K.
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design: This paper proposes the HiFo-Prompt framework, which enhances LLM-driven Automatic Heuristic Design (AHD) through two collaborative modules—Hindsight (a retrospective insight pool) and Foresight (a prospective evolutionary navigator)—achieving substantial improvements over existing methods on tasks such as TSP and FSSP.
Highly Efficient and Effective LLMs with Multi-Boolean Architectures: This paper proposes a novel framework that represents LLM weights as multi-kernel Boolean parameters, enabling, for the first time, direct finetuning of large language models entirely within the Boolean domain—without requiring full-precision latent weights. The approach simultaneously surpasses existing ultra-low-bit quantization and binarization methods in both representational capacity and computational efficiency.
IDER: IDempotent Experience Replay for Reliable Continual Learning: This paper introduces idempotence into continual learning, enforcing output self-consistency during new task acquisition via two components—a Standard Idempotent Module and an Idempotent Distillation Module—simultaneously improving prediction reliability (reduced calibration error) and significantly mitigating catastrophic forgetting.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning: This paper proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to interleave reasoning and code execution during evaluation. With only 8B parameters, TIR-Judge surpasses 32B reasoning reward models across 7 public benchmarks; its distillation-free variant, TIR-Judge-Zero, achieves further self-bootstrapped improvement.
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models: InftyThink is proposed as a new paradigm that transforms monolithic long-form reasoning into iterative short-form reasoning with intermediate summarization. Without modifying model architecture, it achieves theoretically unbounded reasoning depth and significantly reduced computational cost, yielding an 11% improvement for Qwen2.5-Math-7B on AIME24.
Is Finer Better? The Limits of Microscaling Formats in Large Language Models: This paper identifies and explains a counterintuitive anomaly in microscaling quantization — namely that reducing block size below a certain threshold increases quantization error for narrow-distribution tensors due to the limited dynamic range of the FP8 UE4M3 scale format — and proposes FP8 UE5M3 as a hardware-friendly solution.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models: This paper proposes KBVQ-MoE, the first vector quantization framework specifically designed for MoE architectures. It eliminates inter-expert redundancy sharing (IDRE) via KLT-guided SVD and stabilizes outputs through bias-corrected output stabilization (BCOS), achieving 10%+ accuracy improvement over existing methods at 2-bit quantization.
Knowledge Fusion of Large Language Models Via Modular Skillpacks: This paper proposes GraftLLM, a framework that extracts capabilities from heterogeneous source models into compact, transferable "SkillPacks" (modular skill packages). Through a module-aware adaptive compression strategy that stores parameter deltas, GraftLLM supports knowledge transfer, heterogeneous model fusion, and continual learning without forgetting, significantly outperforming existing PEFT and parameter merging methods across multiple settings.
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models: This paper proposes Landscape of Thoughts (LoT), the first tool to visualize LLM reasoning trajectories as two-dimensional terrain maps. By encoding intermediate states via perplexity-based features and projecting them with t-SNE, LoT reveals reasoning behavior patterns and can be adapted as a lightweight verifier to improve reasoning accuracy and test-time scaling.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts: This paper proposes LD-MoLE, which replaces conventional TopK routing with a Sparsegen closed-form projection to achieve differentiable, dynamic, token-adaptive LoRA expert assignment. A lightweight MLP predicts sparse factors, and an analytic sparsity loss is employed. LD-MoLE outperforms fixed-routing and ReLU-routing baselines across multiple benchmarks.
LightMem: Lightweight and Efficient Memory-Augmented Generation: This paper proposes LightMem, a three-stage lightweight memory system inspired by the human Atkinson–Shiffrin memory model. Through three modules — cognitive sensory memory pre-compression, topic-aware short-term memory consolidation, and offline sleep-time updating — LightMem achieves up to 7.7% accuracy improvement on LongMemEval while reducing token consumption by up to 38×.
LLM DNA: Tracing Model Evolution via Functional Representations: Drawing an analogy from biological DNA, this work formally defines LLM DNA as a low-dimensional bi-Lipschitz representation of a model's functional behavior, proves that it satisfies the properties of heritability and genetic determinism, and designs a training-free RepTrace pipeline to extract DNA from 305 LLMs and construct their evolutionary tree.
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations: This paper demonstrates that LLMs encode model-specific success probability information in their pre-generation internal activations. Training linear probes to extract this signal enables efficient model routing that matches the accuracy of the strongest model while reducing inference cost by 70% on benchmarks such as MATH.
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning: This paper proposes LoFT, a low-rank adaptation method composed of six building blocks that aligns the internal optimizer dynamics (momentum and second-order moments) with those of full fine-tuning. In the full-rank limit, LoFT exactly recovers AdamW, and it substantially closes the performance gap between LoRA and full fine-tuning across multiple benchmarks.
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation: This paper proposes LookaheadKV, which predicts true response attention importance scores via learnable lookahead tokens and selectively activated LoRA modules, achieving fast and accurate KV cache eviction without draft generation. The method outperforms existing approaches on multiple long-context benchmarks and reduces eviction overhead by up to 14.5×.
Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba: This paper proposes Memba, a parameter-efficient fine-tuning method inspired by biological neuron membrane potentials. By introducing Leaky Integration Membrane (LIM) neurons into the gating branch of Mamba, Memba achieves temporal adaptability, combined with LoRA placement optimization and cross-layer membrane transfer. With minimal trainable parameters, Memba surpasses existing Mamba PEFT methods on both language and vision tasks.
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes: Through careful data selection and an adaptive mixing strategy, MobileLLM-R1-950M is pretrained on only 4.2T tokens (11.7% of Qwen3's token budget) and matches or surpasses Qwen3-0.6B on reasoning benchmarks such as AIME, while fully open-sourcing both data sources and training recipes.
Modality-free Graph In-context Alignment: This paper proposes MF-GIA, the first graph in-context learning framework that simultaneously satisfies three conditions: no post-training, cross-domain alignment, and modality-agnosticism. By capturing domain characteristics via gradient fingerprints, aligning features and labels through FiLM-conditioned transformations, MF-GIA achieves state-of-the-art performance on few-shot tasks across multiple graph domains.
MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE: This paper proposes MoNE (Mixture-of-Novices-and-Experts), which identifies redundant experts via joint evaluation of access frequency and output variance, and replaces them with their output mean vectors ("novices"). MoNE achieves more effective and robust compression than existing pruning methods across 5 MoE models, with an average accuracy drop of only 0.14 at a 25% pruning ratio.
Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows: This paper proposes Agentic Predictor, a multi-view workflow encoding framework that jointly models graph structure, code semantics, and prompt information to predict the performance of LLM-based agentic workflows, substantially reducing costly trial-and-error evaluations.
Null-Space Filtering for Data-Free Continual Model Merging: Preserving Stability, Promoting Plasticity: This paper proposes NUFILT, a framework that exploits the geometric property of approximate alignment between task vectors and representation subspaces. By applying null-space filtering to suppress interference with previous tasks and projection-aware LoRA to restore plasticity for new tasks, NUFILT achieves continual model merging without accessing any data. It outperforms OPCM by 4–8% on vision, NLP, and multimodal benchmarks, approaching the upper bound of individual fine-tuning.
Parallel Token Prediction for Language Models: This paper proposes Parallel Token Prediction (PTP), which relocates sampling randomness from post-processing to model inputs via auxiliary variables, rendering future tokens deterministic functions of those variables and enabling joint prediction of multiple tokens in a single forward pass.
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference: ParoQuant is proposed to eliminate weight outliers via hardware-efficient and optimizable independent Givens rotations combined with channel scaling, achieving high-accuracy, low-overhead 4-bit weight quantization for reasoning LLMs.
PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery: This paper proposes PASER, a post-training data selection method for recovering pruned LLMs. It identifies capability-relevant instruction subsets via manifold learning and spectral clustering, and adaptively allocates data budgets according to the degree of capability degradation. Using only 4%–20% of the original data, PASER significantly outperforms full-data recovery.
Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation: This paper proposes the IOA (Identifier-Organizer-Adapter) framework, which draws on Bloom's mastery learning principles and Vygotsky's Zone of Proximal Development (ZPD) theory to achieve pedagogically-driven LLM knowledge distillation through three stages: diagnosing knowledge deficiencies, designing progressive curricula, and adapting to cognitive capacity.
π-Flow: Policy-Based Few-Step Generation via Imitation Distillation: This paper proposes π-Flow, which modifies the output layer of a student flow model to predict a policy that generates dynamic flow velocities through multiple sub-steps within a single network evaluation, enabling precise ODE integration. Combined with imitation distillation—matching teacher velocities along the student's own trajectories—the method achieves stable and scalable few-step generation without the quality–diversity trade-off.
PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models: PTQ4ARVG is proposed as the first systematic PTQ framework for autoregressive visual generation (ARVG) models. It addresses three ARVG-specific quantization challenges via Gain-Projected Scaling (GPS), Static Token-Wise Quantization (STWQ), and Distribution-Guided Calibration (DGC).
QKV Projections Require a Fraction of Their Memory: This paper proposes PAMM (Point-Approximate Matrix Multiplication), an activation compression technique that approximates QKV projection layer activations by randomly selecting a small number of representative tokens, achieving up to 512× compression without degrading model performance.
Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation: This paper proposes RD3 (Rectified Decoupled Dataset Distillation), systematically demonstrating that performance discrepancies among existing decoupled dataset distillation methods stem primarily from inconsistent post-evaluation settings rather than differences in distillation quality. By establishing a unified and fair evaluation framework, the reported 27.3% performance gap is corrected to 6.7%.
Reference-Guided Machine Unlearning: This paper proposes ReGUn (Reference-Guided Unlearning), which leverages an independent held-out dataset as a reference standard for "unseen behavior." Through class-conditional distillation, the model's behavior on forget data is aligned to that on truly unseen data, achieving a superior forgetting–utility trade-off.
Rethinking Continual Learning with Progressive Neural Collapse: This paper proposes the ProNC framework, which replaces fixed pre-defined ETFs with a progressively expanding Equiangular Tight Frame (ETF) target to achieve a balance between maximal inter-class separation and minimal forgetting in continual learning.
Revisiting Weight Regularization for Low-Rank Continual Learning: This paper reintroduces Elastic Weight Consolidation (EWC) into low-rank continual learning by estimating the Fisher Information Matrix in the full-dimensional space to regularize a shared LoRA module, achieving effective forgetting mitigation under constant memory overhead.
S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion: This paper proposes S2R-HDR, the first large-scale high-quality synthetic HDR fusion dataset (24,000 samples), and introduces S2R-Adapter, a domain adaptation method that bridges the synthetic-to-real gap, achieving state-of-the-art HDR fusion performance on real-world datasets.
Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models: This paper systematically uncovers the internal mechanism underlying LLM failures in reasoning hop generalization — namely, attention head competition between correct and erroneous reasoning trajectories — and proposes TCR (Test-time Correction of Reasoning), which dynamically identifies and deactivates erroneous processing heads (ep heads) at inference time to correct reasoning errors, achieving an average accuracy improvement of 5–7%.
SeeDNorm: Self-Rescaled Dynamic Normalization: This paper proposes SeeDNorm, an adaptive dynamic normalization layer that conditions the scaling coefficients on the input itself, thereby preserving input norm information in the forward pass while retaining RMSNorm-like adaptive gradient adjustment in the backward pass. With negligible additional parameters, SeeDNorm consistently outperforms RMSNorm, LayerNorm, and DyT on both language modeling and vision tasks.
SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models: SERE is proposed to pre-compute an expert similarity matrix and dynamically re-route secondary experts to their most similar primary experts during batch decoding, achieving up to 2.0× speedup with negligible quality loss, accompanied by a plug-and-play vLLM CUDA kernel.
SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs: This paper systematically revisits the impact of domain-specific SFT on the general capabilities of LLMs, demonstrating that using a smaller learning rate can substantially mitigate general capability degradation, and proposes Token-Adaptive Loss Reweighting (TALR), which further optimizes the trade-off between domain adaptation and general capability retention by adaptively down-weighting the loss of low-probability tokens.
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models: Grounded in the Linear Representation Hypothesis (LRH), this paper proposes a theoretical framework termed specialization after generalization, providing the first systematic explanation of why TTT is effective under in-distribution settings. Foundation models suffer from concept superposition due to global underparameterization; TTT temporarily forgets irrelevant concepts to free model capacity, locally specializing to the small set of concepts relevant to the test task. The theory guarantees generalization even when the feature space is exponentially smaller than the concept space.
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models: This paper proposes the STAR framework, which combines Constrained Knowledge Distillation (CKD) and Similarity-guided Reinforcement Learning (Sim-RL) to effectively transfer function calling capabilities from large models to super-tiny models at the 0.6B scale, achieving substantial improvements over baselines on BFCL and ACEBench.
Steering MoE LLMs via Expert (De)Activation: This paper proposes SteerMoE, which detects behavior-correlated experts via contrastive paired inputs and steers MoE LLM behavior at inference time by activating or deactivating specific experts (safety +20%, faithfulness +27%), while also exposing the fragility of safety alignment in MoE models (safety collapse −100%).
Stress-Testing Alignment Audits with Prompt-Level Strategic Deception: This paper constructs an automated prompt-level red-teaming pipeline (powered by Claude Opus 4.5) to augment situational awareness and strategic reasoning in existing fine-tuned model organisms, and stress-tests four black-box and white-box alignment auditing methods across six experimental settings. The pipeline successfully induces high-confidence incorrect guesses from all auditing methods and provides the first documented instance of prompt-level activation deception without any weight modification.
SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning: This paper proposes SwiReasoning, a training-free LLM reasoning framework that dynamically switches between explicit (chain-of-thought) and implicit (latent space) reasoning modes via entropy-trend-based block-level confidence estimation, achieving Pareto-superior improvements in both accuracy (+1.8%–3.1%) and token efficiency (+57%–79%).
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation: This work reveals that EMA-based momentum updates are equivalent to gradient descent on an online linear regression objective, and builds upon this insight to propose LoRA-Pre — a method that compresses optimizer momentum via low-rank factorization for memory-efficient LLM pretraining and fine-tuning. LoRA-Pre achieves state-of-the-art performance across all model scales using only 1/8 the rank required by baseline methods.
Textual Equilibrium Propagation for Deep Compound AI Systems: This paper proposes Textual Equilibrium Propagation (TEP), a compound AI system optimization method grounded in local learning principles. Through a two-phase design consisting of a free phase and a nudged phase, TEP avoids gradient explosion/vanishing in global textual backpropagation and significantly outperforms TextGrad on deep workflows.
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm: This paper provides the first proof that GPTQ (executed in reverse order) is mathematically equivalent to Babai's nearest plane algorithm from classical lattice theory, thereby obtaining a geometric interpretation and layer-wise error upper bounds, upon which a clipping-free improved quantization method is designed.
The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's Algorithm: Independently of Chen et al. (2026), this paper provides a more concise and elegant proof that GPTQ is equivalent to Babai's nearest plane algorithm, and clarifies the prospect of lattice basis reduction for improving neural network quantization.
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM: This paper proposes Elsa, a method that directly solves sparsity-constrained optimization via surrogate-free ADMM, breaking the 50–60% "sparsity wall" bottleneck in LLM pruning and maintaining high model fidelity even at 90% sparsity.
TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA: This paper proposes TiTok, a framework that enables efficient cross-model transfer of LoRA adapters via token-level contrastive excess scores, without requiring an auxiliary discriminator model. TiTok consistently outperforms TransLoRA and knowledge distillation baselines on reasoning and personalization tasks.
Token Distillation: Attention-Aware Input Embeddings for New Tokens: This paper proposes Token Distillation, a method that distills multi-subword interaction information encoded across all Transformer layers into a single token embedding, enabling high-quality initialization of new token embeddings without requiring a pretrained hypernetwork and outperforming existing approaches.
Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Size: From the perspectives of algebraic geometry and algebraic topology, this paper systematically investigates the connectivity and singularity of the parameter space of feedforward ReLU networks defined on general DAG architectures. It reveals the critical role of bottleneck nodes and balance conditions in determining the topological structure of the parameter space, and establishes a theoretical connection between singularities and differentiable pruning.
Towards Efficient Constraint Handling in Neural Solvers for Routing Problems: This paper proposes the Construct-and-Refine (CaR) framework, which achieves efficient feasibility repair through joint training of a construction module and a lightweight refinement module. CaR provides the first general and efficient neural constraint-handling solution for hard-constrained routing problems, substantially outperforming classical and neural SOTA solvers on TSPTW and CVRPBLTW.
TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation: TurboBoA proposes a backpropagation-free post-training quantization method for LLMs that achieves over 3× speedup over BoA while retaining its accuracy advantages, through three innovations: joint multi-output-channel quantization, preceding-layer error compensation, and adaptive grid selection.
Understanding Dataset Distillation via Spectral Filtering: This paper proposes UniDD, a spectral filtering framework that unifies diverse dataset distillation methods as applying different filter functions on the feature-feature correlation (FFC) matrix to match the frequency information of the feature-label correlation (FLC) matrix. Building on this insight, the paper further introduces Curriculum Frequency Matching (CFM).
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation: This paper proposes UniFlow, a universal unified tokenizer that preserves semantic understanding via hierarchical adaptive self-distillation and achieves high-fidelity reconstruction via a lightweight patch-wise pixel flow decoder. UniFlow achieves state-of-the-art performance on both understanding and generation across 13 benchmarks. The 7B UniFlow-XL surpasses the 14B TokenFlow-XL by 6.05% on average understanding benchmarks while using 40% less training data.
Unveiling Super Experts in Mixture-of-Experts Large Language Models: This paper is the first to discover and systematically study "Super Experts" (SEs) in MoE LLMs—an extremely small subset of experts that are critical to model inference, driving massive activations and attention sink mechanisms through extreme activation outliers in their down_proj outputs.
What Layers When: Learning to Skip Compute in LLMs with Residual Gates: This paper proposes GateSkip—a method that inserts a sigmoid-linear gate at the output of each Attention/MLP branch in a decoder-only Transformer, jointly optimizes gate sparsity and language modeling objectives during fine-tuning, and at inference time deterministically skips low-importance tokens layer-by-layer using a quantile threshold over gate values, thereby achieving token-level adaptive depth. On Llama 8B, GateSkip saves 15% compute while retaining >90% accuracy; on instruction-tuned models, the full-compute variant actually improves accuracy over the baseline, and ~50% savings still matches the baseline. The method is orthogonal and composable with INT4 quantization, structured pruning, and self-speculative decoding.
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis: This paper proposes the TAPPA framework, which explains the formation mechanisms of various attention patterns in LLMs (attention sink, diagonal, periodic, etc.) from a temporal continuity perspective in a unified manner, and leverages query self-similarity (q-similarity) as a metric to guide KV cache compression and model pruning tasks.