📦 Model Compression¶

💬 ACL2026 · 45 paper notes

A Computational Method for Measuring "Open Codes" in Qualitative Analysis: This paper proposes a theoretically grounded computational framework that employs an LLM-augmented code merging algorithm alongside four ground-truth-free metrics (Coverage, Overlap, Novelty, and Divergence) to systematically evaluate the performance of both human and AI coders in inductive qualitative coding.
A Layer-wise Analysis of Supervised Fine-Tuning: This paper conducts a systematic layer-wise analysis of SFT across 1B–32B models from three perspectives—information-theoretic, geometric, and optimization-based—revealing that instruction-following capability is concentrated in the middle layers (20%–80%) rather than uniformly distributed. Based on this finding, the paper proposes a Mid-Block Efficient Tuning strategy that selectively updates middle layers, achieving up to 10.2% improvement over standard LoRA on GSM8K.
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference: This paper proposes ASL (Adaptive Selection Layer), which monitors the variance of token attention score rankings to adaptively determine the layer at which KV cache pruning is performed. ASL significantly outperforms fixed-layer selection methods on difficult tasks while remaining training-free.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis: This paper proposes an analytical post-training framework that rapidly restructures dense FFN layers into sparse MoE by analyzing neuron activation patterns — distinguishing high-frequency shared experts from low-frequency routed experts and constructing routers directly from activation statistics — achieving 1.17× speedup with only 2k-sample fine-tuning.
arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation: This paper proposes the arXiv2Table benchmark (1,957 tables, 7,158 papers) and introduces distractor papers, schema-agnostic user requests, and an annotation-free QA-based evaluation framework to enable more realistic assessment of LLM-based literature-review table generation, along with an iterative batch generation method.
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference: CSD proposes a training-free enhancement framework for speculative decoding that records high-frequency rejection patterns via an Online Correction Memory (OCM) to provide rescue candidates, and then validates candidate reliability through a Semantic Consistency Gating (SCG) mechanism based on probability ratios. The approach achieves up to 2.33× throughput improvement over standard speculative decoding while also improving accuracy on HumanEval and MATH500.
CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering: CBRS proposes a multi-platform framework that efficiently detects and parses blood donation requests from social media message streams via a dual-layer filtering architecture (lightweight classifier + LLM). The work introduces the first bilingual dataset of 11K blood donation requests spanning Bengali, English, and transliterated Bengali. A LoRA fine-tuned Llama-3.2-3B achieves 92% zero-shot accuracy on the parsing task.
ChemAmp: Amplified Chemistry Tools via Composable Agents: This paper proposes a novel "tool amplification" paradigm (distinct from conventional tool orchestration) and introduces the ChemAmp framework, which treats chemistry-specific tools (UniMol2, Chemformer, etc.) as composable building blocks to dynamically construct task-specialized super-agents. ChemAmp surpasses both domain-specific models and general-purpose LLMs on four core chemistry tasks—including molecular design and reaction prediction—while reducing inference token costs by 94%.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents: This paper proposes CLAG, a clustering-based agent memory framework that organizes memories into semantically coherent clusters via SLM-driven routing, performs local evolution updates within clusters, and filters noise through two-stage retrieval, achieving significant improvements over global memory pool baselines across multiple QA datasets.
Compositional Steering of Large Language Models with Steering Tokens: This paper proposes compositional steering tokens that compress behavioral instructions into input-space embedding vectors via self-distillation, and trains a dedicated composition token to capture the general concept of "composition." The approach demonstrates strong generalization to unseen behavior combinations, unseen behaviors, and unseen numbers of behaviors to compose.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing: This paper proposes DASH-KV, a framework that reformulates the attention mechanism as an approximate nearest neighbor search problem. By employing asymmetric deep hashing to encode queries and keys into binary codes, high-dimensional floating-point similarity computation is replaced with efficient Hamming distance bit operations. Combined with a dynamic mixed-precision mechanism, the approach reduces long-context inference complexity from \(O(N^2)\) to \(O(N)\) while matching the performance of full attention.
DeepPrune: Parallel Scaling without Inter-Trace Redundancy: This paper proposes DeepPrune, which trains a dedicated judge model to predict answer equivalence from partial reasoning traces and combines it with an online greedy clustering algorithm to dynamically prune redundant parallel CoT paths. DeepPrune reduces token consumption by 65.73%–88.50% while maintaining competitive accuracy within 3 percentage points.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling: This paper proposes FADE, a framework that employs a Dual-stream Multi-scale Decoupler to separate micro-syntactic and macro-semantic features into parallel shallow streams (replacing deep serial stacking), combined with a Hierarchical Gated Refiner and a Concurrent Stream Parallel Pipeline, achieving state-of-the-art performance in both compression ratio and throughput simultaneously.
Enabling Agents to Communicate Entirely in Latent Space: This paper proposes Interlat, a framework enabling LLM agents to communicate entirely in latent space. The sender directly transmits the final-layer hidden states as a continuous representation of its "thoughts"; the receiver interprets these latent messages via a communication adapter and further compresses them to as few as 8 tokens through latent-space reasoning, achieving up to 24× communication speedup while maintaining competitive performance.
Establishing a Scale for Kullback–Leibler Divergence in Language Models Across Various Settings: This paper embeds language models of diverse architectures into a unified space via log-likelihood vectors, systematically measures the characteristic KL divergence scales across multiple settings—pretraining, model scale, random seeds, quantization, fine-tuning, and inter-layer analysis—and reveals that pretraining trajectories exhibit subdiffusive behavior in log-likelihood space: despite continuous drift in weight space, the output distributions stabilize early in training.
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration: This paper proposes FastKV, which decouples context reduction (Token-Selective Propagation during the prefill phase) from KV cache compression (layer-wise KV retention during the decoding phase), achieving 1.82× prefill speedup and 2.87× decoding speedup on LLaMA-3.1-8B-Instruct while limiting accuracy degradation to within 1% on LongBench.
Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation: This paper proposes PerSyn (Personalized data Synthesis), which adopts a "Route then Generate" paradigm where a router assigns the optimal teacher model to each prompt by jointly considering student learnability and teacher response quality. Compared to the conventional "Generate then Select" paradigm, PerSyn is more efficient and effective, consistently outperforming all baselines across instruction tuning and mathematical reasoning tasks.
MAGEO: From Experience to Skill — Multi-Agent Generative Engine Optimization via Reusable Strategy Learning: This paper reframes Generative Engine Optimization (GEO) from per-instance heuristic optimization to a strategy learning problem, proposing the MAGEO multi-agent framework. The execution layer consists of four collaborating agents — preference, planning, editing, and evaluation — operating in an iterative Generate-Evaluate-Select loop, while the learning layer distills validated edit patterns into reusable engine-specific strategy skills. A Twin Branch causal evaluation protocol and the DSV-CF dual-axis metric are introduced, achieving substantial improvements over heuristic baselines across three mainstream generative engines.
From Weights to Activations: Is Steering the Next Frontier of Adaptation?: This paper systematically argues that steering (inference-time activation-space intervention) should be recognized as an independent model adaptation paradigm. It proposes eight functional evaluation criteria to compare steering against fine-tuning, PEFT, and prompt engineering, positioning steering as a locally reversible, activation-space behavior modification approach with unique advantages in computational efficiency, data efficiency, and reversibility.
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference: This paper proposes HeteroCache, a training-free dynamic KV cache compression framework that exploits two dimensions of attention head characteristics—temporal heterogeneity (stable vs. drifting heads) and intra-layer redundancy (clustering of similar heads)—to implement fine-grained role assignment. Larger cache budgets are allocated to drifting heads, while representative heads sparsely monitor attention drift to trigger asynchronous on-demand retrieval, achieving 3× decoding speedup under 224K context.
IMPACT: Importance-Aware Activation Space Reconstruction: This paper proposes IMPACT, a framework that shifts LLM low-rank compression from minimizing weight reconstruction error to minimizing importance-weighted activation reconstruction error. By incorporating gradient information into the activation covariance matrix, IMPACT derives a closed-form optimal solution, achieving up to 55.4% model size reduction while preserving accuracy.
CadLLM: Improving the Throughput of Diffusion-based LLMs via Training-Free Confidence-Aware Calibration: This paper proposes CadLLM, a training-free adaptive inference acceleration method that leverages token decoding confidence signals in diffusion large language models (dLLMs) to dynamically adjust four dimensions—block size, number of steps, vocabulary sampling range, and commitment threshold—achieving 1.1–2.28× throughput improvements on LLaDA and DREAM while maintaining competitive accuracy.
JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew: This paper proposes a synthetic-organic supervision pipeline that transforms raw judicial opinions into reasoning instruction-tuning data. Through a Chain-of-LoRA strategy (CLM → instruction tuning), the framework achieves high-fidelity emulation of individual judges' reasoning styles, producing outputs indistinguishable from authentic judicial writing in the low-resource Hebrew setting.
Latent-Condensed Transformer for Efficient Long Context Modeling: LCA proposes performing context compression directly in the latent space of MLA — aggregating semantic latent vectors via query-aware weighted pooling and preserving positional accuracy through anchor selection for positional keys — achieving 2.5× prefill speedup and 90% KV cache compression on 128K contexts while maintaining competitive performance.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization: This work formalizes label-free prompt optimization as a dueling bandit problem and proposes the Prompt Duel Optimizer (PDO), which employs Double Thompson Sampling to efficiently select the most informative prompt pairs for comparison. Combined with a top-performer mutation strategy to expand the search space, PDO identifies stronger prompts on BBH and MS MARCO with fewer judge calls than existing baselines.
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging: This paper proposes LoGo (LoRA on the Go), a training-free framework that extracts LoRA activation signals (norm or entropy) via a single forward pass to dynamically select and merge the most relevant LoRA adapters at the instance level, enabling cross-task generalization without labeled data or additional training.
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization: This paper proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. A lightweight Conductor network leverages the final-layer hidden states of the policy model to adaptively select reward weights for each prompt–response pair, consistently outperforming static-reward and single-reward baselines across seven open-domain benchmarks.
Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation: This paper proposes Mem²Evolve, a self-evolving agent framework that achieves co-evolutionary capability expansion and experience distillation via a dual-memory mechanism (Asset Memory + Experience Memory). The framework attains an average Pass@1 of 70.24% across 8 benchmarks spanning 6 task categories, outperforming the strongest experience-centric and capability-centric baselines by 11.80% and 6.46%, respectively.
Mem^p: Exploring Agent Procedural Memory: This paper proposes the Mem^p framework, which systematically investigates how to construct learnable, updatable, and lifelong-evolving procedural memory for LLM agents. By distilling past task trajectories into fine-grained step-by-step instructions and high-level script abstractions, coupled with a dynamic update mechanism (addition / validation / reflection / retirement), Mem^p achieves consistent improvements in success rate and substantial reductions in execution steps on TravelPlanner and ALFWorld.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models: Through systematic comparison of hypernetwork-based LoRA adaptation versus carefully designed few-shot prompting across four benchmarks, this work demonstrates that a 227.8M-parameter hypernetwork yields zero gain—few-shot examples contribute +21.5%, document encoding contributes +5.0%, and the hypernetwork contributes 0%. A 3B model with well-crafted prompts achieves 79.7% of GPT-5's average performance at 10× lower latency.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models: This paper systematically probes 25 Transformer language models (ranging from BERT Base to Qwen2.5-7B) and finds that lexical identity (lexeme) is linearly decodable in early layers but decays with depth, while inflectional features remain stably readable across all layers and occupy compact, controllable subspaces.
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation: This paper proposes NWCAD, a decoding-time adapter that employs a two-stage gating mechanism to precisely fall back to context-free decoding when the context is uninformative (preventing neutral regression), and to leverage context for correction when it is helpful — simultaneously satisfying the objectives of "do no harm" and "be effective."
Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions: This paper proposes PERA (Polynomial Expansion Rank Adaptation), which introduces structured polynomial expansions (square and cross terms) into the parameter space of low-rank factors, extending LoRA's linear adaptation space into a polynomial manifold. Without increasing rank or inference overhead, PERA significantly enhances the expressiveness of weight updates and consistently outperforms LoRA, DoRA, and HiRA on commonsense reasoning and NLU tasks.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty: This paper proposes E-GRM, a framework that estimates uncertainty from the convergence behavior of parallel decoding, triggers CoT reasoning only when necessary, and employs a discriminative scorer trained with a hybrid loss to evaluate reasoning path quality. E-GRM achieves state-of-the-art performance across multiple reward modeling benchmarks while reducing inference latency by 62%.
Representation-Guided Parameter-Efficient LLM Unlearning: This paper proposes ReGLU, a framework that shifts LLM unlearning from a "parameter importance" paradigm to a "representation space geometry" paradigm. It introduces Representation-guided Initialization for LoRA Adaptation (RILA), which aligns unlearning updates to the most discriminative subspace between the forget and retain sets, and a Representation Orthogonality Loss (ROL) that constrains updates from interfering with retain-set knowledge.
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning: SAMoRA addresses imprecise routing and inflexible weight fusion in existing MoE-LoRA methods through a semantic-aware router and a task-adaptive scaling mechanism, achieving state-of-the-art performance on multi-task benchmarks with only 0.15% trainable parameters.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization: This paper proposes SCURank, a ranking framework based on Summary Content Units (SCUs). It extracts SCUs from candidate summaries, estimates information importance via cross-summary clustering, and scores candidates by informativeness. SCURank replaces unstable LLM-based direct ranking and coarse-grained ROUGE-based ranking. Combined with BRIO contrastive learning in a multi-LLM distillation setting, it significantly improves the summarization performance of distilled models.
SeLaR: Selective Latent Reasoning in Large Language Models: This paper proposes SeLaR, a lightweight training-free framework that activates soft-embedding latent reasoning exclusively at high-entropy "exploration steps" via an entropy gating mechanism, while retaining discrete decoding at high-confidence "certainty steps." An entropy-aware contrastive regularization is introduced to prevent soft embeddings from collapsing toward the dominant token. SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods across five reasoning benchmarks.
Supplement Generation Training for Enhancing Agentic Task Performance: SGT (Supplement Generation Training) trains a small LLM (1.7B) to generate instance-specific supplement text (reasoning cues, summaries, error reminders, etc.) that is appended to the input, enabling a frozen large Actor model to solve tasks more effectively. SGT achieves an average improvement of 21% across 5 benchmarks without modifying the Actor's parameters.
Task-Stratified Knowledge Scaling Laws for Post-Training Quantized LLMs: This paper establishes the first task-stratified knowledge scaling laws for post-training quantization (PTQ), decomposing LLM capabilities into three levels—memorization, application, and reasoning—and jointly modeling four factors: model size, bit-width, group size, and calibration set size. The laws are validated across 293 PTQ configurations, revealing differentiated patterns: reasoning is sensitive to precision, application improves with scale, and memorization is sensitive to calibration data.
Training-Free Test-Time Contrastive Learning for Large Language Models: This paper proposes TF-TTCL, a gradient-free test-time contrastive learning framework that enables a frozen LLM to self-improve online through an Explore–Reflect–Guide cycle. It employs multi-agent role-playing to generate diverse reasoning trajectories, distills textual rules from positive–negative contrastive pairs into a memory bank, and retrieves relevant rules at inference time to guide generation.
UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text: UKP_Psycontrol achieves first place on both subtasks of SemEval-2026 Task 2 by combining LLM prompting, a MaxEnt model with Ising interactions, and a neural regression model. The system reveals that LLMs excel at capturing static affective signals, whereas short-term affective changes are better explained by recent numerical trajectories than by textual semantics.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment: This paper proposes the Rank-Surprisal Ratio (RSR), a metric that jointly measures the informativeness and alignment of reasoning trajectories with respect to a student model, achieving an average Spearman correlation of 0.86 with post-training performance across 5 student models and 11 teacher models, and demonstrating utility in both trajectory selection and teacher selection.
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling: This paper proposes an Equivalent Model Theory and the WISCA weight scaling strategy, which dynamically balances the L1 norms of \(W_q/W_k\) and \(W_v/W_o\) in Transformer attention layers during training—without altering model outputs—to steer optimization toward flatter loss minima. On GQA architectures, WISCA achieves an average 5.6% improvement on zero-shot evaluation and a 2.12% reduction in training perplexity.
YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents: This paper introduces the Information Elicitation Agent (IEA) as a novel conversational paradigm, releases YIELD — the first large-scale human-human information elicitation dialogue dataset (2,281 conversations, 26M tokens) — formalizes the elicitation process as a finite-horizon POMDP, and proposes dedicated evaluation metrics (Conformity, Progression, TLR). Experiments demonstrate that fine-tuning on YIELD significantly improves LLM alignment with authentic elicitation behavior.