📦 Model Compression¶

📹 ICCV2025 · 48 paper notes

A Good Teacher Adapts Their Knowledge for Distillation: This paper identifies the root cause of the teacher–student capacity gap in knowledge distillation as intra-class distribution mismatch in the output distributions, and proposes AID (Adapted Intra-class Distribution), a method that fine-tunes the teacher model prior to distillation to align its intra-class distribution with the student's learning capacity, achieving state-of-the-art performance across diverse architecture combinations.
Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning: This paper proposes APT (Additive Prompt Tuning), which replaces the conventional prompt concatenation paradigm with an additive operation. By introducing only two learnable vectors added to the key/value projections of the CLS token, APT achieves state-of-the-art class-incremental learning performance while substantially reducing computational overhead (41.5% reduction in GFLOPs) and trainable parameters (78.2% reduction).
ARGMatch: Adaptive Refinement Gathering for Efficient Dense Matching: This paper proposes an Adaptive Refinement Gathering pipeline comprising three modules—a content-aware offset estimator, a local consistency matching corrector, and a local consistency upsampler—augmented with an adaptive gating mechanism. The approach substantially reduces reliance on heavyweight feature extractors and global matchers, achieving performance comparable to state-of-the-art methods with a lightweight model.
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens: This paper proposes B-VLLM, a framework that dynamically balances spatio-temporal cues within the context window constraints of VLLMs via three modules: text-conditioned adaptive frame selection, temporal frame token merging, and spatial token sampling. The approach achieves a 10% performance improvement on MVBench.
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens: This paper proposes B-VLLM, a framework that dynamically balances spatio-temporal tokens within the VLLM context window budget through three modules: text-conditioned adaptive frame selection, temporal frame token merging, and spatial token sampling. It addresses the dilemma between uniform sampling (which neglects temporal dynamics) and per-frame token reduction (which loses spatial detail), achieving a 10% improvement on MVBench.
Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes: This paper proposes SR-LoRA (Stable Rank-Guided LoRA), which leverages the stable rank of pretrained weight matrices as a natural prior to assign optimal per-layer ranks for LoRA modules. Without any search procedure, SR-LoRA achieves flexible layer-wise rank allocation and significantly outperforms fixed low-rank LoRA and other adaptive-rank methods in large-domain-gap and few-shot transfer scenarios such as medical imaging.
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation: This paper proposes TokenBridge, which converts continuous tokens into discrete tokens by applying post-training dimension-wise quantization to pre-trained continuous VAE features. The approach preserves the high-fidelity representation capability of continuous tokens while enabling straightforward autoregressive modeling with standard cross-entropy loss, achieving generation quality on ImageNet 256×256 comparable to continuous methods.
CIARD: Cyclic Iterative Adversarial Robustness Distillation: This paper proposes CIARD, which addresses the optimization objective conflict between the clean teacher and robust teacher in dual-teacher ARD frameworks via a Contrastive Push Loss, and introduces an Iterative Teacher Training (ITT) strategy to continuously update the robust teacher and prevent performance degradation. CIARD simultaneously improves adversarial robustness by +3.53% and clean accuracy by +5.87% on CIFAR-10/100 and Tiny-ImageNet.
Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks (cmKAN): This paper proposes cmKAN, a hypernetwork-driven Kolmogorov-Arnold Network for color matching. A generator predicts spatially varying KAN spline parameters, supporting three scenarios (supervised / unsupervised / pairwise optimization) and three tasks (raw-to-raw / raw-to-sRGB / sRGB-to-sRGB). cmKAN outperforms existing methods by an average of 37.3% across all tasks while remaining extremely lightweight (76.4K parameters).
Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement: This paper proposes CSCI, a method that introduces a Color token to learn color representations (Color See) and employs a novel S2A self-attention mechanism to disentangle color information from ReID features (Color Ignore), effectively eliminating appearance bias in clothes-changing person re-identification without requiring any external annotations.
Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification: This paper proposes a competitive distillation strategy in which, during multi-network joint training, the best-performing network is dynamically selected as the teacher at each iteration. Combined with a stochastic perturbation mechanism that introduces mutation operations analogous to genetic algorithms, the approach achieves significant improvements in visual classification performance.
Context Guided Transformer Entropy Modeling for Video Compression: This paper proposes the Context Guided Transformer (CGT) conditional entropy model, which reduces entropy modeling time by approximately 65% while achieving an 11% BD-Rate improvement in video compression. This is accomplished via a Temporal Context Resampler that reduces computational overhead and a Dependency-Weighted Spatial Context Assigner that explicitly models spatial dependencies.
Cross-Architecture Distillation Made Simple with Redundancy Suppression: This paper proposes RSD (Redundancy Suppression Distillation), which extracts architecture-agnostic knowledge via cross-architecture invariance maximization and feature decorrelation. Using a single simple RSD loss and a lightweight MLP decoupling module, RSD substantially outperforms OFA—the pioneering cross-architecture distillation method—on both CIFAR-100 and ImageNet-1k, while incurring only a fraction of OFA's parameter overhead.
Dataset Distillation via the Wasserstein Metric: This paper proposes WMDD (Wasserstein Metric-based Dataset Distillation), which replaces MMD with Wasserstein barycenters for distribution matching and incorporates per-class BatchNorm regularization, achieving state-of-the-art dataset distillation performance on large-scale benchmarks including ImageNet-1K.
DLF: Extreme Image Compression with Dual-generative Latent Fusion: This paper proposes the Dual-generative Latent Fusion (DLF) framework, which decomposes the image latent space into semantic and detail branches for separate compression, and eliminates inter-branch redundancy via a cross-branch interactive design. At extreme low bitrates (<0.01 bpp), DLF achieves state-of-the-art reconstruction quality with BD-Rate savings of up to 67.82% over MS-ILLM, while decoding significantly faster than diffusion-based approaches.
DuoLoRA: Cycle-Consistent and Rank-Disentangled Content-Style Personalization: DuoLoRA introduces rank-dimension mask learning (ZipRank) for LoRA merging, combined with SDXL layer priors and a cycle-consistent merging loss (Constyle loss), enabling efficient content-style LoRA composition that surpasses ZipLoRA and other state-of-the-art methods across multiple benchmarks while reducing trainable parameters by 19×.
EA-ViT: Efficient Adaptation for Elastic Vision Transformer: This paper proposes the first ViT framework that introduces elastic structure at the adaptation stage. Through a multi-dimensional elastic architecture, curriculum learning, and a lightweight router, a single adaptation run yields sub-models covering \(10^{26}\) configurations, consistently outperforming existing elastic methods across multiple downstream tasks.
Efficient Adaptation of Pre-Trained Vision Transformer Underpinned by Approximation Theory: This paper identifies that the row/column vectors of pre-trained ViT weight matrices exhibit approximate orthogonality, whereas the projection matrices learned by LoRA/Adapter do not. The authors propose AOFT, a strategy that generates approximately orthogonal down/up projection matrices from a single learnable vector, aligning the adaptation modules with the properties of the backbone network. This reduces the generalization error bound and achieves competitive performance on FGVC and VTAB-1k with fewer parameters.
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning: FastVAR proposes a training-free post-hoc acceleration method for VAR models. Grounded in the observation that large-scale steps primarily model high-frequency textures and are robust to pruning, it selects pivotal tokens via frequency-guided scoring (PTS) to retain only high-frequency tokens during the forward pass, and restores pruned positions using cached token maps from earlier scales (CTR). Built on top of FlashAttention, FastVAR achieves an additional 2.7× speedup with less than 1% performance degradation, and for the first time enables 2K image generation in 1.5 seconds on a single RTX 3090 GPU.
Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation: This paper proposes FBT (Fuse Before Transfer), which mitigates the feature gap in cross-architecture knowledge distillation (CAKD) by first fusing modules (CNN/MSA/MLP) from heterogeneous teachers and students to construct an adaptive intermediate fusion model before knowledge transfer, and replaces the conventional MSE loss with a spatial-agnostic InfoNCE loss. FBT achieves an average improvement of 8.38% on CIFAR-100 and 2.31% on ImageNet-1K.
Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP: This paper proposes replacing traditional JPEG/HEIC compression with a lightweight 10 KB MLP network for encoding HDR gain maps. The MLP takes SDR image color and spatial coordinates \((r,g,b,x,y)\) as input and incorporates exponential residual encoding (gamma map), outperforming existing methods and traditional compression techniques across multiple HDR reconstruction metrics.
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations: This paper proposes LieRA, which leverages Lie group theory to generalize matrix-level PEFT methods (e.g., LoRA) to high-dimensional parameter spaces (e.g., convolutional kernels). By representing perturbations in the Lie algebra and mapping them back to the Lie group via the exponential map, LieRA achieves efficient fine-tuning while preserving the structural properties of the parameter space.
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention: This paper identifies that ID samples exhibit consistent local gradient directions while OOD samples display chaotic gradient directions, and proposes to "short-circuit" feature coordinates exploited by spurious gradients at inference time to suppress OOD confidence. A first-order Taylor approximation is employed to avoid a second forward pass, yielding a lightweight and efficient OOD detection method.
Heavy Labels Out! Dataset Distillation with Label Space Lightening: This paper proposes the HeLlO framework, which constructs a lightweight image-label projector using a CLIP pretrained model and LoRA-like low-rank knowledge transfer, reducing soft label storage in dataset distillation to 0.003% of the original while maintaining or surpassing SOTA performance.
Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning: This paper proposes TUNA, a method that trains orthogonal task-specific adapters for each incremental task and merges them into a universal adapter. Combined with an entropy-based adapter selection mechanism and a dual-adapter ensemble inference strategy, TUNA achieves state-of-the-art performance in exemplar-free PTM-based class-incremental learning.
Knowledge Distillation with Refined Logits: RLD refines teacher knowledge into two complementary forms — Sample Confidence and Masked Correlation — to mitigate the negative effects of teacher mispredictions without disrupting inter-class correlations. It consistently outperforms existing logit distillation methods on both CIFAR-100 and ImageNet.
Learned Image Compression with Hierarchical Progressive Context Modeling: This paper proposes a Hierarchical Progressive Context Model (HPCM) that partitions the latent representation into multi-scale sub-representations and encodes them sequentially from coarse to fine, combined with a cross-attention-based progressive context fusion mechanism across coding steps, enabling more efficient long-range dependency modeling and more accurate entropy parameter estimation while achieving a better trade-off between compression performance and computational complexity.
Local Dense Logit Relations for Enhanced Knowledge Distillation: This paper proposes Local Dense Relational Logit Distillation (LDRLD), which captures fine-grained inter-class relations by recursively decoupling and recombining logit knowledge, combined with an Adaptive Decay Weight (ADW) strategy that assigns higher weights to critical class pairs. LDRLD consistently outperforms existing logit distillation state-of-the-art methods on CIFAR-100, ImageNet-1K, and Tiny-ImageNet.
MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective: This paper proposes MixA-Q, a mixed-precision activation quantization framework that repurposes window-level activation sparsity (originally used for pruning) as a dimension for quantization — assigning lower bit-widths to less important windows rather than skipping their computation entirely. The method achieves lossless 1.35× speedup under PTQ and lossless 1.25× speedup under QAT on COCO object detection, while exhibiting superior out-of-distribution (OOD) robustness.
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion: This paper proposes MotionFollower, which achieves video motion editing via two lightweight convolutional controllers (pose + appearance) and a consistency guidance mechanism based on score function regularization, surpassing strong baselines such as MotionEditor while reducing GPU memory consumption by approximately 80%.
MSQ: Memory-Efficient Bit Sparsification Quantization: MSQ is proposed to achieve mixed-precision quantization discovery by computing the least significant bit (LSB) directly from weights via a RoundClamp quantizer and imposing L1 regularization to induce sparsity, without explicitly creating bit-level trainable parameters. This reduces trainable parameters by 8× and training time by 86% while maintaining competitive accuracy–compression trade-offs.
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning: MoSketch is the first method to address multi-object sketch animation. It integrates four modules — LLM-based scene decomposition, LLM-based motion planning, a motion refinement network, and compositional SDS — under a divide-and-conquer strategy to tackle two core challenges: object-aware motion modeling and complex motion optimization. High-quality multi-object sketch animation is achieved without any training data.
OuroMamba: A Data-Free Quantization Framework for Vision Mamba: The first data-free post-training quantization (PTQ) framework for Vision Mamba Models (VMMs), which generates high-quality synthetic data via enhanced implicit attention and employs a mixed-precision quantization scheme with dynamic outlier detection. Under W4A4 settings, it significantly outperforms existing data-driven PTQ methods.
Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration: This paper proposes Partial Forward Blocking (PFB), which computes sample importance at shallow layers during forward propagation and prunes low-importance samples by blocking their subsequent deep-layer forward passes. On ImageNet with 40% pruning, PFB achieves a 0.5% accuracy improvement and a 33% reduction in training time.
Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation: This paper proposes PAT (Perspective-Aware Teaching), a framework that addresses the view mismatch problem across heterogeneous architectures via Region-Aware Attention (RAA) and the teacher unawareness problem via Adaptive Feedback Prompting (AFP), enabling feature-level distillation to comprehensively surpass logit-level methods in heterogeneous knowledge distillation for the first time.
PLAN: Proactive Low-Rank Allocation for Continual Learning: This paper proposes PLAN, a framework that proactively allocates orthogonal low-rank subspaces for each task and employs a perturbation-based strategy to minimize inter-task interference, achieving efficient and forgetting-free fine-tuning of large models in continual learning (CL) settings, establishing a new state of the art on standard CL benchmarks.
SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation: This paper proposes SAMO, a lightweight sharpness-aware multi-task optimization method that mitigates task gradient conflicts via joint global-local perturbation, while substantially reducing computational overhead through zeroth-order gradient approximation and layer-wise normalization.
Scheduling Weight Transitions for Quantization-Aware Training: This paper identifies that conventional learning rate scheduling fails to control the effective step size of quantized weights in quantization-aware training (QAT), and proposes a Transition Rate (TR) scheduling technique that explicitly governs the number of discrete weight transitions via a Transition-Adaptive Learning Rate (TALR), substantially improving low-bit quantized model performance.
Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning: This paper proposes the Soft Separation and Distillation (SSD) framework, which addresses insufficient inter-client representation uniformity in federated unsupervised learning through two modules — Dimension Scaling Regularization (DSR) and Projector Distillation (PD) — significantly improving global representation quality without incurring additional communication overhead.
SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting: This paper proposes Sign-Splitting Vector Quantization (SSVQ), which decouples the sign bits of weights from the codebook, introduces learnable sign parameters and an enhanced iterative freezing strategy, enabling each quantized weight to update independently along its own gradient direction during VQ fine-tuning. SSVQ significantly outperforms conventional VQ and scalar quantization under extreme compression ratios.
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data: StolenLoRA is the first work to formulate model extraction attacks targeting LoRA-adapted models. It leverages LLM-driven Stable Diffusion to synthesize high-quality training data, eliminating the need to search real datasets, and designs a Disagreement-based Semi-supervised Learning (DSL) strategy that maximizes information gain through selective querying. With only 10k queries, StolenLoRA achieves an attack success rate (ASR) of up to 96.60%, exposing critical security vulnerabilities in LoRA-adapted models.
Task Vector Quantization for Memory-Efficient Model Merging: This paper proposes quantizing task vectors (the difference between fine-tuned and pre-trained weights) rather than the fine-tuned weights themselves. By exploiting the narrower numerical range of task vectors, the method achieves quantization down to 3-bit without accuracy loss. The paper further proposes Residual Task Vector Quantization (RTVQ), which decomposes task vectors into a shared high-precision base vector and low-precision per-task offsets, maintaining or even improving model merging performance while using only 8% of the original storage.
Time-Aware Auto White Balance in Mobile Photography: This paper proposes a lightweight illumination estimation method (~5K parameters) that leverages contextual metadata from mobile devices (timestamps and geolocation) alongside image color information. The method achieves performance on par with or superior to much larger models on a newly collected dataset of 3,224 smartphone images, and runs in under 0.25ms on a flagship mobile DSP.
TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning: This paper proposes TR-PTS, a framework that performs task-driven layer-wise parameter selection via the Fisher Information Matrix and dynamically filters/merges tokens using CLS attention scores. By tuning only 0.34%–0.60% of parameters, TR-PTS surpasses full fine-tuning by 3.40% on FGVC and 10.35% on VTAB.
UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale: This paper proposes UniConvNet, which employs a three-layer Receptive Field Aggregator (RFA) composed of moderately sized convolution kernels (7×7, 9×9, 11×11) to expand the Effective Receptive Field (ERF) while preserving its Asymptotically Gaussian Distribution (AGD), achieving consistent improvements over existing CNNs and ViTs across lightweight to large-scale model regimes.
Variance-Based Pruning for Accelerating and Compressing Trained Networks: This paper proposes Variance-Based Pruning (VBP), a one-shot structured pruning method that removes neurons with the smallest activation variance in MLP hidden layers and compensates their mean activations into the subsequent layer's bias. With only 10 epochs of fine-tuning, VBP recovers 99% of the original accuracy while reducing computation by 35% and parameters by 36%.
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models: This paper proposes ViT-Linearizer, a cross-architecture distillation framework that transfers the "quadratic knowledge" learned by ViT self-attention into linear-complexity recurrent models (Mamba-based Adventurer) via two core mechanisms: activation matching and masked prediction. The approach achieves 84.3% accuracy on ImageNet while delivering up to 4.2× inference speedup on high-resolution tasks.
VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation: VQ-SGen treats each stroke as an independent entity and decouples its shape from positional information. By applying vector quantization (VQ), it constructs a compact discrete stroke codebook, and employs a cascaded autoregressive Transformer to sequentially generate semantic labels, shape codes, and position codes for each stroke. The method significantly outperforms existing approaches on the CreativeSketch dataset.