CVPR2025 Model Compression AI paper notes paper summaries Compression Continual Learning Knowledge Distillation Few-/Zero-Shot Learning Diffusion Models

📦 Model Compression¶

📷 CVPR2025 · 66 paper notes

📌 Same area in other venues: 📷 CVPR2026 (108) · 🔬 ICLR2026 (240) · 💬 ACL2026 (59) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (143)

🔥 Top topics: Compression ×11 · Model Compression ×8 · Continual Learning ×4 · Knowledge Distillation ×3 · Few-/Zero-Shot Learning ×3

Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning: This paper proposes the ACMap framework, which incrementally averages and merges independently trained task adapters into a single adapter (maintaining \(O(1)\) inference complexity). Combined with centroid prototype mapping to align the representation of old task prototypes in the new subspace, it achieves comparable accuracy to the SOTA method EASE on five benchmarks while being 39 times faster in inference.
Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks: A unified utility metric based on Alternating Gradient Flow (AGF) is proposed, utilizing feature-space total variation as a structural pruning metric. Combined with confidence-based cascade routing, this decouples offline topology construction from online dynamic inference. It avoids structural collapse caused by traditional metrics under extreme compression on ImageNet-1K, and matches the accuracy of the full model at 0.92x computational cost in dynamic inference on ImageNet-100.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: This paper proposes the first FPGA architecture implementation for the displacement vector (DV) search module in JPEG XS Intra Pattern Copy (IPC). Utilizing a four-stage pipelined design and optimized memory organization, it achieves a throughput of 38.3 Mpixels/s and a power consumption of 277 mW on Xilinx Artix-7, laying the foundation for practical hardware deployment and ASIC transition of IPC.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: An end-to-end learned image compression framework, ARCHE, is proposed, which integrates hierarchical hyperprior, masked spatial autoregressive context, channel conditioning, and SE-excited channel recalibration into a unified probabilistic architecture. Without requiring Transformers or recurrent components, ARCHE reduces BD-Rate on Kodak by approximately 48% compared to the Ballé baseline and by about 5.6% compared to VVC Intra, with only 95M parameters and a 222ms decoding time.
AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing: This paper proposes AutoSSVH, which selects the most challenging subset of frames as training signals through an adversarial automated frame sampling network (Grade-Net) and designs a Point-to-Set (P2Set) contrastive learning paradigm for hashing. It achieves efficient self-supervised video hashing retrieval and significantly outperforms existing methods on UCF101 and HMDB51.
BHViT: Binarized Hybrid Vision Transformer: To address the severe performance degradation in binarized ViTs, this paper proposes BHViT, a hybrid ViT architecture specifically designed for binarization. It features a multi-scale grouped dilated convolutional token mixer, quantization-decomposed attention matrix binarization, a shift-augmented MLP, and a regularization loss, achieving state-of-the-art performance for 1-bit binarized models on ImageNet-1K.
Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing: Proposed BMTNet—a lightweight hybrid architecture combining binarized Mamba and Swin Transformer for Quad Bayer HybridEVS sensor RAW image demosaicing. By preserving the full precision of the core Selective Scan and incorporating global visual information to compensate for accuracy loss, it significantly reduces computational complexity while maintaining high-quality demosaicing.

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning: This paper proposes CL-LoRA, which designs a dual-adapter architecture (task-shared + task-specific LoRA). By combining knowledge distillation, gradient reassignment, and learnable block-wise weights, CL-LoRA achieves SOTA continual learning performance with only 0.3% trainable parameters.
CoA: Towards Real Image Dehazing via Compression-and-Adaptation: The Compression-and-Adaptation (CoA) framework is proposed for real-world image dehazing: a large-scale model is first trained on synthetic data, and then compressed and adapted to the real-world domain, balancing performance and deployment efficiency.
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation: This work proposes CCFS, which progressively selects suitable real samples from the original dataset via a curriculum learning framework to supplement distilled data. This addresses the incompatibility between distilled and real data in high-IPC scenarios, outperforming state-of-the-art methods significantly on CIFAR-10/100 and Tiny-ImageNet (by up to +6.6%).
Dataset Distillation with Neural Characteristic Function: A Minmax Perspective: The NCFM method is proposed, which reformulates dataset distillation as a minmax adversarial optimization problem by using the Neural Characteristic Function Difference (NCFD) parameterized by a neural network on the complex plane as a distribution distance metric. By aligning both phase (authenticity) and magnitude (diversity) information, it improves performance by up to 20.5% on ImageNet subsets while reducing GPU memory usage by over 300 times.
DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation: Proposes the EarlyLate training strategy, which generates synthetic images of varying difficulty by starting different IPC sub-batches from distinct optimization points and running them for varying numbers of iterations. Within the batch-to-global matching framework, it significantly improves intra-class diversity while reducing computation time by 39.3%, achieving 66.1% accuracy on ImageNet-1K with IPC=50 (using ResNet-101, surpassing RDED by 4.9%).
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models: This paper proposes the DeRS (Decompose-Replace-Synthesis) paradigm. By leveraging the extremely high similarity (cosine similarity >0.999) among experts in upcycled MoEs, DeRS decomposes \(N\) experts into 1 shared base weight + \(N\) lightweight delta weights. By compressing the delta weights via sparsification, quantization, or low-rank representations, it reduces MoE layer parameters by 65% with zero performance degradation, or reduces additional training parameters by 2270-fold.
Distilling Long-tailed Datasets: This work presents the first systematic study on long-tailed dataset distillation. It reveals that existing methods degrade severely in long-tailed scenarios (even underperforming random selection) and proposes two strategies: Distribution-agnostic Matching (DAM) and Expert Decoupling (ED). The proposed method significantly outperforms existing approaches on CIFAR-10/100-LT and Tiny-ImageNet-LT (e.g., surpassing DATM by 19.7% at an imbalance factor of 100).
DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture: This paper proposes the DKDM paradigm, achieving data-free knowledge distillation for diffusion models for the first time. By substituting the real data distribution with the reverse denoising process of a pre-trained teacher model, and incorporating a dynamic iterative distillation strategy to efficiently generate diverse training knowledge, it supports student models of any architecture, achieving generative performance comparable to or even better than data-driven training without any access to the original data.
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models: DyCoke is proposed, a training-free dynamic visual token compression method. By employing a two-stage strategy—temporal token merging (eliminating \(50-60\%\) of cross-frame redundancy) and dynamic KV Cache pruning (dynamically retaining the most relevant tokens at each decoding step to further reduce tokens by \(70-90\%\))—it reduces the average number of tokens per frame in video LLMs to 15, achieving a \(1.5\times\) speedup with comparable or slightly improved performance.
ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression: This paper proposes the ECVC video compression model, which captures non-local correlations among multiple reference frames through Multi-Frame Non-Local Contextual mining (MNLC) and Multi-Head Linear Cross-Attention (MHLCA). Combined with a Partial Cascaded Fine-Tuning Strategy (PCFS) to address the mismatch between training and testing sequence lengths, ECVC saves 10.5% and 11.5% bitrate under IP=32 and IP=-1 configurations compared to DCVC-FM, respectively.
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality: EfficientViM is proposed, which shifts the channel mixing operations in the SSD layer from the token space (\(O(LD^2)\)) to the compressed hidden state space (\(O(ND^2)\), \(N \ll L\)). This achieves a 2x to 4x faster inference speed compared to existing Vision Mamba models while maintaining competitive accuracy (77.9% with 11,952 img/s for the M3 model on ImageNet-1K).
Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning: Proposes Condenser to condense multiple Visual ICL prompt candidates into a single prompt via Patch-wise Cross-Attention, enabling multi-prompt collaboration instead of competitive selection. It achieves \(46.63\) mIoU with \(16\) input prompts (vs. \(44.14\) for a single prompt) on segmentation, detection, and colorization, while being \(15\times\) faster in inference than evaluating prompts one-by-one.
Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios: Proposes the EDF method to address the performance degradation of dataset distillation in complex scenarios (ImageNet subsets). It introduces Common Pattern Dropout (discarding parameter gradients of low-loss common patterns in trajectory matching) and Discriminative Area Enhancement (utilizing Grad-CAM to scale up gradients of discriminative regions), achieving lossless compression on datasets such as ImageMeow/ImageYellow with only 23% of the data.
Enhancing Dataset Distillation via Non-Critical Region Refinement: This paper proposes a three-stage framework, NRR-DD: using CAM to select low-confidence patches to initialize synthetic images, freezing critical regions while optimizing only non-critical regions to improve information density, and replacing 1000-dimensional soft labels with 2 distance values to achieve 500x storage compression. It achieves 46.1% accuracy on ImageNet-1K with IPC=10 (outperforming RDED by 25.7%), reducing soft label storage from 120GB to 0.2GB.
Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation: Proposes Expert Pyramid Tuning (EPT), which introduces the concept of multi-scale feature pyramids from computer vision into LoRA-based MoE. By constructing experts with varying granularities through a shared meta-knowledge subspace and a deconvolutional pyramid projection mechanism, it achieves more efficient multi-task parameter fine-tuning.
Faster Parameter-Efficient Tuning with Token Redundancy Reduction (FPET): FPET (Faster Parameter-Efficient Tuning) is proposed to introduce a plug-and-play token redundancy reduction module in parameter-efficient tuning (PET). By merging approximately half of the tokens in the middle layer of Vision Transformers (ViTs) using a differentiable bipartite matching strategy, FPET achieves 20% faster inference than the original backbone and reduces GPU memory by around 40% while maintaining comparable accuracy with state-of-the-art (SOTA) PET methods.
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation: Proposes FIMA-Q, which replaces the traditional diagonal approximation with a diagonal-plus-low-rank (DPLR) Fisher Information Matrix approximation to capture the impact of quantization errors on the output distribution more accurately, significantly outperforming existing methods in ultra-low-bit (3-bit) ViT quantization (ViT-B 77.63% vs QDrop 74.75%).
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders: Gaze-LLE is proposed, a minimalist gaze target estimation framework based on a frozen DINOv2 encoder. With only ~2.8M trainable parameters (1-2 orders of magnitude fewer than previous methods), no auxiliary depth/pose models, and no independent head encoder, it achieves state-of-the-art (SOTA) performance (AUC 0.958) on benchmarks such as GazeFollow and VideoAttentionTarget, using only person position prompting and a lightweight transformer decoder.
Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion: This paper applies Wasserstein Distortion (WD) as an optimization objective to the overfitted image codec C3. Combined with common randomness to achieve texture resampling, it achieves a visual quality-rate trade-off comparable to generative compression methods while maintaining extremely low decoding complexity (<1% MACs of HiFiC).
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers: HiAP proposes a multi-granular automatic pruning framework. By deploying learnable Gumbel-Sigmoid gates at both macro (attention heads, FFN blocks) and micro (intra-head dimensions, FFN neurons) levels, the framework automatically discovers the optimal subnetwork within a single-stage end-to-end training process, eliminating the need for manual importance ranking or post-hoc thresholding.
HOT: Hadamard-based Optimized Training: The HOT method is proposed. By analyzing the differentiated sensitivity of different gradient paths (\(g_x\) for activation gradients and \(g_m\) for weight gradients) in backpropagation, Hadamard transform and quantization are selectively applied: \(g_x\) uses HT + INT4 to accelerate computation, while \(g_m\) uses HLA + INT8 to save activation memory. This achieves a 75% activation memory saving and 2.6x GPU acceleration, with only a 0.17% accuracy drop on ImageNet for ViT-B.
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis: This work proposes HyperLoRA, a zero-shot personalized portrait generation method that directly generates LoRA weights through an adaptive network. It projects LoRA parameters into a low-dimensional linear space (1.2% of the original parameters), predicts combination coefficients from the input face using a perceiver resampler, and explicitly decomposes LoRA into ID-LoRA and Base-LoRA to decouple identity from irrelevant information. This achieves a balance among high fidelity, high editability, and fast inference.
Incremental Object Keypoint Learning (KAMP): This paper defines the Incremental Keypoint Learning (IKL) paradigm for the first time—where new tasks only label new keypoints and incremental training is conducted without retaining old data. The KAMP framework is proposed to model anatomical spatial relationships between old and new keypoints using a Knowledge Association Network (KA-Net), which is combined with a keypoint-guided spatial distillation loss. Across 4 datasets, the method not only effectively prevents forgetting but also achieves positive transfer to old keypoints (e.g., MPII AAA of 79.93% vs. 75.75% for LwF).
InsTaG: Learning Personalized 3D Talking Head from Few-Second Video: InsTaG is proposed to extract a general motion prior from multi-speaker long videos via Identity-Free Pre-training, and then rapidly learn high-fidelity personalized 3D talking heads from only a 5-second video via Motion-Aligned Adaptation, achieving 82.5 FPS real-time inference.
IterIS: Iterative Inference-Solving Alignment for LoRA Merging: IterIS proposes an iterative inference-solving method for LoRA merging. By directly extracting the input features of the unified adapter (rather than using approximations) to establish a more accurate optimization objective, combined with regularization to reduce sample requirements to 1-5% of prior methods, and introducing adaptive weight balancing optimization, IterIS significantly outperforms baselines in LoRA merging across text-to-image diffusion models, vision-language models, and large language models.
JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba: JamMa proposes an ultra-lightweight semi-dense feature matcher based on Joint Mamba. Through the JEGO scan-and-merge strategy, it achieves cross-view joint scanning, efficient four-way scanning, global receptive fields, and omnidirectional feature representation, achieving a superior performance-efficiency tradeoff compared to Transformer-based matchers with less than 50% of the parameters and FLOPs.
L-SWAG: Layer-Sample Wise Activation with Gradients for Zero-Shot NAS on Vision Transformers: This paper proposes the L-SWAG zero-cost proxy, which combines layer-wise gradient variance statistics (trainability) and activation pattern cardinality (expressivity). For the first time, it achieves a stable positive ranking correlation on ViT search spaces. Furthermore, it introduces the LIBRA-NAS ensemble algorithm to combine multiple proxy metrics, finding an architecture with a 17.0% test error rate on ImageNet-1k in just 0.1 GPU-day.
Layered Image Vectorization via Semantic Simplification: This paper proposes a progressive image vectorization method that utilizes the feature-average effect of Score Distillation Sampling (SDS) to generate a sequence of step-by-step simplified images. This sequence guides the layered reconstruction of vectors from macro semantic structures to fine details, outperforming existing methods significantly in visual fidelity, semantic alignment, and compact layered representation.
Learned Image Compression with Dictionary-based Entropy Model: Propose a Dictionary-based Cross-Attention Entropy model (DCAE), introducing a learnable dictionary to extract typical texture structure priors of natural images from the training dataset. Through multi-scale feature aggregation and cross-attention, accurate probability distribution estimation is achieved. With an encoding/decoding speed of only 193ms, it achieves a BD-rate of -17.0%/-21.1%/-19.7% (Kodak/Tecnick/CLIC), completely outperforming state-of-the-art (SOTA) methods.
PrunNet: Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval: The authors propose PrunNet (Prunable Network). By learning importance scores for each weight and incorporating conflict-aware gradient integration, PrunNet trains a unified model capable of generating compatible subnetworks at any capacity (20%-100%). It achieves a 46.29 mAP on GLDv2, surpassing the dense network baseline, while ensuring feature compatibility across subnetworks of all capacities.
Less is More: Efficient Model Merging with Binary Task Switch: Controlled experiments reveal that task vectors exhibit an "impulse characteristic"—only parameters with magnitudes exceeding a threshold make positive contributions to the task. Based on this, the T-Switch method is proposed to binarize task vectors into three components: activation switches, polarity switches, and scaling knobs, achieving dynamic model merging performance significantly superior to existing baselines while requiring only 1-3% of the storage space.
LALIC: Linear Attention Modeling for Learned Image Compression: This work introduces the RWKV linear attention mechanism to learned image compression for the first time. It designs a Bi-RWKV transform block to achieve global receptive field feature extraction with linear complexity. Combined with an RWKV spatial-channel-temporal context entropy model, it outperforms VTM-9.1 by 15.26% BD-rate with relatively low complexity.
Logits DeConfusion with CLIP for Few-Shot Learning: It is observed that CLIP suffers from severe inter-class confusion in logits under downstream tasks. To resolve this, this paper proposes Logits DeConfusion (LDC), which enhances feature representations through Multi-level Adapter Fusion (MAF) and integrates an Inter-Class Deconfusion (ICD) module to learn and eliminate confusion patterns via a residual architecture, achieving SOTA results across 11 benchmarks.
LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning: LoRA-DRS proposes a "LoRA subtraction" operation—subtracting the LoRA weights of old tasks from the pre-trained weights before learning a new task to construct a Drift-Resistant Space (DRS). Subsequently, the LoRA of the new task is trained via gradient projection within this space, combined with an augmented triplet loss to enhance plasticity. This approach achieves SOTA performance in exemplar-free continual learning, with a particularly significant advantage in long task sequences.
LSNet: See Large, Focus Small: Inspired by the dual-scale mechanism of human visual perception (peripheral for broad perception and foveal for fine aggregation), this paper proposes LS convolution (large-kernel depthwise convolution for perception + small-kernel dynamic convolution for aggregation) to build the LSNet lightweight network family, comprehensively outperforming existing SOTA lightweight models under 0.3~1.3G FLOPs.
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition: Proposes Mamba-Adaptor to enhance Vision Mamba/SSM with two modules: Adaptor-T (temporal) preserves key historical states using a learnable memory selection mechanism, while Adaptor-S (spatial) enhances spatial locality with multi-scale dilated depthwise convolutions. It achieves 83.0% Top-1 accuracy on ImageNet (Mamba-Adaptor-b2) along with comprehensive improvements in detection, segmentation, and transfer learning.
MambaIC: State Space Models for High-Performance Learned Image Compression: Integrates SSM into both the non-linear transform and the context model of learned image compression for the first time. It enhances channel-spatial context modeling through the VSS block and eliminates spatial redundancy using window-based local attention, saving 12.52% BD-rate over VVC on the Kodak dataset, with even more pronounced advantages in high-resolution image compression.
Masking Meets Supervision: A Strong Learning Alliance: This work proposes Masked Sub-branch (MaskSub)—a generic framework that introduces high-ratio (50%) mask augmentation into supervised learning. By utilizing a self-distillation structure with a main branch (unmasked) and a sub-branch (masked), it addresses the training instability caused by strong mask augmentation. It consistently improves performance across various scenarios, including DeiT-III, MAE fine-tuning, CLIP fine-tuning, BERT training, as well as ResNet/Swin architectures.
MDP: Multidimensional Vision Model Pruning with Latency Constraint: MDP proposes a multidimensional pruning paradigm that models structured pruning at different granularities (such as channels, attention heads, Q/K/V, embedding dimensions, and entire blocks) as a Mixed Integer Non-Linear Programming (MINLP) problem. It jointly solves for the globally optimal pruning structure under strict latency constraints, significantly outperforming existing methods at high pruning ratios.
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network: This work proposes MobileMamba, a lightweight visual network. Through a three-stage coarse-grained architectural design and the fine-grained MRFFI module (integrating Mamba global modeling, multi-kernel convolutional multi-scale perception, and Identity redundancy elimination), MobileMamba achieves an optimal balance between speed and accuracy on both classification and downstream high-resolution tasks.
Multi-modal Knowledge Distillation-based Human Trajectory Forecasting: This paper proposes the first multi-modal knowledge distillation framework for pedestrian trajectory prediction. A full-modal teacher model is trained using trajectories, human poses, and text descriptions, and its knowledge is distilled into a student model using only trajectories or trajectories and poses. This achieves up to approximately a 13% improvement in forecasting accuracy across three datasets: JRDB, SIT, and ETH-UCY.
MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation: This paper proposes MuTri, which introduces vector quantization (VQ) to the 3D volumetric translation task from OCT to OCTA for the first time. Through a two-stage training process—first pre-training VQ-VAEs for OCT and OCTA reconstruction to provide multi-view priors, and then utilizing multi-view guidance, namely contrastive semantic alignment (3D OCT/OCTA views) and vessel structure alignment (2D OCTA projection view), to guide the codebook learning of the translation VQ-VAE—the proposed method comprehensively outperforms state-of-the-art (SOTA) methods across three datasets.
MXNorm: Reusing MXFP block scales for efficient tensor normalisation: MXNorm proposes to reuse the block absmax already computed during MXFP quantization to approximate RMS, fusing normalization and MX quantization into a single statistics gathering operation. This achieves a drop-in replacement for RMSNorm, obtaining up to a 2.4× kernel speedup while maintaining training accuracy in Llama 3 8B pre-training.
Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation: This paper reveals that the projector, rather than the SSM, is the critical component for transfer learning in the Mamba architecture. Based on this finding, the authors propose ProDiaL, a method that indirectly fine-tunes frozen projector weights through a diagonal-centric linear transformation matrix. By training less than 1% of the parameters, ProDiaL outperforms LoRA/DoRA on downstream tasks across both vision and language Mamba models.
Plug-and-Play Versatile Compressed Video Enhancement: This paper proposes a codec-aware compressed video enhancement framework. By reusing compression priors such as compression factors, motion vectors, and partition maps from the bitstream, a single model adaptively enhances videos across various compression levels, while also serving as a plug-and-play module to assist multiple downstream vision tasks.
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers: Proposed Q-DiT, a post-training quantization method for Diffusion Transformers (DiTs). It automatically allocates quantization group sizes via evolutionary search and utilizes sample-wise dynamic activation quantization, achieving high-fidelity image/video generation under the W4A8 configuration.
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge: This paper proposes QuartDepth, a post-training quantization framework for ASIC edge devices. By employing LogNP activation polishing (transforming abnormally distributed activation values into quantization-friendly distributions), activation quantization compensation (updating weights to compensate for activation quantization errors), and Fisher information-guided weight reconstruction, depth estimation foundation models are quantized to W4A4/W4A8. A programmable hardware accelerator is also designed to achieve real-time inference.
Sampling Innovation-Based Adaptive Compressive Sensing: The SIB-ACS framework is proposed, which guides multi-stage adaptive sampling allocation through a "sampling innovation" criterion (measuring the reduction in reconstruction error brought by sampling increments) and designs a Principal Component Compressed Domain Network (PCCD-Net) for high-fidelity image reconstruction, significantly surpassing SOTA compressive sensing methods.
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch: This paper is the first to design an efficient inference network tailored specifically to the unique properties of human sketch data. By introducing cross-modal knowledge distillation (SketchyNetV1), a large network is compressed into a lightweight network while maintaining FG-SBIR accuracy. Furthermore, an RL-driven adaptive canvas size selector (SketchyNetV2) is developed to leverage the sparse and abstract nature of sketches, reducing FLOPs even further. Ultimately, this achieves a 99.37% reduction in FLOPs (40.18G \(\rightarrow\) 0.254G) with negligible loss in accuracy.
Style Quantization for Data-Efficient GAN Training: SQ-GAN enhances the effectiveness of discriminator consistency regularization under limited data by discretely quantizing the intermediate style space of StyleGAN into a learnable codebook, compressing the sparse continuous latent space into a compact and structured discrete proxy space. By utilizing CLIP embeddings and optimal transport distance to initialize the codebook, external semantic knowledge is injected into the codebook, significantly improving the generation quality of few-shot GANs.
TADFormer: Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning: TADFormer proposes a parameter-efficient fine-tuning framework for multi-task learning. By dynamically extracting fine-grained task features according to the input context through a Dynamic Task Filter (DTF), combined with task-prompt conditional operations and cross-task interactions, it achieves superior accuracy on PASCAL-Context with 8.4x fewer parameters than full fine-tuning.
Task Singular Vectors: Reducing Task Interference in Model Merging: The Task Singular Vectors (TSV) framework is proposed to analyze and resolve task interference in model merging within the SVD space of layer-wise task matrices. TSV-Compress compresses task vectors to 10% of their size while maintaining 99% accuracy, and TSV-Merge decorrelates singular vectors of different tasks via a whitening transformation, outperforming existing methods by an average of approximately 15 percentage points across 8/14/20 task merging scenarios.
Towards Practical Real-Time Neural Video Compression: This paper proposes DCVC-RT, the first neural video codec to achieve 1080p real-time encoding and decoding on consumer-grade hardware with compression efficiency surpassing H.266/VTM. The core finding is that operational complexity (rather than computational complexity) acts as the actual speed bottleneck. Based on this, implicit temporal modeling and a single-scale low-resolution latent representation are designed, achieving encoding/decoding speeds of 125/113 fps on an A100 GPU while saving 21% bitrate.
Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning: This paper proposes the Tri-WE method, which updates the entire model (instead of freezing the feature extractor) by interpolating three classification heads—base, previous session, and current session—in the weight space, and mitigates forgetting in few-shot scenarios using Amplified Data Knowledge Distillation (ADKD), achieving SOTA performance in FSCIL on miniImageNet/CUB200/CIFAR100.
Understanding Multi-layered Transmission Matrices: This work analyzes the theoretical foundation of multi-layered transmission matrix approximation from a frequency domain perspective. It reveals that the "missing cone" problem in microscopy unexpectedly becomes an advantage in wavefront shaping scenarios, proving that a small number of SLM layers can achieve effective scattering correction within a limited field of view.
WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models: WAVE is proposed to reformulate the initialization of variable-sized models as a multi-task learning problem. By utilizing shared size-agnostic weight templates and lightweight size-specific weight scalers (via Kronecker products), it achieves efficient initialization. It requires only 3.3% of pre-trained parameters to outperform models trained for 150 epochs within just 10 epochs.
What Makes a Good Dataset for Knowledge Distillation?: This work systematically investigates the fundamental question of "what data is effective" in knowledge distillation (KD). It finds that even non-natural synthetic images generated by OpenGL shaders can perform KD effectively. It also concludes that a good distillation dataset should meet several criteria: uniform distribution of teacher predictions, sufficient coverage of the decision space, high data diversity, and containment of decision boundary information.