📦 Model Compression¶

📷 CVPR2026 · 57 paper notes

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation: This paper proposes 4D-RGPT and the Perceptual 4D Distillation (P4D) framework, which enhances 4D perception in MLLMs by distilling knowledge of depth and optical flow from frozen 4D perceptual expert models. It also introduces R4D-Bench, the first region-level 4D video question-answering benchmark.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos: This paper proposes the first fully end-to-end framework for Temporal Sentence Grounding in Videos (TSGV). A Sentence-Conditioned Adapter (SCADA) is introduced to inject sentence embeddings into intermediate layers of the video backbone, dynamically modulating visual features. Combined with a video-centric learning strategy to accelerate training, the method surpasses state-of-the-art performance on Charades-STA and ActivityNet.
Adversarial Concept Distillation for One-Step Diffusion Personalization: OPAD is the first work to address personalization for one-step diffusion models (1-SDP). It achieves single-step high-quality concept generation via joint teacher–student training, alignment loss, and adversarial supervision, and further introduces a collaborative learning phase in which samples generated by the student are fed back to benefit both parties.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: This paper presents the first FPGA hardware acceleration architecture for the Intra Pattern Copy (IPC) tool in the JPEG XS standard. Through a four-stage pipelined DV comparison engine and IPC Group-aligned memory organization, the design achieves 38.3 Mpixels/s throughput and 277 mW power consumption on an Artix-7 FPGA.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: To address the computational bottleneck of Displacement Vector (DV) search in the Intra Pattern Copy (IPC) module for JPEG XS screen content coding, this paper proposes the first four-stage pipeline FPGA architecture and designs an IPC Group-aligned memory organization scheme. Implemented on a Xilinx Artix-7, the design achieves a throughput of 38.3 Mpixels/s at 277 mW power consumption, providing a viable solution for practical hardware deployment of IPC.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: A fully convolutional architecture that unifies hierarchical hyperprior, Masked PixelCNN spatial autoregression, channel-conditional modeling, and SE channel excitation — without relying on Transformers or recurrent components. With 95M parameters and a 222ms decoding time, it achieves a 48% BD-Rate reduction over the Ballé baseline and outperforms VVC Intra by 5.6%.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: This paper proposes ARCHE, an end-to-end image compression framework that, within a purely convolutional architecture free of Transformers and recurrent modules, integrates five complementary components — hierarchical hyperprior, Masked PixelCNN spatial autoregressive context, channel conditioning, SE channel recalibration, and latent residual prediction — achieving a 48% BD-Rate reduction over the Ballé baseline and −5.6% over VVC Intra on Kodak, with only 95M parameters and 222ms decoding time.
Batch Loss Score for Dynamic Data Pruning: This paper proposes Batch Loss Score (BLS), a method that estimates sample importance using only the mean batch loss — which is universally available — rather than per-sample loss, which is difficult to obtain in practice. Grounded in a signal-processing perspective via EMA-based low-pass filtering, BLS offers theoretical guarantees and can be integrated into existing dynamic pruning frameworks with just 3 lines of code.
Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment: This paper proposes AlignPrune—a plug-and-play module based on loss trajectory alignment—that replaces conventional loss-value ranking with a Dynamic Alignment Score (DAS), achieving up to 6.3% accuracy improvement over standard dynamic data pruning methods under noisy label settings.
Bilevel Layer-Positioning LoRA for Real Image Dehazing: This paper proposes BiLaLoRA, which employs bilevel optimization to automatically identify the optimal network layers for LoRA insertion, coupled with H2C Loss — an unsupervised dehazing loss based on CLIP semantic directions — to efficiently adapt synthetic-data-pretrained dehazing models to real-world scenes. The approach reduces training time by 77.7% while matching full fine-tuning performance, and generalizes across models and domains.
Bilevel Layer-Positioning LoRA for Real Image Dehazing: This work leverages CLIP's cross-modal capability to reformulate dehazing as a semantic alignment problem via the H2C loss, and employs bilevel optimization to automatically identify optimal LoRA injection layers (BiLaLoRA), enabling plug-and-play, parameter-efficient synthetic-to-real domain adaptation for image dehazing.
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers: This paper proposes BinaryAttention, which quantizes Query and Key in Transformer attention to 1-bit binary representations and replaces floating-point dot products with XNOR + popcount bitwise operations, achieving over 2× speedup over FlashAttention2 on A100 GPUs while matching or surpassing full-precision attention across vision classification, detection, segmentation, and diffusion generation tasks.
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge: This paper proposes CPS-Prompt, a framework that combines task-aware critical patch sampling (CPS) and decoupled prompt-classifier training (DPCT) to achieve approximately 1.6× reduction in training-time memory and computation for prompt-based continual learning on edge devices, with only ~2% accuracy degradation.
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation: This paper proposes DAGE, a dual-stream Transformer architecture that decouples global consistency modeling (low-resolution stream) from fine-grained detail preservation (high-resolution stream), fusing them via a lightweight Cross-Attention Adapter. DAGE achieves high-quality depth/point map estimation and camera pose prediction at 2K resolution and over 1000-frame sequences, running 2–28× faster than Pi3 and establishing a new state of the art on video geometry estimation.
Distilling Balanced Knowledge from a Biased Teacher: To address the head-class bias of teacher models in knowledge distillation under long-tailed distributions, this paper decomposes the conventional KL divergence loss into a cross-group component and a within-group component. By rebalancing the cross-group loss to calibrate the teacher's group-level predictions and reweighting the within-group loss to ensure equal contribution across groups, the proposed method consistently outperforms existing approaches on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT — and even surpasses the teacher model itself.
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration: DualReg proposes a dual-space registration paradigm that progressively filters feature-space correspondences via lightweight 1-point RANSAC followed by 3-point RANSAC, then constructs geometric proxy point sets from the filtered anchor correspondences for joint dual-space optimization. The method achieves state-of-the-art accuracy on 3DMatch while running 32× faster than MAC.
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling: This paper proposes Cluster-aware Upcycling, which extracts semantic structure from a dense model via spherical k-means clustering to initialize expert and router parameters in MoE, thereby breaking expert symmetry and promoting early specialization. Combined with an Expert Ensemble Self-Distillation (EESD) loss, the method consistently outperforms existing upcycling approaches on CLIP ViT benchmarks.
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection: This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank for each task and layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.
FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning: FAIR-Pruner is a structured pruning framework that introduces the Tolerance of Differences (ToD) metric to reconcile two complementary perspectives: the Wasserstein Utilization Score (U-Score), which identifies redundant units based on class-conditional separability, and the Taylor-based Reconstruction Score (R-Score), which protects critical units. The framework automatically determines non-uniform per-layer pruning ratios and supports search-free flexible compression ratio adjustment, achieving state-of-the-art results on CIFAR-10, SVHN, and ImageNet.
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation: RETA decouples two failure modes in residual matching for dataset distillation—the fit-complexity gap and the pull-to-anchor effect—by employing Dynamic Retrieval Connection (DRC) to adaptively select real patch anchors and Persistent Topology Alignment (PTA) to preserve intra-class diversity. The method achieves 64.3% (+3.1% vs. FADRM) on ImageNet-1K with ResNet-18 at IPC=50.
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention: By replacing the global self-attention in VGGT with descriptor-based cross-attention, FlashVGGT reduces inference time on 1000 images to 9.3% of VGGT while maintaining competitive reconstruction accuracy, and scales to sequences of 3000+ images.
FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation: This paper proposes FOZO, a forward-only zeroth-order prompt optimization paradigm that updates prompts via SPSA gradient estimation, a dynamic perturbation strategy, and shallow–deep feature statistics alignment—without modifying model weights. FOZO achieves 59.52% accuracy on ImageNet-C, surpassing all forward-only methods including FOA (58.13%), and supports INT8 quantized models.
Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning: Free Sinewich proposes a parameter-efficient multi-task learning framework based on frequency switching. By applying task-specific sinusoidal transformations \(M_t = \sin(\omega_t \cdot M_{AWB})\) to a shared low-rank base matrix, the method achieves genuine parameter reuse and task specialization at near-zero cost, attaining state-of-the-art performance on dense prediction benchmarks with the fewest trainable parameters.
From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness: This paper proposes QuADD, a framework that embeds a differentiable quantization module into the dataset distillation loop to jointly optimize synthetic data and quantization parameters, achieving Pareto-optimal compression of "fewer samples + lower precision" under a fixed bit budget.
Generative Video Compression with One-Dimensional Latent Representation: This paper proposes GVC1D, which for the first time replaces the 2D grid latent representation in video compression with a compact 1D token sequence. Combined with a 1D memory module for modeling long-term temporal context, GVC1D achieves over 60% bitrate savings on perceptual quality metrics.
GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design: This paper proposes GeoFusion-CAD, an end-to-end diffusion framework that encodes CAD programs as hierarchical tree structures and introduces a geometry-aware G-Mamba block with linear time complexity to replace quadratic-complexity Transformers, enabling scalable and structure-aware generation of long-sequence parametric CAD programs. The method substantially outperforms Transformer-based approaches on the newly constructed DeepCAD-240 benchmark (up to 240-step commands).
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers: HiAP formulates ViT pruning as an end-to-end budget-aware learning problem, applying stochastic differentiable gating simultaneously at two granularities—entire heads/blocks (macro) and intra-head value dimensions/FFN neurons (micro)—to automatically discover a compact dense subnetwork satisfying a compute budget within a single training run, eliminating the need for importance ranking, threshold search, and separate fine-tuning.
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers: This paper proposes HiAP, a hierarchical Gumbel-Sigmoid gating framework that unifies macro-level (entire attention heads / FFN blocks) and micro-level (intra-head dimensions / FFN neurons) pruning decisions. Through a single end-to-end training pass, HiAP automatically discovers efficient ViT subnetworks satisfying a given compute budget, eliminating the need for manual importance ranking or multi-stage pipelines.
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation: This paper proposes HierAmp, which injects learnable class tokens at each scale of the coarse-to-fine generation process of a Visual AutoRegressive (VAR) model to identify semantically salient regions, and amplifies attention to these regions via positive logit biasing. This enables distilled data to acquire richer and more diverse layouts at coarse scales while focusing on class-relevant details at fine scales, achieving state-of-the-art performance across multiple dataset distillation benchmarks.
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning: This paper proposes IAPL (Image-Adaptive Prompt Learning), which introduces dynamic prompts at the input of a CLIP encoder. These prompts are generated via two complementary pathways: a Conditional Information Learner (extracting forgery-specific and generic cues from texture-rich regions) and test-time token tuning (minimizing entropy through multi-view consistency). The model adaptively adjusts to each test image at inference time, achieving significantly improved detection generalization on unseen generators.
Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery: This paper proposes the LTC framework, which leverages MKEE (Minimize Kernel Energy + Maximize Entropy) to online-generate pseudo-unknown class samples during training. Combined with a dual max-margin loss and an adaptive threshold mechanism, LTC achieves 1.5%–13.1% all-class accuracy improvements across 7 datasets, entirely eliminating the semantic degradation caused by hash encoding.
LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration: LLaVA-LE is the first vision-language model tailored for lunar exploration. By constructing LUCID, a large-scale real lunar image-text dataset (96K images + 81K QA pairs), and applying two-stage curriculum fine-tuning on LLaVA, the model achieves a 3.3× improvement over the baseline on lunar geological understanding and multimodal reasoning.
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis: This paper proposes MaMe, a training-free and differentiable token merging method based on fully matrix-based operations, along with its inverse operation MaRe for token restoration, achieving efficient acceleration with minimal performance degradation across image classification, video recognition, and image generation tasks.
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation: This work reformulates the visual autoregressive model (VAR) from a full-context-dependent next-scale prediction paradigm into a Markovian scale prediction process. By introducing a sliding-window history compensation mechanism for non-full-context modeling, the method achieves a 10.5% FID reduction and 83.8% peak memory reduction on ImageNet.
MARVO: Marine-Adaptive Radiance-aware Visual Odometry: MARVO is an underwater visual odometry framework that embeds a Physics-Aware Radiance Adapter (PARA) into the LoFTR feature matcher to compensate for wavelength-dependent attenuation, integrates GTSAM multi-sensor factor graph fusion, and employs reinforcement learning-based pose graph optimization (RL-PGO), achieving robust localization in underwater scenes.
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction: This paper proposes MEMO, a framework that generates crisp single-pixel edge maps using only cross-entropy loss, achieved through masked edge training and a confidence-ordered progressive inference strategy. MEMO substantially outperforms prior methods on crispness-aware evaluation (CEval ODS on BSDS improves from 0.749 to 0.836).
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation: MDPD proposes an efficient fine-tuning framework based on bidirectional knowledge distillation between a frozen backbone and a lightweight side network. Upon training completion, the side network is discarded, achieving both parameter- and memory-efficient training as well as inference-time acceleration.
On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors: This paper presents the first systematic study of bit-flip robustness in diffusion-based image compression. It demonstrates that reverse channel coding (RCC)-based diffusion compression methods are inherently more resilient to bit-flip errors than traditional and learned codecs, and proposes Robust Turbo-DDCM, which independently encodes each atom index to further enhance robustness. At BER \(10^{-3}\), the proposed method maintains high reconstruction quality with only a marginal increase in BPP.
OPAD: Adversarial Concept Distillation for One-Step Diffusion Personalization: OPAD is the first work to address one-step diffusion model personalization (1-SDP). It achieves reliable single-step personalized generation via joint teacher–student training, alignment losses, and adversarial supervision, and further proposes a collaborative learning stage in which the efficient student generation is fed back to improve the teacher.
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression: This paper proposes the OmniParallax Attention Mechanism (OPAM) for Distributed Multi-view Image Compression (DMIC), which explicitly models inter-view correlations and aligned features between arbitrary view pairs via a two-stage parallax attention mechanism. The resulting ParaHydra framework is the first DMIC method to significantly outperform state-of-the-art MIC encoders while substantially reducing computational overhead.
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching: PlanaReLoc is the first camera relocalization paradigm centered on planar primitives and 3D planar maps. A deep matcher associates planar regions extracted from query images with map planar primitives in a unified embedding space, achieving lightweight 6-DoF camera relocalization without requiring textured maps, pose priors, or per-scene training.
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model: This paper proposes CompACT, which compresses each image into only 8 discrete tokens (~128 bits) by leveraging a frozen pretrained visual encoder to retain planning-critical semantic information, while a generative decoder supplements perceptual details. This achieves approximately 40× speedup in world-model-based planning with no loss in accuracy.
PPCL: Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers: This paper proposes PPCL, a structured pruning framework tailored for large-scale Multi-Modal Diffusion Transformers (MMDiT, 8–20B parameters). It trains linear probes (Linear Probe) to assess the substitutability of each layer, automatically localizes contiguous redundant layer intervals via first-order differences of CKA, and applies non-sequential alternating distillation for dual-axis pruning along depth and width. On Qwen-Image 20B, PPCL achieves 50% parameter reduction and 1.8× inference speedup with an average performance drop of only 2.61%.
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy: This paper revisits the LoRA merging problem through two complementary lenses—subspace coverage and directional anisotropy—and proposes the TARA-Merging framework. By retaining LoRA directions to preserve subspace coverage and applying preference-weighted cross-entropy pseudo-loss for direction-level reweighting, TARA consistently outperforms existing merging methods across 8 vision and 6 NLI benchmarks.
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild: PriVi constructs a large-scale primate video pretraining dataset of 424 hours and performs domain-level pretraining (rather than dataset-level pretraining) on V-JEPA. This work is the first to demonstrate that domain-level pretraining of video models generalizes across datasets, surpassing fully fine-tuned specialized models on four primate behavior recognition benchmarks using a frozen classifier with only 220K parameters.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models: This paper proposes QuantVLA, the first training-free post-training quantization (PTQ) framework for Vision-Language-Action (VLA) models. Through a selective quantization layout and two lightweight calibration mechanisms—Attention Temperature Matching (ATM) and Output Head Balancing (OHB)—QuantVLA achieves approximately 70% memory reduction at W4A8 precision while surpassing the task success rate of the full-precision baseline.
RDVQ: Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression: RDVQ introduces a differentiable relaxation over the codebook distribution, enabling for the first time end-to-end rate-distortion joint optimization for VQ-based image compression. At extremely low bitrates, the method achieves superior or competitive perceptual quality with less than 20% of the parameters of prior approaches.
RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment: This paper proposes RL-ScanIQA, the first end-to-end reinforcement learning-based blind 360° image quality assessment (BIQA) framework. The core idea is to formulate scanpath generation as a sequential decision-making process, using a PPO policy to learn task-driven viewing strategies directly from quality assessment feedback, rather than relying on imitation learning from human gaze data. The framework consists of two jointly optimized modules—a scanpath generator and a quality assessor—augmented with multi-level rewards (step-level exploration, set-level diversity, and task-aligned perception) and distortion-space data augmentation. The method achieves state-of-the-art performance and strong cross-dataset generalization on three benchmarks: CVIQD, OIQA, and JUFE.
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer: SODA is proposed to achieve controllable-speedup high-fidelity generation for Diffusion Transformers without training, via offline fine-grained sensitivity modeling, dynamic-programming-based cache schedule optimization, and a unified adaptive pruning strategy.
TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery: This paper proposes TALON, the first test-time adaptive framework for On-the-Fly Category Discovery (OCD). By combining semantics-aware prototype updating, stable encoder adaptation, and margin-aware logit calibration, TALON operates directly in continuous feature space without hash encoding, substantially alleviating category explosion and significantly improving novel category discovery accuracy.
F²HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling: This paper proposes F²HDR, a two-stage HDR video reconstruction framework that adapts general-purpose pre-trained optical flow to alternating-exposure scenes via a Flow Adapter for robust alignment, and employs physical motion modeling to extract continuous motion masks from optical flow to guide artifact removal in the second stage, achieving state-of-the-art performance on real-world HDR video benchmarks.
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning: This paper proposes Image-Adaptive Prompt Learning (IAPL), which dynamically adjusts the prompts of the CLIP encoder for each test image at inference time. Through test-time token tuning and a conditional information learner, IAPL achieves strong generalization to unseen generators, attaining state-of-the-art average accuracies of 95.61% and 96.7% on UniversalFakeDetect and GenImage, respectively.
Towards Source-Aware Object Swapping with Initial Noise Perturbation: This paper proposes SourceSwap, which generates high-quality pseudo-paired data from single images via frequency-separated initial noise perturbation, and employs a source-aware dual U-Net architecture to learn cross-object alignment, enabling zero-shot, per-object-fine-tuning-free high-fidelity object swapping.
Understanding and Enforcing Weight Disentanglement in Task Arithmetic: This paper proposes Task Feature Specialization (TFS) as a sufficient condition for weight disentanglement, reveals that its geometric consequence is weight vector orthogonality, and introduces OrthoReg — a regularization method that enforces column-wise orthogonality of weight update matrices during fine-tuning to promote task vector disentanglement, substantially improving the performance of various task arithmetic methods.
UniComp: Rethinking Video Compression Through Informational Uniqueness: This paper proposes UniComp, a video token compression framework grounded in informational uniqueness rather than attention scores. Through three modules—Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression—UniComp maximally preserves unique information across temporal, spatial, and global dimensions, surpassing the uncompressed baseline even when retaining only 10% of tokens.
Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation: A fully automated pipeline is proposed that leverages self-supervised ViT features for unsupervised object discovery, generating spatially grounded multi-label annotations for all 1.28 million ImageNet-1K training images without human annotation. Models trained with these labels achieve consistent gains on both in-domain and downstream multi-label tasks (ReaL +2.0 top-1, COCO +4.2 mAP).
WPT: World-to-Policy Transfer via Online World Model Distillation: WPT proposes a world-to-policy transfer training paradigm that injects future-predictive knowledge from a world model into a teacher policy via a trainable reward model, then transfers this knowledge to a lightweight student policy through policy distillation and world reward distillation, achieving a closed-loop driving score of 79.23 with a 4.9× inference speedup.