📦 Model Compression¶
📷 CVPR2026 · 98 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (241) · 💬 ACL2026 (59) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (52)
🔥 Top topics: Compression ×15 · Diffusion Models ×9 · Model Compression ×8 · Multimodal/VLM ×4 · Knowledge Distillation ×3
- 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
-
This paper proposes 4D-RGPT and the Perceptual 4D Distillation (P4D) framework, which enhances 4D perception by distilling knowledge such as depth and optical flow from a frozen 4D perceptual expert model into an MLLM. It also introduces R4D-Bench, the first region-level 4D video question-answering benchmark.
- A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling
-
BoT treats neural network weights as "continuous signals," where models of different sizes are simply discretized versions of the same signal at different resolutions. By applying 3D Discrete Wavelet Transform (DWT) for downsampling to achieve Large-to-Small (L2S) transfer and Inverse DWT (IDWT) with zero-padded high frequencies for upsampling to achieve Small-to-Large (S2L) transfer, it introduces the first training-free, zero-parameter framework that unifies cross-architecture knowledge transfer in both directions. It saves up to 67.1% of pre-training FLOPs on DeiT, BERT, and GPT.
- AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
-
This paper proposes AdaBet, a gradient-free layer selection method based on algebraic topology (the first Betti number \(b_1\)). By calculating the topological complexity of the activation space of each layer through only a forward pass, it determines which layers require fine-tuning without the need for labels, gradients, or backpropagation. On ResNet50/VGG16/MobileNetV2/ViT-B16, AdaBet achieves higher accuracy than full training with only 10% of layers fine-tuned, while reducing peak memory by approximately 40%.
- Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
-
ADTrack treats network depth as a dynamically allocatable computational budget. By equipping a frozen dual-stream ViT-T backbone with multi-layer "anytime" prediction heads and a confidence-calibrated early-exit strategy, and employing a minimalist Holistic-Token-Guided Interaction (HTGI) module with only 37.3K parameters for low-cost cross-modal fusion, it achieves 70.2% PR / 56.3% SR on LasHeR. It runs at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on edge devices.
- Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
-
Addressing the common issues of "color oversaturation + motion collapse" in DMD (Distribution Matching Distillation) for video diffusion models, this paper proposes an adaptive regression loss (using an EMA cache to dynamically down-weight unreliable real samples with high variance) and a temporal regularization loss (directly penalizing low inter-frame variance). Combined with an inference acceleration strategy that reduces the frame rate at high-noise steps and interpolates them back at low-noise steps, the method achieves 4-step generation on Wan2.1-1.3B/14B. The VBench/VBench2 scores surpass all distillation baselines, and user preference even exceeds that of the 50-step teacher model.
- AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
-
AdaSVD utilizes "alternating least squares to compensate for truncated singular matrices" and "adaptive compression rate allocation based on layer importance." These mechanisms significantly reduce accuracy loss in SVD-based Large Multimodal Models (LMMs) under high compression rates (60%+), consistently outperforming SVD-LLM across LLaMA2, OPT, Mistral, and Vicuna.
- Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
-
Aiming at the Open-set Continual Test-Time Adaptation (OCTTA) scenario where "continual domain drift" and "unknown novel classes" occur simultaneously, this paper proposes DOCO. The method first splits the current batch into ID-like and OOD-like subsets. It then learns a visual prompt on the ID samples to "pull" feature statistics back to the source domain. Finally, this prompt is directly reused for OOD samples in the same batch to strip away their domain shift and expose their semantic novelty. This three-step closed-loop mutual assistance achieves an H-score 4.7% higher than the second-best method on ImageNet-C.
- Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution
-
This work identifies a prevalent "pattern imbalance" in existing dataset distillation methods (either favoring intra-class majority class-general patterns or rare marginal patterns). It proposes the BPS framework: first, each class is modeled as a distribution of multiple visual patterns using a hierarchical semantic structure; then, a pattern-balanced coreset is constructed by taking half of the IPC budget from both the "center" and "margin" of each pattern; finally, a student model is trained via knowledge distillation. BPS comprehensively outperforms previous SOTA across four benchmarks and naturally possesses advantages in cross-architecture generalization and efficiency through its "mode once, reuse for all IPC" approach.
- Batch Loss Score for Dynamic Data Pruning
-
Batch Loss Score (BLS) is proposed to estimate sample importance using only the mean batch loss instead of hard-to-acquire per-sample losses. Providing theoretical guarantees from a signal processing perspective via EMA low-pass filtering, it can be integrated into existing dynamic pruning frameworks with only 3 lines of code.
- Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching
-
Addressing the issue where existing ImageNet-1K dataset distillation methods rely excessively on BN statistics matching and suffer performance collapse without soft labels, this paper argues from a gradient perspective that BN matching only aligns gradient "scales" while ignoring the "directions" that determine training. The authors propose Orthogonal Gradient Matching (OGM), which performs SVD on real/synthetic gradients, forces all singular values to 1 to align only the singular vectors, and utilizes the closed-form gradient of the Least Squares Error (LSE) loss to complete matching during the forward pass. At IPC=10, OGM achieves 47.0% with soft labels and 16.7% with hard labels, significantly surpassing baselines like RDED.
- Bilevel Layer-Positioning LoRA for Real Image Dehazing
-
Ours proposes BiLaLoRA, which automatically locates the optimal network layers for LoRA insertion through bilevel optimization. Combined with H2C Loss (an unsupervised dehazing loss based on CLIP semantic directions), it achieves efficient adaptation of synthetic-data pre-trained dehazing models to real-world scenarios—reducing training time by 77.7% while maintaining performance comparable to full fine-tuning across models and domains.
- BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
-
BinaryAttention is proposed to quantize the Query and Key in Transformer attention into 1-bit binary representations. By replacing floating-point dot products with XNOR + popcount bitwise operations, it achieves over \(2\times\) speedup compared to FlashAttention2 on A100 GPUs, while maintaining or even surpassing full-precision attention performance across vision classification, detection, segmentation, and diffusion generation tasks.
- Block-based Learned Image Compression without Blocking Artifacts
-
This paper utilizes a set of analytical recurrence formulas to precisely calculate the minimum overlap required for each layer when a CNN-based image compression model is executed block-wise. This enables off-the-shelf models to run on a per-block basis without retraining, reducing peak memory to approximately 13% while achieving bit-to-bit consistency with full-image inference and completely eliminating block boundary artifacts.
- Bridging Domains through Subspace-Aware Model Merging
-
This paper discovers that merging models fine-tuned on different domains to generalize to unseen domains produces much stronger singular subspace conflicts than conventional multi-task merging. It proposes SCORE: a method that performs a change-of-basis using a shared orthogonal basis constructed from the concatenated principal singular vectors of all models. By retaining diagonal elements (consistent directions) and trimming off-diagonal outliers (conflicted directions), SCORE outperforms existing merging methods across 8 domain generalization benchmarks and 3 model scales.
- CADC: Content Adaptive Diffusion-Based Generative Image Compression
-
CADC makes the "encoding-side representation" and "decoding-side generation prior" of diffusion-based image compression content-adaptive throughout the process. It uses an uncertainty map to drive spatially varying quantization, a lightweight auxiliary decoder to force semantic information into the first 4 channels actually utilized by the diffusion decoder, and derives content-related text conditions bit-free from the auxiliary reconstructed image. It achieves SOTA perceptual quality at ultra-low bitrates (approximately 0.005–0.01 bpp).
- CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
-
Addressing the unique challenges of "attention dissipation caused by low-bit quantization" and "reconstruction oscillation due to bidirectional coupling" in the SAM decoder, CAR-SAM employs MatMul-Aware Compensation (MAC)—which channels activation quantization errors from MatMul inputs back into preceding linear layer weights—and Joint Cross-Attention Reconstruction (JCAR)—which optimizes coupled cross-attention blocks together. This framework successfully compresses SAM/SAM2 to W4A4, achieving mAP improvements of 14.6% and 6.6% over previous state-of-the-art methods on SAM-B and SAM-L, respectively.
- CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
-
CARLoS bypasses metadata provided by LoRA authors. Instead, it "activates" each LoRA by generating images across a large set of prompt × seed pairs, computing the CLIP-space difference from the base model images. These are distilled into a tri-representation of "Direction / Strength / Consistency," enabling LoRA retrieval based on actual generative behavior rather than textual metadata. It outperforms four strong text-based retrieval baselines in both automated and human evaluations.
- Collaborative Multi-Mode Pruning for Vision-Language Models
-
Addressing the simultaneous "parameter redundancy" and "token redundancy" in VLMs, CoMP introduces a Collaborative Importance Measure (CIM, eliminating interference between parameter and token pruning) and a Multi-mode Pruning Strategy (MPS, adaptively selecting the most cost-effective pruning mode at each step). It significantly outperforms single-mode methods at high pruning rates (e.g., leading by 3.51% in test accuracy on NLVR2 at a 0.85 pruning rate).
- Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
-
Addressing the long-neglected optimization of "hierarchical structure" (quality structure + reference structure) in neural video coding (NVC), this paper extracts a hierarchical hyperprior (hh) from the current frame. This prior uniformly guides the content-adaptive joint optimization of quality allocation and dual-reference fusion, saving 15.51% and 12.20% bitrate compared to the previous SOTA, DCVC-FM, under IP -1 and IP 32 settings respectively.
- Continual Distillation of Teachers from Different Domains
-
The paper proposes a new paradigm called "Continual Distillation (CD)" — where a student sequentially distills from a stream of teachers who arrive one after another, belong to different domains, and are mutually invisible. It identifies that distilling with "external data" (unseen by teachers) can transfer unseen domain knowledge (UKT), but sequential progression leads to the forgetting of this knowledge (UKF). Consequently, the authors propose SE2D (restricting self-distillation to external data) to alleviate forgetting and improve cross-domain average accuracy across multiple benchmarks.
- Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
-
The CPS-Prompt framework is proposed, which achieves approximately 1.6× improvement in training-time memory and computational efficiency for Prompt-based continual learning on edge devices through two modules: task-aware Critical Patch Sampling (CPS) and Decoupled Prompt-Classifier Training (DPCT), while only incurring an approximately 2% drop in accuracy.
- Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation
-
Addressing the constraints of existing cloud-edge Continual Test-Time Adaptation (CTTA) which default to isomorphic CNNs, CAA enables the cloud to run a large ViT teacher while the edge runs a lightweight CNN student. Through a mechanism of "selective sample uploading via communication budget + cross-architecture heterogeneous distillation," it achieves heterogeneous synergistic adaptation, setting a new SOTA on ImageNet-C severity-5 with 41.2% mean accuracy while uploading the minimum number of samples.
- Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification
-
To address multi-label classification with "dual missing of views and labels," this work utilizes a strong view as a teacher to distill knowledge into remaining weak views. Furthermore, a learnable binary gate is employed to mask views that remain unreliable after distillation. This approach consistently outperforms nine SOTA methods across six datasets.
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
-
The DAGE dual-stream Transformer architecture is proposed to decouple global consistency modeling (low-resolution stream) from fine-grained detail preservation (high-resolution stream). By fusing these via a lightweight Cross-Attention Adapter, high-quality depth/point map estimation and pose prediction are achieved on 2K resolution and 1000-frame sequences. The method is \(2\times\) to \(28\times\) faster than Pi3 and achieves a new SOTA in video geometry estimation.
- Dataset Distillation by Influence Matching
-
Instead of forcing synthetic data to mimic the training process of real data (gradients/trajectories), this work directly aligns the "training outcome." By utilizing a linear-time, inverse-Hessian-free differentiable influence estimator, the authors reformulate dataset distillation as "influence of the synthetic set on parameters \(\approx\) influence of the real set on parameters." This approach consistently outperforms process-matching SOTA on CIFAR, Tiny-ImageNet, and Flickr30K (e.g., reaching 31.5% on Tiny-ImageNet IPC=10, a 4.7% gain over NCFM).
- Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
-
CRISP factorizes pre-trained weights into a "frozen basis \(B\) shared across layers + a learnable mixer \(A\) private to each layer." Shrinking and sharing \(B\) achieves Model Compression (MC), while freezing \(B\) and tuning only \(A\) achieves Parameter-Efficient Fine-Tuning (PEFT). This unified factor structure bridges two tasks previously handled separately. On VTAB-1K PEFT, it outperforms SOTA by 1.5% with fewer parameters; ViT compression exceeds SOTA by 1.5%, and the combined PEFT+MC setting outperforms existing baselines by over 1%.
- Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression
-
ATDC makes the compression order of multiple task features image-adaptive: it uses a lightweight proxy head to estimate task predictability and form a correlation matrix, then greedily constructs a Directed Acyclic Graph (DAG) for sequencing. Each task feature is residually encoded conditional on its "parent" tasks, achieving higher multi-task accuracy at lower bitrates on Taskonomy.
- Distilling Balanced Knowledge from a Biased Teacher
-
To address the issue of teacher models skewing toward head classes in knowledge distillation under long-tail distributions, this paper decomposes the traditional KL divergence loss into cross-group and within-group components. By rebalancing the cross-group loss to calibrate group-level predictions and reweighting the within-group loss to ensure equal contributions, the proposed method outperforms existing techniques on CIFAR-100-LT/TinyImageNet-LT/ImageNet-LT, even exceeding the teacher model's own performance.
- Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
-
To address the issues of blurred reconstruction and loss of detail in multi-view distributed image compression at extremely low bitrates (<0.1 bpp), this paper proposes MDIC. It is the first to feed side information into a pre-trained text-to-image diffusion model in a "text + visual" multimodal format, using a text-supervised visual mask to gate the restoration of category information and object-level details lost during quantization, achieving SOTA perceptual quality on KITTI Stereo and Cityscapes.
- DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
-
This work refines and distills the "coarse-to-fine generative curriculum knowledge" encoded during the denoising process of a pre-trained text-to-image Diffusion Transformer (DiT) into a lightweight ViT retrieval backbone. This enables the small model to completely discard the DiT during inference while significantly improving R@1 in Open-Set Fine-Grained Retrieval (OSFR) (+9.8% on CUB, +18.6% on Stanford Cars).
- DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
-
DMGD decouples "diffusion-based dataset distillation" into two independent objectives: semantic matching and distribution matching. It injects train-free guidance exclusively during the sampling stage. By utilizing dynamic soft labels to enhance diversity and Optimal Transport (OT) loss to align distribution structures, it outperforms fine-tuning-based SOTA methods on ImageNet-Woof/Nette/1K by an average of 2.1%/5.4%/2.4%.
- Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking
-
EATrack utilizes a full-scale 12-layer ViT teacher to distill target representation and localization capabilities into an 8-layer lightweight student through "feature-level + prediction-level" dual-branch distillation focused solely on target regions. Combined with asymmetric inference and temporal adaptation, it achieves a 1.2% higher average success rate than previous SOTA methods on five UAV benchmarks while running at 241.9 FPS.
- DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
-
DualReg proposes a dual-space registration paradigm that first uses lightweight 1-point RANSAC + 3-point RANSAC to progressively filter feature-space correspondences, and then constructs geometric proxy point sets based on filtered anchors for joint dual-space optimization. It achieves SOTA accuracy on 3DMatch while being 32x faster than MAC.
- DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
-
DuetMerging stacks task vectors from multiple expert models into a 3D tensor and applies Tucker decomposition to derive a "shared core tensor"-driven dynamic expert pool for suppressing task conflicts. It further employs neuron activation-guided sparsification to "surgically" salvage task-specific knowledge from decomposition residuals as a static correction. This "dynamic-static duet" achieves SOTA performance on 8 image classification tasks (99.2% normalized accuracy on ViT-B/32).
- Efficiency Follows Global-Local Decoupling
-
ConvNeur decouples the tasks of "global reasoning" and "preserving local details" into two independent branches: a convolutional branch dedicated to retaining local textural details, and a compressed "neural memory" branch that aggregates image-level context using a chunked approach with sub-quadratic complexity. A learned gating mechanism allows the global signal to modulate rather than overwrite local features. It achieves superior accuracy-efficiency tradeoffs on ImageNet/COCO/ADE20K with fewer FLOPs and parameters.
- Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
-
This paper proposes Cluster-aware Upcycling, which initializes MoE expert and router parameters by extracting the semantic structure of dense models through spherical k-means clustering. This approach breaks expert symmetry and promotes early specialization. Combined with an Expert Ensemble Self-Distillation (EESD) loss, it consistently outperforms existing upcycling methods on CLIP ViT.
- Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
-
The paper proposes the Evidential Transformation Network (ETN), a lightweight post-hoc module that transforms pretrained classifiers or LLMs into evidential models by learning sample-dependent affine transformations in the logit space, achieving reliable uncertainty estimation with minimal computational overhead.
- F²HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
-
The authors propose F²HDR, a two-stage HDR video reconstruction framework. It employs a Flow Adapter to adapt general pre-trained optical flow to alternating exposure scenarios for robust alignment. It utilizes physical motion modeling to extract continuous motion masks from optical flow, guiding artifact removal in the second stage. It achieves SOTA performance on real HDR video benchmarks.
- FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
-
FAAR is proposed as a frequency-aware multi-task parameter-efficient fine-tuning method. It dynamically selects the optimal rank for each task and layer through Performance-Driven Rank Shrinking (PDRS) and enhances spatial awareness and cross-task consistency using the Task-Spectral Pyramidal Decoder (TS-PD) with FFT frequency information. It achieves superior performance with only 1/9 of the parameters compared to traditional fine-tuning.
- FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
-
FAST reformulates "selecting a small core subset from a large dataset" as a continuous distribution matching optimization problem under spectral graph constraints. It uses Characteristic Function Distance (CFD) to match all moment information of the original dataset order-by-order in the frequency domain. Topological constraints are then applied to pull continuous solutions back to real discrete samples. Without relying on any proxy DNN, it achieves an average accuracy 9.12% higher than the Prev. SOTA, with a 96.57% reduction in energy consumption and 2.2× speedup on CPU.
- Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
-
RETA decouples two failure modes of residual matching in dataset distillation (fit-complexity gap and pull-to-anchor effect). It adaptively selects real patch anchors via Dynamic Retrieval Connection (DRC) and preserves intra-class diversity through Persistent Topology Alignment (PTA), achieving 64.3% on ImageNet-1K ResNet-18 IPC=50 (+3.1% vs. FADRM).
- FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
-
Ours proposes FOZO, a forward-only zeroth-order prompt optimization paradigm. By utilizing SPSA gradient estimation, a dynamic perturbation strategy, and shallow-deep feature statistical alignment, FOZO achieves efficient TTA without modifying model weights. It outperforms all forward-only methods on ImageNet-C with 59.52% accuracy (surpassing FOA's 58.13%) and supports INT8 quantized models.
- FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model
-
Addressing the two pain points of learned stereo image compression—"loss of high-frequency details" and "slow autoregressive entropy models"—FreqSIC introduces the Frequency-aware Stereo Context Transfer (FSCT) module. This module models left-right view redundancy separately on high and low-frequency components with adaptive weighting. Furthermore, it replaces the cumbersome spatial autoregressive entropy model with a bi-directional checkerboard context model embedded with FSCT. FreqSIC achieves SOTA rate-distortion performance on InStereo2K and Cityscapes while reducing codec latency to 1.62s (approximately 48 times faster than BiSIC's 78.6s).
- Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning
-
Free Sinewich proposes a parameter-efficient multi-task learning framework based on frequency switching. By applying sine transformations \(M_t = \sin(\omega_t \cdot M_{AWB})\) with different task-specific frequencies to a shared low-rank base matrix, it achieves true parameter reuse and task specialization at near-zero cost, reaching SOTA on dense prediction benchmarks with minimal trainable parameters.
- Generative Video Compression with One-Dimensional Latent Representation
-
Ours proposes GVC1D, which for the first time replaces the 2D grid latent representation in video compression with a compact 1D token sequence. Combined with a 1D memory module for modeling long-term temporal context, it achieves over 60% bitrate savings in perceptual quality metrics.
- Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
-
For post-training mixed-precision quantization (PTQ-MPQ) of super-resolution (SR) models, this paper moves beyond using static statistics like activation standard deviation for layer-wise sensitivity estimation. Instead, it directly uses the "gradient of loss with respect to bit-width" for bit allocation, paired with a non-learning Dynamic Activation Normalization (DAN) to solve the activation range drift caused by the removal of BN in SR. It achieves a 1.26 dB higher PSNR on Urban100 compared to previous PTQ-MPQ methods and is 1.9\(\times\) faster for 3-bit EDSR\(\times\)4.
- Grid Distillation: Compositional Image Distillation via Structured Generative Grids
-
Grid Distillation compresses an entire image class into a "structured generative grid": it first uses Spectral-Submodular Image Selection (SSDIM) to select \(L^2\) representative images from CLIP embeddings—balancing coverage, diversity, and manifold geometry—to form a grid which is then downsampled. Subsequently, a single-step diffusion inversion (based on SD Turbo) restores high-frequency details lost during downsampling, followed by grid-aware cropping for training augmentation. The method significantly outperforms existing dataset distillation approaches across ImageWoof, ImageNette, ImageIDC, and ImageNet-1K, achieving 65.5% on ImageWoof IPC=10 (compared to 39.9% for VLCP).
- HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
-
HeSS proposes the Head Sensitivity Score to quantify the sensitivity of each attention head in the global attention layers of VGGT to sparsification. Based on this score, it redistributes the attention budget from insensitive heads to sensitive ones, significantly outperforming the uniform sparsification method SparseVGGT at high sparsity levels with almost no additional runtime overhead.
- HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
-
Ours proposes HierAmp, which injects learnable category tokens into the coarse-to-fine generation process of Visual Autoregressive (VAR) models to identify semantically salient regions. By amplifying attention in these regions through positive logit bias, the distilled data achieves richer layout diversity at coarse scales and focuses on category-related details at fine scales, reaching SOTA performance on multiple dataset distillation benchmarks.
- ID-Sim: An Identity-Focused Similarity Metric
-
This paper introduces ID-Sim—a feed-forward perceptual metric specifically designed to measure "identity consistency." It mimics human "selective sensitivity" (insensitive to contextual changes like background/pose/lighting, but highly sensitive to subtle identity changes). By training LoRA and dual-head MLPs on a frozen DINOv3 ViT-L using real and synthetic edited data, combined with a dual objective of global CLS contrastive and local patch Optimal Transport contrastive, it outperforms existing metrics in 48 out of 49 evaluation settings across 7 datasets, using 100× less annotated data and a smaller backbone.
- IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
-
Addressing the pain point where diffusion-based dataset distillation samples "over-aggregate in high-density regions and lack discriminative boundary samples," IMS3 leverages the instability of DDIM inversion for fine-tuning (Inversion-Matching, IM) to broaden the generated distribution toward low-density regions. It then employs training-free Selective Subgroup Sampling (S3) to pick synthetic subsets that are both representative and class-separable based on centroid similarity, achieving a new SOTA in diffusion-based distillation on ImageWoof, ImageNette, and ImageIDC.
- Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
-
Addressing the pain points of the "bin generation" step scaling cubically with the number of intra-class samples and fixed patch dropping ratios being unable to adapt to sample redundancy, BGFDQ replaces expensive bin generation with lightweight KNN identification, uses neighbor-aware coreset selection to ensure coverage and remove redundancy, and adaptively selects dropping ratios for each image via semantic offset. This reduces complexity from \(O(CM^3)\) to \(O(CM^2)\), consistently outperforming SOTA across four classification benchmarks (up to +5% on CIFAR-100) and scaling to 200,000 samples per class where bin generation methods fail due to OOM.
- LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
-
LiDeRe argues that for dense prediction tasks with limited data, rather than using Parameter-Efficient Fine-Tuning (PEFT) such as LoRA—which requires backpropagation through the entire backbone—it is superior to attach a carefully designed lightweight readout on top of a frozen backbone. By integrating a "learnable interpolation prior" and "content-guided attention" into a feature interpolation module, this approach often achieves parity with or surpasses PEFT methods in semantic segmentation, pose estimation, object detection, and boundary detection using fewer than 400,000 trainable parameters, while offering faster training and lower memory consumption.
- LoPrune: Efficient Data Pruning for LoRA-Based Fine-Tuning of Vision Transformer
-
Addressing the overlooked "data redundancy" bottleneck in on-device LoRA fine-tuning, LoPrune proposes projecting sample influence functions onto the LoRA trainable subspace to calculate the TSA Score. By utilizing K-FAC curvature approximation for efficient single-epoch scoring, it reduces fine-tuning overhead by up to 72.9% and accelerates training by up to 3.69× on models like ViT/DeiT/Swin/DETR, while achieving accuracy gains of up to 3.50%.
- MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
-
MambaSIC replaces expensive cross-attention in stereo image compression with a linear-complexity Stereo Visual State Space Block (Stereo VSSB) for inter-view context transfer. Combined with a checkerboard-partitioned bi-directional multi-reference entropy model instead of spatial autoregression, it refreshes Rate-Distortion performance on InStereo2K / Cityscapes (improving BD-PSNR) while reducing latency to 1.26s (approximately \(62\times\) faster than the SOTA BiSIC).
- Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
-
Masters bridges the teacher-student capacity gap through a progressive strategy that masks teacher weights by magnitude and gradually restores them during training. This is combined with offline RL driven by accuracy and distillation transferability rewards, enabling compact VLMs to stably absorb knowledge and outperform same-sized models across 13 multimodal benchmarks.
- MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
-
Ours proposes the MEMO framework, which employs masked edge training and a progressive inference strategy based on confidence ranking. Using only standard cross-entropy loss, it generates sharp, single-pixel edge maps and significantly outperforms existing methods in crispness-aware evaluation (increasing CEval ODS from 0.749 to 0.836 on BSDS).
- Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
-
MDPD proposes efficient fine-tuning through bidirectional knowledge distillation between a frozen backbone and a lightweight side network. The side network is discarded after training, achieving both parameter/memory efficiency during training and high speed during inference.
- Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
-
Aiming at 3D foundation models like VGGT, the authors extract a "shared LoRA subspace" for four types of 3D variations—texture, geometry, camera, and lighting—using controlled synthetic data. They demonstrate that these subspaces are approximately orthogonal. By concatenating them into a set of compact LoRA bases, efficient fine-tuning is achieved by training only a small middle matrix. This method achieves superior downstream accuracy on 3D face anti-spoofing, clothed human reconstruction, and transparent object reconstruction with significantly fewer parameters (approx. 4M vs. 16M for LoRA).
- Mitigating The Distribution Shift of Diffusion-based Dataset Distillation
-
This paper identifies that using diffusion models for dataset distillation suffers from two types of distribution shifts: training-stage and sampling-stage shifts. It proposes a two-stage framework: during training, L1 sparse regularization (RSM) is used to force the diffusion model to learn a compact and sparse "distillation-aware" manifold; during sampling, greedy i.i.d. generation is replaced by synchronous denoising of the entire batch with Collaborative Guided Sampling (CGS), which integrates DPP diversity and distribution matching. The method achieves SOTA performance on ImageNet subsets and ImageNet-1K with lower computational cost.
- Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
-
This paper proposes a mathematical framework for "neural differentiation," utilizing a unified Neural Differentiation Index (NDI) (integrating spectral diversity, entropy information, and second-order curvature sensitivity) to quantify the functional uniqueness of each neuron/channel. It provides provable error bounds for pruning; the resulting algorithm, NDP, achieves accuracies comparable to or exceeding the Prev. SOTA on MNIST, CIFAR-10, Tiny-ImageNet, and ImageNet at higher sparsity rates.
- OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models
-
OneSparse unifies Mixture-of-Experts (MoE) and memory modules—two previously distinct sparse activation approaches—into a single "dispatch–process–combine" abstraction. Based on this, it introduces the Nexus Layer, a hybrid sparse layer that utilizes memory units to provide a low-cost baseline for all tokens while employing expert units to refine semantically critical regions. On ImageNet, COCO, and ADE20K, it achieves a superior accuracy-efficiency frontier compared to pure MoE and pure memory-based models at lower computational costs.
- Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
-
This paper proposes the OmniParallax Attention Mechanism (OPAM) for Distributed Multi-View Image Compression (DMIC). By explicitly modeling correlations and aligned features between arbitrary view pairs via two-stage parallax attention, the constructed ParaHydra framework enables DMIC methods to significantly outperform SOTA MIC encoders for the first time while substantially reducing computational overhead.
- Perceptual Neural Video Compression with Color Separation and Rank Chain
-
To address the issues of existing neural video compression focusing solely on PSNR, neglecting the human eye's perceptual differences between luma and chroma, and inconsistent perceptual quality under variable bitrates, this paper proposes PNVC-CR. This framework combines a "luma-chroma separated dual-codec framework (PNVC-C)" with "rate-rank chain adversarial optimization (Rc-GAN)," achieving BD-rate savings of 77.71% / 53.94% / 54.44% / 42.27% on perceptual metrics like LPIPS / DISTS / KID / FID relative to VTM, while maintaining objective fidelity.
- Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
-
Addressing the dilemma where 1-step DMD distillation suffers from insufficient capacity and poor diversity, while direct multi-step expansion leads to VRAM explosion or performance degradation back to 1-step levels when using Stochastic Gradient Truncation (SGTS), this paper proposes Phased DMD. By partitioning the SNR range into subintervals and distilling one expert per phase moving progressively towards higher SNR (with intermediate phases stopping at intermediate timesteps rather than clean samples), the authors derive an unbiased subinterval score matching objective for scenarios lacking clean samples. This naturally produces few-step MoE generators that improve motion dynamics, visual fidelity, and generation diversity on large models such as Qwen-Image-20B and Wan2.2-28B.
- PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching
-
PlanaReLoc is proposed as the first camera relocalization paradigm based on planar primitives and 3D planar maps. By associating planar regions of a query image with map planar primitives in a unified embedding space via a deep matcher, it achieves lightweight 6-DoF camera relocalization without requiring real-textured maps, pose priors, or per-scene training.
- Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
-
CompACT is proposed to compress each image into only 8 discrete tokens (approx. 128 bits). By freezing a pretrained visual encoder to preserve planning-critical semantic information and employing generative decoding to supplement perceptual details, it accelerates world model-based planning by ~40x without compromising accuracy.
- Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
-
This paper revisits the LoRA merging problem from the perspectives of subspace coverage and directional anisotropy. It proposes the TARA-Merging framework, which preserves LoRA directions and performs direction-level reweighting using a preference-weighted cross-entropy pseudo-loss, consistently outperforming existing merging methods across 8 vision and 6 NLI benchmarks.
- PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
-
This paper proposes PRISM, a monolithic video dataset condensation method. Starting from only two temporal anchors (first and last frames), it adaptively inserts keyframes by detecting gradient direction conflicts. This approach achieves SOTA storage efficiency while maintaining content-motion coupling integrity—reaching 17.9% accuracy on miniUCF 1VPC with 20MB, which is 5x less than the 94MB required by previous methods.
- PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
-
PriVi constructs a large-scale primate video pre-training dataset of 424 hours and performs domain-level pre-training (non-target dataset level) on V-JEPA. It demonstrates for the first time that domain-level pre-training for video models can generalize across datasets, outperforming specialized models with full fine-tuning on four primate behavior recognition benchmarks using a frozen classifier with only 220K parameters.
- Progressive Supernet Training for Efficient Visual Autoregressive Modeling
-
VARiant identifies a "scale-depth asymmetric dependence" in Visual Autoregressive (VAR) models: early low-resolution scales are highly dependent on network depth, while later high-resolution scales are robust to depth reductions. Based on this, a 30-layer VAR is trained as a weight-sharing elastic depth supernet (early scales use the full network; late scales use 2–16 layer subnets). Using a three-stage dynamic ratio progressive training to break the fixed-ratio Pareto frontier, d16/d8 subnets achieve near-lossless performance on ImageNet (FID 2.05/2.15 vs. 1.95) while saving 40–65% GPU memory.
- QKD: Quantum-Gated Task-interaction Knowledge Distillation for Class-Incremental Learning
-
QKD introduces quantum gating to class-incremental learning, modeling sample-task correlations in high-dimensional Hilbert space via parameterized quantum circuits. This guides cross-task knowledge distillation and inference-time adapter fusion, achieving SOTA performance across five benchmarks.
- Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation
-
PLD-Debias attaches a lightweight adapter to a completely frozen, parameter-invisible pre-trained vision encoder. It first "amplifies" latent spurious correlation directions via rank regularization and performs clustering to obtain pseudo-bias labels with 90%+ fidelity. Finally, it purifies representations using a dual loss approach of contrastive alignment and cluster-adaptive margins, achieving SOTA worst-group accuracy on CelebA, Waterbirds, and CMNIST without any group annotations.
- Real-Time Neural Video Compression with Unified Intra and Inter Coding
-
To address the weak intra-coding capability of real-time neural video compression (e.g., DCVC-RT) during scene cuts or new content—which typically causes quality drops, bitrate spikes, and error propagation due to "periodic refresh" mechanisms—this paper proposes a single-model unified intra/inter coding approach. By using dual-frame compression and mixed reference training, the model adaptively switches between intra and inter modes based on reference reliability. It achieves an average bitrate saving of 12.1% (BD-rate) over DCVC-RT while maintaining real-time speed, a smaller model size, and eliminating the need for periodic refresh.
- ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
-
ReFTA stacks cross-layer weights into a third-order tensor and utilizes T-SVD to extract and fine-tune only the principal components. By leveraging the operator commutativity of tensor algebra, it swaps the order of "multiplication by \(U_0^\top\)" and "multiplication by input \(X\)." This entirely eliminates the redundant reconstruction of tensorized weights during forward and backward passes, achieving higher average accuracy in image classification and NLU with 96% fewer trainable parameters than LoRA.
- Rethinking Dataset Distillation: Hard Truths about Soft Labels
-
This is an analysis paper with a "myth-busting" nature: the authors systematically prove that the apparent lead of large-scale dataset distillation (DD) methods is primarily sustained by the use of soft labels during downstream training. Once scalability analysis is performed across different label regimes, the advantage of high-quality subsets over random subsets nearly disappears. Based on this, they propose a compute-aware difficulty pruning metric CAD-Prune and a compute-aligned distillation method CA2D, which outperform existing DD methods across multiple IPC settings on ImageNet-1K.
- S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
-
Addressing the issue that Fourier-based PEFT incorrectly assumes the weight change \(\Delta W\) has a sparse spectrum (when it is actually close to a power-uniform distribution), S2FT first estimates \(\Delta W\) and then uses row-column permutation to find a reversible transformation that maps it to a latent matrix \(\Delta\bar W\) with a truly sparse spectrum. By training only a few spectral coefficients in this sparse domain, it outperforms baselines like FourierFT with only 0.08% parameters.
- Sampling-Aware Quantization for Diffusion Models
-
This paper points out that the two acceleration paths for diffusion models—"fast samplers" and "network quantization"—conflict when used together: quantization noise perturbs the directional estimation of high-order samplers at each step, causing the smooth Probability Flow ODE to degrade into a variance-exploding SDE. The authors propose "Sampling-Aware Quantization," which uses a Mixed-Order Trajectory Alignment objective to align quantized first-order directional trajectories with full-precision high-order directional trajectories. This linearizes the probability flow, allowing for dual acceleration of "sampling speedup + model compression" under sparse steps with almost no quality degradation.
- SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification
-
SANER decouples the task of "selectively forgetting specific pedestrians" (De-ReID) from contradictory optimization in a single feature space into two independent low-rank adapters (forgetting / retention). It then utilizes a non-parametric test-time routing algorithm to determine the processing branch based on the similarity between queries and prototypes, effectively "forgetting" target identities with minimal impact on the identification accuracy of others.
- SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
-
SelecTKD shifts the focus of LLM distillation from "what divergence to measure the teacher-student gap" to "on which tokens to apply supervision." Borrowing the "propose-verify" mechanism from speculative decoding, it assigns weights \(\{0, \beta, 1\}\) to each token, applying full loss only on tokens where the teacher is highly confident and consistent with the student. It achieves plug-and-play SOTA across instruction following, mathematics, code, and VLM tasks.
- SG-LoRA: Semantic-guided LoRA Parameters Generation
-
SG-LoRA utilizes a textual task description as a "semantic bridge" to perform weighted aggregation of task semantics from a set of pre-trained expert LoRAs. It then directly samples and generates target LoRA parameters using a Conditional VAE (CVAE). This enables fine-tuning-free real-time model adaptation under conditions where no target task data is available and the task space is open, achieving or even surpassing the performance of task-specific fine-tuning (Oracle) in image-text retrieval.
- Streamlined Knowledge Distillation
-
This paper points out that the increasing complexity of recent logit distillation (stacking multiple knowledge alignments and relationship modeling) leads to redundant objectives and improper losses. It proposes a minimalist approach, SKD, which transfers only two types of knowledge: "instance-level" semantics via KL divergence and "direction-level" relations via the Gram matrix of normalized logits. For the latter, a Mahalanobis distance loss stabilized by Tikhonov regularization and Cholesky decomposition is designed (provably equivalent to the L2 norm in a covariance-whitened space). SKD not only outperforms all logit distillation methods but also exceeds feature distillation on CIFAR-100/ImageNet/COCO, while being the fastest to train.
- TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
-
Ours proposes TALON, the first test-time adaptive framework for On-the-Fly Category Discovery (OCD). By utilizing semantic-aware prototype updates, stable encoder adaptation, and margin-aware logit calibration, it abandons hash encoding to model directly in continuous feature space, significantly mitigating category explosion and substantially improving new category discovery accuracy.
- TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
-
Addressing the persistent "feature collapse caused by shared weights" issue in one-shot Transformer Architecture Search (TAS), TAS-LoRA attaches a set of LoRA experts to a frozen supernet. An LSTM router, taking "architecture configurations" as input, dynamically combines experts for each subnet to learn subnet-specific features. Group-wise router initialization forces experts to differentiate from the early training stages. This approach consistently improves AutoFormer search results on ImageNet by 0.2~1.0 points across various scales with zero additional inference overhead.
- TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
-
TaskIT adapts Multi-LoRA LLMs to new tasks on memory-constrained edge devices: it predicts the importance of each candidate LoRA position using "Cross-Task Transfer" without training any new modules, accurately estimates activation memory on Transformers using a "Block-level Memory Predictor," and finally selects LoRA positions, quantities, and ranks within a memory budget using a dynamic programming scheduler, achieving a superior accuracy-memory trade-off compared to Zero-FT, non-LoRA, and existing LoRA fine-tuning methods.
- Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
-
A "teacher router" constructed from a frozen dense teacher model generates a stable expert assignment distribution. KL distillation is used to supervise the sparse MoE student's router, mitigating the "gradient only for selected experts" issue that causes routing fluctuation in early training. This achieves stable performance gains on ImageNet-1K / CIFAR-100 with zero additional inference overhead.
- Test-time Sparsity for Extreme Fast Action Diffusion
-
This paper proposes "test-time sparsity," utilizing a lightweight pruner with a shared encoder to dynamically predict residual blocks that can be pruned during each forward pass. Combined with an "omni-reuse" strategy that organizes historical features into a 3D lattice, it achieves 95% sparsity, a 92% reduction in FLOPs, and a 5× actual speedup in robotic action diffusion. This increases the inference frequency from 6Hz to 47.5Hz without a drop in success rate.
- ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
-
ThinkingViT integrates a progressive mechanism—"predict quickly with fewer heads, rethink by expanding the sub-network if uncertain"—into a nested ViT. By utilizing Token Recycling to feed features from previous stages into subsequent rounds, it outperforms nested baselines like MatFormer and HydraViT on ImageNet-1K by up to 2.0 p.p. under equivalent throughput.
- Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
-
Ours proposes Image-Adaptive Prompt Learning (IAPL), which dynamically adjusts the prompts of the CLIP encoder for each test image during inference. By integrating test-time token tuning and a conditional information learner, it achieves strong generalization to unseen generators, reaching state-of-the-art (SOTA) performance with average accuracies of 95.61% and 96.7% on UniversalFakeDetect and GenImage, respectively.
- Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
-
TFGC compresses images into 1D token sequences and utilizes the "token flow" phenomenon for variable bitrate masking combined with conditional Gaussian prediction to reconstruct missing tokens. Through a semantic guidance module, Large Vision-Language Models (LVLMs) directly consume the compressed tokens without decoding back to pixels, achieving superior human perception and machine understanding (caption/grounding/VQA) at ultra-low bitrates (0.02–0.06 bpp).
- Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
-
LLSA extends the "single-level coarse selection" of Top-K sparse attention into a "multi-level coarse-to-fine" hierarchical structure, reducing the complexity of both the block selection and attention phases from \(O(N^2)\) to log-linear. Combined with a sparse indexing backpropagation kernel that avoids constructing dense masks, it achieves a 28.27× inference speedup and 6.09× training speedup on 256×256 pixel DiTs without degrading generation quality.
- TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
-
This paper proposes TWEO, a "non-intrusive" method that reduces extreme activation outliers in Transformers from 10,000+ to below 20 using a single regularization loss term. By utilizing counterfactual experiments and SVD analysis, it proves that extreme outliers are "mechanical artifacts" of weight collinearity rather than being data-driven. Consequently, an \(L_p\) loss is designed to directly penalize activation magnitudes, enabling full-model FP8 pre-training (without complex mixed-precision engineering or architectural changes) to converge stably at BF16 levels with a 36% throughput increase, while making simple per-tensor static quantization (including residual streams) feasible for the first time.
- Ultra-Fast Neural Video Compression
-
This paper proposes DCVC-UF, which introduces a "chunk coding" paradigm that encodes multiple frames into a single compact latent and decodes them back in parallel. By completely removing frame-by-frame motion estimation and utilizing frame-specific decoders and single-step entropy decoding, it achieves 371 encoding / 274 decoding FPS at 1080p on a 4090 GPU, while saving 42.2% bitrate compared to VTM(LD), advancing the SOTA in the rate-distortion-complexity trade-off for neural video coding.
- Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder
-
This paper proposes AEIC, an asymmetric extreme image compression framework. It theoretically demonstrates that "at ultra-low bitrates (<0.05 bpp), latent variable variance is naturally small, making heavy encoders unnecessary." Consequently, the encoding side is implemented as a shallow pixel-domain convolutional network with only 0.94M parameters, while all generative capacity is offloaded to a single-step diffusion decoder. Using dual-sided feature distillation to transfer knowledge from a moderate encoder to the shallow encoder, the method achieves 35.8 FPS real-time encoding on 1080P images—approximately 19x faster than similar extreme compression methods—while leading in perceptual metrics (LPIPS/DISTS/FID/KID).
- Understanding and Enforcing Weight Disentanglement in Task Arithmetic
-
This work proposes Task Feature Specialization (TFS) as a sufficient condition for weight disentanglement, revealing that its geometric consequence is the orthogonality of weight vectors. Based on this, the OrthoReg regularization method is introduced. By enforcing orthogonality among the column vectors of the weight update matrix during fine-tuning, OrthoReg promotes task vector disentanglement and significantly enhances the performance of various task arithmetic methods.
- UniComp: Rethinking Video Compression Through Informational Uniqueness
-
Ours proposes UniComp, a video token compression framework based on informational uniqueness (rather than attention). By utilizing frame group fusion, token allocation, and spatial dynamic compression, it maximizes the preservation of unique information across temporal, spatial, and global dimensions. It outperforms uncompressed baselines even when retaining only 10% of tokens.
- When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
-
To address the difficulty of establishing semantic keypoint correspondences between "sparse line sketches" and "texture-rich photos," this paper proposes SFA-DIFT. It first fine-tunes CleanDIFT via LoRA into a cross-modally unified "clean diffusion feature" to align the spatial domain, then utilizes a wavelet-based Low-Frequency Feature Aggregation (LoFFA) module to align the frequency domain. It achieves a new SOTA for PCK on the self-constructed MS-PSC6K benchmark.
- WPT: World-to-Policy Transfer via Online World Model Distillation
-
WPT proposes a world-to-policy transfer training paradigm. It injects future prediction knowledge from a world model into a teacher policy via a trainable reward model, and subsequently transfers this knowledge to a lightweight student policy through policy distillation and world reward distillation, achieving a 79.23 driving score (closed-loop) and a 4.9x inference speedup.