🔄 Self-Supervised Learning¶
📷 CVPR2026 · 89 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (81) · 💬 ACL2026 (1) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (33) · 📹 ICCV2025 (13)
🔥 Top topics: Continual Learning ×20 · Self-Supervised Learning ×8 · Adversarial Robustness ×7 · Few-/Zero-Shot Learning ×5 · Layout & Composition ×3
- A Faster Path to Continual Learning
-
To address the issue of the C-Flat optimizer being too slow due to calculating three additional gradients per step, this paper identifies "direction-invariant" components within the first-order flatness gradients. These components are cached and reused in subsequent steps to skip redundant perturbation gradient calculations. Combined with a linear scheduler that gradually increases the skip interval as tasks progress and an adaptive trigger based on gradient statistics, C-Flat Turbo achieves 1.0×~1.25× speedup over C-Flat (recovering throughput from ~27% to ~60%) while maintaining or even slightly improving accuracy.
- AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
-
AdaPrior reinterprets Long-Tailed Continual Learning (LTCIL) as a "model-induced prior drift" problem. It uses EMA to online estimate the model's self-learned prior \(P_m(y)\), followed by Bayesian alignment for debiasing in both training loss and inference post-processing. This single-stage, plug-and-play approach consistently outperforms recent LTCIL baselines on CIFAR100-LT, ImageNet-subset-LT, and iNaturalist18-subset.
- An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
-
This paper proposes an online mixture model learning framework (MMOT) based on Optimal Transport theory. By maintaining multiple adaptive centroids for each category, it precisely characterizes the multimodal nature of online data streams. Combined with a dynamic preservation strategy to enhance category discriminability, it effectively mitigates catastrophic forgetting in Online Class-Incremental Learning (OCIL).
- Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery
-
To address the critical issues of "feature-to-hash cascade degradation" and "known-class monopoly in the representation space" in On-the-fly Category Discovery (OCD), this paper constructs a hyper-semantic space comprising "derived subspaces" and "calibrated subspaces" to simultaneously characterize intra-class diversity and reserve space for new categories. Assignment-driven hash learning, featuring "soft prototype assignment + binary hash regularization," is then performed within this space. As a plug-and-play module for SMILE/PHE, it achieves an average All accuracy improvement of approximately 12.78% on six fine-grained datasets (based on SMILE).
- Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
-
To address the issues of isolated clusters and rigid boundaries caused by "binary contrast" in self-supervised skeleton action recognition, TranCLR synthesizes "transitional anchors" as manifold regularization terms between actions and reshapes the representation space from discrete point clouds into continuous smooth manifolds using three-level geometric manifold calibration. It achieves SOTA across linear evaluation, transfer learning, and retrieval on NTU/PKU-MMD, while reducing the Expected Calibration Error (ECE) from ~5.6% to 0.65%.
- Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
-
Addressing the issue of forgetting caused by "conflict between current task gradients and replay gradients" in online class-incremental learning, this paper theoretically reveals that hypergradient methods essentially align task gradients to a shared meta-objective but are "myopic" as they only consider the current step. Consequently, it proposes LOR: before updating, it explores multiple future model states along a set of "plasticity-stability" trade-off directions, then optimizes the worst-case direction using a Log-Sum-Exp softened min-max objective. This pushes the model toward flatter, more forgetting-resistant regions, outperforming SOTA on Seq-CIFAR10/100 and Seq-TinyImageNet.
- Beyond the Static World: Continual Category Discovery under Visual Drift
-
Addressing the realistic scenario where "unlabeled data streams both introduce new categories and originate from unfamiliar domains," this paper proposes the OCCD task. It introduces a three-component framework—"Optimal Transport for automatic separation of known/unknown samples → Adversarial Alignment of known class prototypes → Frequency-domain augmentation for category topological consistency"—achieving new SOTA performance in both new category discovery and old category recognition on DomainNet and SSB-C.
- Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
-
Before a ViT officially processes images, a lightweight masked-token pre-training (warm-up) is conducted using purely symbolic sequences without any visual content (e.g., "balanced parentheses") generated by formal grammars. This forces the model to internalize universal computational mechanisms such as stack-based hierarchy and long-range dependencies. When followed by standard image training, this approach achieves a +1.72% top-1 gain on ImageNet-1K with only a 1% training budget expansion, effectively substituting for 28% of image data.
- CHEEM: Continual Learning by Reuse, New, Adapt and Skip -- A Hierarchical Exploration-Exploitation Approach
-
Ours proposes the CHEEM framework, which automatically learns task-aware dynamic ViT backbones via Hierarchical Exploration-Exploitation (HEE) sampled NAS—selecting from four operations: Reuse, New, Adapt, and Skip at each layer. It significantly outperforms prompt-based methods on the MTIL and VDD benchmarks, approaching the upper bound of full fine-tuning.
- Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
-
Ours proposes Chain-of-Models Pre-Training (CoM-PT), which arranges Vision Foundation Models (VFMs) by size into a "model chain." It achieves lossless pre-training acceleration through small-to-large inverse knowledge transfer (weight initialization + feature distillation), where training efficiency improves as the scale of the model family grows.
- CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
-
Addressing "concept confusion" (where tail samples are misclassified into semantically related classes) during foundation model fine-tuning for long-tailed recognition, CUE provides instance-level semantic cues via zero-shot CLIP and class-level cues via LLMs. By supervising these related classes as positive labels using two Binary Logit-Adjustment (BLA) auxiliary losses, CUE preserves the inter-class relationships learned during pre-training, yielding significant gains for tail classes across four long-tailed benchmarks.
- D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping
-
Ours introduces D2Dewarp, the first dewarping method to learn document geometric representations from both horizontal and vertical dimensions. A UNet with dual decoders predicts horizontal lines (top/bottom boundaries of text lines/tables) and vertical lines (left/right boundaries) separately. The HV Fusion Module integrates features from both directions via mixed attention. Additionally, the DocDewarpHV dataset, containing 114K images with dual-dimension annotations, is constructed to support this framework.
- DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
-
To address prototype drift caused by noise/hard positive contamination in few-shot support sets, DDSF utilizes the Determinantal Point Process (DPP) to unify a "Filter-Repair-Expand" pipeline: first, suspicious samples are identified via DPP probabilistic inference rather than being discarded; second, they are "repaired" into effective features using a DPP volume-gradient-guided diffusion process; finally, class representations are expanded from fragile mean points into disentangled shared/unique subspaces. On the Meta-Dataset with OOD contamination, DDSF improves accuracy from the Prev. SOTA of 47.0% to 61.6% under 70% noise.
- Decision Boundary-aware Generation for Long-tailed Learning
-
Aiming at the problem where using "diffusion models + head-to-tail feature transfer" to supplement long-tail data implicitly leaks head-class features into tail-classes and blurs decision boundaries, this paper first quantifies "boundary ambiguity" using three metrics. It then proposes DBG: using adversarial de-classification noise to push samples near the decision boundary and relabeling them with the \(k\) most confusable classes, followed by a classifier-driven dual-path cleaning to discard harmful samples. On CIFAR-LT, DBG consistently reduces inter-class overlap and improves tail-class and overall accuracy for all generative baselines.
- Decouple Your Discovery and Memory in Continual Generalized Category Discovery
-
Addressing the limitation in Continual Generalized Category Discovery (C-GCD) where "over-protection of old classes to prevent forgetting crushes the discovery of new classes," this paper proposes the DYDM dual-branch framework. It utilizes a discovery branch to recognize new classes without constraints and a memory branch using backprop-free Recursive Least Squares (RLS) analytical classifiers to stably retain all old classes. Coupled by a knowledge rehearsal distillation loop, the method achieves significant improvements in both new class and overall accuracy across four benchmarks (CAA improves by 3.2–9.9% over SOTA Happy).
- DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning
-
To address the issue where orthogonal gradient constraints are "too rigid and suppress plasticity" when using pre-trained models + LoRA for Class Incremental Learning, DGS replaces hard orthogonal constraints with an interpolated fusion gradient (original gradient \(\oplus\) gradient projected onto the pre-trained subspace). Combined with semantic-shift calibrated unified classifier alignment and a patch-token alignment loss, it outperforms existing PEFT-CIL methods across six standard benchmarks.
- DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
-
Systematic analysis reveals that representation diversity among DiT blocks is a key factor for effective learning. This paper proposes DiverseDiT: using long residual connections to diversify inputs and a representation diversity loss to explicitly promote feature differentiation between blocks, accelerating convergence and improving generation quality without external guidance models.
- Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning
-
To address the unrealistic assumption in Exemplar-Free Class-Incremental Learning (EFCIL) that "semantic distribution and drift are uniform," this paper proposes a Mixture-of-Experts (MoE) estimator (modeling local semantic shift) and a Low-Rank (LoR) estimator (modeling global semantic shift) for decoupled compensation. Both estimators are updated via closed-form solutions within a few iterations and can be integrated into existing methods as plug-and-play modules, consistently outperforming current SOTA methods across six datasets.
- Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
-
When the number of modalities \(M>2\), in-batch negative samples naturally vary in difficulty based on how many non-anchor modalities they share with the positive sample. Easy2Hard explicitly splits negative samples into "partially unmatched (easy)" and "fully unmatched (hard)" categories. It uses a sigmoid curriculum curve to smoothly shift the training weight from easy negatives to hard negatives as training progresses. This method consistently outperforms Symile / CLIP-Pairwise in zero-shot retrieval across five multimodal datasets.
- Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
-
This paper identifies that the essence of catastrophic forgetting in Exemplar-Free Class Incremental Learning (EFCIL) is the collapse of "class-discriminative structure." It proposes the Adaptive Prototype Calibration (APR) to correct the mean and covariance of old class prototypes (preserving intra-class structure) and the Structural Consistency Constraint (SCC) to maintain angular relationships between new samples and old prototypes (preserving inter-class structure). The method outperforms existing approaches such as SSIAT and SLCA across six benchmarks, with particularly significant gains on fine-grained datasets.
- Exemplar-Free Continual Learning for State Space Models
-
This paper proposes Inf-SSM—a geometric-aware, exemplar-free regularization method that encodes the "infinite-time behavior" of SSMs (e.g., Vim/Mamba) as a point on an extended observability subspace. By constraining the distance between subspaces of new and old tasks on an infinite-dimensional Grassmann manifold and reducing the computational cost from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(n^2)\), this method serves as a plug-and-play module that improves average AA by 8.31% and reduces forgetting (FM) by 9.36% for existing continual learning methods.
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
-
Franca is the first fully open-source (data + code + weights + intermediate checkpoints) visual foundation model. Built on the DINOv2 framework, it introduces "nested Matryoshka multi-head clustering" to refine semantics layer-by-layer along feature dimensions, utilizes CyclicMask to balance mask spatial distribution, and employs RASA post-training to decouple absolute position information from dense features. Using only public data, it matches or surpasses closed-source models like DINOv2 and SigLIP 2 in segmentation, OOD detection, and 3D understanding.
- Free-Grained Hierarchical Visual Recognition
-
Ours proposes "free-grained" hierarchical visual recognition, allowing training labels to appear at any level of the taxonomy, and introduces text-guided pseudo-attributes and taxonomy-guided semi-supervised learning to compensate for missing supervision; during inference, the model adaptively selects prediction depth.
- From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
-
This paper points out that existing Few-shot Fine-grained Classification (FSFG) methods are trained and evaluated only in "few-class" scenarios (e.g., 5-way), failing significantly when faced with "many-way" settings. The authors decompose the causes of this failure into three actionable guiding principles using a generalization bound based on the Class Discriminative Index. Accordingly, they propose SCEG—featuring multi-layer self and collaborative feature enhancement plus an episodic/global dual-scale Intra-Inter Loss—which achieves significant leads across 4 datasets in both few-way and the newly proposed many-way settings.
- GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
-
Addressing the challenge in semi-supervised regression (SSR) where continuous outputs lack confidence scores and low-quality pseudo-labels contaminate training, GaussianMatch utilizes the Gaussian consistency of predictions from multiple weakly-augmented views of the same sample as a proxy for pseudo-label reliability. It retains only those samples where all views fall within a confidence interval and employs Bayesian variance smoothing to prevent over-filtering. Under the extreme scarcity of 30 labels on UTKFace, it reduces MAE by 15.36% and improves \(R^2\) by 50.21%.
- Geometry-driven OOD Detectors Are Class-Incremental Learners
-
GOD treats "each task classifier head possessing both IND recognition and OOD rejection capabilities" as a sufficient condition for Class-Incremental Learning (CIL). By replacing learnable classifier heads with fixed Equiangular Tight Frame (ETF) anchors and utilizing ETF loss (inter-class separation) along with ArcFace loss (intra-class compactness), it unifies "classification" and "uncertainty estimation" within a shared geometric space. This transforms cross-task routing from a fragile Task-ID predictor into a naturally emerging OOD decision, achieving SOTA performance across four benchmarks.
- HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
-
This paper extends lifelong learning from "homogeneous task streams" to "heterogeneous task streams" (LHL) and instantiates it in dense prediction scenarios (LHL4DP). It proposes HAD (Heterogeneity-Aware Distillation), an exemplar-free method that utilizes a frozen teacher to generate pseudo-labels for self-distillation. Two complementary terms, Distribution-Balanced HAD (DB-HAD) and Saliency-Guided HAD (SG-HAD), are introduced to alleviate category/numerical imbalance in pseudo-labels and the loss of boundary information. The method significantly outperforms existing lifelong learning approaches on CityScapes, NYUv2, and Taskonomy.
- Harnessing the Power of Foundation Models for Accurate Material Classification
-
To address the scarcity of material classification labels, this paper constructs a balanced synthetic dataset of 21 categories using "diffusion model generation + semantic grounding auto-labeling." It then employs a two-stream fusion of "frozen DINOv2 vision stream + GPT-4v/CLIP language stream" for classification. It achieves 89% accuracy on FMD, outperforming the dedicated SOTA (MatSim) by 33%.
- HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
-
Addressing the "lack of cross-layer coordination in layer-wise independent training" and the "semantic collapse of features after goodness decoupling" in the Forward-Forward (FF) algorithm, HCL-FF introduces "coarse-to-fine hierarchical supervision" and "supervised contrastive learning on decoupled features" as two local objectives for each layer. Without breaking the layer-wise independence of FF, it improves CIFAR-100 accuracy from 53.09% to 70.09% (+17.00%), setting a new SOTA for FF-based methods.
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
-
Ours proposes the Hier-COS framework, which constructs a theoretically guaranteed hierarchy-aware vector space (HAVS) by assigning orthogonal basis vectors to each node in a hierarchy tree. This work unifies "hierarchy-aware fine-grained classification" and "hierarchical multi-level classification" for the first time while introducing a new evaluation metric, HOPS, consistently outperforming Prev. SOTA across four datasets.
- How Much 3D Do Video Foundation Models Encode?
-
The authors propose the first model-agnostic probing framework, using "frozen video foundation model features + shallow feed-forward heads predicting 3D point clouds/depth/camera poses" to quantify the internal 3D understanding of various video models. The conclusion is that leading video generation models trained only on 2D videos (such as WAN2.1-14B) exhibit strong emergent 3D perception, even surpassing expert models trained specifically on 3D data (e.g., Fast3R) in cross-domain scenarios.
- HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
-
This paper identifies the "Domain Gravity" bias in heterogeneous domain continual learning—where data-rich or low-entropy domains exert disproportionate influence in the shared embedding space. It proposes HyCal, a training-free method that calibrates prototypes by fusing cosine similarity and Mahalanobis distance, achieving robust classification in cross-discipline imbalanced few-shot incremental learning.
- Is Parameter Isolation Better for Prompt-Based Continual Learning?
-
Addressing the mainstream "one set of prompts per task" paradigm in prompt-based continual learning, this paper proposes a Hash framework utilizing a shared prompt pool + task-aware sparse gating. It introduces a modulator based on historical activation statistics to simultaneously suppress abused prompts and protect essential ones, consistently outperforming static allocation methods across 4 class-incremental benchmarks with higher parameter efficiency.
- Learning by Analogy: A Causal Framework for Compositional Generalization
-
This paper formalizes "compositional generalization by analogy"—a form of human cognition—into a latent hierarchical generative process using causal language (modularity + principle of minimal change). It proves that this structure supports compositional generalization for complex conceptual interactions and can be identifiably recovered from image-text pairs. Based on this, it interprets diffusion timesteps as conceptual hierarchies to develop HierDiff, which improves performance on DPG-Bench from ELLA's 74.91 to 79.28.
- Learning Eigenstructures of Unstructured Data Manifolds
-
This paper moves away from the traditional "select operator, discretize, then decompose" pipeline. Instead, it uses a neural network to directly learn spectral bases from unstructured data of arbitrary dimensions (point clouds, image manifolds). Rooted in optimal approximation theory, the network simultaneously recovers spectral bases, implicit metrics (sampling density), and eigenvalues by minimizing the reconstruction error of probe functions, approximating the cotangent Laplacian oracle on 3D surfaces while scaling to high-dimensional image manifolds.
- Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
-
LEASE utilizes a pair of "Generative Codebook + Discriminative Codebook" to encode images offline into two aligned sequences of discrete tokens. A single encoder is then trained using both "Masked Reconstruction" and "Codebook Contrastive" objectives. This allows the same latent space to achieve both high-quality generation and strong discriminative power—without data augmentation, online tokenizers, or distilling frozen teacher models. It achieves a new SoTA in unified SSL on ImageNet-1K, with training speeds 48.7% faster than MAGE and 8.75% faster than Sorcen.
- Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
-
The AL-GCD framework is proposed, which designs an "Analogical Textual Concept Generator" (ATCG) by simulating the human analogical reasoning mechanism. It analogically generates textual concepts for unknown samples from a vision-language knowledge base of known categories, transforming category discovery into a joint vision-language reasoning task. It achieves an average improvement of 5.0% across six benchmarks and 7.1% on fine-grained datasets.
- Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
-
The authors encode three laws of infant visual development—grayscale to color, blurry to clear, and temporal continuity—into a "visual diet" for self-supervised training called CATDiet. Training SSL models solely on object-centric videos yields more robust recognition of corrupted images, shape bias, and depth perception across ten datasets. Furthermore, the models spontaneously demonstrate developmental signals consistent with macaque V1 synapse density and infant visual cliff behavior. A two-stage "CombDiet" is proposed as a warm-up for standard SSL, consistently outperforming conventional SSL.
- Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
-
Addressing Source-Free Domain Adaptation (SFDA), this paper models the target feature space as a "feature universe" with virtual feature padding. It propagates reliable pseudo-labels along a cosine k-NN graph via feature traversal and proposes "Gravity Consistency" regularization—using the similarity between weak and strong augmented features to modulate the strength of logit consistency. This approach consistently outperforms prior SFDA methods on Office-Home, DomainNet-126, and VisDA-C.
- MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping
-
MemFlow proposes a "forward memorizing framework" inspired by the brain's memory mechanism that entirely bypasses backpropagation. By freezing the backbone and using randomly connected neurons to record feature-label associations as Gaussian distributions—retrieved and fused based on confidence—it enables rapid on-device domain adaptation. It achieves up to a 10% improvement across four cross-domain datasets while consuming less than 1% of the time required by traditional domain adaptation methods.
- MOMO: Mars Orbital Model — Foundation Model for Mars Orbital Applications
-
MOMO is the first foundation model for Mars remote sensing. It pre-trains MAEs independently on three Mars sensors (HiRISE/CTX/THEMIS) and proposes an Equal Validation Loss (EVL) checkpoint selection strategy for model fusion. It outperforms ImageNet pre-training and Earth Observation foundation models across 9 downstream tasks in Mars-Bench.
- NitroGen: An Open Foundation Model for Generalist Gaming Agents
-
NitroGen treats "controller input overlays used by players in livestreams" as natural action labels. It automatically extracts (frame, action) pairs from 40,000 hours of public videos covering 1,000+ games. By training a single vision-action Transformer using flow-matching, the model can directly play various 2D/3D games. Pre-trained weights provide a maximum relative success rate improvement of 52% when fine-tuned on unseen games.
- Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
-
A frozen VLM is utilized as a "teacher" to reformulate low-rank compression in unsupervised fine-grained clustering as a top-k selection task. Combined with perturbed instance contrast and cluster centroid orthogonal constraints, these elements are integrated into a Dirichlet Process Variational Inference framework. This approach simultaneously learns representations and automatically infers the number of clusters, achieving SOTA results on fine-grained benchmarks such as CUB, Dogs, Flower, and Pet.
- On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
-
This paper revisits the robustness of Spiking Neural Networks (SNNs) from the perspective of "temporal granularity" (individual time steps) rather than "temporal averaging." It proposes TG-Attack, which constructs perturbations step-by-step (stronger attack), and defines the Temporal Sensitivity Value (TSV) using the Hessian of the per-step input-output gradient to estimate robustness without generating adversarial samples. Based on this, it designs a regularization term TG-Reg to constrain the TSV at each time step, consistently surpassing existing SOTA defenses across multiple datasets and networks.
- OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
-
OpenVision 2 removes the text encoder and contrastive loss from the previous generation (OpenVision), retaining only the "image encoder + text decoder" for pure generative caption-only pretraining. By randomly masking approximately 2/3 of visual tokens, it reduces ViT-L/14 training time by ~1.5× and memory usage by ~1.8× with almost no performance degradation, while enabling scaling the visual encoder up to 1 billion parameters.
- PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
-
PAF distills the phenomenon that "OOD samples exhibit more unstable representations under semantic-preserving perturbations" into a representation-level filtering signal. It employs Otsu's adaptive thresholding to dynamically exclude open-set (OOD) samples from unlabeled data. Combined with a two-stage training framework, it achieves SOTA performance in both seen-class classification accuracy and OOD detection AUC on benchmarks like MNIST, CIFAR, and TinyImageNet.
- Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
-
GRAPA is a parameter-efficient continual learning method designed for "capacity-limited" scenarios. It first identifies safely reusable frozen parameters from old tasks using gradient direction consistency, and then adaptively determines a "just enough" pruning rate for each new task via A2C reinforcement learning. This significantly enhances plasticity (learning new tasks) without sacrificing stability (no forgetting). On six heterogeneous task sequences, it achieves an average accuracy improvement of up to 7.67%, with gains up to 14.92% on subsequent complex tasks.
- PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
-
Addressing the issue of "semantic inconsistency across scenes caused by sample-independent modeling" in scene-level point cloud self-supervision, PointCSP utilizes State Space Models to serialize samples within a batch into long sequences for Cross-Sample Semantic Propagation (CSP) to establish a globally consistent semantic space. It then employs an asymmetric teacher-student Stability Preservation Distillation (SPD) to eliminate batch-dependency shifts during single-scene testing, achieving new SOTA results across S3DIS, 3DSES, ScanObjectNN, ModelNet40, and ShapeNetPart.
- Progressive Mask Distillation for Self-supervised Video Representation
-
PMD addresses the issue where "a single masking rate cannot fully capture complex semantics" in masked video self-supervision. It employs four students with progressively increasing masking rates (75%→80%→85%→90%) for progressive distillation. Low-masking-rate students learn low-level semantics first and then serve as auxiliary teachers to guide high-masking-rate students in learning high-level semantics. Supplemented by difficulty-aware region enhancement and cross-layer feature alignment, it achieves SOTA on SSv2/K400/UCF-101/HMDB-51.
- Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
-
QR-Prompt discretely quantizes the "residuals" between CLIP visual and text features—which are typically smoothed out by contrastive learning—into a set of frozen Discriminative Subspace Codebooks (DSQ). These discrete codes are then translated into class-adaptive continuous prompts via a Hierarchical Prompt Encoder (HPE) and a Prompt Combiner (PC). This mechanism balances stability and plasticity in FSCIL, outperforming existing SOTA methods on CUB200, CIFAR100, and miniImageNet.
- Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG
-
AEMG treats surface electromyography (EMG) as a "language"—utilizing an energy-driven tokenizer to segment muscle contractions into "words" and multi-channel coordination into "sentences." By applying vector quantization (VQ) codebooks and masked reconstruction for self-supervised pre-training, it yields a universal EMG foundation model across devices, subjects, and tasks. In the rigorous Zero-shot Leave-One-Subject-Out (LOSO) gesture recognition task, it outperforms six SOTA methods by an average of 5.79–9.25%.
- Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
-
Inspired by the autaptic self-feedback of cerebellar Purkinje cells, this paper introduces a set of "Time-Delay Autapses" to LIF neurons (TDA-LIF). By expanding a single spiking neuron in the temporal dimension and applying specific pruning/sharing strategies to the autapses, the authors equivalently reconstruct three SNN architectures: Reservoir Computing (RC), Multilayer Perceptron (MLP), and Convolution-like structures. This approach achieves accuracy comparable to standard SNNs of equivalent scale while reducing the number of neurons per layer to 1 and state VRAM from 8 KB to 4 Bytes, increasing single-neuron information density by orders of magnitude at the cost of temporal latency in extreme single-neuron settings.
- Recurrent Video Masked Autoencoders
-
RVM utilizes a "Transformer + GRU hybrid recurrent core" to aggregate video features frame-by-frame. Trained solely on an asymmetric pixel reconstruction objective—where unmasked past frames reconstruct a 95% masked future frame—it yields a general-purpose encoder proficient in both spatiotemporal tasks (action recognition, point/object tracking) and dense spatial tasks (depth, segmentation correspondence). Furthermore, small models achieve competitive performance without distillation, offering up to \(30\times\) parameter efficiency compared to existing Video MAEs.
- Reframing Long-Tailed Learning via Loss Landscape Geometry
-
This paper revisits the head-tail seesaw dilemma in long-tailed learning from the perspective of loss landscape geometry. It reveals that the root cause of tail class degradation is optimization convergence to sharp regions far from the tail-class optima. A dual-module framework comprising Grouped Knowledge Preservation (GKP) and Grouped Sharpness Awareness (GSA), inspired by continual learning, is proposed. The method achieves SOTA on CIFAR-LT, ImageNet-LT, and iNat2018 without requiring external data.
- Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
-
RSIAT employs a single shared adapter (with parameters remaining constant regardless of task growth) for PTM-based class-incremental learning. It first shapes features to be intra-class compact and inter-class separable using a "Representation-Steered Loss" in the base task. Then, in incremental tasks, it utilizes "Residual AutoEncoder Projection + Orthogonal Loss" to align new and old feature spaces and suppress prototype drift. It achieves a superior stability-plasticity trade-off across six CIL benchmarks with fewer parameters.
- Residual Connections Harm Generative Representation Learning
-
The authors discover that the "identity shortcut" in residual connections injects shallow high-frequency details directly into deep layers, suppressing semantic abstraction. They propose Decayed Identity Shortcuts—an architectural modification where the weight of the identity shortcut decays monotonically with layer depth. With only one additional hyperparameter \(\alpha_{\min}\) and zero extra parameters, this method improves the KNN accuracy of MAE on ImageNet-1K from 27.4% to 63.9% and linear probing from 67.8% to 72.7%, while also enhancing the generation quality of diffusion models.
- Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
-
This paper proposes the HD-LIF (Hybrid-Driven LIF) spiking neuron model family. By employing different spike calculation mechanisms above and below the threshold, it theoretically proves gradient separability and alignment, resolving the inconsistency between forward and backward propagation in SNN online training. It simultaneously achieves full-stage optimization of learning accuracy, memory complexity, and power consumption—reaching 78.61% accuracy on CIFAR-100 with 10× parameter compression, 11× power reduction, and 30% NOPs savings.
- Robust Spiking Neural Networks by Temporal Mutual Information
-
From an information theory perspective, this paper proves that the upper bound of the robust error in deep networks is determined by the "mutual information (MI) between input and hidden representations." It indicates that the unique temporal characteristics of SNNs (accumulative firing + temporal spike dependence) naturally result in smaller mutual information. Based on this, it proposes a TMI regularization term that directly minimizes MI along the temporal dimension, consistently enhancing the intrinsic robustness of SNNs across multiple datasets like CIFAR/ImageNet under various attacks.
- Scaling Dense Event-Stream Pretraining from Visual Foundation Models
-
ScaleEvent treats Visual Foundation Models (VFM) such as DINOv3 as frozen teachers to perform large-scale cross-modal dense distillation on approximately 500,000 synchronized "image-event" pairs. By using an "Event Activation Mask + Structure-Aware Loss" to correct semantic collapse caused by differences in sparsity and granularity between images and events, it obtains fine-grained event representations transferable to segmentation, depth, and optical flow, reducing downstream RMSE by up to ~58%.
- Scaling Parallel Sequence Models to Vision Foundation Models
-
This paper transforms the linear-complexity 2D Spatial Propagation Network (GSPN) into a compressed latent space version (C-GSPN) and uses two-stage cross-operator distillation to transfer knowledge from an attention-based teacher. It marks the first successful push of sub-quadratic operators to CLIP-level vision foundation model pre-training—achieving 2× faster block-level latency than FlashAttention at 1K resolution, a 2.1% improvement in segmentation, and zero-shot accuracy approaching attention baselines.
- SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
-
Addressing the issue where Open-World Semi-Supervised Learning (OWSSL) only performs "clustering" and relies on Hungarian matching for accuracy, SECOS utilizes a frozen CLIP to "ground" visual features of novel class samples to candidate textual labels. It generates reliable pseudo-labels through two stages (Global Compensation + In-batch Recapture) and aligns visual-semantic spaces using adapters. This allows direct prediction of textual labels during testing without any post-processing, outperforming competitors by up to 5.4% on 7 datasets even when they use Hungarian matching.
- Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
-
CausalGCD remodels "Cross-Domain Generalized Category Discovery" as a structural causal problem: it suppresses domain-related spurious shortcuts using Causal Dependence Risk (CDR) and locks the cross-domain invariant geometric relationships between known and novel classes using Causal Geometric Manifold Constraint (CGMC). It consistently outperforms SOTA methods like FREE and HiLo by approximately 2 percentage points on SSB-C and DomainNet benchmarks containing domain shifts.
- Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning
-
SGLC utilizes a frozen CLIP as a backbone and adapts to FSCIL using a dual-layer prompt learning approach consisting of "global vision-text prototype alignment + local attribute-multiview optimal transport alignment." LLM-generated semantic descriptions serve as teachers via knowledge distillation for both prompt layers, leading to comprehensive improvements over previous SOTA on miniImageNet, CIFAR-100, and CUB200 benchmarks.
- Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
-
This work reformulates semantic correspondence as a Fused Gromov-Wasserstein (FGW) optimal transport problem. By leveraging geometric structure constraints from 3D foundation models to generate globally consistent pseudo-labels, it addresses the geometric inconsistency caused by the locality and 2D appearance ambiguity of traditional nearest-neighbor matching.
- Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning
-
This paper introduces the "Computational Resource-Aware Incremental Learning (CRIL)" setting and designs Smart Replay. By treating the replay sample ratio \(\lambda_r\) in each mini-batch as a tunable control variable, it employs optimal control and a heuristic Q-function to dynamically schedule the replay ratio under a fixed computational budget. This achieves higher accuracy and lower forgetting than fixed-ratio baselines under the same compute constraints.
- Spectral Mixture-of-Experts for Continual Learning
-
To address "structural interference" and "compositional forgetting" in LoRA-MoE for continual learning, this paper proposes Spectral MoE: it employs non-overlapping frequency domain masks to constrain each expert into independent subspaces for inherent orthogonality, combined with a dual online/offline router and Dynamic Consistency Projection to lock routing policies. It achieves higher retention and plasticity in cross-domain task-agnostic incremental learning.
- Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
-
The authors discover that pre-training noise not only weakens spectral energy but also causes a "rotation" of the principal feature subspace. They propose the Principal Direction Angle (PDA) to quantify this rotation and design the FGS framework—a lightweight projection head inserted after a frozen backbone using a trio of Perturbation Consistency, Variance-Activation Regularization, and Feature Consistency Distillation. FGS outperforms previous spectral methods on multiple vision benchmarks by at least +1.53% on average.
- Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
-
Addressing the issue where the "visual encoder drift is significantly greater than the text encoder drift" in CLIP-based continual learning (termed Asymmetric Drift by the authors), this paper proposes CCA-CL. It accumulates visual-text covariance statistics across tasks and employs closed-form Canonical Correlation Analysis (CCA) to solve for a shared subspace that maximizes cross-modal correlation. This pulls both modalities back into alignment without modifying CLIP parameters or storing exemplars. By incorporating Random Fourier Projections (RFP) for non-linearity, it achieves SOTA accuracy across four benchmarks with the fastest training speed (5.8 minutes on CIFAR-100).
- Suppressing Non-Semantic Noise in Masked Image Modeling Representations
-
This paper reveals that representations learned by Masked Image Modeling (MIM) retain a significant amount of non-semantic information (low-level features such as texture and color). It proposes a training-free post-processing method, SOAP (Semantically Orthogonal Artifact Projection), which identifies and removes non-semantic components via PCA, consistently improving zero-shot performance across multiple MIM models.
- TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
-
Ours proposes TALO, a high-degree-of-freedom alignment framework based on Thin Plate Spline (TPS). By propagating global control points and utilizing a point-agnostic submap registration design, it corrects spatially-varying inconsistencies of 3D vision foundation models in online reconstruction. It is compatible with various foundation models and camera configurations, significantly reducing trajectory errors on Waymo and nuScenes datasets.
- TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
-
Targeting the "attention artifacts" issue in ViTs for fine-grained Generalized Category Discovery (GCD)—where a few high-norm tokens sequester attention, causing the [CLS] token to over-rely on global semantics while ignoring local discriminative cues—TAR introduces a plug-and-play three-module pipeline. It utilizes parameter-free reweighting to exclude high-norm tokens, samples reliable local tokens based on [CLS] consistency, and injects local details into [CLS] via gating, achieving consistent performance gains across fine-grained benchmarks like CUB, Cars, and Aircraft.
- TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
-
TeFlow is proposed as the first method to introduce multi-frame supervision into self-supervised feed-forward scene flow estimation. By employing a temporal ensemble strategy to construct a motion candidate pool and aggregating temporally consistent supervision signals via consensus voting, it achieves a Three-way EPE of 3.57 cm on Argoverse 2—comparable to the optimization-based method Floxels—while maintaining real-time inference (8s vs. 24min), a 22.3% improvement over SeFlow++.
- Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
-
This work identifies Temporal Imbalance as a neglected source of bias in Class-Incremental Learning (CIL) and proposes the Temporal-Adjusted Loss (TAL). By utilizing a time-decay memory kernel to dynamically reduce the weight of negative supervision for old classes, TAL significantly alleviates catastrophic forgetting in a plug-and-play manner.
- Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
-
To address the deficiency where spiking self-attention "models only space and almost no time," this paper first proposes the TIC metric to quantify the issue. It then introduces the Multi-Delay Mixer (multi-branch learnable delays), inspired by biological axonal transmission delays, as a plug-and-play module to inject multi-scale temporal dependencies into Key/Value. This approach consistently refreshes SOTA for Spiking Transformers across static, neuromorphic, and long-sequence benchmarks.
- Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for Enhanced Temporal Spiking Features
-
To address the issue of temporal redundancy in Spiking Neural Networks (SNNs), where the same set of dominant channels are repeatedly activated across multiple timesteps, this paper proposes TRE. TRE estimates the contribution of each channel per category during training and uses adaptive threshold gating to temporarily mask "overly dominant" channels, forcing subsequent timesteps to mine complementary semantics. At inference, no masking is applied, resulting in zero extra overhead while achieving stable performance gains on CIFAR-100/ImageNet/DVS-CIFAR10.
- Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval
-
Ours proposes TPSNet, which utilizes domain prompts learned by CLIP as text priors to provide fine-grained semantic supervision, while introducing phase spectrum features as phase priors to bridge domain distribution gaps and maintain semantic integrity. The synergy of text-phase dual priors achieves significant improvements in unsupervised cross-domain image retrieval.
- The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
-
This paper identifies "gradient entanglement" in Generalized Category Discovery (GCD), where sharing parameters between supervised and unsupervised objectives causes unsupervised gradients to pollute supervised directions and supervised gradients to pull new class representations into old class subspaces. It proposes EAGC, a plug-and-play module that uses a supervised reference model to anchor labeled sample gradients (AGA) while adaptively soft-projecting unlabeled gradients out of the old class subspace based on energy (EEP), achieving an average gain of 3.2% in All ACC and 4.3% in New ACC across four GCD baselines and five datasets.
- Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
-
EgoViT employs a teacher-student ViT framework to jointly optimize three mechanisms—"proto-object discovery + depth geometric regularization + teacher-filtered temporal consistency"—from unlabeled egocentric videos. It achieves an +8.0% improvement in unsupervised object discovery CorLoc and a +4.8% increase in semantic segmentation mIoU.
- TrackMAE: Video Representation Learning via Track, Mask, and Predict
-
Explicit motion signals are introduced into the masked video modeling (MVM) framework: point trajectories are extracted using CoTracker3 as additional reconstruction targets, and a motion-aware masking strategy is designed to jointly learn spatial reconstruction and motion prediction. This approach significantly outperforms existing self-supervised video methods on motion-sensitive benchmarks (SSv2, FineGym).
- Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition
-
Addressing the issues in multi-expert "mutual distillation" for long-tailed recognition—where errors from a single expert propagate to the entire group (bias propagation) or the entire group collectively confirms errors with high confidence (error consolidation)—this paper proposes TCL. It employs a "Knowledge Quality Gate + Tail-Class Knowledge Compensation" to ensure only correct experts propagate knowledge while amplifying rare correct insights. Furthermore, a "Consensus Error Calibration" module detects and suppresses high-confidence negative classes agreed upon by all experts, improving CIFAR100-LT Top-1 accuracy from 57.2% to 58.7%.
- Tunable Soft Equivariance with Guarantees
-
This paper proposes an architecture-agnostic "soft equivariant" framework: projecting the weights of any pre-trained model into a subspace determined by the Lie algebra representation of a group. Using a truncation threshold \(b\), the model can be continuously tuned from "fully equivariant" to "fully non-equivariant," while providing a provable upper bound on the equivariant error. It simultaneously improves accuracy and reduces equivariant error in ImageNet classification, segmentation, and trajectory prediction.
- Unique Lives, Shared World: Learning from Single-Life Videos
-
A geometry-aware visual encoder can be trained via self-supervision using only "a single person's lifetime of egocentric video" (effectively 38 hours). The study further discovers that models independently trained on different individuals converge to highly consistent geometric representations. These "single-life" representations can transfer to downstream tasks like depth estimation, achieving performance comparable to models trained on diverse internet videos of equal duration.
- UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
-
UniRefiner systematically categorizes up to 40% of "spurious tokens" in the feature maps of large-scale pre-trained ViTs (including EVA-CLIP-8B and InternViT-6B) into three types. By using a multiplex detector to identify them and employing "contrastive registers" during LoRA self-distillation, it explicitly drives spurious signals into register regions while retaining clean semantics in image regions. With only 5k images and a few fine-tuning epochs, vision-language models (VLMs) originally ill-suited for dense tasks outperform DINOv2 on ADE20K (EVA-CLIP-8B reaches 51.9% mIoU, +9.4% Gain).
- UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
-
UPLiFT uses a weight-sharing convolutional \(2\times\) decoder to iteratively upsample low-resolution features from pretrained backbones (e.g., DINOv2) to pixel-level density. It introduces a Local Attender operator based entirely on fixed local offsets to replace cross-attention, reducing complexity from quadratic to linear while maintaining semantic consistency and avoiding "upsampling semantic drift"—outperforming existing upsamplers in segmentation and depth estimation with faster speeds.
- VideoSSR: Video Self-Supervised Reinforcement Learning
-
To address the dilemma where strong models are saturated by existing video RLVR datasets and manual annotation is too costly, VideoSSR automatically generates training data with verifiable answers from raw videos using three parameterizable self-supervised pretext tasks (anomaly grounding / object counting / temporal jigsaw). Combined with task-specific smooth reward functions for GRPO training, it improves Qwen3-VL-8B by an average of over 5% across 17 benchmarks.
- Vision Transformers Need More Than Registers
-
This paper argues that prevalent dense feature artifacts in ViT under label, text, and self-supervision are not merely high-norm token issues, but a consequence of the model learning to use background patches as global semantic shortcuts under the combined influence of coarse-grained supervision and global attention. To address this, the authors propose LaSt-ViT, which replaces original CLS aggregation with selective aggregation guided by frequency-domain stability, consistently improving localization, segmentation, and open-vocabulary tasks across 12 benchmarks.
- VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
-
VT-Intrinsic utilizes the physical complementarity between visible and thermal infrared images (unreflected light is absorbed as heat) to derive visible-thermal intensity ordinality that directly maps to reflectance and shading ordinality. This serves as a self-supervised signal for neural network optimization, enabling high-quality intrinsic image decomposition without pre-training data.
- Weight Space Representation Learning via Neural Field Adaptation
-
This paper proposes using a "pretrained neural field base model + multiplicative LoRA (mLoRA) + asymmetric masking" to constrain network weights fitted to individual samples into structured representations. This ensures that INR weights possess high-quality reconstructability, support weight diffusion model generation, and maintain semantic separability, significantly outperforming the prior weight space method HyperDiffusion on FFHQ and ShapeNet.
- Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
-
Addressing the issue where Cross-Entropy (CE) suffers from vanishing gradients for non-target classes once a sample is correctly classified—thereby losing information about "how dissimilar classes are"—this paper proposes Complementary Dissimilarity Loss (CDL). It employs a "one-cold" target, where the target class is set to 0 and non-target classes are assigned probability mass based on dissimilarity, to explicitly supervise all non-target classes. This approach maintains non-vanishing gradients and actively pushes representations toward controllable Neural Collapse, providing consistent plug-and-play performance gains across closed-set, open-set, few-shot, and domain generalization tasks.