🔄 Self-Supervised Learning¶
🔬 ICLR2026 · 81 paper notes
📌 Same area in other venues: 📷 CVPR2026 (89) · 💬 ACL2026 (1) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (33) · 📹 ICCV2025 (13)
🔥 Top topics: Self-Supervised Learning ×14 · Alignment/RLHF ×7 · Continual Learning ×5 · Adversarial Robustness ×4 · Multimodal/VLM ×3
- A Bayesian Nonparametric Framework for Learning Disentangled Representations
-
This paper replaces the common isotropic Gaussian prior in VAEs with a Bayesian nonparametric hierarchical mixture prior. While preserving provable identifiability, it allows the number of mixture components for each generative factor to grow adaptively with the data, learning modular and compact disentangled representations without any additional regularization terms.
- Adaptive Gaussian Expansion for On-the-fly Category Discovery
-
This paper demonstrates that the "On-the-fly Category Discovery" (OCD) task possesses a performance lower bound overlooked by existing hashing methods. It subsequently decomposes OCD into two sub-tasks: "Open-Set Recognition + Real-time New Category Discovery." By employing soft thresholds to categorize known classes directly and utilizing Adaptive Gaussian Expansion (AGE)—based on multivariate Gaussian density—for online incremental clustering of new classes, the authors improve overall accuracy by approximately 10% across multiple datasets.
- Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts
-
The AdaTTT framework is proposed, achieving robust test-time adaptation on multi-center ICU EHR data through dynamic feature-aware self-supervised learning (adaptive masking strategy) and prototype-guided partial optimal transport alignment, utilized for predicting invasive mechanical ventilation (IMV) needs 24 hours in advance.
- Adversarial Encoding Perturbation and Synthesis for Set Representation Auxiliary Learning
-
SRAL treats each set as an empirical distribution and uses 2-Sliced-Wasserstein distance to encode "distribution-aware" representations. It injects adversarial perturbations at the feature/encoding layer rather than the input layer and employs min-max optimization to force the model to resist worst-case perturbations. This serves as a plug-and-play self-supervised auxiliary objective for various downstream tasks. Theoretically, this objective is equivalent to optimizing the Sliced-Wasserstein distance between sets in expectation. It consistently outperforms existing set encoders across four tasks: set similarity ranking, bundle recommendation, point cloud classification, and topic set expansion.
- Architecture-Agnostic Test-Time Adaptation via Backprop-Free Embedding Alignment
-
PEA decomposes "domain shift" into three geometric distortions in the embedding space: translation (mean shift), scaling (variance shift), and rotation (covariance shift). It utilizes a backprop-free and architecture-agnostic layer-wise covariance alignment process. By performing only two forward passes per batch, it pulls shifted intermediate features back to the source domain distribution. It achieves SOTA accuracy on ImageNet-C / CIFAR-C with a memory footprint of only ~900MB, enabling direct deployment on Jetson Orin Nano edge devices.
- AutoDV: An End-to-End Deep Learning Model for High-Dimensional Data Visualization
-
AutoDV transforms traditional visualization (t-SNE / UMAP), which requires "per-dataset parameter tuning + iterative optimization," into a one-time trained, plug-and-play end-to-end model. It first converts datasets of arbitrary dimensions into multi-scale similarity graphs, then utilizes a multi-graph GNN + Graph Transformer to directly output 2D/3D embeddings, trained with an affine invariant loss. It achieves 89.37% relative accuracy to t-SNE and 91.05% to UMAP on unseen CIFAR-10 data, and even outperforms t-SNE/UMAP themselves on genomics and UCI tabular data.
- Bayesian Test-Time Adaptation via Dirichlet feature projection and GMM-Driven Inference for Motor Imagery EEG Decoding
-
BTTA-DG compresses the moment-to-moment prediction sequence of each EEG trial into a Dirichlet parameter vector. It utilizes a GMM fitted on historical trials as the likelihood and the deep model output as the prior to perform a gradient-free Bayesian posterior calibration. It achieves SOTA and real-time performance (15.7 ms/trial) in cross-subject/cross-session transfer for motor imagery BCI.
- Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
-
This paper proposes Unpaired Multimodal Learner (UML): it requires no sample-level pairing (e.g., image-text, audio-image). As long as the auxiliary modality shares semantic structure with the target modality, training signals from unpaired text, images, or audio are channeled into a unified representation via cross-modal weight sharing. This enhances the classification performance and robustness of models that ultimately use only the single target modality.
- Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization
-
Fifty hours of free-living ExG data were collected using lightweight earphone-style hardware. A "Physiology-Informed Multi-band Tokenization (PiMT)" is proposed to decompose signals into 12 sub-band tokens with explicit physical meanings. Combined with reconstructive self-supervised pre-training, a set of task-agnostic ExG representations applicable across five sensory tasks (visual, auditory, gustatory, tactile, olfactory) was learned.
- Bidirectional Predictive Coding
-
This paper proposes bidirectional Predictive Coding (bPC), which employs a single energy function to accommodate both "top-down generative" and "bottom-up discriminative" inference. This allows the same biologically plausible local circuit to perform accurate classification like discPC and generation/reconstruction like genPC, outperforming existing unidirectional or hybrid PC models in brain-inspired tasks such as cross-modal association and occlusion completion.
- Boosting Open Set Recognition Performance through Modulated Representation Learning
-
This paper points out that nearly all Open Set Recognition (OSR) methods employ a fixed temperature \(\tau\) for logits, restricting the model to a single point on the spectrum between "instance-level" and "class-level" features. The authors propose temperature scheduling (centering on a novel Negative Cosine Schedule, NegCosSch), allowing the model to initially define coarse decision boundaries at low temperatures and subsequently tighten intra-class samples as temperature increases. This improves both open-set and closed-set performance without additional computational overhead, particularly yielding the highest gains on the challenging Semantic Shift Benchmark (SSB).
- Bures-Isotropy Alignment: Manifold Learning of Generalized Category Discovery
-
BIA treats the class token representation in Generalized Category Discovery (GCD) as a manifold geometry problem requiring repair. It aligns the mini-batch class-token Gram matrix with an isotropic prior using Bures distance and achieves lightweight regularization through equivalent nuclear norm maximization. This enhances clustering accuracy and the stability of category number estimation without modifying the underlying GCD framework.
- CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis
-
CARL uses "wavelength position encoding + self-attention-cross-attention spectral encoder" to distill spectral images with arbitrary channel counts (RGB/MSI/HSI) into camera-agnostic feature representations. Combined with a feature-level spatio-spectral self-supervised strategy (CARL-SSL), it achieves cross-camera spatio-spectral joint representation learning for the first time, outperforming camera-specific and channel-independent baselines in medical, autonomous driving, and satellite domains.
- Chart Deep Research in LVLMs via Parallel Relative Policy Optimization
-
Proposes PRPO (Parallel Relative Policy Optimization) to resolve training bottlenecks in GRPO caused by multi-dimensional reward signal interference and heterogeneous data gradient conflicts through parallel decoupling at both reward and data levels. Simultaneously, constructs MCDR-Bench to transform subjective generation evaluation into objective error identification based on the "Principle of Error Uniqueness," enabling quantitative assessment of chart deep research capabilities.
- CoLA: Co-Calibrated Logit Adjustment for Long-Tailed Semi-Supervised Learning
-
To address two weaknesses of Logit Adjustment in long-tailed semi-supervised learning—"over-suppression of head classes caused by frequency counting" and "the global adjustment intensity \(\tau\) being a fixed hyperparameter decoupled from class-level adjustment"—CoLA introduces De-duplicated Distribution Estimation (DDDE) using effective rank and learns the optimal \(\tau\) (LMC) via meta-learning on a proxy validation set mirroring the estimated distribution, achieving SOTA results across four long-tailed benchmarks.
- Contrastive Predictive Coding Done Right for Mutual Information Estimation
-
This paper theoretically debunks the long-standing misconception that "InfoNCE is a mutual information estimator"—it is actually a variational lower bound of another divergence (K-way JSD) and can never converge to the KL divergence. The authors introduce a simple modification by adding an "anchor class" (InfoNCE-anchor), allowing the critic to directly learn unambiguous density ratios. This results in a low-bias, low-variance, plug-and-play MI estimator. Furthermore, they use proper scoring rules to unify the NCE / InfoNCE / f-divergence family of contrastive objectives into a single framework.
- CSRv2: Unlocking Ultra-Sparse Embeddings
-
CSRv2 utilizes "progressive k-annealing + sparse supervised contrastive learning + full-backbone fine-tuning" to advance Contrastive Sparse Representation (CSR) into the ultra-sparse range of \(k\le 4\). This approach reduces dead neurons from 80% to 20% and achieves a 14% accuracy gain at \(k=2\). It enables embeddings with only 2 active dimensions to match the performance of CSR at \(k=8\) or MRL at 32 dimensions, providing up to a 300× improvement in computational and memory efficiency compared to dense embeddings.
- Debiased and Denoised Representation Learning for Incomplete Multi-view Clustering
-
This paper proposes DDR-IMVC, which uses unbiased consensus representations learned from complete samples to correct the biased representations of missing-view samples, and then employs robust contrastive learning in the form of truncated InfoNCE to suppress completion noise, achieving more stable clustering results on multiple incomplete multi-view clustering datasets.
- Detect, Decide, Unlearn: A Transfer-Aware Framework for Continual Learning
-
To address the negative transfer issue in continual learning where "remembering outdated knowledge hinders new tasks," this paper proposes the DEDUCE framework. It detects negative transfer using transferability bounds or gradient conflict analysis, decides whether to trigger unlearning, and finally selectively erases interfering old knowledge using batch-level Local Unlearning (LUM) and network-level Global Unlearning (GUM). As a plug-and-play enhancement, it can be integrated into 9 CL baselines, achieving a maximum average performance gain of 4.55%.
- Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective
-
This work rigorously proves through similarity graph model theory that "difficult examples" (cross-class high-similarity sample pairs) damage unsupervised contrastive learning performance. It shows that difficult examples strictly deteriorate the generalization error bound and proposes three theory-guided mitigation strategies: deleting difficult examples, adjusting margin, and temperature scaling, leading to up to 10.42% linear probing accuracy improvement on TinyImageNet. This finding is counter-intuitive: while "more data is better" is common in deep learning, meticulously removing difficult examples in contrastive learning is beneficial.
- Disentangled representation learning through unsupervised symmetry group discovery
-
This paper enables an embodied agent to automatically discover the underlying symmetry group decomposition of its action space through unsupervised interaction with the environment. It then learns "Linear Symmetry-Based Disentanglement" (LSBD) based on this discovered structure, overcoming the limitation of prior methods requiring manual pre-specification of group structures. The method outperforms existing LSBD approaches across three different group-structured environments.
- Disentanglement of Variations with Multimodal Generative Modeling
-
IDMVAE builds upon the multimodal VAE framework by adding two types of mutual information (MI) regularization—maximizing cross-view MI to extract shared variables and using cycle-consistent generative augmentation to remove redundancy. By replacing Gaussian priors with diffusion models, it achieves clean separation of shared and private information on challenging datasets where likelihood models are insufficient.
- Disentangling the Factors of Convergence between Brains and DINOv3
-
The authors train a series of DINOv3 self-supervised vision models with systematically controlled variables from scratch. Using three complementary metrics—"Encoding Score / Spatial Score / Temporal Score"—they align model representations with human fMRI and MEG data. This approach quantitatively disentangles how "model scale, training amount, and image type" independently and interactively drive models to become "brain-like," revealing that the emergence of this similarity follows a timeline highly consistent with human cortical development.
- DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick
-
DiVeQ reformulates the non-differentiable operation of "mapping latent variables to the nearest codeword" as "adding an error vector to the latent variable, where the vector aligns with the direction of the nearest codeword and its length equals the quantization error." This maintains hard quantization in the forward pass while ensuring smooth gradient flow in the backward pass. Its space-filling variant, SF-DiVeQ, generalizes the quantization target from discrete codewords to line segments between codewords. It achieves higher reconstruction accuracy than STE / EMA / Rotation Trick / Gumbel-Softmax / NSVQ across image compression, generation, and speech coding tasks without requiring auxiliary losses or temperature annealing.
- Diverse Dictionary Learning
-
When the generative process \(g\) of observations \(X=g(Z)\) and the latent variables \(Z\) are both unknown, and one is reluctant to introduce strong assumptions like linearity or auxiliary supervision, this paper proves that the intersection, complement, symmetric difference of latent variable "sets," as well as the latent-observation dependency structure, remain identifiable under minimal assumptions. It identifies that achieving this only requires a universal inductive bias during estimation: adding an L1 sparsity penalty to the Jacobian ("dependency sparsity").
- Dual Perspectives on Non-Contrastive Self-Supervised Learning
-
This paper rigorously proves from both optimization and dynamical systems perspectives that the stop-gradient (SG) and EMA training processes commonly used in non-contrastive self-supervised learning do not minimize any well-defined objective function. However, they do avoid collapse upon convergence, and their non-trivial equilibria are asymptotically stable in the linear case.
- Equivariant Splitting: Self-supervised learning from incomplete data
-
By combining the invariance prior of "Equivariant Imaging (EI)" with the efficient unbiased properties of "measurement splitting," this paper proposes the Equivariant Splitting (ES) loss. This allows training a reconstructor that approximates the MMSE using only a single highly under-sampled forward operator, without requiring multiple forward evaluations as in EI.
- Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification
-
This paper discovers that features from Pathology Foundation Models (PFMs) possess a low-dimensional manifold geometry (with an effective rank of only 29.7 out of 512 dimensions). Standard linear layers tend to destroy this structure, leading to few-shot overfitting. The authors propose the plug-and-play MR Block, which utilizes a frozen random matrix as a geometric anchor and a low-rank residual path for task adaptation, achieving SOTA performance in few-shot WSI classification.
- FedOpenMatch: Towards Semi-Supervised Federated Learning in Open-Set Environments
-
This paper formally introduces the "Open-Set Semi-Supervised Federated Learning" (OSSFL) problem, where unlabeled data at clients contains unknown category samples outside the label space. It proposes FedOpenMatch, the first framework for this task, which employs a one-vs-all (OVA) outlier detector reinforced by "gradient stop + logit adjustment" combined with logit consistency regularization, improving open-set accuracy by up to 14.33% under heterogeneous federated data.
- Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
-
Inspired by the fruit fly olfactory circuit, this work proposes the Fly-CL framework. It achieves SOTA performance in pre-trained model-based continual learning while significantly reducing training time through a three-stage progressive decorrelation process involving sparse random projection, top-k operation, and streaming ridge classification.
- GUIDE: Gated Uncertainty-Informed Disentangled Experts for Long-tailed Recognition
-
GUIDE systematically dismantles the three-level "representation-decision-optimization" entanglement prevalent in multi-expert long-tailed recognition. It employs competitive specialization to force experts to learn distinct features, utilizes epistemic/aleatoric uncertainty decomposition to diagnose difficult samples for targeted refinement, and implements dual-time-scale updates to isolate the optimization of the main task from the meta-strategy, setting new SOTA results across five long-tailed benchmarks.
- HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
-
HiMAE integrates masked autoencoders into a U-Net-style hierarchical 1D CNN, allowing intermediate layers to naturally correspond to embeddings at different temporal resolutions. This transforms "resolution" from a hyperparameter into a probe-based diagnostic tool, while the model is small enough to perform sub-millisecond inference on smartwatch CPUs.
- In Context Semi-Supervised Learning
-
This paper introduces the In-Context Semi-Supervised Learning (IC-SSL) problem and constructs a two-stage Transformer. It first learns geometric spectral representations from a large number of unlabeled samples within the same context, then executes categorical ICL with a few labels during forward propagation, significantly improving classification accuracy and cross-geometric generalization in low-label scenarios.
- Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
-
Addressing the "dual-missing" scenario where both views and labels are absent, SCSD utilizes a cross-view shared discrete codebook to quantize and align various views into a consistent representation. It further achieves robust multi-view multi-label classification through weighted fusion based on label correlation and self-distillation using "fused predictions as teachers."
- InfoNCE Induces Gaussian Distribution
-
This paper theoretically demonstrates that the InfoNCE loss induces representations to converge toward a Gaussian distribution through two complementary mechanisms: an empirical idealization path (alignment + spherical uniformity → Gaussian) and a regularized path (vanishing regularization → isotropic Gaussian). These findings are validated using synthetic data and CIFAR-10.
- Learning Dynamics of Logits Debiasing for Long-Tailed Semi-Supervised Learning
-
This paper provides a unified explanation of various debiasing methods in Long-Tailed Semi-Supervised Learning (LTSSL) from the perspective of "learning dynamics"—demonstrating that they all essentially reshape gradient flows. Based on this, it proposes DyTrim, a training-efficient dynamic pruning framework that performs class-aware hard pruning for labeled data and confidence-based soft pruning for unlabeled data to reallocate the gradient budget toward samples that actually rectify bias.
- MaskCO: Masked Generation Drives Effective Representation Learning and Exploiting for Combinatorial Optimization
-
MaskCO redefines "learning to solve combinatorial optimization" as "masked self-supervision on optimal solutions"—masking a part of the optimal solution for the model to complete. This fissions a single (instance, solution) pair into exponential local learning signals and utilizes a "mask-reconstruct" loop during inference to iteratively refine solutions. It reduces the optimality gap by 99%+ on TSP/CVRP/MIS and achieves approximately 10x speedup.
- Maximizing Asynchronicity in Event-based Neural Networks
-
The EVA framework is proposed, treating events as language tokens and utilizing a RWKV-6 based linear attention asynchronous encoder for event-by-event feature updates. Combined with Multi-Representation Prediction (MRP) and Next-Representation Prediction (NRP) self-supervised learning, it acquires generalizable features, successfully achieving high-difficulty object detection tasks in the Asynchronous-to-Synchronous (A2S) paradigm for the first time (0.477 mAP on Gen1 dataset).
- Maximizing Incremental Information Entropy for Contrastive Learning
-
The IE-CL (Incremental-Entropy Contrastive Learning) framework is proposed to explicitly optimize the entropy gain between augmented views (rather than just maximizing mutual information). By treating the encoder as an information bottleneck and jointly optimizing learnable transformations (entropy generation) with encoder regularizers (entropy preservation), it consistently improves contrastive learning performance on CIFAR-10/100, STL-10, and ImageNet under small-batch settings. The core modules can be integrated into existing frameworks as plug-and-play components.
- Mechanistic Independence: A Principle for Identifiable Disentangled Representations
-
This paper proposes "mechanistic independence" as a unifying principle for the identifiability of disentangled representations. It defines factors by how they act on observations through a generator (rather than how they are distributed), providing a family of subspace identifiability theorems that are invariant to latent density re-weighting and hold even under nonlinear, non-invertible mixing.
- Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
-
Midway Network transfers "latent dynamics modeling" from decision-making domains to natural videos. By employing a midway top-down path to infer latent motion variables between frames, combined with dense forward prediction and a hierarchical structure, it is the first method to successfully learn both "object recognition (semantic segmentation)" and "motion understanding (optical flow)" representations using only natural videos.
- Mini-cluster Guided Long-tailed Deep Clustering
-
This paper proposes MiniClustering, which utilizes an auxiliary "fine-grained over-clustering" head to estimate how many mini-clusters each target cluster occupies. Under purely unsupervised conditions, it infers head/tail attributes for each class to re-weight the self-training loss, systematically introducing the re-weighting concept from supervised long-tailed learning into deep clustering for the first time.
- Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality
-
By treating "pre-training data originating entirely from the deployment environment itself" as a sandbox, this paper proposes Test-Space Training (TST): performing cross-modal self-supervised pre-training using multimodal data collected within a single test space. The resulting model outperforms universal models trained on internet-scale data (e.g., DINOv2, CLIP, 4M-21) on segmentation, detection, and captioning tasks within that specific environment.
- NEO — No-Optimization Test-Time Adaptation through Latent Re-Centering
-
NEO discovers that input distribution shifts cause a global translation in penultimate embeddings shared across samples and classes. By re-centering test features to the origin using a single global centroid vector, it outperforms seven mainstream TTA methods with zero optimization, zero hyperparameters, and near-zero additional overhead.
- On the Alignment Between Supervised and Self-Supervised Contrastive Learning
-
This paper theoretically proves that under shared randomness, self-supervised contrastive learning (CL) and a supervised surrogate—"Negative-only Supervised Contrastive Learning" (NSCL)—maintain a high level of alignment in the representation similarity space throughout training (with high-probability lower bounds for CKA/RSA), even while their parameters might diverge exponentially. This establishes NSCL as a principled bridge connecting self-supervised and supervised learning.
- One-Shot Exemplars for Class Grounding in Self-Supervised Learning
-
This paper proposes the OSESSL (One-Shot Exemplar SSL) setting—providing only one labeled image per class to "ground" self-supervised features into the real class space. The method constructs class prototypes using labeled exemplars and discriminative neighbors to align unlabeled data, while employing interpolation consistency to smooth decision boundaries. On CIFAR-100 and ImageNet-100, k-NN accuracy improves by approximately 3% and 6% over the Prev. SOTA.
- OrthoRF: Exploring Orthogonality in Object-Centric Representations
-
Building on unsupervised object discovery frameworks like Rotating Features (RF) that "bind objects via phase synchrony," OrthoRF enforces orthogonality in an \(n\)-dimensional orientation space through softmax competitive binding and an inner-product orthogonal loss. This allows objects to occupy distinct dimensions, eliminating the need for post-hoc k-means clustering. The method matches or exceeds existing techniques in overlapping, noisy, or out-of-distribution scenarios and can recover occluded object parts within the intermediate representations.
- Part-level Semantic-guided Contrastive Learning for Fine-grained Visual Classification
-
PSCL utilizes ClearCLIP to decouple "region selection" and "region representation" into two separate branches. Combined with multi-scale multi-part progressive reasoning and a vision-language contrastive loss incorporating intermediate-granularity categories, it achieves SOTA or highly competitive accuracy across five FGVC datasets.
- PAS: Estimating the Target Accuracy Before Domain Adaptation
-
This paper proposes PAS (Potential Adaptability Score)—an asymmetric score computed before actual domain adaptation training using only pre-trained model embeddings. It measures the transferability of a source domain and a pre-trained model to an unlabeled target task by evaluating the "relative margin between the nearest and second-nearest distances" from target samples to source class centroids. This allows for selecting the optimal "source domain + pre-trained model" combination that yields the highest post-adaptation target accuracy, avoiding the heavy overhead of exhaustive training.
- Plug-and-Play Compositionality for Boosting Continual Learning with Foundation Models
-
CompSLOT utilizes Slot Attention to unsupervisedly extract image concept slots from frozen ViT backbones. It then selects class-related "primitives" and distills the pairwise similarities of these primitives into the logits of arbitrary continual learners. This "plug-and-play" mechanism consistently improves performance and alleviates catastrophic forgetting across various foundation-model-based continual learning methods.
- PonderLM: Pretraining Language Models to Ponder in Continuous Space
-
PonderLM is proposed, which introduces a "pondering" mechanism during the pre-training phase. It transforms predicted probability distributions into continuous embeddings via weighted sums and performs repeated forward passes. Without requiring annotated data or reinforcement learning, a 2.8B model outperforms a 6.9B model across 9 downstream tasks.
- PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks
-
PredNext introduces a plug-and-play "cross-view future prediction" module for self-supervised video learning in Spiking Neural Networks (SNNs). By simultaneously predicting features of the next time step and the next clip within the same video, it enhances temporal feature consistency without imposing rigid constraints. This allows deep SNNs to learn unsupervised representations on large-scale video datasets like UCF101 that approach the performance of ImageNet supervised pre-training.
- PRISM: Progressive Robust Learning for Open-World Continual Category Discovery
-
PRISM proposes "Open-World Continual Category Discovery" (OW-CCD), a more realistic setting where data streams contain both new categories and domain shifts. By utilizing a "High-frequency Categorical Shunting + Sparse Assignment Matching + Invariant Knowledge Transfer" toolkit, it consistently achieves new CCD SOTA on SSB-C and DomainNet (with a 15.1% gain on the clean domain of CUB-C).
- PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
-
PromptHub upgrades Multi-Prompt Visual In-Context Learning (VICL) fusion from "patch-wise concatenation" to "locality-enhanced fusion in embedding space." Coupled with a triple loss loop (prediction/alignment/utilization) and VICL-specific data augmentation, it ensures the backbone backbone truly trusts and utilizes the fused prompts, consistently outperforming the predecessor CONDENSER in segmentation, detection, and colorization tasks.
- Regularized Latent Dynamics Prediction is a Strong Baseline for Behavioral Foundation Models
-
The paper proposes Regularized Latent Dynamics Prediction (RLDP), which maintains feature diversity by adding simple orthogonal regularization to a self-supervised latent next-state prediction objective. RLDP matches or exceeds complex SOTA representation learning methods in zero-shot RL, demonstrating significant advantages particularly in low-coverage scenarios.
- Relationship Alignment for View-aware Multi-view Clustering
-
RAV preserves the neighborhood structure of each view through "cross-view sample relationship alignment" and dynamically adjusts the intensity of cluster-level label contrastive learning using "view-aware adaptive weighting" based on Wasserstein distance. This ensures strong alignment for similar views and weak alignment for dissimilar views, overall surpassing existing SOTA on ten multi-view clustering benchmarks.
- Representation Alignment for Diffusion Transformers without External Components
-
This paper discovers a "bad-to-good" evolution of discriminative representations within Diffusion Transformers. It proposes SRA (Self-Representation Alignment): aligning student representations at "shallower layers + higher noise" with EMA teacher representations at "deeper layers + lower noise." This accelerates DiT/SiT training without any external tasks or pre-trained encoders, significantly outperforming methods depending on external tasks and approaching REPA, which relies on DINOv2.
- Rethinking JEPA: Compute-Efficient Video Self-Supervised Learning with Frozen Teachers
-
This paper replaces the "online EMA teacher" in V-JEPA with a "static teacher" that is pre-trained via pixel reconstruction and subsequently frozen. This results in SALT, a simplified two-stage framework that requires no anti-collapse regularization. SALT outperforms V-JEPA 2 in frozen backbone evaluations while saving compute, and unexpectedly reveals that a small, "weak" teacher can effectively supervise a very strong student.
- Rethinking Unsupervised Cross-Modal Flow Estimation: Learning from Decoupled Optimization and Consistency Constraint
-
DCFlow shifts unsupervised cross-modal optical flow estimation from "implicit learning via appearance similarity" to "decoupled optimization + explicit motion supervision." By utilizing geo-aware single-image data synthesis, it generates reliable synthetic flow labels for the flow network, allowing the modality translation and flow networks to train on their respective sub-tasks independently. These are then jointly fine-tuned using cross-modal consistency constraints, significantly reducing EPE across five real-world datasets and achieving state-of-the-art (SOTA) performance among unsupervised methods.
- Samples Are Not Equal: A Sample Selection Approach for Deep Clustering
-
This paper argues that deep clustering over-learns simple and redundant samples in high-density regions. It proposes a plug-and-play sample selection component: using local density to re-estimate clustering prototypes during initialization, and dynamically removing learned samples during training based on prediction consistency and pseudo-label stability. This approach simultaneously improves clustering accuracy and training efficiency across various deep clustering baselines.
- SCAD: Super-Class-Aware Debiasing for Long-Tailed Semi-Supervised Learning
-
SCAD identifies the issue of "local bias within semantically similar categories" in long-tailed semi-supervised learning. It utilizes automatically discovered super-class contexts to perform instance-level dynamic corrections for logit adjustment. SCAD consistently improves existing LTSSL methods on benchmarks such as CIFAR, STL, ImageNet-127, and Food101-LT.
- Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
-
Addressing the combinatorial generalization deficit in Goal-Conditioned Behavioral Cloning (GCBC), which fails to "stitch" novel state-goal pairs, this paper proposes BYOL-\(\gamma\): a self-predictive representation learning objective that samples future states using a geometric distribution to approximate the successor measure. As an auxiliary loss for BC, it requires neither TD learning nor negative samples, achieving average success rates across OGBench stitching tasks that surpass all baseline methods.
- SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty
-
SNAP-UQ proposes a single-pass uncertainty estimation method tailored for TinyML scenarios: it attaches tiny int8 prediction heads to selected layers of the backbone network to predict next-layer activation statistics in a self-supervised manner. The deviation between actual activations and predictions ("surprisal") is aggregated into an uncertainty score. This approach requires no additional forward passes, temporal buffering, or ensembles, enabling reliable out-of-distribution (OOD) and failure detection on microcontrollers with only a few dozen KB of additional flash memory.
- SnaPhArd Contrast Learning
-
Starting from optimality conditions, this paper theoretically proves that "easy samples" in contrastive learning act as fixation points for the optimal solution and induce representation collapse. It proposes SPACL: using dynamic anchors + farthest point iteration to select hard positives, adversarial generators to create hard negatives, and relative thresholds to filter trivial negatives. It consistently outperforms or matches SOTA across image classification, knowledge graph link prediction, and out-of-distribution intent detection tasks.
- Soft Equivariance Regularization for Invariant Self-Supervised Learning
-
Proposes SER (Soft Equivariance Regularization), a layer-decoupling design that applies soft equivariance regularization to intermediate layers of ViT while maintaining the invariance objective at the final layer. It consistently improves classification accuracy and robustness for invariant SSL methods (MoCo-v3, DINO, Barlow Twins) without introducing additional modules.
- Spatial Structure and Selective Text Jointly Facilitate Image Clustering
-
SATC constructions a graph for each image using GAT to extract spatial structure features between patches, compensating for the missing local structure in CLIP. It employs a selector based on "textual compactness \(\tau\)" to automatically decide whether to introduce text features for a given dataset. Finally, it achieves clustering through mutual distillation across vision, spatial, and textual modalities, outperforming SOTAs like TAC across 18 benchmarks.
- Spatially Informed Autoencoders for Interpretable Visual Representation Learning
-
This paper proposes SI-VAE (Spatially Informed Variational Autoencoder), which utilizes the pseudo-likelihood of spatial point processes as a self-supervised objective to supervise the VAE latent space. This allows the model to learn statistically interpretable representations of "spatial arrangements between objects" rather than just pixel intensities. On synthetic data, it improves point pattern classification accuracy from 48% (standard VAE) to 80%–90% and enables zero-shot conditional simulation of point processes from single images, applied to the analysis of protein localization in human cells.
- SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting
-
SplitLoRA transforms the long-standing challenge of "determining the minor subspace dimension" in continual learning from a heuristic threshold into a solvable optimization problem. It derives a theoretical upper bound for "Stability Loss + Plasticity Loss" as a function of the subspace dimension \(k\), solves for the module-specific optimal \(k^*\) for each LoRA module, and fixes the LoRA down-projection matrix \(A\) to this subspace while training only \(B\). It outperforms existing methods by 2%–5% on ImageNet-R, CIFAR-100, and DomainNet.
- Symmetric Space Learning for Combinatorial Generalization
-
This paper proposes CartanFM, which constrains the latent representation space to a symmetric space. By utilizing Cartan decomposition and geodesic symmetry consistency, the model extrapolates symmetries from observed combinations to unobserved ones. It significantly outperforms VAEs and existing symmetry learning methods on combinatorial generalization benchmarks such as dSprites, 3D Shapes, and MPI3D.
- Temporal Slowness in Central Vision Drives Semantic Object Learning
-
By simulating human central vision (fixation-based cropping) and the temporal slowness principle (temporal contrastive learning), this work trains an SSL model on Ego4D data. The findings suggest that the combination of these two effectively enhances semantic object representations—central vision strengthens foreground extraction, while temporal slowness distills semantic information during fixations.
- Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting
-
Chroma is proposed as a framework for small pretrained time series model portfolios. By producing frequency/domain experts from a generalist model through post-training (achieving 10× training acceleration) and combining them via test-time model selection or greedy ensemble, a portfolio with 4M parameters matches the performance of large monolithic models with 205M-500M parameters on Chronos Benchmark II, while maintaining inference computation far below that of test-time fine-tuning.
- TrainRef: Curating Data with Label Distributions and Minimal Reference Samples for Accurate Prediction and Reliable Confidence
-
TrainRef utilizes a minimal (one sample per class is sufficient) trusted reference set \(D_\text{ref}\) as "extrinsic normality" to select clean samples. It rewrites labels from "one-hot classes" into "label distributions." Through a three-phase process—MIM pre-training, influence function filtering, and curation-training co-evolution—it achieves new SOTA performance in both accuracy and Expected Calibration Error (ECE) on CIFAR-100, Animal-10N, and WebVision.
- Two-Way is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning
-
Addressing the challenge of "old class prototype drift caused by backbone updates" in Exemplar-Free Class-Incremental Learning, this paper proposes BiCyc: simultaneously learning an "old \(\to\) new" adapter \(A\) and a "new \(\to\) old" distiller \(D\) during the training phase. It enforces both as mutual inverse mappings using stop-gradient gating and cycle consistency loss, thereby accurately transporting old class Gaussian prototypes to the new feature space. It minimizes forgetting and outperforms state-of-the-art methods like AdaGauss and DPCR on from-scratch benchmarks including CIFAR-100 and TinyImageNet.
- Uncover Underlying Correspondence for Robust Multi-view Clustering
-
This paper treats cross-view correspondences in noisy multi-view clustering as latent variables and proposes CorreGen. Using an EM framework, it generates soft correspondence distributions in the embedding space while simultaneously handling category-level mismatch, mismatched samples, and unalignable samples through GMM marginal estimation and a virtual sample mechanism, significantly enhancing clustering robustness in noisy correspondence scenarios.
- Understanding the Learning Phases in Self-Supervised Learning via Critical Periods
-
This paper identifies a "transferability tradeoff" in self-supervised pre-training—where intermediate checkpoints exhibit stronger out-of-distribution (OOD) generalization than final ones. Drawing on the biological and supervised learning concept of "critical periods," the authors characterize SSL through three stages—plasticity, consolidation, and over-specialization—using deficit injection and Fisher Information (FI) probes. They further propose two lightweight strategies, CP-guided checkpoint selection and self-distillation, to balance in-distribution (ID) and OOD performance.
- Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data
-
This paper provides a rigorous theoretical analysis of the robustness of different distributed self-supervised learning (D-SSL) frameworks under non-IID data. It proves that Masked Image Modeling (MIM) is inherently more resistant to heterogeneity than Contrastive Learning (CL), and that robustness increases with the average network connectivity (Federated Learning is at least as robust as Decentralized Learning). Based on these insights, the authors design MAR loss with local-global alignment regularization as a practical exemplar.
- Unified and Efficient Multi-view Clustering from Probabilistic Perspective
-
UEMCP reinterprets anchor-based multi-view clustering as "data point \(\rightarrow\) anchor \(\rightarrow\) category" probabilistic transition learning. It simultaneously learns consensus anchors, view weights, anchor graphs, and category assignments within a unified objective, achieving superior clustering performance and near-linear complexity on several large-scale multi-view datasets.
- Unsupervised Representation Learning - An Invariant Risk Minimization Perspective
-
This paper generalizes Invariant Risk Minimization (IRM), which originally depends on labels, to unlabeled scenarios. It redefines "invariance" as "feature distribution alignment across environments" and proposes two methods: PICA for linear Gaussian cases and VIAE for deep generative models. The approach achieves label-independent invariant structure extraction and cross-environment generalization on synthetic data, modified MNIST, and CelebA.
- Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning
-
This work diagnoses the root cause of partial prototype collapse in prototypical self-supervised learning (SSL) as shortcut learning induced by the joint optimization of the encoder and prototypes. It proposes a fully decoupled training strategy—using an online GMM to independently estimate prototypes—to eliminate collapse and improve downstream performance.
- XIL: Cross-Expanding Incremental Learning
-
This paper proposes a novel continual learning setting, XIL, where class-incremental data originates from evolving domains. It requires the model to "fill" new classes back into old domains and "expand" old classes into new domains (Bi-directional Domain Transfer, BiDoT). The XEED framework is introduced, utilizing domain-specific prompts, diffusion models to generate cross-domain transfer samples, and evolving prototype classification, improving BiDoT scores by up to 31.41% on datasets with strong domain shifts.
- ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse
-
To address the issue where test-time entropy minimization easily collapses into degenerate solutions (predicting the same class for all samples), this paper transfers the "asymmetric structure" from negative-free SSL. By inserting a learnable predictor before the classifier and applying a stop-gradient, it creates asymmetric online/target branches within a single forward pass. Alignment regularization excludes constant one-hot solutions from the optimal set. With almost zero extra overhead, it achieves greater stability and performance across vision TTA and LLM reasoning tasks, especially on collapse-prone small models.