🔄 Self-Supervised Learning¶

🧠 NeurIPS2025 · 36 paper notes

A Joint Learning Approach to Hardware Caching and Prefetching: This paper proposes a joint training framework that unifies hardware cache replacement and prefetching policies. By constructing shared feature representations via a joint encoder and contrastive learning, the framework breaks the performance bottleneck imposed by independently trained policies.
Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees: This paper proposes Adv-SSL, which rewrites the Frobenius norm of the covariance regularization term as a minimax dual form, eliminating the biased sample-level risk estimation present in methods such as Barlow Twins. The approach substantially improves downstream classification performance without incurring additional computational cost, and provides end-to-end theoretical convergence guarantees.
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering: This paper proposes the SpherePair loss function, which performs pairwise constraint embedding learning in angular space (rather than Euclidean space), enabling a deep constrained clustering method that requires neither anchors nor prior knowledge of the number of clusters, while providing rigorous theoretical guarantees for determining optimal hyperparameters.
Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE: By modeling embedding evolution as Langevin dynamics on a compact Riemannian manifold, this paper proves that the convergence guarantees of classical simulated annealing extend to the temperature scheduling setting in contrastive learning: a sufficiently slow logarithmic inverse-temperature schedule guarantees probabilistic convergence to the globally optimal representation set, whereas faster schedules risk trapping the system in suboptimal minima.
BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals: This paper proposes BrainOmni—the first brain signal foundation model that unifies EEG and MEG—by discretizing heterogeneous brain signals into a unified token space via BrainTokenizer (incorporating a physical Sensor Encoder), followed by self-supervised masked prediction pretraining with a Criss-Cross Transformer. The model achieves an 11.7 percentage-point improvement on Alzheimer's disease detection and demonstrates zero-shot reconstruction generalization to completely unseen devices.
Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning: This paper derives the optimal tight lower bound of KL divergence in terms of JS divergence, \(\Xi(D_{\text{JS}}) \leq D_{\text{KL}}\), in the general case. It proves that training a discriminator by minimizing cross-entropy loss is equivalent to maximizing a guaranteed lower bound on mutual information, thereby providing the missing theoretical foundation for JSD-based discriminative representation learning methods. The tightness and practical utility of the bound are validated in MI estimation and the Information Bottleneck framework.
Continuous Subspace Optimization for Continual Learning (CoSO): This paper proposes CoSO, a framework that dynamically derives continuous subspaces from per-step gradient SVD (rather than LoRA's fixed subspace), combined with orthogonal projection onto historical task subspaces to prevent interference and Frequent Directions for efficient gradient information aggregation. CoSO achieves 78.19% final accuracy on ImageNet-R with 20 tasks, surpassing the best baseline by 2.77 percentage points.
Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning: This paper proposes Task-Modulated Contrastive Learning (TMCL), inspired by top-down modulation in the neocortex. In continual learning, sparse label information (as little as 1% labels) is integrated via affine modulation, and contrastive learning is then used to consolidate the modulation information into feedforward weights. TMCL surpasses both unsupervised and supervised baselines on class-incremental learning and transfer learning benchmarks.
Contrastive Representations for Temporal Reasoning: This paper proposes CRTR (Contrastive Representations for Temporal Reasoning), which introduces intra-trajectory negative pairs by repeating trajectory IDs within training batches. This eliminates the reliance on static contextual features in standard temporal contrastive learning, enabling representations that reflect temporal structure. CRTR achieves, for the first time, search-free solving on combinatorial reasoning tasks such as the Rubik's Cube.
Curiosity-driven RL for Symbolic Equation Solving: This work combines curiosity-driven exploration mechanisms (RND, ICM, etc.) with a graph action space based on expression trees, enabling a PPO agent to solve nonlinear equations involving radicals, exponentials, and trigonometric functions — surpassing prior RL methods that were limited to linear equations.
DataRater: Meta-Learned Dataset Curation: This paper proposes DataRater, a meta-gradient-based data valuation framework that employs meta-learning to automatically score and filter low-quality training samples. It achieves up to 46.6% net compute savings across multiple pre-training datasets, and a DataRater trained on a 400M internal model generalizes directly to LLM training at scales ranging from 50M to 1B parameters.
Disentangling Hyperedges through the Lens of Category Theory: This work is the first to analyze hyperedge disentanglement through the lens of category theory. By deriving a naturality condition, it establishes a "factor representation consistency" criterion (aggregation-then-disentanglement vs. disentanglement-then-aggregation should yield consistent results), and proposes Natural-HNN, which comprehensively outperforms 14 baselines across 6 cancer subtype classification datasets (BRCA F1: 75.7% → 80.4%) while achieving 100% accuracy in capturing the functional context of genetic pathways.
Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition: This paper proposes a three-stage framework (meta-scientific integration → hybrid human-AI co-creation → autonomous scientific discovery) to characterize how foundation models are driving a transition in scientific paradigms from tool-based enhancement toward paradigm-level transformation. It also provides a systematic survey of FM integration across the four classical scientific paradigms: experimental, theoretical, computational, and data-driven.
Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings: This paper proposes TANDEM (Tree-And-Neural Dual Encoder Model), a hybrid autoencoder architecture that jointly trains a neural network encoder and an Oblivious Soft Decision Tree (OSDT) encoder, and introduces a sample-level stochastic gating network as a learnable data augmentation mechanism. TANDEM achieves superior performance over strong baselines—including tree-based and deep learning methods—in low-label tabular data settings.
Implicit Modeling for Transferability Estimation of Vision Foundation Models: This paper proposes Implicit Transferability Modeling (ITM), a framework that encodes the transferability of model–task pairs via a latent variable \(z\), and employs Divide-and-conquer Variational Approximation (DVA) to efficiently simulate embedding space evolution. On 10 downstream tasks with 10 diverse pre-trained models, the weighted Kendall \(\tau_w\) improves from the previous state-of-the-art of 0.45 to 0.61.
Know Thyself by Knowing Others: Learning Neuron Identity from Population Context: This paper proposes NuCLR, a self-supervised framework that learns neuron-level representations enriched with population context via contrastive learning—pulling together different temporal windows of the same neuron and pushing apart different neurons within a population. NuCLR achieves new state-of-the-art performance on cell type and brain region decoding, and is the first to demonstrate cross-animal zero-shot generalization and data scaling laws in this domain.
Long-Tailed Recognition via Information-Preservable Two-Stage Learning: This paper proposes an information-preservable two-stage learning framework: Stage 1 employs Balanced Negative Sampling (BNS) to learn an effective and separable feature space via mutual information maximization; Stage 2 uses Information-Preservable DPP (IP-DPP) to sample the most informative examples in a mathematically principled manner to correct majority-biased decision boundaries. The method achieves state-of-the-art performance on multiple long-tailed benchmarks.
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization: To address policy collapse and entropy collapse in self-supervised reinforcement learning for LLMs, this paper proposes M-GRPO, a momentum-anchored GRPO framework combined with an IQR-based low-entropy trajectory filtering method, achieving stable training and state-of-the-art performance.
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization: To address the pervasive "policy collapse" problem in self-supervised reinforcement learning from verifiable rewards (SS-RLVR) during extended training, this paper proposes M-GRPO: a framework that employs a momentum model to provide stable pseudo-label targets alongside IQR-based low-entropy trajectory filtering to prevent entropy collapse. Training Qwen3-4B-Base on unlabeled MATH data, the final checkpoint directly surpasses the manually selected best checkpoint of SRT, achieving +2.92% on AIME24 and +5.05% on GPQA.
Manifolds and Modules: How Function Develops in a Neural Foundation Model: This work opens the "black box" of a state-of-the-art neural activity foundation model (FNN) from a computational neuroscience perspective. By constructing decoding and encoding manifolds, the study reveals that each processing module (encoder, recurrent, readout) exhibits qualitatively distinct representational structures, and identifies critical discrepancies between the model and the biological visual system.
Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks: MIRA embeds Hopfield-style associative memory modules into each layer of a ViT, storing and retrieving LoRA adapter weights as key-value pairs. Through a two-stage training procedure (adaptation + consolidation), it simultaneously addresses domain generalization (DG), class-incremental learning (CIL), and domain-incremental learning (DIL) within a single unified architecture, significantly outperforming task-specific methods across multiple benchmarks.
Minimal Semantic Sufficiency Meets Unsupervised Domain Generalization: MS-UDG operates without class or domain labels, decomposing representations into semantic and variation components via an Information Disentanglement Module (IDM). Coupled with a Semantic Representation Optimization Module (SROM) that simultaneously maximizes semantic information and minimizes variation interference, the method achieves 72.89% accuracy on PACS (+1.5% vs. CycleMAE). Theoretical analysis proves that minimally sufficient semantic representations minimize the downstream Bayes error rate.
Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models: This paper presents the first systematic study of design principles for synthetic priors, identifying diversity, distinctiveness, and real-data alignment as critical attributes. Based on these findings, the authors propose Mitra — a tabular foundation model trained on a carefully selected mixture of synthetic priors — which consistently outperforms TabPFNv2 and TabICL on both classification and regression benchmarks.
One Filters All: A Generalist Filter for State Estimation: This paper proposes LLM-Filter, which reprograms a large language model (LLM) as a generalist state estimator. Through a System-as-Prompt (SaP) mechanism, the frozen LLM achieves zero-shot generalization to unseen dynamical systems, surpassing state-of-the-art learning-based filters.
SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery: This paper proposes SEAL, a framework that leverages naturally occurring semantic hierarchies (rather than manually constructed abstract hierarchies) to guide generalized category discovery. Through hierarchically semantic-guided soft contrastive learning and a cross-granularity consistency module, SEAL achieves state-of-the-art performance on fine-grained benchmarks.
Soft Task-Aware Routing of Experts for Equivariant Representation Learning: This paper proposes STAR (Soft Task-Aware Routing), which employs a MoE routing mechanism to coordinate shared and task-specific information between invariant and equivariant representation learning objectives, reducing redundant feature learning and improving downstream transfer performance.
STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking: STaRFormer is proposed, which employs Dynamic Attention-based Regional Masking (DAReM) to identify task-critical regions and apply masking perturbations, coupled with intra-batch and intra-class semi-supervised contrastive learning to embed task information into latent representations. The method comprehensively outperforms state-of-the-art baselines across 56 datasets spanning non-stationary, irregularly sampled, classification, anomaly detection, and regression settings.
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning: This paper proposes T-REGS — a self-supervised learning regularization framework based on maximizing the length of the minimum spanning tree (MST). The authors theoretically prove that the method simultaneously prevents dimensional collapse and promotes uniform distribution of representations on compact Riemannian manifolds, with empirical validation on standard JE-SSL benchmarks.
TabArena: A Living Benchmark for Machine Learning on Tabular Data: This paper introduces TabArena, the first continuously maintained "living" benchmark for tabular machine learning. From 1,053 candidate datasets, 51 are curated and 16 models are evaluated through large-scale experiments (~25 million model training runs). Key findings: under post-hoc ensembling, deep learning models match or surpass GBDTs; tabular foundation models excel on small datasets; and cross-model ensembles further advance the state of the art.
TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields: TabSTAR is a foundation model designed specifically for tabular data with text fields. It achieves target-aware text representations through end-to-end optimization with an unfrozen text encoder (e5-small-v2), injects target semantics via target-aware tokens, and enables cross-dataset transfer learning through a dataset-parameter-free architecture. After pre-training on 350 datasets, TabSTAR surpasses CatBoost-Tuned (4h tuning) on 12 out of 14 classification datasets and outperforms TabPFN-v2 on 8 out of 11 datasets.
The Complexity of Finding Local Optima in Contrastive Learning: This paper proves that finding local optima in contrastive learning is computationally hard: the discrete triplet maximization problem is PLS-hard (even when \(d=1\)), and continuous triplet loss minimization is CLS-hard, implying that (under standard assumptions) no polynomial-time algorithm exists for finding local optima.
Towards Reliable and Holistic Visual In-Context Learning Prompt Selection: This paper proposes RH-Partial2Global, which for the first time employs Spearman rank correlation tests to demonstrate that the "similarity-first hypothesis" in VICL is statistically significant yet exhibits extremely weak correlation strength (\(\bar{\rho} \approx 0.03\text{-}0.05\)). By constructing reliable candidate sets via Jackknife conformal prediction and achieving comprehensive uniform pairwise preference sampling through covering designs, the method consistently outperforms state-of-the-art approaches across three visual tasks: segmentation, detection, and colorization.
TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Structural Relationships: TRIDENT is a tri-modal molecular representation learning framework that introduces Hierarchical Taxonomic Annotations (HTA) as a third modality. It combines a volumetric contrastive loss for global tri-modal alignment with a functional group–text local alignment module, dynamically balancing the two objectives via a momentum mechanism. The framework achieves state-of-the-art performance across 18 molecular property prediction tasks.
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction: This paper proposes OligoICP, a method that leverages the interquartile range (IQR) of TabPFN's predicted distributions as an unlabeled model selection heuristic, achieving superior performance over both specialized SOTA models and naive ensembles on siRNA knockdown efficiency prediction.
Understanding Ice Crystal Habit Diversity with Self-Supervised Learning: This paper presents the first application of self-supervised learning (SSL) to latent representation learning for ice crystal images. By pre-training a ViT on a large-scale cloud particle image dataset, the method learns continuous latent representations of ice crystal habits and quantifies habit diversity using the vMF concentration parameter, achieving a state-of-the-art classification accuracy of 84.39% with a 30× reduction in computational cost.
You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering: This paper proposes DCBoost, a plug-and-play module requiring no additional hyperparameters, which selects high-confidence samples via adaptive k-NN and leverages reliable local structural information to guide global feature space optimization, substantially improving the performance of existing deep clustering models.