ICML2025 Self-Supervised Learning AI paper notes paper summaries Few-/Zero-Shot Learning Reinforcement Learning Alignment/RLHF Adversarial Robustness

🔄 Self-Supervised Learning¶

🧪 ICML2025 · 22 paper notes

📌 Same area in other venues: 📷 CVPR2026 (92) · 🔬 ICLR2026 (81) · 💬 ACL2026 (1) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (35)

🔥 Top topics: Self-Supervised Learning ×3

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints: Introduces "downstream free energy" as a Bayesian model selection criterion for pretraining checkpoint adaptability, proves that "pretraining free energy" serves as its upper bound proxy (without requiring downstream data), and experimentally validates that a large learning rate, small batch size, and high momentum improve downstream transfer performance by reducing pretraining free energy.
AdaWorld: Learning Adaptable World Models with Latent Actions: AdaWorld is proposed, which builds highly adaptable world models by performing action-aware pre-training through self-supervised extraction of latent actions from videos, supporting zero-shot action transfer and fast adaptation to new environments with few interactions.
Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search: Alpha-SQL models zero-shot Text-to-SQL as a tree search problem. By combining a Monte Carlo Tree Search (MCTS) framework with an LLM-as-Action-Model and a self-supervised reward function, it achieves a 69.7% execution accuracy on the BIRD dataset using a 32B open-source model without any fine-tuning, surpassing the GPT-4o-based zero-shot SOTA by 2.5 percentage points.
Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions: Using physical activity data from 162K participants and 2.5 billion hours of wearable behavioral data from the Apple Heart and Movement Study, this work systematically explores combinations of tokenizers and architectures. By constructing WBM, a behavioral foundation model leveraging TST + Mamba-2 + contrastive learning, the model significantly outperforms hand-crafted feature baselines across 57 health detection tasks and complements PPG sensor models.
CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries: Proposes CLARIFY, a method that constructs a trajectory embedding space integrating preference information via contrastive learning and utilizes rejection sampling to select clearer, more distinguishable preference queries, thereby improving annotation efficiency and policy performance of offline PbRL under non-ideal feedback.
ReSA: Clustering Properties of Self-Supervised Learning: This work systematically analyzes the clustering properties of various components in JEA-based SSL. It discovers that the encoding possesses superior and more stable clustering capabilities compared to the embedding and hidden layers of the projector. Based on this, ReSA (Representation Self-Assignment) is proposed to utilize encoding clustering information to guide embedding learning, forming a positive-feedback SSL framework that significantly outperforms SOTA on multiple standard benchmarks.
Collapse-Proof Non-Contrastive Self-Supervised Learning: This paper proposes the FALCON method, which designs the projector and loss function based on the principles of hyperdimensional computing. It theoretically proves the simultaneous prevention of four known training failure modes (representation collapse, dimensional collapse, cluster collapse, and intracluster collapse) while naturally endowing representations with decorrelation and clustering properties.
Contextures: Representations from Contexts: Establishes the contexture theory to unify and prove that various representation learning paradigms, including supervised learning, self-supervised learning, and manifold learning, can be understood as learning the top-\(d\) singular functions of the expectation operator induced by context variables, while revealing the law of diminishing marginal returns in model scaling and proposing context quality evaluation metrics.
Deep Learning is Not So Mysterious or Different: This is a position paper arguing that the generalization phenomena deemed "mysterious" in deep learning (benign overfitting, double descent, and the success of overparameterization) are neither unique to deep learning nor mysterious. They can be formalized using long-standing generalization frameworks (PAC-Bayes and countable-hypothesis bounds) and unified under the explanatory principle of soft inductive biases.
Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning: GloFND is proposed to learn a dynamic threshold for each anchor sample, discovering and filtering global false negatives in real-time during training. This improves the representation quality in contrastive learning with low computational overhead.
Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection: This paper systematically investigates the advantages and disadvantages of using Foundation Models (FMs) to replace conventional Information Extractors (IEs) for subset selection. It is found that FMs significantly outperform traditional IEs on fine-grained datasets. Consequently, the RAM-APL method is proposed to utilize multiple FMs (DINOv2 + CLIP) to jointly measure sample importance from both intra-class and inter-class dimensions, achieving SOTA results on three fine-grained datasets.
Generalization Analysis for Supervised Contrastive Representation Learning under Non-IID Settings: This paper establishes the first generalization bounds for supervised contrastive representation learning (CRL) under non-independent and identically distributed (non-IID) settings. By leveraging U-statistics decomposition techniques to handle the dependency issue arising from overlapping samples in training tuples, it provides the convergence rate of the excess risk with respect to the number of labeled samples \(N\).
Griffin: Towards a Graph-Centric Relational Database Foundation Model: Griffin is the first foundation model designed for Relational Databases (RDBs). By transforming multi-table structures into heterogeneous graphs, and combining a unified encoder/decoder, cross-attention, and a hierarchical aggregation MPNN, it conducts self-supervised masked completion pre-training on over 150M+ rows of data followed by joint SFT, achieving cross-database, cross-domain, and cross-task generalized predictions.
MTL-UE: Learning to Learn Nothing for Multi-Task Learning: MTL-UE is the first unlearnable example generation framework tailored for Multi-Task Learning (MTL). Utilizing an encoder-decoder architecture to inject task-specific class prior embeddings, it reduces the intra-class variance of shortcut features. Coupled with intra- and inter-task embedding cosine regularization, it increases inter-class distances and reduces redundancy. On CelebA (40 tasks), it degrades MTL model accuracy from 91% to 59%, demonstrating consistent effectiveness across 4 datasets, 3 base UE methods, 5 backbones, and 5 MTL strategies.
Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling: Proposes two nearest-neighbour-based Gaussian process prior approximation methods (HPA and SPA) to introduce neighbour-driven sparsity into the latent space inference of GPVAEs. This enables scalable mini-batch training while retaining key latent dependencies, eliminating reliance on a large number of inducing points or restricted kernel functions.
PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations: Proposes PDE-Transformer, an improved Transformer architecture for physics simulations. By separating channel embedding, shifted-window attention, and a multi-scale U-shaped structure, it outperforms existing SOTA on 16 PDE types and demonstrates strong transfer capability to downstream tasks.
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting: This paper proposes a structure-level feature regularization method termed Proxy-FDA. By transferring the nearest neighbor graph from the pre-trained feature space to the fine-tuned feature space, and employing a lightweight proxy generator to synthesize novel features to enhance distribution coverage, Proxy-FDA achieves forward transfer across all fine-tuning tasks without sacrificing downstream accuracy.
Test-Time Canonicalization by Foundation Models for Robust Perception: This work proposes the FoCal framework, which leverages visual priors from CLIP and Stable Diffusion during the inference phase. Utilizing a "Vary-then-Rank" strategy, it transforms input images into their most visually canonical versions, enhancing downstream model robustness to variations in viewpoint, illumination, and rotation without any retraining.
Test-Time Training Provably Improves Transformers as In-Context Learners: This paper rigorously proves that Test-Time Training (TTT) can provably enhance the In-Context Learning (ICL) capabilities of Transformers, and validates on the tabular foundation model TabPFN that TTT can reduce the required sample size by 3-5 times while yielding significant improvements in inference efficiency.
Towards Benchmarking Foundation Models for Tabular Data With Text: The first systematic study on modeling tabular data containing text features: qualitative counterexamples are designed to expose the failure modes of three types of text embeddings, 13 real-world datasets are manually curated, and text features are found to improve predictive accuracy on 11/13 datasets, although no single optimal embedding method exists, indicating that tabular data with text remains an unsolved problem.
Update Your Transformer to the Latest Release: Re-Basin of Task Vectors: Proposed TransFusion, a two-level weight permutation method (inter-head + intra-head) specifically designed for Transformers, enabling data-free and training-free migration of fine-tuned knowledge (task vectors) from old models to new foundation models.
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models: This paper proposes "Inductive Bias Probes", which evaluate whether the extrapolation behavior of foundation models aligns with hypothesized world models by repeatedly fine-tuning them on synthetic datasets. The findings reveal that while foundation models can accurately predict sequences in domains such as orbital mechanics, Othello, and lattice problems, they do not truly learn the underlying world models but rather develop task-specific heuristic strategies.