ICML2026 Self-Supervised Learning AI paper notes paper summaries Few-/Zero-Shot Learning Diffusion Models Alignment/RLHF LLM Continual Learning

🔄 Self-Supervised Learning¶

🧪 ICML2026 · 28 paper notes

📌 Same area in other venues: 📷 CVPR2026 (92) · 🔬 ICLR2026 (81) · 💬 ACL2026 (1) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (35) · 📹 ICCV2025 (13)

🔥 Top topics: Self-Supervised Learning ×3 · Few-/Zero-Shot Learning ×2

A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning: This paper improves the sample complexity upper bound for supervised contrastive learning (where tuples are constructed from a finite labeled data pool). By employing two distinct U-statistic estimators, it achieves a breakthrough from bounds dependent on the minimum class probability to bounds that depend only on the number of classes or the sample scale in extreme multi-class scenarios.
Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance: This paper proposes PriorAL, which utilizes foundation model predictions as priors for joint decision-making with small models via a "Product of Experts." It employs imbalance-aware entropy filtering to partition the unlabeled pool into a "clean set (for free pseudo-labeling)" and a "noise set (for human annotation)," achieving over 50% savings in labeling costs on image/text tasks characterized by both class imbalance and label noise.
Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning: Ours proposes SAGE, which replaces "estimating unlabeled data distributions" with "structural inference in the representation space." By combining simplex ETF geometric anchors, high-order graph propagation, and distribution-agnostic reliability weighting, SAGE achieves an average accuracy improvement of 8.52% under the UniSSL setting with extreme label scarcity and arbitrary unlabeled distributions.
Can Local Learning Match Self-Supervised Backpropagation?: This paper theoretically proves that local self-supervised learning (local-SSL) can precisely achieve the gradient updates of global backpropagation (BP-SSL) in deep linear networks. Based on this insight, the authors propose CLAPP++ (introducing 2D spatial dependence and direct feedback), which achieves performance comparable to global BP-SSL on CIFAR-10/STL-10/Tiny ImageNet, setting a new SOTA for local-SSL.
Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise: The authors prove that "predefined data augmentation (rotation/cropping/flipping)" in contrastive learning is equivalent to a point estimation of Positive-incentive Noise (π-noise). They then upgrade π-noise from "point estimation" to a learnable distribution by training a π-noise generator (PiNDA) to add learnable noise as augmentation. This leads to consistent gains for SimCLR / BYOL / SimSiam / MoCo / DINO in vision and is naturally compatible with non-visual data without manual augmentation, such as HAR / Reuters / Epsilon.
FLAG: Foundation Model Representation with Latent Diffusion Alignment via Graph for Spatial Gene Expression Prediction: FLAG reformulates the prediction of spatial gene expression from H&E pathology images as a structured distribution generation problem. It employs a fixed spatial graph encoder to compress tissue topology into conditional vectors, uses a DiT for denoising in the gene dimension, and injects gene-gene regulatory priors through intermediate layer alignment with Gene Foundation Models (GFMs). This approach elevates Gene Structural Correlation (GSC) and Spatial Structural Correlation (SSC) to new heights while maintaining competitive PCC/MSE performance.
From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection: This paper proposes OutFormer, a tabular Prior-Fitted Network (PFN) pretrained on a mixture of three synthetic priors (GMM, SCM, and Copula) and stabilized through a Multi-Armed Bandit-based Self-Evolving Curriculum. It achieves zero-shot tabular outlier detection by processing training data in-context and generating labels in a single forward pass. OutFormer achieves SOTA rankings across ADBench and two new benchmarks containing 1500+ datasets, while maintaining inference latency comparable to shallow models.
How 'Neural' is a Neural Foundation Model?: The authors treat a "SOTA foundation model of mouse visual cortex (FNN)" as a physiological experimental subject. By analyzing its encoder, recurrent, and readout modules using a trinity of decoding manifolds, encoding manifolds, and decoding trajectories, they discovered that FNN's fitting accuracy is primarily sustained by a large number of homogeneous feature maps in the readout, while only the recurrent module is truly "brain-like." Using a newly proposed tubularity metric, they quantitatively show that "early encoding layers lack biological-grade temporal structure," providing explicit suggestions for future neural foundation models to "add recurrence early and reduce feature dimensions in the readout."
Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data: This paper proposes "local inconsistency" \(S_\rho(\theta)\)—the worst-case KL divergence within a parameter ball—which can be calculated using only unlabeled data. By employing it as a training regularization term, the resulting IAM optimizer performs comparably to or better than SAM/ASAM in supervised tasks and brings additional improvements in semi-supervised (FixMatch) and self-supervised (SimCLR) scenarios by leveraging unlabeled batches.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimation: InfoAtlas transforms mutual information estimation from an optimization problem where an evaluation network is trained from scratch for each dataset into a "single forward pass" problem using a hypernetwork pre-trained on large-scale synthetic data. This achieves accuracy comparable to neural estimators like MINE/MINDE while providing a 100× speedup.
Learning Graph Foundation Models on Riemannian Graph-of-Graphs: R-GFM treats subgraphs of "different hop counts" as nodes in a higher-level Graph-of-Graphs (GoG), using a dynamic MoE router to assign each GoG to the Riemannian manifold (Hyperbolic / Euclidean / Spherical) that best matches its curvature. It simultaneously addresses two inherent flaws in existing graph foundation models—fixed receptive fields and single Euclidean embeddings—achieving up to a 49% relative improvement in downstream tasks.
Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation: This paper proposes the Relational Task Extrapolator (RTE), which reinterprets "new tasks outside the training support" as a compositional problem of "known anchor tasks + seen inter-task transformations." It trains a relational operator \(\Psi\) to assemble these anchor-transform pairs at test time to predict the outputs of unseen tasks.
LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems: Addressing the long-standing issue in LLM selective prediction where "UCB risk bounds are too conservative and offer few usable thresholds," the authors rewrite the objective "post-selection error rate \(\le \alpha\)" as a linear expectation constraint involving indicator functions for selection and error. This leads to a finite-sample sufficient condition (Eq. 5) that depends only on the calibration set. This approach maintains strict finite-sample guarantees while being significantly tighter than UCB. The framework naturally extends to two-model routing systems for joint threshold calibration, achieving consistent power gains across CommonsenseQA, TriviaQA, ScienceQA, and MM-Vet v2, and accepting 9.5% more samples than Clopper-Pearson UCB on TriviaQA.
LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models: Addressing two major pathologies in tabular foundation models like TabPFN-v2—severe low-rank collapse in shallow layers and the negligible contribution of sample-attention in the final layer to prediction signals—the authors propose using Radial Basis Functions to expand each scalar into a set of local responses (RaBEL) to unlock degrees of freedom in the "value direction." Furthermore, the bidirectional attention blocks are rearranged from F→S→N to S→N→F to ensure all attention paths flow into the readout. With only 2M parameters, this model consistently outperforms the 7M TabPFN-v2 and 27M TabICL across mainstream tabular benchmarks.
Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment: This work proposes posterior correction for "Tabular Foundation Models" such as TabPFN, which feed training sets directly into attention mechanisms as context. It identifies severe overfitting to the training set's majority class and introduces DistPFN: a posterior reweighting method using \(\tilde{p}(y) \propto \hat{p}(y)^2 / p_{train}(y)\). Across 253 OpenML datasets, it improves the accuracy of TabPFN-v2 from 72.7% to 76.9% under strong label shift (\(\beta=5\)) without retraining, test-prior estimation, or architectural modifications.
NITP: Next Implicit Token Prediction for LLM Pre-training: NITP provides continuous representation-space supervision for the final hidden states by using shallow representations as implicit targets. This supplements standard NTP to prevent hidden representations from degenerating into low-dimensional anisotropic configurations, achieving a 5.7% improvement in MMLU-Pro on a 9B MoE and general gains of 4-6% in reasoning tasks with only ~2% additional computational overhead.
NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models: NumLeak detects and quantifies the degree of foundation model memorization of public numeric benchmarks (financial factors, macroeconomic data, climate data) via a four-layer diagnostic protocol—revealing how such contamination leaks into downstream financial signals and mitigating risks through system prompt defenses; Opus 4.7 achieves a within-25 bps accuracy of 0.60 and a Pearson \(r = 0.99\) on the Mkt-RF factor.
PartCo: Part-Level Correspondence Priors Enhance Category Discovery: PartCo introduces a plug-and-play framework to enhance Generalized Category Discovery by explicitly leveraging part-level feature correspondences inherent in Vision Transformer patch tokens, improving baselines like SimGCD / SPTNet / FlipClass by 2-10% across multiple benchmarks including CUB, Stanford-Cars, and ImageNet-100.
Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch: The authors prove that in typical triplet tasks within contrastive learning, if the embedding dimension \(d\) is less than a certain constant multiple of the true dimension \(D\), the accuracy "collapses" to the 50% baseline (equivalent to a 1D random embedding) regardless of the optimizer. Furthermore, this phenomenon is shown to be hard to approximate in polynomial time under the Unique Games Conjecture.
Riemannian Metric Matching for Scalable Geometric Modeling of Distributions: The "Riemannian metric of the data manifold" is rewritten as a carré du champ operator, and a neural network is trained to learn this operator directly using a denoising-style conditional regression loss. This eliminates the need for kNN graph construction or computing large network Jacobians, allowing for the amortized estimation of intrinsic dimensionality, tangent spaces, and geodesic paths on high-dimensional data at a constant cost. Inference is up to 400x faster than kNN-based diffusion geometry estimation.
Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts: The authors propose CaRE: inserting a Bi-Level Routing MoE (BR-MoE) into each ViT block. It first uses a "class-perceiver" to select Top-M relevant task routes based on entropy, then each route activates Top-K task experts while adding a shared EMA expert. This allows the model to retain old knowledge while absorbing new classes even in sequences exceeding 300 tasks. The work also introduces the 1000-class OmniBenchmark-1K to fill the gap in long-sequence CIL evaluation.
Statistical Consistency and Generalization of Contrastive Representation Learning: This paper establishes the Fisher/statistical consistency for Contrastive Representation Learning (CRL) for the first time, proving that minimizing upstream contrastive loss is equivalent to optimizing downstream AUC-type retrieval performance. It provides refined generalization bounds of \(O(1/m+1/\sqrt n)\) (supervised) and \(O(1/\sqrt m+1/\sqrt n)\) (self-supervised) depending on the number of positive samples \(n\) and negative samples \(m\), theoretically explaining why CLIP/SimCLR continue to benefit from tens of thousands of negative samples.
The Geometry of Projection Heads: Conditioning, Invariance and Collapse: This paper analyzes projection heads in self-supervised learning as trainable metric tensors from a Riemannian geometry perspective. It demonstrates that their role is to dynamically whiten the optimization landscape, escape collapse saddle points via negative curvature from smooth activations, and induce metric singularities along data augmentation directions—collectively explaining the long-standing mystery of why these heads are "required during training but discarded for inference."
Towards One-for-All Anomaly Detection for Tabular Data: OFA-TAD is proposed: using "neighbor distance" as a cross-domain universal anomaly cue, multi-view distance representations are extracted from metric spaces induced by various feature transformations. These are adaptively fused using a Mixture of Experts (MoE) gating mechanism. After a single training phase, the model generalizes directly to unseen tabular datasets for anomaly detection without any target-domain fine-tuning.
TRACER: Robust Multimodal Fine-tuning Proven with WMA Teacher + Geometric Decomposition: TRACER utilizes closed-form theoretical analysis to geometrically decompose contrastive fine-tuning into "task subspace" and "orthogonal preservation" components. It proves that EMA teachers collapse and lose regularization power, prompting the proposal of a Weighted Moving Average (WMA) teacher. This teacher maintains finite-horizon continuous constraints and achieves unbiased convergence in the task subspace. On CLIP ViT-B/16, the average ImageNet distribution shift performance improved to 64.07% vs CaRot 62.54%.
Understanding Self-Supervised Learning via Latent Distribution Matching: The authors unify contrastive, non-contrastive, and predictive SSL as "Latent Distribution Matching (LDM)": maximizing the log-probability of samples under a hypothesized latent model (alignment) plus maximizing latent entropy (uniformity). Based on this, they derive a nonlinear identifiable predictive SSL equipped with a Kalman predictor.
When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE: This paper interprets InfoNCE as a top-1 selection likelihood and points out that standard softmax implicitly assumes a Gumbel tail distribution. However, hard negatives with high similarity in normalized embeddings more frequently exhibit Weibull behavior with finite endpoints. The authors propose WEINCE, a parameter-free method that adaptively mixes softmax logits with endpoint shortfall logits using in-batch tail statistics to stably improve self-supervised representation quality.
Zero-Flow Encoders: The paper discovers a counter-intuitive phenomenon: a rectified flow trained with independent coupling is zero at \(t=0.5\) if and only if the source and target distributions are identical ("Zero-Flow Criterion"). By generalizing this to conditional distributions, the authors prove that \(\mathbf{v}_{t=0.5}=0\) is equivalent to the encoder \(f(Y)\) being sufficient for predicting \(X\) (conditional independence). Based on this, they design a simulation-free least-squares loss without parametric density assumptions to unifiedly learn Markov blankets in graphical models and self-supervised representations, naturally circumventing the "shortcut problem" inherent in contrastive learning.