Skip to content

🔄 Self-Supervised Learning

🧪 ICML2026 · 9 paper notes

📌 Same area in other venues: 📷 CVPR2026 (30) · 🔬 ICLR2026 (14) · 🤖 AAAI2026 (13) · 🧠 NeurIPS2025 (32) · 📹 ICCV2025 (11)

🔥 Top topics: Self-Supervised Learning ×2

A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning

This paper improves the upper bound of sample complexity for supervised contrastive learning (where tuples are constructed from a finite labeled data pool). By introducing two different U-statistics estimators, it achieves a breakthrough in the extreme multi-class setting: moving from bounds dependent on the minimum class probability to those dependent only on the number of classes or sample size.

Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning

This paper proposes SAGE, which replaces "estimating the distribution of unlabeled data" with "structural inference in representation space." By combining simplex ETF geometric anchors, high-order graph propagation, and distribution-agnostic reliability weighting, it achieves an average 8.52% accuracy improvement under the UniSSL setting with extremely scarce labels and arbitrary unlabeled distributions.

Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise

The authors prove that "predefined data augmentations (rotation/cropping/flipping)" in contrastive learning are equivalent to point estimation of Positive-incentive Noise (π-noise). They then upgrade π-noise from "point estimation" to a learnable distribution by training a π-noise generator to add learnable noise to the original image as augmentation (PiNDA), leading to consistent improvements for SimCLR / BYOL / SimSiam / MoCo / DINO on vision tasks, and naturally adapting to non-vision data (HAR / Reuters / Epsilon) where manual augmentations are unavailable.

How 'Neural' is a Neural Foundation Model?

The authors treat a "state-of-the-art foundation model (FNN) of mouse visual cortex" as a physiological experimental subject, analyzing its encoder, recurrent, and readout modules using the trio of decoding manifold, encoding manifold, and decoding trajectory. They find that FNN's fitting accuracy mainly relies on the readout's homogeneous feature maps, while only the recurrent module is truly "brain-like." Using a newly proposed tubularity metric, they quantitatively show that early encoding layers lack biological temporal structure, and provide clear recommendations for future neural foundation models: "add recurrence early, reduce feature dimensions in readout."

Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch

The authors prove: For typical triplet tasks in contrastive learning, as long as the embedding dimension \(d\) is less than a constant multiple of the true dimension \(D\), the accuracy will "collapse" to the 50% baseline of a 1D random embedding, regardless of the optimizer used. Moreover, algorithmically, under the Unique Games Conjecture, this cannot be approximated in polynomial time.

Statistical Consistency and Generalization of Contrastive Representation Learning

This work is the first to establish the Fisher/statistical consistency of contrastive representation learning (CRL), showing that "minimizing upstream contrastive loss is equivalent to optimal downstream AUC-type retrieval performance." It further provides sharp generalization bounds dependent on the number of positive samples \(n\) and negative samples \(m\): \(O(1/m+1/\sqrt n)\) (supervised) and \(O(1/\sqrt m+1/\sqrt n)\) (self-supervised). This theoretically explains, for the first time, why using tens of thousands of negatives in CLIP/SimCLR continues to improve performance.

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

This paper proposes TC-JEPA, which conditions the I-JEPA masked feature predictor additionally on image captions. By applying multi-layer sparse cross-attention, patch representations become predictable under textual "prompts," enabling the learning of semantically richer and dense prediction-friendly visual representations without contrastive loss.

The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence

This paper elevates the InfoNCE loss to a deterministic "population energy" over representation distributions using a measure-theoretic framework, proving that the unimodal case is convex and converges to a unique Gibbs equilibrium, while the symmetric multimodal case exhibits persistent negative symmetric KL coupling, which geometrically and inevitably induces a modality gap.

Understanding Self-Supervised Learning via Latent Distribution Matching

The authors unify contrastive, non-contrastive, and predictive SSL as "Latent Distribution Matching (LDM)": maximizing the log-likelihood of samples under an assumed latent model (alignment) + maximizing latent entropy (uniformity), and based on this, derive a nonlinear identifiable predictive SSL with a Kalman predictor.