Skip to content

🔄 Self-Supervised Learning

🔬 ICLR2026 · 15 paper notes

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

This paper provides rigorous theoretical proof via a similarity graph model that difficult examples (cross-class sample pairs with high similarity) hurt unsupervised contrastive learning — they strictly worsen the generalization error bound. Three theoretically grounded mitigation strategies are proposed: removing difficult examples, adjusting margins, and temperature scaling. On TinyImageNet, the approach yields up to a 10.42% improvement in linear probing accuracy. This finding is counterintuitive: while "more data is better" is a common principle in deep learning, carefully removing difficult examples in contrastive learning is in fact beneficial.

Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions

DeMol is a dual-graph enhanced multi-scale interaction framework that introduces parallel atom-centric and bond-centric channels along with Double-Helix Blocks to explicitly model atom–atom, atom–bond, and bond–bond interactions, achieving state-of-the-art performance on PCQM4Mv2, OC20, QM9, and related benchmarks.

Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning

Inspired by the Drosophila olfactory circuit, Fly-CL is proposed as a framework that achieves progressive decorrelation through three stages — sparse random projection, top-\(k\) activation, and streaming ridge classification — significantly reducing training time while attaining state-of-the-art performance in pre-trained model-based continual learning.

Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

This paper proposes GradFix, a method that constructs a binary mask from gradient signs computed on a minimal number of samples from the target pre-trained model, and uses it to filter the source model's task vector coordinate-wise, retaining only components aligned with the descent direction of the target loss landscape. Without any fine-tuning, GradFix enables task knowledge transfer across pre-trained models, provides a rigorous first-order descent guarantee in theory, and substantially outperforms both naive transfer and few-shot fine-tuning on vision and language benchmarks.

InfoNCE Induces Gaussian Distribution

This paper theoretically proves that the InfoNCE loss induces representations toward a Gaussian distribution via two complementary mechanisms: an empirical idealization route (alignment + spherical uniformity → Gaussian) and a regularization route (vanishing regularizer → isotropic Gaussian). The findings are validated on synthetic data and CIFAR-10.

Maximizing Asynchronicity in Event-based Neural Networks

This paper proposes EVA, a framework that treats events as language tokens and employs an RWKV-6-based linear attention asynchronous encoder to update features event-by-event. Combined with a self-supervised learning scheme consisting of Multi-Representation Prediction (MRP) and Next-Representation Prediction (NRP), EVA learns generalizable features and, for the first time, successfully tackles the challenging object detection task within the Asynchronous-to-Synchronous (A2S) paradigm (0.477 mAP on the Gen1 dataset).

Maximizing Incremental Information Entropy for Contrastive Learning

This paper proposes IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly maximizes entropy gain between augmented views—rather than merely maximizing mutual information—by treating the encoder as an information bottleneck and jointly optimizing a learnable transformation module (for entropy generation) and an encoder regularizer (for entropy preservation). IE-CL consistently improves contrastive learning performance on CIFAR-10/100, STL-10, and ImageNet under small-batch settings, with its core modules serving as plug-and-play components compatible with existing frameworks.

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

This paper proposes Self-Representation Alignment (SRA), which identifies that internal representations of diffusion Transformers exhibit a quality gradient along two dimensions—increasing layer depth and decreasing noise level. Based on this observation, SRA aligns early-layer, high-noise representations of a student network to late-layer, low-noise representations of an EMA teacher, requiring no external representation components (DINOv2/CLIP/MAE), and substantially accelerates convergence while improving generation quality on DiT and SiT (SiT-XL/2 achieves FID 1.58 at 800 epochs, comparable to REPA which relies on DINOv2).

PonderLM: Pretraining Language Models to Ponder in Continuous Space

This paper proposes PonderLM, which introduces a "pondering" mechanism at pretraining time — computing a weighted sum of the predicted probability distribution over token embeddings to form a continuous pondering embedding, then performing repeated forward passes. Without labeled data or reinforcement learning, a 2.8B model trained with this approach surpasses a 6.9B baseline on 9 downstream tasks.

Regularized Latent Dynamics Prediction is a Strong Baseline for Behavioral Foundation Models

This paper proposes Regularized Latent Dynamics Prediction (RLDP), which augments a self-supervised latent next-state prediction objective with a simple orthogonality regularization to preserve feature diversity. RLDP matches or surpasses complex state-of-the-art representation learning methods in zero-shot RL, with particularly notable advantages in low-coverage settings.

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty

SNAP-UQ proposes a single-forward-pass uncertainty estimation method tailored for TinyML scenarios. Lightweight int8 prediction heads are attached at selected tap layers of a backbone network; these heads predict the activation statistics of the next layer in a self-supervised manner. The deviation ("surprisal") between predicted and actual activations is aggregated into an uncertainty score. The method requires no additional forward passes, temporal buffers, or ensembles, and adds only tens of kilobytes of flash memory, enabling reliable distribution-shift detection and failure detection on microcontrollers.

Soft Equivariance Regularization for Invariant Self-Supervised Learning

This paper proposes SER (Soft Equivariance Regularization), a layer-decoupled design that applies soft equivariance regularization to intermediate ViT layers while preserving the invariance objective at the final layer. Without introducing additional modules, SER consistently improves classification accuracy and robustness for invariant SSL methods (MoCo-v3, DINO, Barlow Twins).

Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

This paper reveals that post-training methods such as RLHF and DPO systematically impair models' in-context steerability, output coverage, and distributional alignment. It proposes the Spectrum Suite evaluation framework and the Spectrum Tuning method, representing the first post-training approach that improves distributional alignment.

Temporal Slowness in Central Vision Drives Semantic Object Learning

By simulating human central vision (foveal cropping) and the temporal slowness principle (temporal contrastive learning), SSL models trained on Ego4D data demonstrate that combining these two mechanisms effectively improves semantic object representations — central vision enhances foreground extraction, while temporal slowness distills semantic information during fixation periods.

Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning

This paper diagnoses that the root cause of partial prototype collapse in prototypical self-supervised learning is shortcut learning induced by joint optimization of the encoder and prototypes. It proposes a fully decoupled training strategy—estimating prototypes independently via an online GMM—to completely eliminate collapse and improve downstream performance.