Skip to content

🔄 Self-Supervised Learning

🎞️ ECCV2024 · 16 paper notes

📌 Same area in other venues: 📷 CVPR2026 (92) · 🔬 ICLR2026 (81) · 💬 ACL2026 (1) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (35)

🔥 Top topics: Self-Supervised Learning ×6

Adaptive Multi-head Contrastive Learning

This paper proposes Adaptive Multi-head Contrastive Learning (AMCL), which generates different feature perspectives through multiple projection heads and independently weights each sample pair using an adaptive temperature mechanism derived from Maximum Likelihood Estimation (MLE). This effectively resolves the overlap in similarity distributions of positive and negative samples under diverse data augmentations, consistently improving the performance of SimCLR, MoCo, and Barlow Twins.

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

This work proposes a city-scale 2.5D urban layout generation method based on Graph Masked Autoencoders (GMAE). By capturing multi-level semantic context across buildings, blocks, and communities through a canonical graph representation and combining it with priority-scheduled iterative sampling, the method achieves realistic, semantically consistent, and topologically correct large-scale urban layout generation across 330 US cities.

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

CropMAE is proposed: a siamese masked autoencoder trained by replacing video frame pairs with two randomly cropped views of the same image. With an extremely high masking ratio of 98.5%, it learns object boundary-aware representations using only 2 visible patches. This accelerates training by up to 23.8× compared to SiamMAE while achieving competitive performance on video propagation tasks.

Exemplar-Free Continual Representation Learning via Learnable Drift Compensation

Proposes Learnable Drift Compensation (LDC), which trains a forward projector to map the old feature space to the new feature space. This effectively compensates for the semantic drift of class prototypes without needing to store old exemplars, achieving exemplar-free semi-supervised continual learning for the first time.

FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning

Proposes FlowCon, a density estimation-based OOD detection method that innovatively combines normalizing flows with supervised contrastive learning. By using a contrastive loss based on the Bhattacharyya coefficient in the latent space of the flow model to learn class-conditional Gaussian distributions, it achieves efficient OOD detection without requiring external OOD data or retraining the classifier.

InfMAE: A Foundation Model in the Infrared Modality

This paper proposes InfMAE, the first foundation model designed specifically for the infrared modality. By constructing the Inf30 dataset with 300,000 infrared images, and designing an information-aware masking strategy along with a multi-scale encoder, the proposed method outperforms existing state-of-the-art approaches in three downstream tasks: infrared semantic segmentation, object detection, and small object detection.

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

This paper proposes MarineInst, a foundation model for marine image analysis that simultaneously outputs instance masks and semantic descriptions. Additionally, it constructs MarineInst20M—the largest marine image dataset to date (20 million images), supporting multi-level marine visual analysis tasks from image-level scene understanding to region-level instance understanding.

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

The Position Forest Transformer (PosFormer) is proposed. By encoding the LaTeX sequence of mathematical expressions into a position forest structure, it explicitly models the hierarchical and spatial relationships among symbols. An implicit attention correction module is designed to comprehensively outperform SOTA methods on single-line, multi-line, and complex expression datasets, without introducing additional inference overhead.

PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

This paper proposes the PromptCCD framework, which utilizes a Gaussian Mixture Model (GMM) as a prompt pool to achieve continual discovery of novel categories in unlabeled data streams while mitigating catastrophic forgetting.

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

This work proposes the Multi-T (Multiple Thresholding) module, which generates two thresholds to isolate inliers and outliers within a target dataset, respectively. By utilizing the identified inliers to train a clean normal manifold and using the outliers for feature denoising, Multi-T significantly enhances the performance of existing outlier scoring methods.

Revisiting Supervision for Continual Representation Learning

This work challenges the common belief that "self-supervised learning outperforms supervised learning in continual representation learning." It reveals that supervised learning with an MLP projector can build stronger representations than SSL in continual learning scenarios. The key lies not in the presence or absence of labels, but in the enhancement of feature transferability by the MLP projector.

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

SCPNet is proposed, which of first achieves effective unsupervised cross-modal homography estimation on datasets with large modal discrepancy (e.g., satellite-to-map) through the synergy of three key components: intra-modal self-supervised learning, correlation networks, and consistent feature map projection, achieving an MACE that is 14% lower than the supervised method MHN.

Self-supervised Video Copy Localization with Regional Token Representation

This paper proposes a self-supervised video copy localization framework. By introducing Regional Tokens into Vision Transformers to capture local regional information and utilizing the transitivity property to automatically generate training data, the proposed method outperforms supervised approaches without requiring manual annotations.

STSP: Spatial-Temporal Subspace Projection for Video Class-Incremental Learning

The Spatial-Temporal Subspace Projection (STSP) method is proposed to address catastrophic forgetting in video class-incremental learning. By representing each class with orthogonal subspace bases using a Temporal-based Subspace Classifier (TSC) and constraining gradients to the null space of old task features using Spatial-based Gradient Projection (SGP), the method achieves SOTA results on HMDB51, UCF101, and SSv2.

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

ViC-MAE unifies contrastive learning and masked autoencoders into a single framework. By treating short video clips as augmented views (rather than duplicating images into video frames), it achieves excellent performance on both image and video downstream tasks—reaching 87.1% top-1 on ImageNet-1K (+2.4% over OmniMAE) and 75.9% on SSv2.

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

The WeCromCL framework is proposed to achieve scene text spotting using only transcription annotations (without location annotations) through weakly supervised atomic cross-modality contrastive learning. The detected anchor points are utilized as pseudo-labels to train a single-point supervised text detector, achieving performance close to fully supervised methods without bounding box annotations.