🔄 Self-Supervised Learning¶

📷 CVPR2025 · 26 paper notes

📌 Same area in other venues: 📷 CVPR2026 (92) · 🔬 ICLR2026 (81) · 💬 ACL2026 (1) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (35)

🔥 Top topics: Continual Learning ×4 · Few-/Zero-Shot Learning ×3 · Adversarial Robustness ×2

AutoSSVH: Automated Frame Sampling for Self-Supervised Video Hashing

BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: Proposes BoSS—a scalable active learning oracle strategy that generates candidate batches by ensembling multiple selection strategies, evaluates performance gain by freezing the backbone and retraining only the last layer, and selects the optimal batch. It showcases oracle performance on large-scale datasets like ImageNet for the first time, revealing that SOTA active learning strategies still have significant room for improvement.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: Legacy hand-tuned priors are replaced with a pretrained Foundation Model (TabPFN) to achieve zero-hyperparameter tuning for circuit yield multi-corner analysis. By freezing the backbone to perform in-context learning, automatically transferring knowledge across corners, and integrating automatic feature selection (1152D to 48D), this method achieves SOTA accuracy (MRE down to 0.11%) on SRAM benchmarks while reducing verification costs by over 10x.

CheXWorld: Image World Modeling for Radiograph Representation Learning

Do Your Best and Get Enough Rest for Continual Learning: Inspired by Ebbinghaus's forgetting curve theory, this paper proposes the View-Batch Model (VBM). By replacing multiple distinct samples in a batch with multiple augmented views (replay) of the same sample, VBM extends the recall interval by a factor of \(V\) to an optimal range. Concurrently, it employs a one-to-many KL-divergence self-supervised loss to extract more knowledge from a single sample ("do your best"). Serving as a drop-in replacement, VBM consistently improves performance across various continual learning methods.

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

Few-Shot Implicit Function Generation via Equivariance: Generates implicit functions (NeRF/SDF) from few-shot samples using equivariance constraints, leveraging symmetry priors to reduce data requirements.
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling: Proposes a prototype-driven curriculum learning for MAE, which identifies "prototype" samples (representative images close to cluster centroids) in the dataset using K-means clustering. By using a temperature-controlled sampling strategy, the training smoothly transitions from prototypes to the full distribution, achieving an up to \(8\times\) training acceleration (a 200-epoch prototype curriculum performs comparably to an 800-epoch standard MAE).
Hyperbolic Category Discovery: This work proposes the HypCD framework, shifting representation learning in Generalized Category Discovery (GCD) from Euclidean/spherical spaces to hyperbolic space (the Poincaré ball model). Capitalizing on the property of hyperbolic space where the volume grows exponentially—making it naturally suitable for encoding hierarchical structures—this work proposes hybrid distance-angle similarity learning and a hyperbolic classifier. It achieves an accuracy improvement on CUB for SelEx from 69.1% to 71.8%, and on ImageNet-100 from 87.1% to 88.3%.
Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry: This paper proposes GBWBN, the first batch normalization method for the SPD manifold based on generalized Bures-Wasserstein geometry. By introducing learnable metric parameters and matrix power non-linear transformations to effectively handle ill-conditioned covariance matrices, it achieves SOTA performance on skeleton-based action recognition and EEG classification.
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining: This work proposes Masked Autoregressive Pretraining (MAP). By utilizing a hierarchical pretraining objective that combines local MAE modeling with row-level autoregressive decoding, this work successfully pretrains hybrid Mamba-Transformer vision backbones for the first time, significantly outperforming individual MAE and AR strategies.
MaRI: Material Retrieval Integration across Domains: This work proposes the MaRI framework, which constructs a shared embedding space using dual DINOv2 encoders (image + material) via contrastive learning. By combining synthetic data from Blender and real-world material data generated by ZeST, it achieves accurate cross-domain PBR material retrieval.
MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning: MetaWriter formulates the personalized adaptation of handwritten text recognition as a prompt tuning problem. By combining it with a Masked Autoencoder (MAE) self-supervised auxiliary task, it achieves label-free test-time adaptation. It optimizes prompt initialization via meta-learning to align the self-supervised loss with the recognition loss, achieving SOTA on IAM and RIMES by updating less than 1% of the parameters.
MOS: Modeling Object-Scene Associations in Generalized Category Discovery: Challenges the traditional view in GCD that "scene information is noise," revealing that scenes are misunderstood as noise due to the "ambiguity challenge" (conflicts in base/novel relations between objects and scenes). It proposes the MOS framework, which effectively utilizes scene information through a dual-branch network and an MLP scene-aware module, achieving an average improvement of \(4\%\) on fine-grained GCD.
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad: OCRT proposes a plug-and-play three-stage pipeline—Object (Slot Attention decoupling), Concept (importance filtering), and Relation (concept graph reasoning)—which significantly improves the accuracy of SAM on weakly-supervised medical/camouflaged segmentation and the robustness of CLIP under adversarial attacks, without altering the FM backbone.
Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping: GDDSG is proposed, which groups classes by similarity using graph coloring theory—ensuring classes within the same group are as dissimilar as possible (to reduce interference). Each group is learned independently using an NCM classifier and a LoRA adapter, achieving a 94.00% accuracy and only a 0.78% forgetting rate on CIFAR-100 10-step (prev. SOTA RanPAC: 90.50% / 3.49%).
Representation Learning for Spatiotemporal Physical Systems: This work systematically evaluates the capability of general self-supervised learning methods to learn physically meaningful representations in spatiotemporal physical systems. The evaluation reveals that JEPA, which performs predictions in the latent space, significantly outperforms pixel-level reconstruction methods (MAE) and autoregressive models, closely approaching the performance of the domain-specific physical modeling method DISCO.
SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers: This paper proposes SATA (Spatial Autocorrelation Token Analysis), a training-free robustness enhancement method for ViTs. By grouping tokens based on spatial correlation patterns via spatial autocorrelation analysis and reweighting token representations according to the grouping information, SATA improves ViT robustness under distribution shifts and adversarial attacks without compromising clean performance.
ScaleLSD: Scalable Deep Line Segment Detection Streamlined: By streamlining the line segment detection architecture (introducing HAT-induced proposal verification) and designing an efficient pseudo-label generation pipeline (LSD-Rectifier), ScaleLSD achieves large-scale self-supervised training on 10 million unlabeled images for the first time, comprehensively outperforming classical non-deep LSD methods in zero-shot evaluations.
SEC-Prompt: SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning: The SEC-Prompt (SEmantic Complementary Prompting) framework is proposed to learn two sets of semantically complementary prompts—discriminative prompts (D-Prompt) and non-discriminative prompts (ND-Prompt). Working cooperatively through an adaptive query mechanism to reinforce inter-class discrimination and facilitate generalization to new classes respectively, they achieve SOTA performance on three benchmark datasets.
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning: This paper proposes SMILE, which enhances masked video modeling through synthetic motion augmentation (overlaying segmented objects moving along random trajectories onto videos) and CLIP feature reconstruction targets. Combined with a trajectory-guided masking strategy, it significantly boosts K400 linear probing to 56.2% (an improvement over the previous SOTA of 47.5%).
Spectral State Space Model for Rotation-Invariant Visual Representation Learning: Proposed Spectral VMamba, which orders the patch traversal sequence using eigenvectors of the spectral graph Laplacian (instead of predefined scan lines) and combines it with a Rotation Feature Normalizer (RFN, aggregating features of 4 canonical rotations) to achieve 87.86% accuracy on miniImageNet with complete invariance to canonical rotations.
Task-Agnostic Guided Feature Expansion for Class-Incremental Learning: The TagFex framework is proposed to continuously capture task-agnostic features via continual self-supervised learning. These features are adaptively integrated with task-specific features using merge attention and then distilled back into the inference model, mitigating the feature collision problem in expansion-based class-incremental learning.
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval: TPSNet is proposed to address unsupervised cross-domain image retrieval using dual text-phase priors: domain prompts (text prior) provide more precise semantic supervision than pseudo-labels, while phase features (phase prior) achieve semantic-preserving domain-invariant alignment, with both synergistically fused through cross-attention.
Transformers without Normalization: Discovered that the input-output mapping of LayerNorm exhibits a tanh-like shape, and proposed Dynamic Tanh (DyT) as a plug-and-play alternative to normalization layers: \(\text{DyT}(x) = \gamma \odot \tanh(\alpha x) + \beta\). DyT achieves comparable or superior performance to LN across multiple tasks such as vision, language, diffusion, and speech.
UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines: The authors propose the UniSTD framework, which utilizes a standard Transformer combined with a rank-adaptive mixture-of-experts (RA-MoE) and a lightweight temporal module. This design enables a single model to simultaneously handle 10 spatio-temporal predictive tasks across 4 diverse disciplines without performance loss, outperforming existing joint-training methods by 18.8 PSNR in multi-task scenarios.