🔄 Self-Supervised Learning¶

📷 CVPR2026 · 38 paper notes

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking: This paper proposes PL-Stitch, a self-supervised framework that leverages the Plackett-Luce probabilistic ranking model to use temporal ordering of video frames as a pretraining signal. The method learns "procedure-aware" video representations and consistently outperforms existing self-supervised approaches on surgical phase recognition and cooking action segmentation.
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation: This paper proposes AcTTA, a framework that for the first time treats activation functions as learnable components for test-time adaptation (TTA). By introducing a parameterized activation center shift \(c\) and asymmetric gradient scaling \(\lambda_{pos}, \lambda_{neg}\) to replace or augment conventional normalization-layer adaptation, AcTTA consistently outperforms all normalization-based TTA methods on CIFAR-10/100-C and ImageNet-C, while supporting learning rates up to 10× larger.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning: This paper proposes MMOT, an online mixture model learning framework driven by optimal transport theory. By maintaining multiple adaptive centroids per class, MMOT more accurately captures the multimodal structure of online data streams. Combined with a dynamic preservation strategy that enhances class discriminability, MMOT effectively alleviates catastrophic forgetting in online class-incremental learning (OCIL).
BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning: This paper proposes the BD-Merging framework, which trains a debiased router via Dirichlet evidential modeling, Adjacency Discrepancy Score (ADS), and discrepancy-aware contrastive learning to adaptively assign model merging weights, significantly improving the robustness and generalization of merged models under test-time distribution shifts and on unseen tasks.
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: This paper proposes BoSS, a scalable oracle strategy selection framework. In each active learning round, multiple query strategies are run in parallel on random sub-pools to generate candidate batches; each candidate batch is evaluated rapidly by freezing the backbone and retraining only the final linear head; the batch yielding the greatest performance gain is selected. This framework enables quantification of the gap between existing AL strategies and the theoretical optimum.
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: This paper proposes BoSS (Best-of-Strategies Selector), which generates 100 candidate batches by ensembling 10 complementary AL selection strategies, and efficiently evaluates the performance gain of each candidate batch by freezing the pretrained backbone and retraining only the final linear layer. The best-performing batch is selected as an Oracle upper-bound reference. BoSS is the first deep active learning Oracle scalable to ImageNet, and reveals that current state-of-the-art strategies still leave approximately a 2× accuracy improvement gap on large-scale, many-class datasets.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: This paper proposes a zero-hyperparameter yield multi-corner analysis framework based on Learned Priors (TabPFN foundation model). By replacing traditional GP/normalizing flow hyperparameter tuning with in-context Bayesian inference, and combining automatic feature selection, Cross-Corner knowledge transfer, and uncertainty-driven active learning, the framework achieves an MRE as low as 0.11% with no manual tuning, reducing verification cost by over 10×.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: This paper proposes replacing handcrafted priors (GP kernels, IS Gaussian assumptions) with the learned prior of the foundation model TabPFN, enabling zero-hyperparameter multi-PVT-corner yield analysis. On industrial-grade SRAM benchmarks, the method achieves state-of-the-art accuracy (MRE as low as 0.11%) while reducing verification cost by more than 10×.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models: This paper proposes Chain-of-Models Pre-Training (CoM-PT), which arranges vision foundation models in a size-ordered "model chain" and progressively accelerates training via inverse knowledge transfer (weight initialization + feature distillation) from smaller to larger models, achieving lossless training acceleration whose efficiency improves as the model family grows.
CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale: This work is the first to formalize crater analysis as an instance-level image retrieval problem. It introduces the CraterBench-R benchmark (~25K Mars crater IDs, 50K gallery, 5K queries), and through systematic diagnosis reveals that single-vector pooling imposes an accuracy ceiling while supervised metric learning consistently degrades performance. A training-free instance token aggregation method is proposed—selecting K seed tokens via top-K attention or FPS and performing cosine nearest-neighbor residual assignment—to compress 196 ViT patch tokens into K representative tokens for late interaction matching. At K=64, the method matches full-token accuracy with substantially reduced storage. A practical two-stage pipeline (single-vector coarse retrieval + instance token re-ranking) recovers 89–94% of full-pipeline accuracy.
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping: This paper proposes D2Dewarp—the first document dewarping method that learns geometric representations from both horizontal and vertical dimensions. A UNet with dual decoders predicts horizontal lines (top/bottom boundaries of documents, tables, and text lines) and vertical lines (left/right boundaries) respectively. An HV Fusion Module cross-fuses features from both directions via mixed attention. The authors also introduce the DocDewarpHV dataset containing 114K images with dual-dimension annotations.
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers: Through systematic analysis, this work identifies inter-block representation diversity as a key factor for effective learning in DiTs, and proposes DiverseDiT: long residual connections to diversify inputs combined with a representation diversity loss to explicitly promote feature differentiation across blocks—accelerating convergence and improving generation quality without any external guidance model.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model for Geo-Localization: GeoBridge proposes a semantic-anchored multi-view foundation model for geo-localization that bridges UAV, street-view, and satellite imagery through textual descriptions as cross-modal semantic anchors, enabling bidirectional cross-view matching and language-to-image localization. The authors also introduce the GeoLoc dataset (50K+ location tuples across 36 countries).
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration: This paper introduces GeoChemAD, the first open-source multi-region multi-element geochemical anomaly detection benchmark (8 subsets covering three sampling media—sediment/rockchip/soil—and four target elements—Au/Cu/Ni/W), and proposes GeoChemFormer, a two-stage Transformer framework that first learns spatial context and then models inter-element dependencies, achieving a mean AUC of 0.7712 that surpasses all baselines.
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration: This paper introduces GeoChemAD, an open-source benchmark dataset, and GeoChemFormer, a two-stage framework that performs unsupervised geochemical anomaly detection via spatial context learning and elemental dependency modeling, achieving an average AUC of 0.7712 across 8 subsets.
Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning: This paper proposes leveraging DINOv3 with two self-supervised pretraining tasks — individual optical flow estimation and group-relevant object localization — to learn group activity features (GAF), achieving substantial improvements over existing methods without any group activity annotations.
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency: This paper proposes LaS-Comp, a zero-shot, category-agnostic 3D shape completion framework. It injects known geometry in the spatial domain via an Explicit Replacement Stage (ERS) and optimizes boundary consistency in the latent space via gradient-based updates in an Implicit Alignment Stage (IAS). The framework bridges the gap between the latent space and spatial domain of pretrained 3D foundation models, achieving state-of-the-art performance across diverse partial observation patterns.
MINE-JEPA: In-Domain Self-Supervised Learning for Mineral Exploration: This paper proposes Mine-JEPA, the first in-domain self-supervised learning (SSL) pipeline for side-scan sonar (SSS) mine classification. Built upon SIGReg regularization loss, sonar-adapted augmentation strategies, and ImageNet initialization, Mine-JEPA pretrained on only 1,170 unlabeled sonar images surpasses DINOv3—a foundation model pretrained on 1.7 billion images.
MOMO: Mars Orbital Model — Foundation Model for Mars Orbital Applications: MOMO is the first foundation model for Mars remote sensing. It independently pre-trains MAE on three Mars sensors (HiRISE/CTX/THEMIS) and proposes an Equal Validation Loss (EVL) checkpoint selection strategy for model merging, outperforming ImageNet pre-training and Earth observation foundation models across 9 downstream tasks in Mars-Bench.
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism: This paper proposes OmniGCD, the first modality-agnostic generalized category discovery method. A GCDformer trained on synthetic data transforms the GCD latent space of arbitrary modalities at test time into representations more amenable to clustering, achieving zero-shot GCD across 16 datasets spanning four modalities.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning: This paper proposes an online mixture model framework driven by optimal transport theory (MMOT), which maintains multiple adaptive centroids per class to capture the multimodal distribution of streaming data. Combined with a dynamic preservation strategy to mitigate catastrophic forgetting, the method substantially outperforms existing approaches in the OCIL setting.
Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting: This paper proposes Re-Depth Anything, which refines depth predictions from Depth Anything V2/3 at inference time through self-supervised optimization: the predicted depth map is augmented via re-lighting, and a 2D diffusion model's SDS loss is used to guide the optimization without any labeled data.
Representation Learning for Spatiotemporal Physical Systems: This paper systematically compares four self-supervised/physics-modeling methods on three PDE-based physical systems (active matter, shear flow, and Rayleigh-Bénard convection), finding that latent-space prediction (JEPA) consistently outperforms pixel-level prediction (VideoMAE) on physical parameter estimation tasks — achieving 28%–51% relative MSE reduction — and that JEPA trained with only 10% of fine-tuning data surpasses VideoMAE trained on 100% of the data. Notably, methods specifically designed for physical modeling are not always the optimal choice.
Representation Learning for Spatiotemporal Physical Systems: This paper systematically benchmarks four learning paradigms — JEPA, VideoMAE, an autoregressive foundation model (MPP), and an operator learning method (DISCO) — across three PDE-based physical systems. It finds that latent-space predictive objectives (JEPA) consistently outperform pixel-level prediction methods on the downstream task of physical parameter estimation, achieving 28–51% relative MSE reduction with greater data efficiency.
Robustness of Vision Foundation Models to Common Perturbations: This paper presents the first systematic study on the robustness of vision foundation models to common perturbations (e.g., JPEG compression, brightness adjustment). It proposes three robustness metrics, formalizes five mathematical properties, finds that foundation models are generally non-robust, and introduces a fine-tuning method that improves robustness without sacrificing utility.
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild: This paper reformulates semantic correspondence as a Fused Gromov-Wasserstein (FGW) optimal transport problem, leveraging geometric structural constraints from 3D foundation models to generate globally consistent pseudo labels, thereby addressing the geometric inconsistency caused by the locality and 2D appearance ambiguity inherent in conventional nearest-neighbor matching.
SpHOR: A Representation Learning Perspective on Open-set Recognition: SpHOR proposes a two-stage decoupled training framework: Stage 1 performs OSR-tailored representation learning via orthogonal label embeddings, spherical constraints (vMF distribution), and Mixup/Label Smoothing; Stage 2 freezes the encoder and trains a linear classifier. The method achieves up to 5.1%/5.2% gains in OSCR/AUROC on the Semantic Shift Benchmark, and introduces two new metrics: Angular Separability and Norm Separability.
SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks: This paper proposes SpHOR, a two-stage decoupled training framework for open-set recognition (OSR) that explicitly shapes the feature space via spherical representation learning (vMF distributions), orthogonal label embeddings, and integrated Mixup/Label Smoothing, achieving up to 5.1% OSCR improvement on the Semantic Shift Benchmark.
Suppressing Non-Semantic Noise in Masked Image Modeling Representations: This paper identifies that representations learned by Masked Image Modeling (MIM) retain substantial non-semantic information (e.g., low-level features such as texture and color), and proposes a training-free post-hoc method, SOAP (Semantically Orthogonal Artifact Projection), which leverages PCA to identify and project out non-semantic components, consistently improving zero-shot performance across multiple MIM models.
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction: This paper proposes TALO, a high-degrees-of-freedom alignment framework based on Thin Plate Spline (TPS), which corrects spatially varying geometric inconsistencies of 3D vision foundation models (3DVFMs) in online reconstruction via globally propagated control points and a point-agnostic submap registration design. TALO is compatible with multiple foundation models and camera configurations, and significantly reduces trajectory error on the Waymo and nuScenes datasets.
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation: This paper proposes TeFlow — the first method to introduce multi-frame supervision into self-supervised feed-forward scene flow estimation. By constructing a motion candidate pool via temporal aggregation and aggregating temporally consistent supervision signals through consensus voting, TeFlow achieves a Three-way EPE of 3.57 cm on Argoverse 2 (on par with the optimization-based method Floxels) while maintaining real-time inference (8 s vs. 24 min), representing a 22.3% improvement over SeFlow++.
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval: This paper proposes TPSNet, which leverages CLIP-learned domain prompts as text priors to provide fine-grained semantic supervision, while introducing phase spectrum features as phase priors to bridge domain distribution gaps and preserve semantic integrity. Significant improvements in unsupervised cross-domain image retrieval (UCDIR) are achieved through the synergistic combination of text-phase dual priors.
TrackMAE: Video Representation Learning via Track, Mask, and Predict: This paper introduces explicit motion signals into the masked video modeling (MVM) framework. Point trajectories extracted via CoTracker3 serve as auxiliary reconstruction targets, complemented by a motion-aware masking strategy. The model jointly learns spatial reconstruction and motion prediction, achieving substantial gains over existing video self-supervised methods on motion-sensitive benchmarks (SSv2, FineGym).
UniGeoCLIP: Unified Geospatial Contrastive Learning: UniGeoCLIP is the first to align five complementary geospatial modalities (aerial imagery, street-view imagery, digital surface models, text, and GPS coordinates) into a unified embedding space via pure contrastive learning, and proposes a multi-scale coordinate encoder to enhance spatial representation capacity.
Vision Transformers Need More Than Registers: This paper argues that dense feature artifacts in ViTs trained under label supervision, text supervision, and self-supervision share a common root cause: rather than a simple high-norm token problem, models learn to exploit background patches as global semantic shortcuts, driven by coarse-grained supervision combined with global attention. The authors accordingly propose LaSt-ViT, which replaces standard CLS aggregation with frequency-domain stability-guided selective aggregation, yielding consistent improvements in localization, segmentation, and open-vocabulary tasks across 12 benchmarks.
Vision Transformers Need More Than Registers: This paper systematically analyzes the artifact phenomenon widely observed in ViTs across fully supervised, text-supervised, and self-supervised paradigms, revealing that the root cause is "lazy aggregation"—ViTs exploit semantically irrelevant background patches as shortcuts to represent global semantics. The authors propose LaSt-ViT (LazyStrike ViT), which anchors the CLS token to foreground regions via frequency-aware selective channel aggregation, consistently eliminating artifacts and improving performance across 12 benchmarks.
VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair: VT-Intrinsic exploits the physical complementarity between visible and thermal infrared images—unreflected light is absorbed as heat—to derive ordinal relationships between visible-thermal intensities that directly correspond to ordinal relationships in reflectance and shading. These ordinal relations serve as dense self-supervised signals to drive neural network optimization, achieving high-quality intrinsic image decomposition without any pre-training data.
Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers: Three substitution control experiments (mean substitution, noise substitution, and cross-image shuffling) demonstrate that zero-ablation overstates the dependence on the precise content of register tokens in DINO-series ViTs — the model requires only "reasonable register-like activations" rather than image-specific values.