🔄 Self-Supervised Learning¶

📹 ICCV2025 · 11 paper notes

A Token-level Text Image Foundation Model for Document Understanding (TokenFD/TokenVL): This paper proposes TokenFD, the first token-level text image foundation model, pre-trained on 20 million images and 1.8 billion BPE token-mask pairs via token-level vision-language alignment to achieve image-as-text semantic understanding. Built upon TokenFD, TokenVL is introduced as a document understanding MLLM, achieving a score of 860 on OCRBench (highest among 8B-class models) and an average improvement of 8.8% across ten VQA benchmarks including DocVQA.
Always Skip Attention: This paper theoretically demonstrates that the self-attention mechanism in Vision Transformers is inherently ill-conditioned, leading to training collapse in the absence of skip connections. It further proposes Token Graying (TG), a method that improves the condition number of input tokens to enhance ViT training stability and performance.
CObL: Toward Zero-Shot Ordinal Layering without User Prompting: This paper presents CObL, an architecture based on multiple frozen Stable Diffusion UNets operating in parallel, capable of inferring an occlusion-ordered object layer representation (one amodally-completed object per layer) from a single image without any user prompts or prior knowledge of object count. Trained on only a few thousand synthetic tabletop scenes, CObL generalizes zero-shot to real-world photographs.
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations: This paper theoretically analyzes how MAE learns spatial correlations in images. It derives a closed-form solution for linear MAE, reveals how masking ratio and patch size select short- or long-range spatial features, and extends the analysis to nonlinear MAE, providing theoretical guidance for hyperparameter selection in practice.
Improving Large Vision and Language Models by Learning from a Panel of Peers: This paper proposes the Panel-of-Peers (PoP) learning framework, in which multiple LVLMs of comparable capability mutually generate candidate responses and score each other to construct preference data. Combined with iterative self-improvement via SimPO, PoP raises the average score across 15 benchmarks from 48% to 57% without any human-annotated data.
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models: LoftUp is proposed to map low-resolution VFM features to arbitrary high resolutions via a coordinate-cross-attention architecture, with class-agnostic mask refinement and self-distillation to construct full-resolution pseudo-GT for training, achieving average improvements of 10–20% across 6 downstream tasks and nearly 50% on video object segmentation.
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams: This paper proposes Manual-PA, a Transformer-based instruction-guided 3D part assembly framework that infers assembly order by aligning 3D parts with instruction step diagrams via contrastive learning, then uses the learned order as soft guidance through positional encoding for 6DoF pose prediction, significantly outperforming existing methods on PartNet.
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning: MoSiC extracts long-range motion trajectories via an offline point tracker and propagates cluster assignments along the temporal dimension through an Optimal Transport (Sinkhorn-Knopp)-based clustering mechanism. This enables learning spatially and temporally consistent dense representations from video data, improving DINOv2 by 1%–6% across multiple image and video benchmarks using only video for training.
Scaling Language-Free Visual Representation Learning: By training DINOv2/MAE-series models (1B–7B parameters) on MetaCLIP's 2 billion web images, this work systematically demonstrates that purely visual self-supervised learning (SSL) exhibits superior scaling behavior compared to CLIP in both model and data dimensions, surpassing CLIP on average VQA performance at 5B+ parameters—including OCR/Chart tasks conventionally assumed to require language supervision.
To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models: This paper proposes PALM — a unified mathematical model that characterizes active learning trajectories using four interpretable parameters (maximum accuracy \(A_{\max}\), coverage efficiency \(\delta\), initial learning offset \(\alpha\), and scalability \(\beta\)). The model predicts complete learning curves from limited labeled data, enabling quantitative and fair comparison of active learning strategies.
WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction: WIR3D optimizes a set of 3D Bézier curve parameters under the spatial guidance of CLIP intermediate-layer activations to faithfully represent the geometric structure and visually salient features (including texture) of 3D shapes from arbitrary viewpoints, achieving sparse yet semantically rich 3D shape abstraction.