Exploring Structural Degradation in Dense Representations for Self-supervised Learning¶

Conference: NeurIPS 2025 arXiv: 2510.17299 Code: GitHub Area: Image Segmentation Keywords: Self-supervised Learning, Dense Representations, Performance Degradation, Model Selection, Regularization

TL;DR¶

This paper identifies and systematically investigates the Self-supervised Dense Degradation (SDD) phenomenon — where longer training improves classification yet hurts dense task performance — and proposes the DSE metric along with DSE-guided model selection and regularization strategies, achieving an average mIoU improvement of 3.0%.

Background & Motivation¶

Self-supervised learning (SSL) has achieved remarkable success in image-level representation learning, but improvements in dense (patch/pixel-level) representation learning remain limited. This paper uncovers a counterintuitive phenomenon:

Self-supervised Dense Degradation (SDD): Although training loss converges and classification performance steadily improves, performance on dense tasks such as semantic segmentation degrades in the later stages of training.

Universality: SDD appears consistently across 16 state-of-the-art SSL methods, spanning contrastive learning (MoCo v3, DenseCL), non-contrastive learning (BYOL, SimSiam, DINO), clustering-based methods (SwAV), and masked modeling approaches (MAE, I-JEPA).

Not Overfitting: SDD persists even when training and evaluation are conducted on the same dataset (COCO) — e.g., DINO exhibits a 4.0% mIoU drop.

Existing Metrics Fail: Metrics such as α-REQ, RankMe, and Lidar are primarily designed for image-level tasks and exhibit negative correlation with dense task performance.

Method¶

Overall Architecture¶

Grounded in error rate decomposition theory, the paper proposes the Dense representation Structure Estimator (DSE), which combines a class separability measure and an effective dimensionality measure:

\[\text{DSE} = M_{inter} - M_{intra} + \lambda \cdot M_{dim}\]

Key Designs¶

Theoretical Foundation: - Theorem 2 (Class-related measure): Proves that when the intra-class radius (estimated via the trace of the normalized representation matrix) is smaller than the inter-class distance, a simple nearest-neighbor classifier suffices for correct classification. - Corollary 5 (Dimensionality effect): Proves that the downstream error rate decays exponentially with representation dimension $d$: $\text{Err} \leq \delta + 2K\exp(-\tilde{C}_\delta \cdot d)$.

Class Separability Measure: - Pseudo-labels are generated via k-means clustering. - Intra-class radius: $M_{intra} = \frac{1}{k}\sum_{j=1}^{k} \frac{\sum_{i=1}^{\min(\tilde{N}_j, d)} \sigma_i(\tilde{Z}_c^j)}{(\tilde{N}_j - 1)}$ - Inter-class distance: $M_{inter} = \frac{1}{k}\sum_{j=1}^{k} \frac{1}{N_j}\sum_{z \in \tilde{Z}_j} \min_{i \neq j} \|z - \tilde{\mu}_i\|^2$

Effective Dimensionality Measure: - Randomly samples $B'$ independent dense representations and computes the effective rank: $M_{dim} = \text{Erank}(\bar{Z}) = \exp(-\sum_i p_i \log p_i)$

Adaptive Scaling: $\lambda = \text{Std}(M_{inter} - M_{intra}) / \text{Std}(M_{dim})$

Loss & Training¶

DSE-Guided Model Selection (offline): 1. Compute DSE for all checkpoints. 2. Identify local maxima of DSE as candidates. 3. Select the top-3 checkpoints with the highest DSE values.

DSE Regularization (online): $$\mathcal{L} = \mathcal{L}_{original} - \beta \cdot \text{DSE}$$ where $\lambda = 1$, $\beta = 0.001$, with training resumed from the checkpoint of best initial performance for 10 epochs.

Key Experimental Results¶

Main Results¶

SDD phenomenon across 16 SSL methods (Best vs. Last mIoU gap on COCO-Stuff / PASCAL VOC / ADE20k / Cityscapes):

Method	COCO Diff	VOC Diff	ADE20k Diff	Cityscapes Diff
MoCo v3	-22.0	-45.2	-14.4	-11.5
DINO	-4.4	-11.3	-4.2	-0.1
iBOT	-2.5	-3.0	-3.7	-3.2
I-JEPA	-5.6	-7.6	-4.5	-3.9
BYOL	-6.4	-6.7	-7.9	-7.5
MAE	-0.4	-1.3	-0.7	-2.1

DSE model selection results (+MS denotes improvement after model selection):

Method	COCO mIoU	VOC mIoU
MoCo v3	15.1 → 30.9 (+15.8)	5.9 → 42.0 (+36.1)
BYOL	30.7 → 37.1 (+6.4)	45.4 → 51.1 (+5.7)
I-JEPA	34.0 → 39.6 (+5.6)	52.6 → 59.3 (+6.7)
EsViT	33.4 → 41.6 (+8.2)	54.3 → 59.8 (+5.5)

Ablation Study¶

DSE vs. other metrics (average Kendall's τ):

Metric	COCO	VOC	ADE20k	City	Avg.
α-ReQ	-0.07	-0.05	-0.05	0.09	-0.02
RankMe	-0.10	-0.09	-0.14	0.00	-0.08
Lidar	-0.37	-0.36	-0.26	-0.21	-0.30
RankMe† (dense-adapted)	0.25	0.26	0.22	0.23	0.24
DSE (Ours)	0.58	0.60	0.56	0.49	0.57

DSE component ablation (average Kendall's τ):

Class Separability	Eff. Dimensionality	COCO	VOC	ADE20k	City	Avg.
✓	✗	0.45	0.42	0.33	0.37	0.39
✗	✓	0.25	0.26	0.22	0.23	0.24
✓	✓	0.58	0.60	0.56	0.49	0.57

Efficiency comparison:

Method	Avg. Improvement	Compute Cost (GPU·h)
Loss-based	-1.0	0.0
Supervised	+3.6	2.43
DSE (Ours)	+3.0	0.025 (~97× speedup)

Key Findings¶

Degradation causes vary by method: In MoCo v3, SDD stems from dimensional collapse; in DINO, it stems from declining class separability.
DSE regularization reverses degradation: On iBOT and I-JEPA, adding DSE regularization halts the performance decline.
DSE generalizes to image-level tasks: On ImageNet k-NN evaluation, the average Kendall's τ reaches 0.86, outperforming RankMe (0.79).
Minimal data suffices: Only 2,048 images (~0.16% of training data) are needed to reliably compute DSE.

Highlights & Insights¶

Breadth of the discovered phenomenon: SDD is systematically validated across 16 methods × 4 datasets × multiple evaluation protocols, constituting a community-level finding of significant importance.
Elegant unification of theory and practice: Starting from error rate decomposition, the paper derives class separability and dimensionality as the two governing factors, then designs a directly optimizable DSE metric.
High practical utility: DSE-guided model selection requires only 0.025 GPU·h while yielding an average mIoU gain of 3.0%.
Insight from Proposition 1: Reveals the fundamental issue that instance-level distance measures using k-means pseudo-labels always predict 100% accuracy, motivating the design of a class-level radius measure instead.

Limitations & Future Work¶

Theoretical analysis is primarily scoped to the linear probing setting, without fully accounting for distribution shift in transfer learning.
DSE exhibits relatively weaker predictive power on regression tasks such as depth estimation (lower Kendall's τ).
Online DSE regularization requires method-specific adaptation (e.g., the manner in which dense representations are extracted from the student model).
The specific mechanisms underlying dimensional collapse or separability degradation in individual methods remain underexplored.

DINOv3 (concurrent work) investigates degradation from the perspective of the iBOT/DINOv2 family and addresses it via Gram matrix distillation, complementing the present work.
RankMe is essentially a dense-adapted version of the proposed $M_{dim}$, but fails to capture degradation in class separability.
The findings offer a new perspective for SSL training strategy design: an appropriate trade-off between class separability and dimensional collapse should be pursued.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Identifies an important phenomenon insufficiently recognized by the community.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Systematic validation across 16 methods × 4 datasets.
Value: ⭐⭐⭐⭐⭐ — A near-zero-cost model selection strategy with consistent gains.
Writing Quality: ⭐⭐⭐⭐⭐ — The logical chain from phenomenon to theory to method is exceptionally clear.