Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning¶

Conference: NeurIPS 2025 arXiv: 2506.04411 Code: Available (project page) Area: Interpretability Keywords: Contrastive Learning, Self-Supervised Learning, Supervised Contrastive Loss, Neural Collapse, Few-Shot Learning

TL;DR¶

This paper theoretically proves that self-supervised contrastive learning (DCL) is approximately equivalent to a supervised contrastive loss (NSCL), with the gap vanishing at rate \(O(1/C)\) as the number of classes increases. It further proves that the global optimum of NSCL satisfies Neural Collapse (augmentation collapse + within-class collapse + Simplex ETF), and proposes a tighter few-shot error bound based on directional CDNV.

Background & Motivation¶

Background: Self-supervised contrastive learning methods (e.g., SimCLR, MoCo) learn representations from unlabeled data that match or exceed supervised learning on downstream tasks. However, the theoretical foundation remains incomplete — why does a loss function with no access to labels learn representations that cluster semantically by class?

Limitations of Prior Work: - Existing theoretical analyses (e.g., Arora et al., Saunshi et al.) rely on strong assumptions such as conditional independence of augmented views given class labels, limiting the generality of their conclusions. - Alignment and uniformity analyses characterize properties of representations but do not explain why CL organizes class structure. - Neural Collapse is well understood for supervised classification losses but has not been connected to self-supervised contrastive learning.

Key Challenge: In self-supervised CL, same-class samples are treated as negatives in the denominator and pushed apart, which should theoretically hurt within-class compactness — yet representations still cluster by class. This apparent contradiction demands explanation.

Goal: (1) Formally connect SSL CL to supervised CL; (2) characterize the geometric structure of representations learned by CL; (3) explain why CL representations support few-shot transfer.

Key Insight: When the number of classes \(C\) is large, the fraction of same-class samples among negatives is small (\(\sim 1/C\)), so the extra terms in the DCL denominator from same-class samples become negligible, making DCL approximately equivalent to NSCL, which excludes same-class negatives.

Core Idea: Self-supervised contrastive learning is essentially optimizing a supervised contrastive loss approximately, with the gap vanishing as the number of classes grows.

Method¶

Overall Architecture¶

This is a theory-driven work. Three core contributions build on each other: first proving DCL ≈ NSCL (Theorem 1) → then proving the global optimum of NSCL satisfies Neural Collapse (Theorem 2) → then proposing a tighter few-shot error bound (Proposition 1), together providing a complete explanation of why CL works.

Key Designs¶

DCL–NSCL Duality (Theorem 1):
- Function: Proves an upper bound on the gap between the DCL and NSCL losses.
- Mechanism: The DCL denominator sums over all \(j \neq i\), while the NSCL denominator sums only over \(y_j \neq y_i\). The extra same-class terms number at most \(K(n_{\max}-1)\), which is small relative to the total negative count \(K(N - n_{\max})\). Formally: \(\mathcal{L}^{\text{NSCL}} \leq \mathcal{L}^{\text{DCL}} \leq \mathcal{L}^{\text{NSCL}} + \log\!\left(1 + \frac{n_{\max} e^2}{N - n_{\max}}\right)\). For balanced classification, \(\frac{n_{\max}}{N - n_{\max}} = \frac{1}{C-1}\).
- Design Motivation: The bound is label-agnostic and architecture-agnostic — no assumptions on model architecture or data distribution are required.
- Key Corollary: As \(C \to \infty\), DCL and NSCL become identical; self-supervised learning equals supervised learning in the limit.
Neural Collapse at the Global Optimum of NSCL (Theorem 2):
- Function: Characterizes the geometric structure of NSCL global optima under the unconstrained features model.
- Mechanism: Any global optimum of NSCL simultaneously satisfies three properties:
  - Augmentation collapse: All augmented views of the same sample map to the same point, \(z_i^{l_1} = z_i^{l_2}\).
  - Within-class collapse: All samples of the same class share the same representation, \(z_i = z_j\) if \(y_i = y_j\).
  - Simplex ETF: Class means form an equiangular tight frame, \(\langle \mu_c, \mu_{c'} \rangle = -\frac{\|\mu_c\|^2}{C-1}\).
- Design Motivation: These properties are identical to the global optima of cross-entropy, MSE, and supervised contrastive losses — NSCL also induces Neural Collapse.
Directional CDNV-based Few-Shot Error Bound (Proposition 1):
- Function: Proposes a tighter upper bound on few-shot classification error than existing CDNV-based bounds.
- Mechanism: Introduces the directional CDNV \(\tilde{V}_f\), which measures variance only along the direction connecting class means (rather than total variance). The new bound is \(\text{err}^{\text{NCC}}_{m,D}(f) \leq (C'-1)\!\left[8\tilde{V}_f + \frac{8}{\sqrt{m}}V_f^s + \frac{8}{\sqrt{m}}V_f + \frac{4}{m}V_f\right]\).
- Key Advantage: Since \(\tilde{V}_f \leq V_f\), and for isotropic distributions \(\tilde{V}_f = \frac{1}{d} V_f\), the directional CDNV can be small even when the full CDNV is large — explaining why SSL representations transfer effectively even when full CDNV appears unfavorable.

Key Experimental Results¶

Main Results¶

Dataset	Method	NCC 100-shot Acc	LP 100-shot Acc
CIFAR-10	DCL	85.3%	86.3%
CIFAR-10	NSCL	95.7%	95.6%
CIFAR-100	DCL	57.3%	61.7%
CIFAR-100	NSCL	70.8%	73.7%
mini-ImageNet	DCL	69.0%	72.9%
mini-ImageNet	NSCL	79.8%	81.3%

Ablation Study¶

Validation	Result
DCL–NSCL loss gap vs. \(C\)	Decays as \(O(1/C)\), consistent with theory
Correlation between two losses	Highly correlated throughout training (correlation > 0.99)
Whether minimizing DCL also minimizes NSCL	Yes — NSCL values are close to those from directly optimizing NSCL
Whether directional CDNV is tighter than full CDNV	Yes — influence of full CDNV diminishes as \(m\) increases
Tightness of error bound	Optimized Cor. 1 bound closely tracks actual error in practice

Key Findings¶

DCL essentially optimizes NSCL: Even without label access, the NSCL value achieved during DCL training is close to that of directly optimizing NSCL.
More classes → DCL ≈ NSCL is more accurate: The loss gap on CIFAR-100 (\(C=100\)) is an order of magnitude smaller than on CIFAR-10 (\(C=10\)).
Directional CDNV dominates few-shot performance: As the number of labeled shots increases, the influence of full CDNV diminishes, while directional CDNV remains the primary factor.
Results hold for ViT and MoCo: Conclusions are not limited to ResNet/SimCLR architectures.

Highlights & Insights¶

Elegant theoretical connection: A single concise inequality (Theorem 1) formally connects SSL and supervised learning — two seemingly disparate paradigms — without requiring any assumption on model architecture or data distribution, which no prior theoretical work achieved.
Neural Collapse extended to SSL: Previously, Neural Collapse was known only for the optima of CE/MSE/SCL. This paper proves it also holds for NSCL, thereby indirectly explaining the semantic clustering behavior of SSL via DCL ≈ NSCL.
Introduction of directional CDNV: This metric is more fine-grained than full CDNV and explains the common yet puzzling phenomenon where representations have large total variance but still achieve strong few-shot performance.

Limitations & Future Work¶

Limitation of the unconstrained features model: Theorem 2 holds under the unconstrained features model; actual neural network parameterization may result in incomplete collapse.
Suboptimal constants in the bound: The coefficients (8, 8, 4) in Proposition 1 are not optimal; Cor. 1 provides an improved version but is still not the tightest possible.
Optimization dynamics not analyzed: The paper characterizes global optima but does not analyze whether or how quickly SGD converges to these structures.
DCL cannot achieve the NSCL optimum: Since DCL is label-agnostic, it cannot achieve perfect within-class collapse (which requires label information) — how the residual gap affects practical performance warrants further investigation.

vs. Arora et al. (2019): Their CL theory requires an augmentation conditional independence assumption; the bound in this paper is architecture-agnostic and label-agnostic, making it more general.
vs. Neural Collapse literature: This paper is the first to extend NC properties to unsupervised/self-supervised loss functions.
vs. Alignment & Uniformity analysis: The latter characterizes low-order statistics of representations; this paper reveals deeper class-oriented distributional structure.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First formal connection between SSL CL and supervised CL; strong theoretical contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple datasets and architectures; theoretical predictions align well with experiments.