ReSA: Clustering Properties of Self-Supervised Learning¶

Conference: ICML 2025
arXiv: 2501.18452
Code: None
Area: Self-Supervised Learning
Keywords: self-supervised learning, clustering properties, ReSA, positive feedback, Sinkhorn-Knopp

TL;DR¶

This work systematically analyzes the clustering properties of various components in JEA-based SSL. It discovers that the encoding possesses superior and more stable clustering capabilities compared to the embedding and hidden layers of the projector. Based on this, ReSA (Representation Self-Assignment) is proposed to utilize encoding clustering information to guide embedding learning, forming a positive-feedback SSL framework that significantly outperforms SOTA on multiple standard benchmarks.

Background & Motivation¶

Background¶

Background: Self-Supervised Learning (SSL) learns semantically rich representations under unsupervised conditions through Joint Embedding Architectures (JEAs), outperforming supervised learning in visual representation learning. A JEA consists of a shared encoder \(E_{\theta_e}\) and a projector \(G_{\theta_g}\), outputting encoding \(H\) and embedding \(Z\), respectively. Existing studies (Ben-Shaul et al., 2023) find that SSL representations exhibit hierarchical clustering properties—strengthening three levels of clustering structures: sample-level, semantic class-level, and superclass-level.

Limitations of Prior Work¶

Limitations of Prior Work: (1) Although SSL representations are known to possess clustering properties, scarcely any methods utilize these properties to improve SSL itself—rich clustering information is wasted; (2) It remains unclear whether encoding or embedding has better clustering properties—the optimization dynamics and information flow of the projector remain open questions; (3) Existing online clustering methods like SwAV use learnable prototypes to map embeddings to a clustering space, which require extra parameters, and clustering is conducted on embeddings where information has already been lost.

Key Challenge: SSL representations possess rich clustering properties but are not utilized to improve SSL itself, and clustering information is unevenly distributed across different components of the JEA.

Goal¶

Goal: (1) Answer "where is the best place to extract clustering properties?"; (2) Answer "how to utilize clustering properties?"; (3) Answer "does positive feedback promote better clustering?".

Core Idea: Encoding has optimal clustering properties \(\rightarrow\) online self-clustering on encoding \(\rightarrow\) using the cluster assignment matrix to guide cross-entropy loss of embedding \(\rightarrow\) positive feedback loop to improve representation quality.

Method¶

Overall Architecture¶

On top of the standard JEA (encoder + projector), ReSA extracts clustering information from the encoder output \(H\). It generates an online cluster assignment matrix \(A_H\) via the Sinkhorn-Knopp algorithm, using it to guide the cross-entropy loss between projector outputs \(Z, Z'\), forming a closed-loop positive feedback: better encoding \(\rightarrow\) better cluster assignment \(\rightarrow\) better training signals \(\rightarrow\) better encoding.

Key Designs¶

Empirical Finding of the Superior Clustering Properties of Encoding:
- Function: Determine the optimal source of clustering information
- Mechanism: Evaluate the clustering properties of the encoding \(H\), embedding \(Z\), and projector hidden layers \(P_0, P_1\) of methods like SimCLR, VICReg, SwAV, and BYOL on CIFAR-10/100 using the Silhouette Coefficient (SC_mean to measure local clustering capability, SC_std to measure stability) and Adjusted Rand Index (ARI to measure global clustering capability). Findings: (a) encoding achieves higher SC_mean, lower SC_std, and higher ARI across nearly all methods; (b) during training, the clustering metrics of encoding continuously improve, whereas those of embedding degrade in later stages; (c) although the projector hidden layers yield linear evaluation accuracy close to that of encoding, their clustering properties are significantly worse.
- Design Motivation: Confirming that the encoding is the optimal clustering information source, laying the foundation for subsequent designs.
Online Self-Clustering Mechanism:
- Function: Extract clustering information from encoding and generate a soft assignment matrix
- Mechanism: Instead of using learnable prototypes, encoding samples within a mini-batch are simultaneously treated as data points to be clustered and clustering anchors. The cosine self-similarity matrix \(S_H = H^\top H\) (after L2-normalization) is calculated, and then \(\exp(S_H/\epsilon)\) is converted into a doubly stochastic matrix \(A_H\) via the Sinkhorn-Knopp algorithm (3 iterations, regularization parameter \(\epsilon=0.05\)) to serve as the cluster assignment.
- Design Motivation: Unlike SwAV which uses learnable prototypes, ReSA requires no additional parameters and operates in the encoding space—directly leveraging the superior clustering properties of the encoding. Sinkhorn-Knopp does not involve gradient propagation and is implemented efficiently on GPUs.
Cluster-Guided Cross-Entropy Loss:
- Function: Utilize cluster assignments to guide embedding learning
- Mechanism: The ReSA loss is defined as \(\ell_{\text{ReSA}} = -\frac{1}{2m}(\sum_{i,j} A_H \circ \log \mathcal{D}(Z^\top Z') + \sum_{i,j} A_H^\top \circ \log \mathcal{D}(Z'^\top Z))\), where \(\mathcal{D}\) denotes softmax temperature normalization and \(\circ\) represents the Hadamard product. \(A_H\) originates from encoding clustering information and guides the alignment of similar samples in the embedding space.
- Design Motivation: Contrast with the "swapped prediction" mechanism of SwAV—SwAV performs Sinkhorn on embeddings and then swaps predictions, whereas ReSA performs Sinkhorn on encodings and then guides embeddings, capitalizing on the superior clustering properties of encodings.

Loss & Training¶

The total loss is the ReSA cross-entropy loss, with the temperature hyperparameter \(\tau\) controlling distribution sharpness. No extra contrastive negative samples or momentum encoders are required, as the regularization of Sinkhorn-Knopp naturally prevents representation collapse.

Key Experimental Results¶

Main Results: ImageNet Linear Evaluation¶

Method	Backbone	Epochs	Top-1 Acc.
SimCLR	ResNet-50	200	66.5%
BYOL	ResNet-50	200	70.6%
SwAV	ResNet-50	200	71.8%
VICReg	ResNet-50	200	68.6%
ReSA	ResNet-50	200	73.2%

Ablation Study: Clustering Source¶

Clustering Source	SC_mean ↑	SC_std ↓	ARI ↑	Training Stability
Embedding \(Z\)	Lower	High	Lower	Degradation in later stages
Projector hidden layer \(P_0\)	Medium	Medium	Medium	Unstable
Encoding \(H\)	Highest	Lowest	Highest	Continuous improvement

Training Efficiency Comparison¶

Method	Epochs required to reach 70% Top-1	Notes
SimCLR	Not reached	—
BYOL	~200	—
SwAV	~180	—
ReSA	~150	Faster convergence

Key Findings¶

Encoding possesses the optimal clustering properties in almost all SSL methods—this is a universal phenomenon rather than an exception of specific methods.
The positive feedback mechanism of ReSA not only boosts performance but also accelerates convergence—better clustering signals lead to more efficient training.
ReSA simultaneously improves both fine-grained and coarse-grained clustering properties.

Highlights & Insights¶

A New Paradigm of Positive Feedback SSL: Clustering properties \(\rightarrow\) training signals \(\rightarrow\) better representation \(\rightarrow\) better clustering properties—this self-reinforcing loop represents a conceptual contribution to SSL methodology.
Systematic Analysis of Encoding vs. Embedding: This study is the first to rigorously quantify the differences in clustering properties across various JEA components, offering a fresh perspective on the mechanism of the projector.
Prototype-Free Online Clustering: Unlike the learnable prototypes in SwAV/DINOv2, ReSA performs self-clustering directly among samples—more elegant and requiring no extra parameters.

Limitations & Future Work¶

Computational Overhead of Sinkhorn-Knopp under Large Batch Sizes: The self-similarity matrix \(S_H\) is sized \(m \times m\), resulting in increased memory and computational costs in the case of large batch sizes.
Superficial Validation in Vision SSL Only: Clustering properties in NLP and multimodal SSL may exhibit different characteristics.
Relation to Knowledge Distillation Methods: ReSA can be viewed as a form of self-distillation—where the encoding acts as a "teacher" to guide the embedding—making its relationship with DINO/iBOT worthy of in-depth exploration.
Implicit Assumption on the Number of Clusters: Sinkhorn-Knopp does not explicitly set the number of clusters, but the batch size implicitly constrains the discoverable cluster structures.

vs. SwAV (Caron et al., 2020): SwAV performs clustering on embeddings using learnable prototypes—ReSA conducts prototype-free self-clustering on encodings, exploiting a superior source of information.
vs. DINOv2 (Oquab et al., 2023): Also utilizes Sinkhorn-Knopp but adopts learnable prototypes—ReSA's prototype-free design is more streamlined.
vs. Ben-Shaul et al. (2023): They discovered the hierarchical clustering properties of SSL—this work is the first to leverage these properties to improve SSL itself.
vs. Ma et al. (2023): Utilizes the enhanced robustness of the encoding to reweight positive pair alignments—but ignores clustering information.

Rating¶

Novelty: ⭐⭐⭐⭐ Positive-feedback SSL is a conceptual contribution; the systematic analysis of encoding clustering properties is of high value.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-method and multi-dataset clustering analysis + large-scale experiments on ImageNet.
Writing Quality: ⭐⭐⭐⭐ Structured clearly around three progressive questions.
Value: ⭐⭐⭐⭐ Provides a new methodological paradigm for the SSL community.