A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning¶

Conference: ICML 2026
arXiv: 2605.07596
Code: None
Area: Self-supervised learning / Representation learning / Theoretical analysis
Keywords: Contrastive learning, Generalization bounds, U-statistics, Extreme multi-class, Sample complexity

TL;DR¶

This paper improves the sample complexity upper bound for supervised contrastive learning (where tuples are constructed from a finite labeled data pool). By employing two distinct U-statistic estimators, it achieves a breakthrough from bounds dependent on the minimum class probability to bounds that depend only on the number of classes or the sample scale in extreme multi-class scenarios.

Background & Motivation¶

Background Contrastive representation learning has achieved significant empirical success across various machine learning tasks. However, the theoretical understanding of its sample complexity remains insufficient. Existing analyses (e.g., Arora et al. 2019) typically assume that input tuples are independent and identically distributed (i.i.d.), an assumption that often fails in practical settings.

Limitations of Prior Work In actual pipelines, contrastive tuples are constructed from a finite pool of labeled data, leading to dependencies between tuples. Recent work has analyzed this setting using U-statistics, but its analysis requires uniform risk concentration across all categories. This results in sample complexity scaling at the order of \(\rho_{\min}^{-1}\) (the reciprocal of the minimum class probability), which is overly pessimistic in extreme multi-class scenarios with many tail classes.

Key Challenge Existing methods struggle when dealing with imbalanced data: they must ensure estimation accuracy for all classes while simultaneously avoiding the severe impact of the minimum class probability on complexity.

Goal To improve the generalization analysis of supervised contrastive learning and achieve tighter bounds within extreme multi-class settings.

Key Insight The requirement for uniform concentration is relaxed, allowing estimators for different categories to have heterogeneous precisions. Simultaneously, a brand-new U-statistic estimator is designed to enforce joint concentration across classes rather than individual concentration at the class level.

Core Idea Two levels of innovation are introduced. First, the class-level fusion U-statistic estimator is improved by removing the dependence on the minimum class probability, replacing it with a dependence on the number of classes \(R\). Second, a completely different estimator is introduced based on the joint concentration of class collision probabilities, which recovers the complexity to depend only on the sample pool size \(k\) in extreme multi-class scenarios.

Method¶

Overall Architecture¶

The paper investigates the supervised contrastive learning setting where tuples are constructed from a finite pool of labeled data. Given a labeled dataset \(S=\{X_j\}_{j=1}^N\), for a representation function \(f\in\mathcal{F}\) and contrastive loss \(\phi\), the tuple-level loss is \(\ell_{\phi,f}(X,X^+,\{X_i^-\}_{i=1}^k)\), where \(X,X^+\) belong to the same class and \(\{X_i^-\}\) are \(k\) negative samples. Because these tuples repeatedly reuse the same set of data points, they are no longer independent. One cannot directly apply classical concentration results for i.i.d. tuples. Instead, the paper utilizes U-statistics to estimate the population contrastive risk and define its uniform excess risk (the maximum gap between empirical and population risk).

The methodology follows two parallel lines of improvement, both designed to remove the bottleneck found in Hieu (2025). Hieu decomposed the population risk into class-wise risks where each class concentrated separately with identical precision, causing the sample complexity to be penalized by \(\rho_{\min}^{-1}\), which is overly pessimistic for extreme multi-class scenarios with many tail classes. The first line (Key Design 1) retains the same class-level fusion estimator but relaxes the "uniform precision" requirement, reducing complexity to depend only on the class count \(R\). The second line (Key Design 2) introduces a completely new joint concentration estimator, shifting from the "class" dimension to the "sample" dimension, further reducing complexity to depend only on the number of samples per tuple \(k\) under extreme multi-class conditions. The concentration proofs for both lines are built upon the same mathematical tool—survival probability decomposition (Key Design 3).

Key Designs¶

1. Improved Class-level Fusion Estimator: Replacing \(\rho_{\min}^{-1}\) with Class Count \(R\) Previous work using U-statistics required uniform risk concentration across all classes, forcing sample complexity to scale with \(\rho_{\min}^{-1}\). This paper relaxes this uniformity, allowing different class estimators to concentrate at different rates: rare classes with small risk contributions use relaxed precision, while only primary classes maintain high precision. This non-uniform precision allocation stems from the observation of the population risk decomposition \(L_\phi(f)=\sum_{r=1}^R\rho_r L_r(f)\). Since rare classes contribute minimally to the total risk, high-precision estimation for them is unnecessary. Consequently, the sample complexity is reduced from \(O(R\cdot\rho_{\min}^{-1})\) to \(O(R)\), eliminating dependence on the minimum class probability.

2. Joint Concentration Estimator: Shifting from "Class" to "Sample" Dimension While the first design is an improvement, \(R\) can still be very large in extreme multi-class settings. The second design utilizes a different U-statistic formulation that enforces joint concentration across classes: the collision-free contrastive risk is decomposed into "at least one collision negative sample" and "exactly zero collision negative samples." The resulting estimator is dominated by the product of the class collision probability \((1-\tau)^2\) and the sample pool size \(k\), rather than the number of classes. In the extreme multi-class limit where many \(\rho_r\) are very small and collision probability \(\tau\to 0\), the complexity recovers to the ideal \(O(k)\), aligning with classical k-tuple learning theory.

3. Survival Probability Decomposition: Shared Mathematical Foundation To prove U-statistic concentration for both estimators, a tool is needed to measure the contribution of each class separately. The paper utilizes survival probabilities: for a distribution \(\mathcal{D}\) and level \(\ell\), it defines \(p_{\mathcal{D}}(\ell)=\Pr(X\geq\ell)\). By decomposing \(\mathbb{E}[\min\{k_i,X_i\}]=\sum_{\ell=1}^{k_i}p_i(\ell)\), the tuple-level objective is rewritten as a weighted sum of marginal survival probabilities. This allows contributions across classes and levels to be measured and concentrated independently, providing a unified proof framework for both new estimators.

Loss & Training¶

The paper focuses on theoretical analysis. The core object is the Logistic contrastive loss \(\phi(\mathbf{v})=\ln(1+\sum_{i=1}^k e^{-v_i})\). The analysis is based on general Lipschitz parameterized function classes, with a complexity term \(\mathcal{C}_N(\mathcal{H})\sim\widetilde{O}(\sqrt{W})\), where \(W\) is the number of parameters.

Key Experimental Results¶

Main Results¶

Method	Estimator Type	Sample Complexity (Default)	Sample Complexity (Balanced)	Dependent on \(\rho_{\min}\)?
Arora et al. 2019	Collision-allowed U-stat	\(O(\sqrt{k/N})\) (i.i.d. tuples)	-	No
Hieu 2025	Collision-free class fusion	\(\mathcal{C}^2_N R\max[\rho_{\min}^{-1},(1-\rho_{\max})^{-1}]\)	\(\mathcal{C}^2_N R\)	Yes
Ours (Contrib 1)	Improved class fusion	\(\mathcal{C}^2_N[\hat{\theta}_{k+2}R+(1-\hat{\theta}_{k+2})^2k]\)	\(\mathcal{C}^2_N R\)	No
Ours (Contrib 2)	Joint concentration (New)	\(\mathcal{C}^2_N k(1-\tau)^2\)	\(\mathcal{C}^2_N k\)	No

Where \(\hat{\theta}_{k+2}=\Pr[\rho_r\leq 2/(k+2)]\) represents the proportion of low-probability classes, and \(\tau\) is the class collision probability.

Ablation Study¶

Setting	Result	Description
Perfectly Balanced (\(\rho_r=1/R\))	Ours equivalent to Hieu	When all classes are equiprobable, \(\tau=O(1)\), complexity is \(O(k)\) for both.
Extreme Multi-class (Most \(\rho_r\ll \frac{2}{k+2}\))	Ours \(\approx O(k)\) vs Hieu \(O(R)\)	New estimator exploits the small contribution of rare classes.
Long-tail Distribution	Improvement depends on \(\theta_{k+2}\)	Longer tails lead to greater potential for improvement.

Key Findings¶

Both U-statistic estimators have specific use cases: class-level fusion is suitable for scenarios with dominant majority classes, while the joint concentration estimator performs optimally near balanced distributions.
The magnitude of improvement is quantified by \(\theta_{k+2}\) (the number of classes that are small relative to \(k+2\)), which can improve from \(O(R)\) to \(O(k)\) in extreme multi-class settings.
Theoretical results do not depend on the number of rare classes, but rather their relative total size in the population.

Highlights & Insights¶

Refined Non-uniform Concentration: Allowing class-level precision heterogeneity is a simple but powerful idea that directly corresponds to real-world differences in class contributions, avoiding pessimistic bounds caused by the minimum class probability.
Dual Innovation in U-Statistics: The second estimator, through collision probability decomposition, cleverly shifts the theoretical bridge from the "class" dimension to the "sample" dimension, representing a theoretical leap from fixed multi-class to extreme multi-class.
Unification with Classical Theory: Recovering the \(O(k)\) rate in the extreme multi-class limit aligns with Hoeffding-type results in k-tuple learning, demonstrating theoretical consistency.

Limitations & Future Work¶

The contribution is primarily theoretical, lacking empirical validation of how different U-statistic estimators perform in actual contrastive learning pipelines.
Assumption of avoidable class collisions: In practice (especially in self-supervised learning), collisions cannot be entirely avoided. While the paper discusses collision-allowed risk, the analysis is less deep than the collision-free case.
Specific lower bounds for function class complexity are not addressed: The \(\mathcal{C}_N(\mathcal{H})\) term in the sample complexity bound may still be large for certain function classes.

vs. Arora et al. 2019: Arora assumes i.i.d. tuples, leading to sample complexity expressed by the number of tuples \(N\) (rather than data points \(N\)). This work handles realistic finite-pool construction, making the framework more practical.
vs. Hieu & Ledent 2025: Directly improves their U-statistic analysis by relaxing uniform concentration assumptions, achieving exponential improvements in extreme multi-class scenarios.
vs. SSL Theory: Contributes a tight analysis for the supervised version, laying the groundwork for understanding the complexity of self-supervised learning (including collisions).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The construction and analysis of two parallel U-statistic estimators are novel, especially the theoretical breakthrough of the joint concentration idea.
Experimental Thoroughness: ⭐⭐⭐ Theoretical paper without empirical experiments; historical/theoretical results are complete but lack practical validation.
Writing Quality: ⭐⭐⭐⭐ Mathematical expressions are rigorous and clear, with main results being easy to interpret.
Value: ⭐⭐⭐⭐ Deepens the understanding of generalization in supervised contrastive learning and advances the theoretical foundation of multi-class learning.