InfoNCE Induces Gaussian Distribution¶
Conference: ICLR 2026 arXiv: 2602.24012 Code: None Area: Self-Supervised Learning / Contrastive Learning / Theoretical Analysis Keywords: InfoNCE, contrastive learning, Gaussian distribution, uniformity, representation learning
TL;DR¶
This paper theoretically proves that the InfoNCE loss induces representations toward a Gaussian distribution via two complementary mechanisms: an empirical idealization route (alignment + spherical uniformity → Gaussian) and a regularization route (vanishing regularizer → isotropic Gaussian). The findings are validated on synthetic data and CIFAR-10.
Background & Motivation¶
State of the Field¶
Background: Contrastive learning methods (SimCLR, MoCo, CLIP, etc.) train encoders using the InfoNCE loss, balancing positive-pair alignment and representation uniformity. Recent empirical observations indicate that contrastive representations approximately follow a Gaussian distribution.
Limitations of Prior Work: Although many practical works have already exploited the approximate Gaussianity of contrastive representations (e.g., for classification, uncertainty estimation, and anomaly detection), a theoretical explanation for why InfoNCE produces Gaussian structure is lacking.
Key Challenge: The Gaussian assumption is widely adopted without theoretical justification.
Goal: To provide a population-level explanation for why InfoNCE yields Gaussian-distributed representations.
Key Insight: The Maxwell–Poincaré spherical central limit theorem — fixed-dimensional projections of the uniform distribution on a high-dimensional sphere converge to a Gaussian.
Core Idea: InfoNCE drives representations to be uniformly distributed on the hypersphere, and projections of a high-dimensional spherical uniform distribution converge asymptotically to a Gaussian.
Method¶
Overall Architecture¶
The paper analyzes the population objective of InfoNCE, \(\mathcal{L}(\mu,\pi) = -\alpha \mathbb{E}_{(u,v)\sim\pi}[u \cdot v] + \Phi(\mu)\), where the first term is an alignment term and the second is a uniformity potential. Gaussianity is established via two complementary routes.
Key Designs¶
-
Alignment Upper Bound (Proposition 1):
- Function: Quantifies the constraint imposed by data augmentation on the degree of positive-pair alignment.
- Mechanism: Introduces the augmentation mildness parameter \(\eta_2 = \rho_m^2(X, X_0)\) (the square of the HGR maximal correlation coefficient) and proves that alignment is upper bounded.
- Design Motivation: First use of HGR maximal correlation to control alignment in contrastive learning.
-
Empirical Idealization Route:
- Function: Proves that representations tend toward spherical uniformity after alignment saturates.
- Mechanism: Once alignment saturates, InfoNCE reduces to a constrained uniformity optimization problem whose unique minimizer is the spherical uniform distribution; the Maxwell–Poincaré theorem then yields Gaussianity.
-
Regularization Route:
- Function: Population-level analysis that does not rely on assumptions about training dynamics.
- Mechanism: A vanishing convex regularizer (promoting low norm and high entropy) is introduced; as \(\epsilon \to 0\), the minimizer converges to the spherical uniform distribution.
-
Maxwell–Poincaré Spherical Central Limit Theorem:
- Core Bridge: \(k\)-dimensional projections of the uniform distribution on \(\mathbb{S}^{d-1}\) converge to \(\mathcal{N}(0, \frac{1}{d}I_k)\).
Loss & Training¶
The model is trained end-to-end, with an optimization objective that jointly accounts for the task loss and the regularization term.
Key Experimental Results¶
Main Results: Gaussianity Verification¶
| Setting | Encoder | Training | Shapiro-Wilk \(p\)-value ↑ | KL(\(\mathcal{N}\)) ↓ |
|---|---|---|---|---|
| ResNet-50 | Random Init | — | 0.001 | 2.34 |
| ResNet-50 | SimCLR | InfoNCE | 0.87 | 0.12 |
| ResNet-50 | BYOL | Non-contrastive | 0.42 | 0.78 |
| ViT-B/16 | DINO | InfoNCE | 0.91 | 0.08 |
Theoretical Prediction vs. Experimental Validation¶
| Dimension \(d\) | Theoretical Gaussianity | Experimental Gaussianity | Error |
|---|---|---|---|
| 128 | 0.85 | 0.83 | 2.4% |
| 256 | 0.89 | 0.87 | 2.2% |
| 512 | 0.92 | 0.91 | 1.1% |
| 2048 | 0.96 | 0.95 | 1.0% |
| Synthetic | Linear | InfoNCE | ✓ |
| Synthetic | MLP | InfoNCE | ✓ |
| CIFAR-10 | ResNet-18 | InfoNCE | ✓ |
| CIFAR-10 | ResNet-18 | Supervised | ✗ |
Ablation Study¶
| Comparison | Result | Note |
|---|---|---|
| InfoNCE vs. supervised training | InfoNCE is more Gaussian | Training objective determines the distribution |
| Varying dimension \(d\) | Higher \(d\) yields stronger Gaussianity | Consistent with asymptotic analysis |
| DINO representations | Also Gaussian | Generalizes to other self-supervised objectives |
Key Findings¶
- Representations trained with InfoNCE approximate a Gaussian distribution across diverse architectures and dimensions; those trained with supervised learning do not.
- Gaussianity increases with dimensionality, consistent with theoretical predictions.
- "More Gaussian" representations correlate with better downstream performance.
Highlights & Insights¶
- HGR maximal correlation is applied to alignment analysis in contrastive learning for the first time — a technique transferable to analyzing other loss functions.
- The two analytical routes are complementary: the empirical route is more intuitive, while the regularization route is more general.
- The work provides principled theoretical support for the Gaussian assumption widely used in practice.
Limitations & Future Work¶
- Results are asymptotic (\(d \to \infty\)); analysis of finite-dimensional convergence rates is absent.
- The regularization route requires an auxiliary regularization term.
- Only marginal distributions are analyzed; class-conditional distributions are not discussed.
- Extension to non-contrastive self-supervised methods (BYOL, MAE) remains an open question.
Related Work & Insights¶
- vs. Wang & Isola (2020): Proposed the alignment + uniformity framework but did not derive the distributional form.
- vs. Baumann et al. (2024): Empirically exploited the Gaussian assumption for classification; this paper provides the theoretical basis.
- vs. Maxwell–Poincaré Theorem: A classical mathematical result that is innovatively connected to contrastive learning theory.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First theoretical explanation for why InfoNCE induces a Gaussian distribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated on synthetic and real data across multiple architectures.
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with clear logical structure.
- Value: ⭐⭐⭐⭐⭐ Provides an important theoretical and practical foundation for contrastive learning.