Self-Supervised Learning from Structural Invariance¶
Conference: ICLR 2026 arXiv: 2602.02381 Code: https://github.com/SkrighYZ/AdaSSL Area: Self-Supervised Learning / Causal Representation Learning Keywords: self-supervised learning, latent variable model, structural invariance, heteroscedasticity, causal representation
TL;DR¶
This paper proposes AdaSSL, which introduces latent variables to model conditional uncertainty between positive pairs, derives a variational lower bound on mutual information, and enables SSL to handle complex (multimodal, heteroscedastic) conditional distributions in naturally paired data. AdaSSL outperforms baselines on causal representation learning, fine-grained image understanding, and video world models.
Background & Motivation¶
Background: Joint-embedding SSL methods (e.g., SimCLR, BYOL) learn representations by encouraging similarity between positive pair embeddings, typically relying on hand-crafted data augmentations to construct semantically related pairs.
Limitations of Prior Work: Hand-crafted augmentations (cropping, color jitter) cannot precisely simulate real-world variation factors, may discard fine-grained information, require modality-specific heuristics, and differ from natural distribution shifts. Naturally paired data (e.g., adjacent video frames, image-text pairs) better reflects real-world variation, but introduces complex conditional distributions \(p(\mathbf{z}^+|\mathbf{z})\)—heteroscedastic and multimodal—which existing SSL methods cannot model.
Key Challenge: InfoNCE's dot-product similarity implicitly assumes a vMF distribution (isotropic noise), and AnInfoNCE extends this to anisotropic but still constant noise. However, Proposition 2.1 theoretically establishes that even when noise in the latent space is isotropic, mapping to the normalized embedding space inevitably induces heteroscedasticity—a necessary consequence of geometric mismatch.
Goal: Enable SSL to flexibly model arbitrarily complex conditional distributions \(p(\mathbf{z}^+|\mathbf{z})\) while keeping the similarity function simple.
Key Insight: Inspired by JEPA, a latent variable \(\mathbf{r}\) is introduced to capture predictive uncertainty, decomposing the complex conditional distribution into two steps: first sampling \(\mathbf{r}\) (e.g., camera motion, actions), then predicting \(\mathbf{z}^+\) with a simple model.
Core Idea: Via the chain rule of mutual information \(I(f(\mathbf{x}); f(\mathbf{x}^+)) = I(f(\mathbf{x}), \mathbf{r}; f(\mathbf{x}^+)) - I(\mathbf{r}; f(\mathbf{x}^+)|f(\mathbf{x}))\), the first term is optimized with an extended InfoNCE (simple similarity + latent variable), and the second term is regularized with KL divergence to prevent \(\mathbf{r}\) from encoding shortcuts.
Method¶
Overall Architecture¶
An encoder \(f\) extracts embeddings, and a latent variable \(\mathbf{r}\) captures uncertainty between positive pairs. An editing function \(t(f(\mathbf{x}), \mathbf{r})\) modifies embeddings to bring them closer to \(f(\mathbf{x}^+)\). The objective combines an SSL loss (InfoNCE or BYOL) with a regularization term that limits the information content of \(\mathbf{r}\).
Key Designs¶
-
AdaSSL-V (Variational Version):
- Function: Models the latent variable using a variational distribution \(q_\phi(\mathbf{r}|\mathbf{x}, \mathbf{x}^+)\)
- Mechanism: \(\mathcal{L} = \mathcal{L}_{SSL}(\mathbb{E}_{q_\phi} \psi_1(\mathbf{x}, \mathbf{r}), \psi_2(\mathbf{x}^+)) + \beta D_{KL}(q_\phi(\mathbf{r}|\mathbf{x}, \mathbf{x}^+) \| p_\theta(\mathbf{r}|\mathbf{x}))\), where KL regularization prevents \(\mathbf{r}\) from directly encoding \(f(\mathbf{x}^+)\)
- Design Motivation: Derives a tractable lower bound on \(I(f(\mathbf{x}); f(\mathbf{x}^+))\) with theoretical rigor
-
AdaSSL-S (Sparse Version):
- Function: Deterministically predicts \(\mathbf{r}\) and regularizes its sparsity
- Mechanism: \(\mathbf{r} = m(f(\mathbf{x}), f(\mathbf{x}^+))\), with differentiable L0 penalty via Gumbel-Sigmoid. The editing function adopts a modular low-rank design: \(t(f(\mathbf{x}), \mathbf{r}) = f(\mathbf{x}) + \sum_i r_i (\mathbf{B}_i \mathbf{A}_i f(\mathbf{x}) + b_i)\)
- Design Motivation: Natural variations typically correspond to sparse changes in latent factors; the sparsity inductive bias better aligns with causal representation learning
-
Necessity of Heteroscedasticity (Proposition 2.1):
- Function: Theoretically proves that the conditional distribution of pairs in embedding space is necessarily heteroscedastic
- Mechanism: When the latent space \(\mathbb{R}^{d_z}\) is mapped to a curved manifold (e.g., unit sphere \(\mathbb{S}^{d_f}\)), local neighborhood distortions are position-dependent, producing location-dependent variance even when the original noise is isotropic
- Design Motivation: Provides a fundamental proof of the inadequacy of standard SSL similarity functions
Loss & Training¶
- AdaSSL-V: InfoNCE + KL regularization (\(\beta\) controls strength)
- AdaSSL-S: InfoNCE + L0 sparsity regularization (Gumbel-Sigmoid)
- Compatible with non-contrastive methods such as BYOL
Key Experimental Results¶
Main Results¶
| Task / Dataset | Metric | AdaSSL | InfoNCE | AnInfoNCE | H-InfoNCE |
|---|---|---|---|---|---|
| Numerical Heteroscedastic (OOD) | R² | 0.92+ | <0.27 | <0.40 | 0.76 |
| 3DIdent (CRL) | DCI | 0.85+ | 0.72 | 0.74 | 0.78 |
| CelebA Fine-Grained | 40-attr Acc | Best | Lower | Lower | Moderate |
| Moving-MNIST Acceleration | R² | 0.55 (BYOL baseline 0.15) | — | — | — |
Ablation Study¶
| Configuration | Numerical OOD R² | Notes |
|---|---|---|
| AdaSSL-V | 0.92+ | Full variational version |
| AdaSSL-S | 0.90+ | Sparse version, slightly lower but sparser |
| H-InfoNCE | 0.76 | Heteroscedastic but no latent variable |
| InfoNCE | <0.27 | Baseline completely fails |
| AnInfoNCE | <0.40 | Anisotropy insufficient |
Key Findings¶
- Under complex conditional distributions (multimodal + heteroscedastic), InfoNCE and AnInfoNCE completely fail (OOD R² < 0.4), while AdaSSL maintains 0.9+
- Naturally paired data (vs. standard augmentations) significantly improves downstream performance when modeled correctly
- The sparse \(\mathbf{r}\) learned by AdaSSL-S aligns with ground-truth variation factors
- In video world models, AdaSSL captures stochastic acceleration, which BYOL discards
Highlights & Insights¶
- The heteroscedasticity theorem reveals a fundamental limitation of standard SSL—not an empirical observation but a mathematical inevitability
- Generality of latent variable modeling: the same framework is compatible with both contrastive and distillation-based SSL, and applies across numerical, image, and video domains
- The sparse modular editing design (\(\mathbf{r}\) controlling low-rank editing modules) is conceptually analogous to LoRA-style ideas
Limitations & Future Work¶
- AdaSSL-S requires additional treatment when applied to distillation methods (e.g., BYOL)
- The latent variable dimension \(d_r\) must be preset; automatic determination would be preferable
- Large-scale validation is lacking (no ImageNet-scale experiments)
- When the number of modes in multimodal conditional distributions is unknown, the choice of variational prior warrants further investigation
Related Work & Insights¶
- vs. AnInfoNCE: The anisotropic weight \(\Lambda\) is global and does not vary with data. AdaSSL achieves data-adaptive modeling through latent variables
- vs. JEPA/V-JEPA: JEPA assumes \(\mathbf{r}\) is known (e.g., actions); AdaSSL infers \(\mathbf{r}\) from data pairs
- vs. LieSSL: Lie group transformations assume invertible and structured changes; AdaSSL is more flexible
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Heteroscedasticity theorem + MI lower bound + dual-variant design; theoretically and methodologically deep
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task validation (numerical/CRL/image/video), but lacking large-scale comparisons
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear; the logic from theory to method to experiments flows smoothly
- Value: ⭐⭐⭐⭐ Addresses a fundamental theoretical problem in SSL with strong methodological generality