Skip to content

Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions

Conference: NeurIPS 2025
arXiv: 2510.21706
Code: Not provided
Area: Representation Learning / Group Theory
Keywords: Equivariant Representations, Contrastive Learning, Nonlinear ICA, Group Representations, Identifiability

TL;DR

Proposes Equivariance by Contrast (EbC), an encoder-only method that jointly learns equivariant embedding spaces and implicit group representations from paired observations \((\mathbf{y}, g \cdot \mathbf{y})\). It aligns finite group actions with invertible linear mappings in the latent space and provides theoretical guarantees for identifiability.

Background & Motivation

  • In many real-world inference problems, relations between observations are governed by structured transformations: rotation/translation in computer vision, gene knockouts in biology, or sensory stimuli in neuroscience.
  • The goal is to learn equivariant embeddings where the group action corresponds to a linear transformation \(\mathbf{x}' = \mathbf{R}(g)\mathbf{x}\) in the embedding space.
  • Nonlinear ICA provides a theoretical foundation for this problem, but requires additional structural assumptions to make it solvable.
  • Limitations of prior work: CARE is restricted to orthogonal representations, STL permits non-linear equivariant relations, and NFT requires learning a generative model.
  • An approach is needed that is encoder-only, free from generative models, devoid of group-specific inductive biases, and applicable to general linear group representations.

Core Problem

How to learn an encoder \(\phi\) and group representation \(\mathbf{R}'\) from paired observations \((\mathbf{y}, g \cdot \mathbf{y})\) with unknown group elements \(g\), such that \(\phi(g \cdot \mathbf{y}) = \mathbf{R}'(g)\phi(\mathbf{y})\) holds with identifiability guarantees.

Method

Data Assumptions and Problem Setup

Data is provided in batches, each containing \(n+1\) pairs of samples \(\{(\mathbf{y}_i, \mathbf{y}'_i)\}\), where all pairs in the same batch undergo the same group action \(g\). The data generation process is:

\[\mathbf{y}_i = \mathbf{f}(\mathbf{x}_i), \quad \mathbf{y}'_i = \mathbf{f}(\mathbf{R}(g)\mathbf{x}_i)\]

where \(\mathbf{f}\) is an unknown non-linear mixing function, and \(\mathbf{R}: G \to \text{GL}(d, \mathbb{R})\) is the linear representation of the group.

Implicit Group Representation Estimation

The group representation matrix is estimated from the encoded sample pairs using least-squares regression:

\[\hat{\mathbf{R}}(\mathbf{X}, \mathbf{X}') = \arg\min_{\mathbf{R} \in \text{GL}(d)} \|\mathbf{X}' - \mathbf{X}\mathbf{R}^\top\|_F^2 = (\mathbf{X}^\top\mathbf{X})^{-1}(\mathbf{X}^\top\mathbf{X}')\]

where \(n\) sample pairs are used to estimate \(\hat{\mathbf{R}}\), and the remaining 1 pair is used as a query.

Contrastive Learning Objective

The training objective combines the InfoNCE loss with the group structure:

\[p_\phi(\mathbf{y}' \mid \mathbf{y}, \mathbf{Y}, \mathbf{Y}', S) = \frac{\exp(-\|\mathbf{u}_\phi(\mathbf{y}, \mathbf{Y}, \mathbf{Y}') - \phi(\mathbf{y}')\|\^2)}{\sum_{\mathbf{y}'' \in S} \exp(-\|\mathbf{u}_\phi(\mathbf{y}, \mathbf{Y}, \mathbf{Y}') - \phi(\mathbf{y}'')\|\^2)}\]

where \(\mathbf{u}_\phi\) is the operation that infers the group action from context pairs and applies it to the query sample:

\[\mathbf{u}_\phi(\mathbf{y}, \mathbf{Y}, \mathbf{Y}') = \hat{\mathbf{R}}(\phi(\mathbf{Y}), \phi(\mathbf{Y}'))\phi(\mathbf{y})\]

The final optimization is: \(\min_\phi \mathcal{L}[\phi] = -\mathbb{E}[\log p_\phi(\mathbf{y}' \mid \mathbf{y}, \mathbf{Y}, \mathbf{Y}', S)]\)

Content-Style Disentanglement

The embedding space is partitioned into an equivariant subspace (\(n\) dimensions) and an invariant subspace (\(m\) dimensions) by constraining the structure of the group representation matrix:

\[\hat{\mathbf{R}}_{n+m}' = \begin{pmatrix} \hat{\mathbf{R}}_n & \mathbf{0} \\ \mathbf{0} & \mathbf{I}_m \end{pmatrix}\]

Identifiability Guarantees

Theorem 1 (Group Representation Identifiability): Under data diversity conditions, defining \(\mathbf{h} := \phi \circ \mathbf{f}\):

  • (a) Recovers the original vector space up to linear indeterminacy: \(\mathbf{h}(\mathbf{x}) = \mathbf{L}\mathbf{x}\), \(\mathbf{L} \in \text{GL}(d)\)
  • (b) Recovers the group representation up to conjugacy: \(\hat{\mathbf{R}}(\mathbf{h}(\mathbf{X}), \mathbf{h}(g\mathbf{X})) = \mathbf{L}\mathbf{R}(g)\mathbf{L}^{-1}\)

Corollary (Equivariance): \(\mathbf{h}(g\mathbf{x}) = g\mathbf{h}(\mathbf{x})\), i.e., the encoder strictly preserves equivariance.

Key Experimental Results

Comprehensive Results on Synthetic and Image Data

Group \(G\) Data \(R^2(x)\) \(R^2(G)\) Acc(C) Acc(G,5)
\(SO_3\) — InfoNCE non-linear 0.0 0.0 98.9
\(SO_3\)EbC non-linear 99.7 99.7 99.1
\(O_3\) — InfoNCE non-linear 0.0 0.0 99.1
\(O_3\)EbC non-linear 99.8 99.7 99.2
\(GL_3\) — InfoNCE non-linear 0.1 0.0 98.5
\(GL_3\)EbC non-linear 99.8 99.7 98.5
\(R_m \times \mathbb{Z}_n^2\) — InfoNCE idSprites 99.97 0.36
\(R_m \times \mathbb{Z}_n^2\)EbC idSprites 74.04 99.91

Key Findings: - EbC achieves latent space and group representation recovery quality of \(R^2 > 99\%\) across all synthetic groups. - InfoNCE/LDS/SLDS baselines completely fail to recover the group structure (\(R^2 \approx 0\)), learning only invariant representations. - A content-group structure trade-off is observed on idSprites: group structure recovery reaches 99.91%, but content classification drops to 74%. - The linear baseline EbC(lin.) degrades severely under non-linear mixing (\(R^2(x) \approx 60\)-\(70\%\)).

Model Robustness

  • When the embedding dimension is over-specified: group structure kNN accuracy remains >99%, and content classification stabilizes at >80%.
  • Dimension misspecification (with true dimension as 3): Acc(G) exhibits a clear peak at the correct dimension \(d=3\), serving as a reliable basis for hyperparameter selection.
  • Performance remains robust as the mixing layer depth increases up to 4 layers, after which it begins to degrade.

Highlights & Insights

  • ⭐ First encoder-only method to achieve general linear group representation learning without requiring generative models or group-specific inductive biases, covering non-Abelian groups like \(O(n), GL(n)\).
  • ⭐ Elegant design of implicit group representation: directly estimating the group matrix from embedding pairs via least squares. The encoder and the group representation are unified through a single \(\phi\).
  • ⭐ Theoretical completeness: proves linear identifiability and group representation identifiability from the contrastive/discriminative framework of nonlinear ICA.
  • The Acc(G) metric can be utilized without access to true latent variables, offering a practical model choice criterion.

Limitations & Future Work

  • The drop in content classification accuracy on idSprites (74%) indicates remaining room for improvement in content-style separation.
  • The theoretical assumptions require the data to satisfy "sufficient diversity" conditions; its applicability to low-data regimes remains to be validated.
  • Currently, validation is limited to finite groups; extending this to continuous groups (such as continuous \(SO(3)\)) is not addressed.
  • The data generation process does not incorporate noise (the theorems assume exact group actions), and robustness to noise is only preliminarily explored in the appendix.
  • Extensive evaluation on real-world visual datasets beyond idSprites is left for future work.
Method Type Group Representation Requires Generative Model Identifiability
CARE Encoder Orthogonal (Hyperspherical) No Limited
STL Encoder Non-linear Equivariant No No
NFT Generative + Encoder General Linear Yes Yes
EbC (Ours) Encoder-only General Linear (GL) No Yes (Linear Indeterminacy)

Insights & Connections

  • The least-squares estimation of the group matrix is highly practical for scenarios with known paired data, with potential applications in physical simulation, robotics, etc.
  • The combination of "contrastive learning + group structure" provides a novel regularization tool for self-supervised representation learning.
  • Hyperparameter selection via the peak of Acc(G) to determine the embedding dimension serves as a valuable strategy for any latent space dimensional selection task.
  • The connection to nonlinear ICA identifiability theory establishes a solid theoretical foundation for equivariant representation learning.

Rating

  • ⭐ Novelty: 9/10 — The emergence of equivariance and group representations from contrastive learning is a novel perspective with outstanding theoretical contributions.
  • ⭐ Experimental Thoroughness: 7/10 — Synthetic experiments are comprehensive, but empirical evaluation on visual data is limited to idSprites, lacking more realistic scenarios.
  • ⭐ Writing Quality: 8/10 — Clear theoretical derivations, consistent notation, and intuitive illustrations.
  • ⭐ Value: 8/10 — Establishes a new paradigm for equivariant representation learning, with theoretical significance outweighing immediate practical applications.