Learning (Approximately) Equivariant Networks via Constrained Optimization¶

Conference: NeurIPS 2025 arXiv: 2505.13631 Code: None Area: Machine Learning Theory / Equivariant Neural Networks Keywords: equivariance, constrained optimization, homotopy methods, approximate symmetry, dual methods

TL;DR¶

This paper proposes ACE (Adaptive Constrained Equivariance), a framework that formulates equivariant neural network training as a constrained optimization problem. Via dual methods, ACE automatically and progressively transitions from a flexible non-equivariant model to an equivariant one, adapting to both fully and partially symmetric data without manual hyperparameter tuning.

Background & Motivation¶

Equivariant neural networks encode symmetries architecturally to improve generalization and sample efficiency, yet their training faces three core challenges:

Complex loss landscapes: Equivariance constraints complicate the loss landscape even when data are fully symmetric, slowing optimization.
Partially symmetric real-world data: Effects such as noise, measurement bias, and dynamical phase transitions break perfect symmetry, causing strictly equivariant models to underfit.
Manual tuning burden: Existing relaxation methods (e.g., penalty weights in REMUL, annealing schedules in PennPaper) require extensive domain-specific tuning.

Limitations of prior work:

REMUL: Adds an equivariance penalty to the loss and adaptively adjusts weights \(\alpha, \beta\), but provides no guarantee on the degree of equivariance in the final solution.
PennPaper: Manually decreases the perturbation parameter \(\gamma\) to zero, but is sensitive to the chosen schedule and introduces additional hyperparameters via a Lie-derivative penalty.

Method¶

Overall Architecture¶

A homotopy architecture \(f_{\theta,\gamma} = f_{\theta,\gamma}^L \circ \cdots \circ f_{\theta,\gamma}^1\) is constructed, where each layer is defined as \(f_{\theta,\gamma}^i = f_\theta^{\text{eq},i} + \gamma_i f_\theta^{\text{neq},i}\). When \(\gamma = 0\) the model is equivariant; when \(|\gamma_i| > 0\) deviations are permitted.

Training is formulated as a constrained optimization problem, and a dual method (gradient descent–ascent) is employed to automatically adjust \(\gamma\).

Key Designs¶

Equality-constraint scheme (fully symmetric data, Algorithm 1):
- Function: Constrains \(\gamma_i = 0\) via dual variables \(\lambda_i\) that automatically govern the transition rate.
- Mechanism: Initialize \(\gamma^{(0)} = 1\) (non-equivariant); dual variables \(\lambda_i\) grow continuously whenever \(\gamma_i > 0\), progressively pushing the model toward equivariance. Key updates: \(\gamma_i^{(t+1)} = \gamma_i^{(t)} - \eta_p(\nabla_{\gamma_i} J_0^{(t)} + \lambda_i^{(t)})\), \(\lambda_i^{(t+1)} = \lambda_i^{(t)} + \eta_d \gamma_i^{(t)}\).
- Design Motivation: The dual method is equivalent to adaptive annealing—tightening the equivariance constraint at a rate governed by its actual effect on downstream performance.
Elastic inequality-constraint scheme (partially symmetric data, Algorithm 2):
- Function: Replaces equality constraints with \(|\gamma_i| \leq u_i\), where slack variables \(u_i\) are also optimized.
- Mechanism: A term \(\frac{\rho}{2}\|u\|^2\) is added to the objective to penalize large slacks. The optimum satisfies \(u_i^* = \lambda^*/\rho\), so layers with tighter constraints correspond to larger \(\lambda_i\). Projected update: \(\lambda_i^{(t+1)} = [\lambda_i^{(t)} + \eta_d(|\gamma_i^{(t)}| - u_i^{(t)})]_+\).
- Design Motivation: When data are partially symmetric, \(\gamma_i\) does not vanish in certain layers; the magnitude of the dual variable automatically identifies which layers require relaxed equivariance.
Theoretical guarantees (Theorem 4.1 & 4.2):
- Function: Provide explicit bounds on the approximation error and degree of equivariance violation after removing \(\gamma\).
- Mechanism: Thm 4.1 — \(\|f_{\theta,\gamma}(x) - f_{\theta,0}(x)\| \leq [\sum_{k=0}^{L-1}(1+\bar{\gamma})^k] \bar{\gamma} B M^{L-1} \|x\|\); Thm 4.2 — \(\|\rho_Y(g)f_{\theta,\gamma}(x) - f_{\theta,\gamma}(\rho_X(g)x)\| \leq 2\bar{\gamma}(M + C\bar{\gamma})^{L-1}LB^2\|x\|\).
- Design Motivation: These bounds guarantee that when \(\gamma_i\) is sufficiently small, the error incurred by truncating to an equivariant model remains controlled.

Loss & Training¶

Lagrangian for the equality-constraint formulation: \(\hat{L}(\theta, \gamma, \lambda) = \frac{1}{N}\sum_{n=1}^N \ell_0(f_{\theta,\gamma}(x_n), y_n) + \sum_{i=1}^L \lambda_i \gamma_i\)

Crucially, no equivariance penalty is used (\(\beta = 0\)); the method relies entirely on the constraint and dual updates. Only two learning rates \(\eta_p, \eta_d\) and the elastic constant \(\rho = 1\) are required.

Key Experimental Results¶

Main Results (Tables)¶

CMU MoCap motion prediction MSE (\(\times 10^{-2}\)):

Model	Run	Walk
EGNN	50.9±0.9	28.7±1.6
EGNO (original)	33.9±1.7	8.1±1.6
EGNO + ACE (equality)	improved	improved
EGNO + ACE (elastic inequality)	best	best

N-Body physical simulation: SEGNN + ACE outperforms standard SEGNN in both validation MSE and sample efficiency.

Ablation Study¶

The equality-constraint scheme (Alg. 1) improves convergence trajectories on fully symmetric data: flexible early exploration followed by progressive tightening toward equivariance.
The elastic inequality scheme (Alg. 2) maintains partial equivariance while improving performance on noisy or symmetry-broken data.
Equivariance error is verified to approach zero progressively during training (Figure 4).
The theoretical bound (Thm 4.2) is consistent with observed trends in equivariance violation.

Key Findings¶

ACE consistently improves performance across multiple architectures (SEGNN, EGNN, EGNO, p4m-CNN) and tasks (N-Body, motion prediction, image classification).
On fully symmetric data, the advantage of ACE stems from smoothing the optimization landscape.
On partially symmetric data, ACE automatically identifies which layers require relaxed equivariance (those with larger \(\lambda_i\)).
Sample efficiency gains are substantial: ACE achieves lower error for the same number of samples.
Robustness to input perturbations is also improved.

Highlights & Insights¶

No manual tuning required: The equivariance transition is fully automatic, requiring no schedules, penalty weights, or domain knowledge.
Theory–practice consistency: Theoretical bounds accurately predict observed experimental behavior.
High generality: Applicable to any differentiable architecture \(f_{\theta,\gamma}\) where \(f_{\theta,0}\) is equivariant.
The deep connection between dual methods and homotopy/simulated annealing provides an optimization-theoretic perspective on the approach.

Limitations & Future Work¶

\(\gamma_i\) does not reach exactly zero in finite iterations, necessitating a final truncation step that incurs the error characterized by Thm 4.1.
The non-equivariant branch \(f_\theta^{\text{neq},i}\) increases parameter count and computational cost.
Scalability to large models (e.g., large GNNs) has not been verified.
Convergence guarantees for the dual method in non-convex settings rely on a "sufficiently expressive" parameterization assumption.
Only discrete groups (e.g., p4m) and continuous groups (e.g., SE(3)) have been tested; applicability to larger groups remains to be explored.

REMUL (EquivMTL): A penalty-based approach requiring tuning of \(\alpha, \beta\) with no equivariance guarantee; ACE supersedes it via dual methods.
PennPaper: Relies on manual schedules and a Lie-derivative penalty; ACE automates this process via the dual formulation.
Residual Pathway Priors: Adds an invariant branch to increase flexibility, but lacks a progressive tightening mechanism.
The constrained optimization perspective of ACE generalizes naturally to other structured neural network constraints (e.g., sparsity, low-rank).

Rating¶

⭐⭐⭐⭐ — Contributes both theoretically and methodologically, providing a general, tuning-free solution for training equivariant networks with comprehensive experimental validation.