Skip to content

Tunable Soft Equivariance with Guarantees

Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Representation Learning / Equivariance Theory
Keywords: Soft Equivariance, Equivariant Error Bounds, Lie Algebra Projection, Schur Decomposition, Pre-trained Model Adaptation

TL;DR

This paper proposes an architecture-agnostic "soft equivariant" framework: projecting the weights of any pre-trained model into a subspace determined by the Lie algebra representation of a group. Using a truncation threshold \(b\), the model can be continuously tuned from "fully equivariant" to "fully non-equivariant," while providing a provable upper bound on the equivariant error. It simultaneously improves accuracy and reduces equivariant error in ImageNet classification, segmentation, and trajectory prediction.

Background & Motivation

Background: Equivariance (where input transformations lead to predictable output transformations) is a fundamental inductive bias in computer vision. For example, in segmentation tasks, if an object translates, the predicted mask should also translate. Specially designed group-equivariant networks (Group Convolutions, Equivariant Transformers/GNNs) are theoretically elegant and effective in specific tasks but are rarely adopted in mainstream large-scale models.

Limitations of Prior Work: Real-world data is often only "approximately equivariant"; strictly imposing equivariance can sacrifice model expressivity. "Soft equivariance" approaches have emerged, but existing methods have major drawbacks: (1) Data augmentation and regularization (penalizing equivariant error in the loss) provide no guarantees for final equivariance after training; (2) Mixing non-equivariant branches into equivariant backbones (e.g., RPP) allows for a trade-off but still offers no guarantees and relies on specialized architectures, making it difficult to apply to off-the-shelf models like ViT or ResNet.

Key Challenge: A trade-off exists between expressivity and equivariance. Current methods are either uncontrollable, unprovable, or non-general—there is no way to take a pre-trained ViT and "precisely inject tunable and bounded equivariance while preserving its power."

Goal: Construct a soft equivariant layer that satisfies three criteria: (a) compatible with any pre-trained model without adding parameters; (b) continuously tunable softness; (c) provides a theoretical upper bound on equivariant error at any level of tuning.

Key Insight: The authors start from the specific case where anti-aliasing (blurring) in CNNs improves translation invariance (Zhang 2019). Signal processing shows that low-pass filtering is equivalent to "projecting onto a band-limited subspace," which remains band-limited under translation. Thus, blurring is essentially a projection operator. The authors generalize this "projection = soft equivariance" perspective from translation groups to any compact connected Lie group.

Core Idea: Use "generalized blurring"—projecting weights into a subspace determined by the group action—to achieve soft equivariance. The number of preserved directions is determined by a truncation threshold \(b\). A smaller \(b\) makes the model more equivariant, and it enables the derivation of a closed-form bound for the equivariant error.

Method

Overall Architecture

The method can be summarized in one sentence: Given a group \(G\) and a pre-trained layer, offline compute the singular/eigenstructure of the group's Lie algebra representation, construct a projection operator, and force the layer's learnable parameters into this subspace during training. The pipeline is "Group Structure → Projection Operator → Apply to any linear layer." It adds no learnable parameters and can be seamlessly integrated into ViT patch embeddings, positional encodings, convolutions, and fully connected layers for point features; pointwise non-linearities like ReLU are inherently equivariant and require no modification.

The authors first redefine "soft equivariance" as a scale-invariant relative quantity (Key Design 1). They then use the SVD of the Lie algebra representation for continuous groups to construct the projection operator and prove error bounds (Key Design 2). For "normal" groups like rotations, they use Schur decomposition to reduce construction costs from \(O((d d')^3)\) to \(O(\max(d,d')^3)\) (Key Design 3). Finally, they extend the theory to discrete groups using "group forward difference operators" (Key Design 4). The truncation threshold \(b\) is the sole knob for softness, adjustable via a validation set.

Key Designs

1. \(\epsilon\)-Soft Equivariance: Defining a scale-invariant, interpretable error via Jacobian normalization

Prior works define soft equivariance as an absolute error constraint \(\lVert F(\rho_X(g)x) - \rho_Y(g)F(x)\rVert \le \epsilon\), but this value drifts with the scale of \(F(x)\), making \(\epsilon\) uninterpretable. This paper changes it to a relative metric:

\[\frac{\lVert F(\rho_X(g)x) - \rho_Y(g)F(x)\rVert}{\lVert J_F(x)\rVert_F \,\lVert x\rVert} \le \epsilon, \quad \forall g\in G,\ x\in X.\]

Here \(J_F(x)\) is the Jacobian of \(F\) at \(x\), and \(\lVert J_F\rVert_F\) denotes the local output sensitivity. Intuitively, the denominator normalizes the "violation" by the model's own local scale at that point. Thus, \(\epsilon\) measures equivariance breakdown relative to the model's internal changes, making it comparable across tasks and models. When \(\rho_Y\) is the identity, it reduces to soft invariance. This definition is the foundation for all subsequent error bounds.

2. Lie Algebra Projection Operator: Restricting weights to subspaces with "low group action sensitivity"

This is the core of the paper. For a fully connected invariant layer \(y=w^\top x\), the authors do not learn \(w\) directly. Instead, they let \(w = B_{\text{inv}}\theta\), where \(\theta\) are the learnable parameters and \(B_{\text{inv}}\) is a fixed projection operator. This is derived by taking the Lie algebra representation \(\bar A = d\rho_X(A)\) (infinitesimal generators characterizing first-order action near identity), performing SVD \(\bar A = U\Sigma V^\top\) with singular values in ascending order, and keeping only those left-singular vectors whose singular values are below threshold \(b\):

\[B_{\text{inv}} = \sum_{i:\,\sigma_i < b} u_i u_i^\top.\]

Directions with large singular values are those most "agitated" by the group action. Filtering them out ensures weights live in a subspace insensitive to the group action—this is "generalized blurring." The error bound (Claim 1) is given as \(\epsilon_b = b\sqrt{n_G}\, r_G + \delta_G\), where \(n_G\) is the number of generators, \(r_G\) is the injectivity radius, and \(\delta_G\) is the residue from the first-order Taylor expansion. For equivariant (rather than invariant) layers, constraints are combined via Kronecker products into a matrix \(L = d\rho_X(A)^\top\!\otimes I - I\otimes d\rho_Y(A)\), and \(B_{\text{eq}}\) is constructed from its right-singular vectors, yielding a similar bound (Claim 2). This requires no specific architecture and no new parameters.

3. Efficient Implementation with Schur Decomposition: Reducing costs for normal groups

The complexity of SVD on \(L\) is \(O((d\cdot d')^3)\). While computed only once before training, it becomes prohibitive as \(d\cdot d'\) grows (e.g., 15 minutes for 14×14 inputs). The authors observe that when the Lie algebra representation is a normal matrix (commuting with its conjugate transpose, as in 2D/3D rotations), one can use real Schur decomposition \(d\rho_X = U_X\Lambda_X U_X^\top\) and \(d\rho_Y = U_Y\Lambda_Y U_Y^\top\). \(\Lambda\) is a block-diagonal matrix of \(1\times1\) or \(2\times2\) blocks, reducing complexity to \(O(\max(d,d')^3)\). Weights are shifted to the Schur basis \(\Delta' = U_Y^\top \Delta U_X\), and projection \(B_{\text{Schur}}\) is applied block-wise. This reduced the 14×14 SVD time from 15 minutes to under 1 second (Tab. 6).

4. Discrete Group Extension: Using group forward difference operators

The previous designs rely on Lie algebra representations. Discrete groups (e.g., finite rotation groups) lack Lie algebras. The authors introduce the "group forward difference operator" \(\Delta_s f(g) = f(sg) - f(g)\) as a discrete analogue. Based on this, they provide a first-order Taylor approximation for discrete groups (Lemma 2): \(\hat f(g) = f(e) + \sum_i n_{s_i}\Delta_{s_i}f(e)\), where the error is bounded by the word metric \(d_S\) and Lipschitz constant \(h\). By replacing \(d\rho\) with \(\Delta_s\), the projection operators and error bounds transfer directly to discrete groups.

Loss & Training

The framework does not change the training objective. It merely replaces weights in fortified layers with "projection × learnable parameters" and performs standard fine-tuning. The truncation threshold \(b\) is the only knob: smaller \(b\) is more equivariant (fewer directions), larger \(b\) is more flexible. Smooth thresholding was found to perform better than hard truncation (Tab. 7).

Key Experimental Results

Main Results

On ImageNet-1K, Ours outperforms baselines in both accuracy and invariance error across three backbones, without the accuracy drop seen in canonicalizers (iErr units \(\times10^{-2}\)):

Backbone Method Acc↑ aAcc↑ cAcc↑ iErr↓
ViT Base 81.67 77.29 79.40 0.36
ViT Canon. 76.51 75.81 76.15 0.15
ViT Ours 82.28 80.56 81.40 0.15
DINOv2 Base 84.27 82.82 83.52 0.13
DINOv2 Ours 85.31 84.44 84.87 0.05

Semantic segmentation (PASCAL VOC) and human trajectory prediction (ETH/UCY) showed similar trends. In trajectory prediction, Ours outperformed the "fully equivariant" EqAuto—strict equivariance actually hurt accuracy, whereas Ours provided a better balance.

Ablation Study

Configuration Key Metric Note
SVD (14×14) ~890 s Extremely slow for large dimensions
Schur (14×14) 0.25 s Equivalent construction for normal generators; ~3500× faster
Hard Threshold mIoU 73.92 Direct truncation
Smooth Threshold mIoU 74.78 Better accuracy and lower error

Key Findings

  • No trade-off on ImageNet: Accuracy and invariance error improved simultaneously, which is rare for soft equivariance. Projecting to group-insensitive subspaces seems to act as a robust stabilizer.
  • Strict equivariance isn't always better: In trajectory prediction, fully equivariant models (EqAuto) performed worse than Ours, confirming that real data is only "approximately" equivariant.
  • Schur is the key to scalability: It makes the framework usable for large dimensions.
  • Limitations of canonicalizers: Rotations move corner pixels out of view, causing normalization networks to drift, failing to achieve zero error and losing significant accuracy (-5 to -10 points).

Highlights & Insights

  • Generalized Anti-Aliasing: Generalizes the empirical observation that blurring improves invariance into a formal theory for any compact connected Lie group.
  • Architecture-Agnostic & Zero Param Overhead: Can be applied to existing ViT/ResNet/DINOv2 weights directly, unlike previous soft methods requiring specific equivariant backbones.
  • Provable Error Bounds: Turns "softness" into an interpretable knob \(b\) linked to a linear error bound.
  • Portable Methodology: The paradigm of using the spectral structure of group actions to constrain weights is highly transferable to point clouds or generative models.

Limitations & Future Work

  • First-order Taylor residue \(\delta_G\): The bound includes an uncontrolled residue term; for large transformations, the bound might become loose.
  • Normal Matrix Dependency: Schur acceleration requires normal matrices; general groups still require expensive SVD.
  • Focus on 2D Rotations: Most experiments focus on 2D rotation; more complex groups (e.g., O(5)) were only tested on synthetic datasets.
  • vs. Group-Equivariant Architectures: Those impose strict equivariance and require specific designs; Ours is soft, architecture-agnostic, and treats equivariance as a spectrum.
  • vs. Augmentation/Regularization: Those offer no guarantees; Ours provides provable, closed-form error bounds.
  • vs. RPP/ResEq: Those mix branches and double model size; Ours adds no parameters and achieves better trade-offs.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ 5/5
  • Experimental Thoroughness: ⭐⭐⭐⭐ 4/5
  • Writing Quality: ⭐⭐⭐⭐ 4/5
  • Value: ⭐⭐⭐⭐⭐ 5/5