ICLR 2026 Optimization covariance regularization optimization stability ViT SIGReg representation collapse random sketching

Weak-SIGReg: Covariance Regularization for Stable Deep Learning¶

Conference: ICLR 2026 arXiv: 2603.05924 Code: GitHub Area: Optimization Stability / Representation Regularization Keywords: covariance regularization, optimization stability, ViT, SIGReg, representation collapse, random sketching

TL;DR¶

This work transfers SIGReg regularization from LeJEPA's self-supervised learning setting to supervised learning and proposes a computationally efficient variant called Weak-SIGReg—constraining the covariance matrix toward the identity (rather than matching all moments). Random projections reduce memory from \(O(C^2)\) to \(O(CK)\). On a ViT without BN or residual connections, this approach recovers CIFAR-100 accuracy from 20.73% (collapsed) to 72.02%, matching or surpassing carefully tuned baselines.

Background & Motivation¶

Background: Modern neural network training relies on architectural priors such as Batch Normalization and residual connections for optimization stability. In self-supervised learning, methods like VICReg and Barlow Twins have demonstrated that covariance regularization can prevent representation collapse.

Limitations of Prior Work: - Removing BN/residual connections, or training low-inductive-bias architectures (ViT) on small datasets with strong augmentation, frequently causes training collapse (accuracy ~20%, degenerating to random guessing). - Existing solutions depend on delicate hyperparameter tuning (specific weight decay, initialization schemes, positional embedding types, learning rate schedules), making them brittle and non-generalizable. - Covariance regularization from self-supervised learning (VICReg, SIGReg) has not been systematically applied to supervised learning.

Key Challenge: Optimization stability relies on architectural tricks rather than principled methods—can regularization replace architectural priors?

Core Idea: Adopting an interacting particle system perspective, hidden-layer representations are treated as particles evolving under stochastic dynamics. The "stochastic flux" (finite batch size, high learning rate, data augmentation) drives the representation density toward degenerate states (dimensional collapse). Constraining the representation distribution toward an isotropic Gaussian prevents this degeneracy.

Method¶

Overall Architecture¶

Encoder \(f_\theta\) outputs batch representations \(Z \in \mathbb{R}^{N \times C}\) → random projection \(S \in \mathbb{R}^{C \times K}\) reduces dimensionality to \(ZS\) → covariance of the projected representations is computed → Frobenius norm constrains the covariance toward the identity matrix → added as a regularization term to the total loss.

Key Designs¶

Strong SIGReg (from LeJEPA)
Function: Matches the empirical characteristic function (ECF) to the analytically derived Gaussian characteristic function.
Matching is performed after random projection to a \(K\)-dimensional space.
Theoretically constrains all moments (mean, covariance, skewness, kurtosis, …), driving representations toward a perfectly isotropic Gaussian.
Computationally heavy—requires evaluation of the characteristic function.
Weak-SIGReg (this paper)
Function: Constrains only the second-order moment (covariance), discarding higher-order moment constraints.
Core assumption: In supervised learning, preventing dimensional collapse primarily requires covariance conditioning; full distributional matching is unnecessary.
Loss function: \(\mathcal{L} = \mathcal{L}_{CE} + \lambda \|\text{Cov}(ZS) - I\|_F\)
\(S \in \mathbb{R}^{C \times K}\) is a fixed random projection matrix (Johnson–Lindenstrauss guarantees geometric structure preservation).
Memory advantage: Direct computation of the \(C \times C\) covariance requires \(O(C^2)\); after random projection, only \(O(CK)\) is needed (e.g., \(C=1024, K=64\)).
Minimalist implementation: ~10 lines of PyTorch code, plug-and-play.
Relation to VICReg/Barlow Twins: Conceptually similar but used as a purely internal regularizer (no dual-branch architecture or augmented views required), applied directly on top of the supervised loss.
Physical Intuition (Interacting Particle Systems)
Batch representations are treated as particles evolving under Dean–Kawasaki stochastic dynamics.
"Stochastic flux" (SGD noise, small batch size, strong augmentation) → representation density drifts onto a low-dimensional manifold (collapse).
SIGReg constrains the representation density toward an isotropic Gaussian → prevents density degeneracy.
Strong SIGReg = constrains density toward a perfect sphere; Weak-SIGReg = constrains only the covariance (permits more flexible geometry while preventing collapse).

Loss & Training¶

Added as a regularization term to the standard cross-entropy loss.
Gradient clipping (norm=1.0) is applied in all experiments to ensure fair comparison.
The random projection matrix \(S\) is generated before training and kept fixed.

Key Experimental Results¶

ViT on CIFAR-100 (No BN / No Residual)¶

Configuration	SIGReg	Top-1 Acc	Status
AdamW baseline	None	20.73%	Collapsed
AdamW	Strong (LeJEPA)	70.20%	Converged
AdamW	Weak (Ours)	72.02%	Converged

→ Weak-SIGReg not only recovers training but slightly outperforms the more computationally expensive Strong SIGReg.

vs. Expert-Tuned Baselines¶

Setting	SIGReg	Top-1 Acc
Expert-tuned baseline (specific weight decay + init + PE + LR schedule)	None	70.76%
Expert-tuned + Strong	—	72.71%
Expert-tuned + Weak	—	71.65%

→ Weak-SIGReg matches expert-tuned performance without any specialized tuning—demonstrating practical value as a robust default stabilizer.

Vanilla MLP (6-layer, pure SGD, no BN / no residual)¶

Augmentation	SIGReg	Top-1 Acc
None	None	26.77%
None	Strong	35.99%
None	Weak	42.17%

→ Under extreme settings (6-layer MLP without BN + pure SGD), Weak-SIGReg yields larger gains—suggesting that covariance constraints effectively serve as a "soft Batch Normalization."

Key Findings¶

Weak ≥ Strong: Across all settings, Weak-SIGReg matches or surpasses Strong SIGReg—indicating that second-order moment constraints suffice in supervised learning; full distributional matching is unnecessary.
20.73% → 72.02%: SIGReg recovers training from complete collapse to normal operation—not a marginal improvement, but a qualitative fix.
Replacing architectural tricks: SIGReg can functionally substitute for the stabilizing effects of BN and residual connections.
Random projection makes high-dimensional covariance regularization practically feasible—direct computation and storage of a \(1024 \times 1024\) covariance matrix would otherwise be prohibitive.

Highlights & Insights¶

Transfer from SSL to supervised learning: VICReg, Barlow Twins, and SIGReg all originated in SSL—this work demonstrates that the same ideas are highly effective as supervised regularizers.
The interacting particle system physical intuition is compelling—framing training dynamics as stochastic particle evolution, where stability equates to preventing density degeneracy.
Minimalist implementation (~10 lines of code) makes the approach highly practical—it can be directly incorporated into any training pipeline.
The Weak > Strong finding is counterintuitive yet meaningful: the supervised signal already provides directional constraints, so only collapse prevention (second-order moments) is needed, without enforcing distributional shape (all moments).

Limitations & Future Work¶

Validated only on CIFAR-100—effectiveness at ImageNet scale is unknown.
Performance gap relative to standard BN + residual architectures is not quantified (72% vs. potentially higher with BN + residual).
Sensitivity of the random projection dimension \(K\) across different layers and architectures is not analyzed.
Guidance for tuning the regularization strength \(\lambda\) is absent.
Not evaluated on NLP models (e.g., Transformer language models).

vs. VICReg: VICReg regularizes SSL representations with three terms—variance, invariance, and covariance; Weak-SIGReg uses only the covariance term as a supervised regularizer.
vs. Batch Normalization: BN is an architecture-embedded mean/variance normalization; SIGReg is a loss-level covariance constraint—more expressive and controllable.
vs. LeJEPA's SIGReg: LeJEPA applies Strong SIGReg for SSL; this work demonstrates that the Weak variant performs better under supervision while being more computationally efficient.

Rating¶

Novelty: ⭐⭐⭐ Primarily transfers an existing technique (SIGReg) to a new setting (supervised learning) and proposes a simplified variant.
Experimental Thoroughness: ⭐⭐⭐ Limited to CIFAR-100 scale; only two architectures evaluated (ViT + MLP).
Writing Quality: ⭐⭐⭐⭐ Physical intuition is clearly articulated; inline implementation code is intuitive.
Value: ⭐⭐⭐⭐ A minimalist, practical stabilization tool; the "20% → 72%" recovery is impressive.