LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models¶

Conference: ICLR 2026 arXiv: 2601.18513 Code: None Area: Others / Adversarial Robustness Keywords: Lipschitz constraints, certified robustness, orthogonal matrices, manifold optimization, spatial shift module

TL;DR¶

This paper proposes LipNeXt—the first unconstrained, convolution-free 1-Lipschitz architecture—which learns orthogonal matrices via manifold optimization and achieves spatial mixing through a theoretically motivated Spatial Shift Module derived from Theorem 1. LipNeXt scales to billion-parameter models and establishes new state-of-the-art certified robust accuracy (CRA) on CIFAR-10/100, Tiny-ImageNet, and ImageNet, with a +8% CRA gain on ImageNet at \(\varepsilon=1\).

Background & Motivation¶

The challenge of adversarial robustness: Adversarial examples pose a fundamental threat to safety-critical applications such as autonomous driving, medical imaging, and malware detection. Empirical defenses cannot provide formal guarantees, leaving models vulnerable to stronger attacks.

Two paradigms for certified robustness: (a) Randomized smoothing (RS) provides probabilistic guarantees through noise averaging; (b) Lipschitz-based methods exploit the network's Lipschitz constant to provide deterministic (worst-case) guarantees. This paper focuses on the latter.

Scaling bottleneck of Lipschitz methods: Existing methods predominantly use VGG-style architectures with \(\leq 32\)M parameters, which already underfit on CIFAR-100 and suffer significant performance degradation on ImageNet. Performance gains from larger models saturate quickly.

Orthogonal matrices: performance-critical yet computationally expensive: Tight Lipschitz bounds require all weight matrices to be orthogonal. Existing explicit methods (matrix exponential SOC, Cayley transform, LOT-Orth, Cholesky-Orth) and implicit methods (AOL, CPL, SLL layers) introduce substantial computational overhead—including FFT, matrix inversions, and power iterations—that limits scalability and low-precision training.

Attention mechanisms are incompatible with Lipschitz control: Transformer attention lacks direct Lipschitz constraint mechanisms. Nevertheless, ConvNeXt and MetaFormer demonstrate that macro-level design principles from the Transformer era can be integrated with Lipschitz architectures.

Core motivation: Can one design a 1-Lipschitz architecture that requires neither constrained reparameterization nor convolutions, enabling certified robustness to benefit from scaling laws in the same way as standard training?

Method¶

Overall Architecture¶

A LipNeXt Block stacks four 1-Lipschitz components: orthogonal projection \(R \in \mathcal{M}_C\) (channel mixing) → Spatial Shift \(\mathcal{S}\) (spatial mixing) → orthogonal projection \(R^\top\) (back-projection) → orthogonal linear \(M\) + \(\beta\)-Abs activation. The complete block is:

\[Z = \sigma(M R^\top \mathcal{S}(R(X + p)) + b)\]

where \(p \in \mathbb{R}^{H \times W \times 1}\) is a learnable positional encoding and \(\sigma\) is the \(\beta\)-Abs activation. Spatial dimensions are aggregated via L2 Spatial Pooling: \([\text{L2Pool}(X)]_c = \sqrt{\sum_{h,w} X_{h,w,c}^2}\). The entire network strictly maintains the 1-Lipschitz property.

Key Design 1: FastExp Manifold Optimization¶

Core observation: During large-model training, the learning rate \(\eta \sim 10^{-3}\) is small, implying that the Frobenius norm of the skew-symmetric parameter matrix \(A\) in the exponential map (Eq. 3) is also small. This motivates an adaptive truncated Taylor expansion:

\[\text{FastExp}(A) = \begin{cases} I + A + \frac{1}{2}A^2, & \|A\|_F < 0.05 \\ I + A + \frac{1}{2}A^2 + \frac{1}{6}A^3, & 0.05 \leq \|A\|_F < 0.25 \\ I + A + \frac{1}{2}A^2 + \frac{1}{6}A^3 + \frac{1}{24}A^4, & 0.25 \leq \|A\|_F < 1 \\ \exp(A), & \|A\|_F \geq 1 \end{cases}\]

Two stabilization techniques:

(a) Periodic Polar Retraction: At the end of each epoch, SVD \(X = U\Sigma V^\top\) is performed and the matrix is reset as \(X \leftarrow UV^\top\), correcting accumulated truncation errors.
(b) Manifold Lookahead: The standard Lookahead weight interpolation \(0.5X_t + 0.5X_{t-K}\) destroys orthogonality. Instead, interpolation is performed in the tangent space over skew-symmetric updates: \(X_{\text{slow}} \leftarrow X_{\text{slow}} \cdot \text{FastExp}(\frac{1}{2}\sum_{j=t-K+1}^{t} \Delta_j)\), preserving orthogonality on the manifold.

The additional per-step overhead amounts to at most 5 matrix multiplications, far less than FFT convolutions or power iterations.

Key Design 2: Spatial Shift Module (Theorem 1)¶

Theorem 1: Let \(f_K\) be a spatial convolution with kernel \(K \in \mathbb{R}^{k \times k}\), unit stride, and circular padding. \(f_K\) is norm-preserving (tight 1-Lipschitz isometric), i.e., \(\|f_K(X) - f_K(Y)\|_F = \|X - Y\|_F, \forall X,Y\), if and only if \(K\) contains exactly one nonzero element with value \(\pm 1\).

Implication: A norm-preserving depthwise convolution necessarily degenerates into a spatial shift—this theoretical result directly induces the architectural design.

2D implementation: Each token's features are partitioned into 5 groups (shift up / shift down / shift left / shift right / no shift), corresponding to circular shifts. An orthogonal projection \(R\) mixes channels before and after the shift, ensuring that the shift does not always act on the same fixed channel subsets. The empirically optimal shift ratio is \(\alpha \in \{1/8, 1/16\}\).

Circular padding vs. zero padding: Zero padding implicitly introduces positional information, whereas circular padding does not but guarantees norm-preservation. This paper adopts circular padding with an explicit positional encoding \(p\), which experiments confirm is superior to zero-padding schemes.

Key Design 3: β-Abs Activation¶

\[[\beta\text{-Abs}(\boldsymbol{x})]_i = \begin{cases} |x_i|, & i \leq \beta d \\ x_i, & \text{otherwise} \end{cases}\]

The parameter \(\beta \in [0,1]\) controls the degree of nonlinearity. At \(\beta = 0.5\), the commonly used MinMax activation can be expressed as: \(\exists R \in \mathcal{M}_{2d}, \text{MinMax}(x) = R^\top \beta\text{-Abs}(Rx)\). The activation is 1-Lipschitz and GPU-friendly, requiring no sorting or pairing operations.

Loss & Training¶

EMMA loss is used for certified robust training, following the training recipe of LiResNet++. bfloat16 precision training is supported (LiResNet requires float32 due to numerical overflow in power iteration, and BRONet requires the equivalent of float64 due to complex FFT arithmetic). Multi-class classification is handled via one-vs-rest decomposition.

Key Experimental Results¶

Table 1: CIFAR-10/100 + Tiny-ImageNet Main Results¶

Dataset	Model	Params	Clean Acc	CRA@36/255	CRA@72/255	CRA@108/255
CIFAR-10	LiResNet	83M	81.0	69.8	56.3	42.9
CIFAR-10	BRONet	68M	81.6	70.6	57.2	42.5
CIFAR-10	LipNeXt L32W1024	64M	81.5	71.2	59.2	45.9
CIFAR-10	LipNeXt L32W2048	256M	85.0	73.2	58.8	43.3
CIFAR-100	LiResNet	83M	53.0	40.2	28.3	19.2
CIFAR-100	BRONet	68M	54.3	40.2	29.1	20.3
CIFAR-100	LipNeXt L32W2048	256M	57.4	44.1	31.9	22.2
Tiny-IN	BRONet	75M	41.2	29.0	19.0	12.1
Tiny-IN	LipNeXt L32W2048	256M	45.5	35.0	25.9	18.0

Table 3: ImageNet Results¶

Model	Params	Training Speed (min/epoch)	CRA@ε=1	Clean@ε=36/255	CRA@ε=36/255
LiResNet	51M	5.3	14.2	45.6	35.0
BRONet	86M	10.5	-	49.3	37.6
LipNeXt 1B	1B	8.9	21.1	55.9	40.3
LipNeXt 2B	2B	17.8	22.4	57.0	41.2

CRA at \(\varepsilon=1\) on ImageNet improves by +8% over BRONet; CRA at \(\varepsilon=36/255\) improves by +3%.

Table 4: Scaling Experiments (ImageNet 400 classes, ε=1)¶

Config	Depth	Width	Clean Acc	CRA
Fixed depth=32	32	1024→4096	40.5→51.7	22.9→30.0
Fixed width=2048	8→128	2048	30.7→47.5	22.4→26.9
Fixed params=1B	32	4096	51.7	30.0
Fixed params=1B	64	2896	51.2	29.6

A depth of 32 layers is optimal under a fixed parameter budget. Both width and depth yield non-saturating gains.

Key Findings¶

Lipschitz certification can benefit from scaling: CRA continues to improve from 1B to 2B parameters, challenging the conventional wisdom that certified robustness is confined to small models.
Stability under low-precision training: LipNeXt supports bfloat16 training, whereas LiResNet requires float32 due to numerical overflow in power iteration under bf16, and BRONet's complex FFT arithmetic is equivalent to float64. This enables LipNeXt to continuously benefit from hardware acceleration.
FastExp approximation is sufficiently accurate: The combination of adaptive Taylor truncation, periodic SVD retraction, and manifold Lookahead ensures numerical stability, with performance on par with exact matrix exponentiation.
Theoretical limits of norm-preserving convolutions: Theorem 1 proves that norm-preserving depthwise convolutions under circular padding can only be spatial shifts—a tight necessary and sufficient condition.
Necessity of positional encoding: Circular padding introduces no positional information; explicit positional encoding is required to achieve performance competitive with zero-padding. Experiments confirm that circular padding with explicit PE outperforms zero-padding.

Highlights & Insights¶

Theory-driven architecture design: Theorem 1 naturally derives the Spatial Shift Module from the norm-preservation condition, rather than as an empirical design choice.
Paradigm shift from constraints to manifolds: Replacing orthogonal constraints from "reparameterization followed by projection" with "direct optimization on the manifold" yields a conceptually clean and computationally efficient approach (only 5 matrix multiplications per step).
First billion-scale certified robust model: This work demonstrates that deterministic certification need not be confined to small models, opening new directions for subsequent research.
Training efficiency: Despite being 10–20× larger, LipNeXt achieves comparable training throughput to prior work (1B model at 8.9 min/epoch vs. BRONet at 86M parameters and 10.5 min/epoch).

Limitations & Future Work¶

Only \(\ell_2\) norm certification is considered; \(\ell_\infty\) norm certification is more practically demanded but remains more challenging.
Training the 2B parameter model requires 16× H100 GPUs; deployment at this scale necessitates distillation or other compression techniques.
The maximum CRA@ε=108/255 on CIFAR-10 is lower than AOL (45.9 vs. 49.0), as AOL trades clean accuracy for robustness at large perturbation radii.
The method has not been trained on large-scale image-text datasets, making direct comparisons with randomized smoothing approaches (which can leverage pretrained models such as CLIP) potentially incomplete.

Method	Orthogonal Matrix Implementation	Spatial Mixing	Scalable	Low-Precision Training
LipNeXt (Ours)	Manifold optimization + FastExp	Spatial Shift (parameter-free)	✅ 1–2B	✅ bf16
LiResNet (Hu et al., 2024)	Cholesky-Orth	Convolution + power iteration	❌ Saturates at 83M	❌ Requires float32
BRONet (Lai et al., 2025)	Block Reflector	FFT-based frequency-domain convolution	❌ 86M	❌ Equivalent to float64

vs. LiResNet: LipNeXt retains the macro-level structure of LiResNet but replaces all core components—manifold optimization replaces Cholesky-Orth, and the Spatial Shift replaces convolution with power iteration—eliminating the scaling bottleneck.

vs. BRONet: BRONet's FFT convolutions require complex32 arithmetic, whereas LipNeXt's Spatial Shift is a parameter-free integer index operation. LipNeXt already surpasses BRONet at equal parameter counts, with the advantage growing further at larger scales.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First convolution-free, manifold-optimized billion-scale certified robust architecture; Theorem 1 provides theory-driven design
Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, scaling experiments, and extensive ablations; \(\ell_\infty\) experiments are absent
Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, Algorithm 1 is complete, and motivation is developed progressively
Value: ⭐⭐⭐⭐⭐ A significant milestone in certified robustness, demonstrating that deterministic guarantees can track modern scaling trends