Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption¶
Conference: ICML2026
arXiv: 2605.29497
Code: None
Area: Optimization / Robust Statistics / Single-Index Models
Keywords: Single-Index Models, Robust Regression, Heavy-tailed Noise, Strong Adversarial Corruption, Convex Basin, Spectral Initialization
TL;DR¶
Under heavy-tailed noise and a constant fraction of strong adversarial corruption, the authors prove that the squared loss of Gaussian Single-Index Models (SIMs) with a broad class of non-monotonic link functions (GeLU, Swish, Tanh, Probit, Logistic, Phase Retrieval, etc.) possesses a dimension-independent, constant-radius convex basin. Based on this, they design a robust recovery algorithm with \(\tilde{O}(nd)\) time and \(\tilde{O}(d)\) sample complexity, achieving a final estimation error of \(O(\sigma\sqrt{\epsilon})\).
Background & Motivation¶
Background: The Single-Index Model (SIM) \(Y=f(X^\top\beta^\star)+\zeta\) unifies linear regression, Logistic regression, phase retrieval, and generalized linear models (GLMs) into a semiparametric family. Modern gating primitives in neural networks, such as GeLU and Swish, are naturally non-monotonic link functions. Existing robust recovery theories only cover three narrow settings: linear (\(f(x)=x\)), strictly monotonic (Logistic and other GLMs), and phase retrieval (\(f(z)=z^2\)), as addressed by Pensia et al. (JASA 2024), Awasthi et al. (NeurIPS 2022), and Buna and Rebeschini (AISTATS 2025), respectively.
Limitations of Prior Work: Extending these proofs to general "non-monotonic + non-asymmetric" link functions (e.g., GeLU, Swish) causes immediate failure. First-order proof strategies (Arous et al.) rely on martingale-drift decomposition, requiring zero-mean stochastic deviations, which is destroyed by strong adversarial corruption of an \(\epsilon\) fraction of samples. Furthermore, the symmetric structure of phase retrieval (quadratic link) cannot be transferred to asymmetric cases.
Key Challenge: To perform robust recovery in high dimensions, at least two structural conditions must be met: (i) the squared loss must have a dimension-independent constant-radius convex basin around \(\beta^\star\) to allow second-order convergence proofs; (ii) this basin must be efficiently reachable from random initialization. Until now, these two conditions were only simultaneously satisfied for phase retrieval, and it was unknown if broader non-monotonic links possessed both.
Goal: Identify a set of sufficiently mild conditions on the link function \(f\) such that (i) and (ii) hold simultaneously, and provide a near-linear time, sample-optimal robust recovery algorithm.
Key Insight: The presence of a "convex basin" is translated into a purely one-dimensional integral condition regarding the Gaussian expectation of \(f\) (Assumption 2.1), and "reachability" is translated into the second-moment criterion \(\mathrm{ESC}(\beta,f):=\mathbb{E}[(f'(X^\top\beta))^2 + f(X^\top\beta)f''(X^\top\beta)]>0\) (Assumption 2.2). This converges the burden of high-dimensional proofs onto the 1D properties of the link function itself.
Core Idea: Characterize "convex basin existence + reachability" using two 1D conditions: "local Lipschitz complexity + ESC." Then, use spectral initialization to enter the basin and Robust Gradient Descent (RGD) to refine the estimate, extending robust recovery from phase retrieval to the entire class of SIMs with a generation index \(\le 2\).
Method¶
Overall Architecture¶
Input: A dataset \(\{(x_i,y_i)\}_{i=1}^N\) corrupted by an \(\epsilon\) fraction of strong adversaries, where \(x_i\sim\mathcal{N}(0,\mathbf I_d)\), and an unknown index vector \(\beta^\star\) with \(\|\beta^\star\|_2=1\). Output: A unit vector \(\hat\beta\) such that \(\|\hat\beta-\beta^\star\|_2=O(\sigma\sqrt\epsilon)\). Algorithm 1 splits samples into \(P+1\) equal buckets. First, LRSI is used for spectral initialization (\(\beta_0\leftarrow\text{LRSI}(N_1,\epsilon)\)) to land in the convex basin. Then, LRGD (Robust Gradient Descent) is applied (\(\beta_P\leftarrow\text{LRGD}(N_{2..P+1},\beta_0,\epsilon,\alpha,\gamma)\)) to refine the error from \(O(\epsilon^{1/4})\) to \(O(\sqrt\epsilon)\), followed by normalization.
Key Designs¶
-
Convex Basin Existence (Assumption 2.1 + Theorem 3.1):
- Function: Characterizes which link functions ensure the squared loss \(\mathcal L(\beta)=\frac12\mathbb E[(f(X^\top\beta)-Y)^2]\) has a dimension-independent constant-radius convex basin around \(\beta^\star\).
- Mechanism: At \(\beta^\star\), Gaussian symmetry simplifies the Hessian to \(H(\beta^\star) = \mathbb E[(f'(Z))^2]\mathbf I_d+(\mathbb E[Z^2(f'(Z))^2]-\mathbb E[(f'(Z))^2])\beta^\star\beta^{\star\top}\). Thus, \(\lambda_{\min}(H(\beta^\star))=\mu:=\min\{\mathbb E[f'(Z)^2],\mathbb E[Z^2 f'(Z)^2]\}\). The Mean-Value Theorem decomposes the operator norm bound of \(H(\beta)-H(\beta^\star)\) into \(C_{\text{lip}}(R):=\sup_{\|\beta-\beta^\star\|\le R}\sqrt{\mathbb E_{z\sim\mathcal N(0,\|\beta\|^2)}[18f'(z)^2 f''(z)^2+2f'''(z)^2 f(z)^2]}\cdot\|\beta-\beta^\star\|\). Crucially, \(C_{\text{lip}}(R)\) involves only 1D Gaussian integrals of \(f\) and its first three derivatives, independent of dimension \(d\). Choosing \(R\le\mu/(2(315)^{1/4}C_{\text{lip}}(R))\) guarantees \(\|\Delta(\beta)\|_{\text{op}}\le\mu/2\), ensuring \(\frac{\mu}{2}\mathbf I_d\preceq H(\beta)\preceq(\frac{\mu}{2}+\mu_1)\mathbf I_d\) over the ball \(\mathcal B(\beta^\star,R)\).
- Design Motivation: Collapse the high-dimensional basin existence problem into 1D conditions, which are automatically satisfied if \(f\) and its third derivative have finite fourth moments (polynomial growth), unifying GeLU, Swish, Tanh, Probit, Logistic, and phase retrieval.
-
ESC Condition + LRSI Spectral Initialization (Assumption 2.2 + Theorem 4.2):
- Function: Drives random initialization into the convex basin using second-moment spectral methods, with guarantees under strong adversarial corruption.
- Mechanism: Define \(\tilde Y:=Y X\). Using Stein's second-order identity, \(\beta^\star\) is the top eigenvector of \(\mathbb E[\tilde Y\tilde Y^\top]\) if and only if \(\mathrm{ESC}(\beta^\star;f)>0\) (this is a higher-order "monotonicity" since \(\mathrm{ESC}(\beta,f)=\mathbb E[(f^2(X^\top\beta))'']\)). By proving \(\tilde Y\) is \((4,C_4)\)-hypercontractive, where \(C_4=3(\mathbb E[f(X^\top\beta^\star)^8]^{1/8}+K_4)/\sigma\), one can use the near-linear time robust 1-ePCA subroutine from Jambulapati et al. (2024) to extract \(\hat u\) and set \(\beta_0:=\hat u\). The theory guarantees \(\text{dist}(\beta_0,\beta^\star)=O(C_4\epsilon^{1/4}\sqrt{\sigma^2+\mathbb E[f^2]+c}/\sqrt c)\), ensuring \(\beta_0\) lands in the basin with high probability.
- Design Motivation: Previous phase retrieval initializations relied on symmetry. ESC characterizes "signal identification via second moments" as a pure link function property, extending reachability to non-symmetric GeLU/Swish while maintaining \(O(nd)\) complexity.
-
LRGD Robust Gradient Descent (Theorem 4.1):
- Function: Refines the \(O(\epsilon^{1/4})\) error from spectral initialization to the optimal \(O(\sigma\sqrt\epsilon)\) rate.
- Mechanism: Within the basin, \(\mathcal L\) is \(\gamma=\mu/2\) strongly convex and \(\alpha=\mu/2+\mu_1\) smooth, so standard GD with \(\eta=2/(\alpha+\gamma)\) converges linearly. Since only corrupted samples are available, \(\nabla\mathcal L(\beta)=\mathbb E[(f(X^\top\beta)-Y)f'(X^\top\beta)X]\) is estimated using the robust mean estimator (Lemma 2.2) from Diakonikolas et al. (2022), yielding a robust gradient \(\hat g\) such that \(\|\hat g-\nabla\mathcal L(\beta)\|=O(\sigma'\sqrt\epsilon)\). Sample splitting across independent buckets avoids trajectory-dependent correlation.
- Design Motivation: Spectral estimation alone hits a statistical lower bound of \(\epsilon^{1/4}\). Robust mean estimation maintains \(\tilde O(nd)\) complexity while reaching the information-theoretically optimal \(\sigma\sqrt\epsilon\) error level.
Loss & Training¶
The objective is the squared loss \(\mathcal L(\beta)=\frac12\mathbb E[(f(X^\top\beta)-Y)^2]\). Total sample complexity is \(n=\tilde O(m+P\tilde m)\), where \(m=\Theta(C_4^2(d\log d+\log(1/\delta))/\epsilon^{3/2})\) for spectral initialization, \(\tilde m=\tilde O(d/\epsilon)\) for each robust gradient step, and \(P=O(1)\). Total runtime is \(\tilde O(md/C_4^4+P\tilde m d)=\tilde O(nd)\). The tolerable corruption fraction \(\epsilon\) is \(O(\min\{1/C_4^4,\,c^2\min\{R^4,1\}/(C_4^4(\sigma^2+\mathbb E[f^2]+c)^2),\,\gamma^2/\phi_1,\,\gamma^2 R^2/(\sigma^2\phi_2)\})\).
Key Experimental Results¶
This is a purely theoretical paper. The following tables summarize its theoretical metrics across different link functions and settings compared to prior work.
Main Results (Theoretical Indices for Robust Recovery)¶
| Link Function / Task | Noise + Corruption | Error Rate | Time | Samples | Source |
|---|---|---|---|---|---|
| Linear Regression \(f(x)=x\) | Heavy-tailed + Adversarial | \(O(\sigma\sqrt\epsilon)\) (Optimal) | \(\tilde O(nd)\) | \(\tilde O(d)\) | Cherapanamjeri et al. 2020 / Pensia et al. 2025 |
| Logistic (Monotonic GLM) | Gauss + Adversarial | \(O(\sigma\epsilon\log\frac1\epsilon)\) (Optimal) | Not explicitly bnd | \(\tilde O(d)\) | Awasthi et al. 2022 |
| Logistic (Squared Loss) | Heavy-tailed + Adversarial | \(O(\sigma\sqrt\epsilon)\) | \(\tilde O(nd)\) Streaming | \(\tilde O(d^2)\) | Diakonikolas et al. 2022 |
| Phase Retrieval \(f(z)=z^2\) | Heavy-tailed + Adversarial | \(O(\sigma\sqrt\epsilon)\) | Polynomial | \(\tilde O(d)\) | Das & Batra 2026 |
| GeLU / Swish / Tanh / Probit / Logistic / Phase Retrieval (Ours) | Heavy-tailed + Adversarial | \(O(\sigma\sqrt\epsilon)\) | \(\tilde O(nd)\) | \(\tilde O(d)\) | Ours (Thm 4.1) |
Structural Conclusion Comparison¶
| Condition / Conclusion | Phase Retrieval (Buna-Rebeschini 2025) | Monotonic GLM (Awasthi 2022) | Ours (Thm 3.1 + 4.2) |
|---|---|---|---|
| Convex Basin Existence | Quadratic link only | Monotonic links only | Any \(f\) satisfying Assumption 2.1 (incl. GeLU, Swish) |
| Basin Radius \(R\) | Dimension-independent constant | Unclear | Dimension-independent (1D Gaussian integral) |
| Reachability Proof | Symmetry + Standard PCA | Tensor PCA / Special Proof | Second-order Stein + Robust 1-ePCA |
| Asymmetric & Non-monotonic | No | No | Yes |
| Spectral Init Time | Polynomial | Polynomial | \(\tilde O(nd)\) Near-linear |
Key Findings¶
- Convex basin existence is fully characterized by 1D conditions: \(C_{\text{lip}}(R)\) is merely a 1D Gaussian integral of \(f\) and its derivatives. As long as \(f\) grows at most polynomially, the basin radius is automatically dimension-independent.
- Spectral methods are bottlenecked at the \(\epsilon^{1/4}\) statistical lower bound: Robust GD is essential to reach \(\sqrt\epsilon\). This reuses the "Wirtinger Flow = Spectral Init + GD Refinement" two-stage structure from phase retrieval but generalizes reachability via ESC.
- ESC represents "High-order Monotonicity": Since \(\mathrm{ESC}(\beta,f)=\mathbb E[(f^2(X^\top\beta))'']\), it determines whether the top eigenvector of the second-moment matrix aligns with \(\beta^\star\). Signal can be recovered even if \(f\) is non-monotonic, provided \(f^2\) is "convex."
Highlights & Insights¶
- 1D Reduction Paradigm: Complex high-dimensional structures (basin radius, Hessian spectrum, reachability) are characterized via Stein identities + 1D Gaussian integrals. This compresses high-dimensional robust analysis into "link function auditing + existing robust subroutines."
- Bringing Modern Activations into SIM Theory: Functions like GeLU and Swish are now provably robustly recoverable. This suggests that the theoretically analyzable boundaries of gated Transformer layers are broader than previously thought.
- Transferable Techniques: (1) Linearizing signal identification for non-linear problems using \(\tilde Y=YX\) and Stein's identity; (2) Incorporating \((4,C_4)\)-hypercontractivity to bring robust PCA complexity down to \(O(nd)\); (3) The "Spectral Init + Robust GD" two-stage template is likely applicable to any problem with "convex basin presence + spectral reachability."
Limitations & Future Work¶
- Unit Norm Constraint: \(\|\beta^\star\|_2=1\) is a standard but non-negligible normalization constraint. Jointly estimating magnitude and direction is an open question.
- Sub-optimal Error Rate: This work achieves \(O(\sigma\sqrt\epsilon)\), whereas Awasthi et al. (2022) achieve \(O(\sigma\epsilon\log\frac1\epsilon)\) for monotonic links under Gaussian noise. Reaching \(O(\epsilon)\) for non-monotonic links remains unsolved.
- Generation Index \(>2\): If the signal disappears in the second moment (e.g., specific terms in \(f(z)=z^3\)), higher-order tensor methods are required, as the spectral framework fails.
- Known Link Function: In semiparametric statistics, \(f\) is often unknown. Extending this to joint estimation of a robust and unknown \(f\) is a natural next step.
Related Work & Insights¶
- vs Buna & Rebeschini (2025) / Das & Batra (2026): They target phase retrieval, relying on quadratic symmetry and poly-time PCA. This work uses ESC + Stein + Robust 1-ePCA to generalize the two-stage framework to all non-monotonic asymmetric links and reduces PCA to \(\tilde O(nd)\).
- vs Diakonikolas et al. (2019b/2022): They target Logistic recovery on specific losses. Their sample complexity is \(O(d^5)\) or \(O(d^2)\). This work maintains the optimal \(\tilde O(d)\) samples for the squared loss.
- vs Awasthi et al. (2022): They provide the optimal rate \(O(\epsilon \log \frac{1}{\epsilon})\) for monotonic GLMs but lack runtime guarantees and ignore non-monotonicity. This work expands to "non-monotonic + near-linear time + heavy-tailed noise" at the cost of a slightly slower error rate.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to achieve convex basin existence + reachability + near-linear time for non-monotonic asymmetric SIMs.
- Experimental Thoroughness: ⭐⭐⭐ Pure theoretical paper; however, theoretical comparisons are clear and cover wide link function instances.
- Writing Quality: ⭐⭐⭐⭐ Three main contributions are well-defined. Proof blueprints and subroutine boundaries are clear; notation is dense.
- Value: ⭐⭐⭐⭐ Brings modern activations (GeLU/Swish) into the provable scope of robust statistics, contributing structurally to robust optimization.