Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit¶

Conference: ICLR 2026 arXiv: 2511.15120 Code: None Area: Optimization Keywords: Multi-Index Models, Information-Theoretic Lower Bounds, Feature Learning, Power Iteration, Two-Layer Neural Networks

TL;DR¶

This paper proves that, under generic non-degeneracy assumptions, standard two-layer neural networks trained via hierarchical gradient descent can learn generic Gaussian Multi-Index models \(f(\bm{x})=g(\bm{U}\bm{x})\) with \(\tilde{O}(d)\) samples and \(\tilde{O}(d^2)\) time, achieving information-theoretically optimal sample and time complexity. This constitutes the first proof that neural networks can efficiently learn hierarchical functions.

Background & Motivation¶

Background: The Multi-Index model \(f^*(\bm{x})=g^*(\bm{U}\bm{x})\) is a standard framework for studying representation learning, where the output depends on the input only through its projection onto a low-dimensional subspace. Information-theoretic lower bounds require \(\Theta(d)\) samples, and polynomial-time spectral algorithms are known to attain this bound.

Limitations of Prior Work: - Single-index models (\(r=1\)) are well understood theoretically, but guarantees for neural network learning of Multi-Index models (\(r>1\)) remain extremely limited. - Damian et al. (2022) showed that neural networks can learn Multi-Index models, but require a suboptimal \(\tilde{\Theta}(d^2)\) samples. - Arnaboldi et al. address only a special subclass with staircase structure, far from the general case. - The concurrent work of Mousavi-Hosseini & Wu (2026) achieves only weak recovery of partial feature directions and cannot guarantee full subspace recovery.

Key Challenge: Information theory guarantees that \(\Theta(d)\) samples suffice, and spectral methods achieve this bound—yet whether neural networks trained via standard gradient descent can attain the same complexity remained unproven.

Key Insight: By analyzing gradient descent dynamics under small initialization, the paper identifies that the inner-layer weights implicitly perform a power iteration procedure, analogous to spectral initialization on the local Hessian. Critically, the number of training steps must be carefully calibrated: too few steps leave sample noise unreduced; too many cause all neurons to align with the leading eigenvector direction, losing coverage of non-dominant subspace directions.

Core Idea: Gradient descent on neural networks implicitly performs "power iteration with early stopping"—an appropriate number of training steps suffices to suppress finite-sample noise while preserving subspace diversity, thereby approaching the information-theoretic optimum.

Method¶

Overall Architecture¶

A two-layer neural network \(f_{\Theta}(\bm{x}) = \sum_j a_j \sigma(\bm{w}_j^T \bm{x} + b_j)\) is trained via hierarchical gradient descent: Phase 1: Train inner-layer weights \(\bm{W}\) on dataset \(\mathcal{D}_1\) for \(T_1\) steps; Phase 2: Re-initialize biases \(\bm{b}\), then train outer-layer weights \(\bm{a}\) on dataset \(\mathcal{D}_2\) for \(T_2\) steps.

Key Designs¶

Implicit Power Iteration Mechanism:
- Function: Proves that the first \(T_1\) steps of gradient descent implicitly simulate power iteration on the local Hessian \(\hat{\Sigma}_\ell = \frac{1}{n}\sum_i \ell_i \bm{x}_i\bm{x}_i^T\).
- Mechanism: Under small initialization (\(\epsilon_0 \to 0\)), the activation approximates a quadratic (\(\sigma''(0)=1\)), neuron gradient updates are nearly independent, and each update \(\bm{w}_j^{(t+1)} \approx \hat{\Sigma}_\ell \bm{w}_j^{(t)}\) resembles the power method. The large eigenvalues of \(\hat{\Sigma}_\ell\) correspond to signal (the latent subspace), while small eigenvalues correspond to finite-sample noise.
- Design Motivation: Power iteration naturally amplifies signal directions and attenuates noise—but must be stopped at the right step.
Critical Stopping Time Condition:
- Function: Determines the optimal number of inner-layer training steps \(T_1\).
- Mechanism: \(T_1\) too small → noise directions are insufficiently attenuated (especially when \(n \sim d\)); \(T_1\) too large → all neurons align to the leading eigenvector direction, losing full coverage of the subspace. The optimal \(T_1\) ensures that all \(r\) signal directions are covered by sufficiently many neurons, while noise is reduced to a negligible level.
- Design Motivation: This reveals a counterintuitive phenomenon—optimal learning requires a diverging (rather than \(O(1)\)) number of first-layer training steps.
Generic Loss Function as Data Preprocessing:
- Function: Interprets the choice of loss function as a form of data preprocessing.
- Mechanism: The choice of \(\ell'(0, y)\) determines the spectral structure of \(\Sigma_\ell\). For Multi-Index models with generative exponent 2, an appropriate loss function can make \(\Sigma_\ell\) non-degenerate (Assumption 5). This extends the role of the loss function from an "optimization objective" to a "preprocessor for feature extraction."
- Design Motivation: Broadens the applicable class of objective functions beyond squared loss to include \(\ell^1\), Huber loss, and others.

Loss & Training¶

Phase 1: \(\bm{W}^{(t)} \leftarrow \bm{W}^{(t-1)} - \eta_1[\nabla_W \hat{\mathcal{L}}_{\mathcal{D}_1} + \beta_1 \bm{W}^{(t-1)}]\) (GD with weight decay)
Phase 2: Standard GD on outer weights \(\bm{a}\) (convex optimization, converges to optimum)
Symmetric initialization: \(a_j = -a_{m-j}\), \(\bm{w}_j^{(0)} = \bm{w}_{m-j}^{(0)}\) (eliminates mean bias)
Learning rate selection is critical: excessively small learning rates cause the power iteration approximation to break down.

Key Experimental Results¶

Main Results (Key Quantities in Theorem 1)¶

Quantity	Complexity	Information-Theoretically Optimal?
Sample complexity	\(\tilde{O}(d)\)	Yes (leading order)
Time complexity	\(\tilde{O}(d^2)\)	Yes (leading order)
Test error	\(o_d(1) \to 0\)	Yes
Applicable rank	Any constant \(r\)	—
Applicable losses	Squared, \(\ell^1\), Huber, etc.	—

Comparison with Prior Work¶

Work	Sample Complexity	End-to-End Learning?	Generic Model?
Damian et al. 2022	\(\tilde{\Theta}(d^2)\)	Yes	Partial
Arnaboldi et al. 2023	\(\tilde{O}(d^L)\) (\(L\)=staircase steps)	Yes	Staircase structure
Mousavi-Hosseini & Wu 2026	\(O(d)\)	No (weak recovery only)	Yes
Ours	\(\tilde{O}(d)\)	Yes	Yes (non-staircase)

Key Findings¶

First end-to-end optimal guarantee: Neural networks trained with GD fully learn Multi-Index models with information-theoretically optimal \(\tilde{O}(d)\) samples.
Training steps cannot be too few: Optimal results require \(T_1 \to \infty\) first-layer training steps, ruling out all DMFT/state-evolution-based analyses that assume \(O(1)\) steps.
Learning rate cannot be too small: The power iteration approximation requires a moderately large learning rate; too small a rate causes the dynamics to degenerate.
No spectral initialization or warm-starting required: Standard random initialization suffices—neural networks learn the correct features through gradient descent alone.
Loss function as preprocessing: Choosing an appropriate loss function is equivalent to selecting a data preprocessing scheme, which can activate additional signal directions.

Highlights & Insights¶

The discovery of the "implicit power iteration with early stopping" mechanism is the paper's central theoretical contribution—gradient descent is not merely an optimizer; its dynamics inherently encode a spectral method. This provides a new perspective on understanding why deep learning works.
The counterintuitive finding that \(O(1)\) steps are insufficient challenges DMFT-based analytical frameworks, suggesting that the true mechanism of neural network learning may differ from the predictions of these simplified theories.
Treating the loss function as preprocessing establishes a connection between loss selection and feature extraction—in practice, the choice of loss function can be informed by its effect on signal extraction, not solely by its optimization properties.

Limitations & Future Work¶

The non-staircase assumption is a key limitation—target functions with a generative staircase structure are not covered.
Hierarchical training (W before a) rather than standard joint GD—theoretically cleaner but less aligned with practice.
Requires sample splitting (two independent datasets)—a cost of theoretical tractability.
Only two-layer networks are considered—analysis of deeper networks remains to be developed.
Constant \(r\) (subspace dimension) and \(p\) (polynomial degree)—high-dimensional settings are not addressed.
Leading-order optimal but constants are not precisely characterized—the gap in constants relative to optimal spectral methods is not quantified.

vs. Damian et al. 2022: The first neural network learning result for Multi-Index models, but requiring \(d^2\) samples; this paper reduces the requirement to \(d\)—a leading-order improvement.
vs. Mousavi-Hosseini & Wu 2026: A concurrent work achieving \(O(d)\) samples but only weak, incomplete subspace recovery; this paper proves full recovery with end-to-end learning.
vs. Spectral methods (DLB25): Information-theoretically optimal but not based on neural networks; this paper demonstrates that neural networks can match the efficiency of spectral methods.
vs. DMFT/state-evolution methods: These require \(O(1)\) training steps or warm-starting; this paper proves that optimal learning requires a diverging number of steps.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First proof that neural networks attain the information-theoretic optimum for Multi-Index learning; exceptionally deep theoretical contributions.
Experimental Thoroughness: ⭐⭐ Purely theoretical work with no empirical validation (though the theory is self-contained and complete).
Writing Quality: ⭐⭐⭐⭐ Mathematically rigorous with clear problem positioning, though notation density is high.
Value: ⭐⭐⭐⭐⭐ A milestone contribution to the theory of feature learning in neural networks.