Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks¶

Conference: NeurIPS 2025 arXiv: 2505.21791 Code: None Area: Neural Network Sparsification / Optimization Theory Keywords: ℓp regularization, ReLU networks, sparsity, global minimizers, network pruning

TL;DR¶

This paper proves that, for single-hidden-layer ReLU networks, global minimizers of the \(\ell^p\) (\(0 < p < 1\)) path norm correspond exactly to the sparsest data-interpolating networks, thereby recasting the combinatorial sparse interpolation problem as a continuously differentiable optimization task.

Background & Motivation¶

Importance of Sparse Networks¶

Overparameterized neural networks can interpolate a given dataset in many ways. A fundamental question is: among all feasible solutions, which should be preferred? Sparse networks (those with fewest parameters or neurons) are particularly valuable for the following reasons: - Computational efficiency: lower storage and inference overhead - Generalization: sparse networks empirically tend to generalize better - Interpretability: fewer neurons yield more transparent model structures - Model compression: provides a foundation for knowledge distillation and deployment

Limitations of Prior Work¶

Existing sparsification strategies are largely heuristic and lack theoretical guarantees:

Method Category	Representative Methods	Theoretical Guarantee
Pruning	Lottery Ticket, Magnitude Pruning	No sparsest-solution guarantee
\(\ell_1\) Regularization	Lasso, path norm minimization	Non-unique solutions; possibly non-sparsest
Structured Sparsity	Group Lasso, \(\ell_{2,1}\)	Group-level sparsity, not finest-grained
Stochastic Gating	\(\ell_0\) approximation	Approximate solutions, no exact guarantees

Notably, \(\ell_1\) path norm minimization, despite its apparent promotion of sparsity, has the following shortcomings in the neural network setting: - Solutions are generally non-unique - Certain \(\ell_1\)-optimal solutions may have arbitrarily many neurons - The intuition "\(\ell_1 =\) sparsity" therefore does not carry over to neural networks

Paper Goals¶

This paper proposes a continuously differentiable regularization objective based on the \(\ell^p\) (\(0 < p < 1\)) quasi-norm, whose global minimizers are provably the sparsest ReLU networks.

Method¶

Overall Architecture¶

Univariate Case (Input Dimension \(d = 1\))¶

Consider a single-hidden-layer ReLU network of the form:

\[f_{\boldsymbol{\theta}}(x) = \sum_{k=1}^{K} v_k (w_k x + b_k)_+ + ax + c\]

The \(\ell^p\) path norm minimization problem is defined as:

\[\min_{\boldsymbol{\theta}} \sum_{k=1}^{K} |w_k v_k|^p \quad \text{s.t.} \quad f_{\boldsymbol{\theta}}(x_i) = y_i, \; i=1,\ldots,N\]

Multivariate Case (Input Dimension \(d > 1\))¶

For ReLU networks mapping \(\mathbb{R}^d \to \mathbb{R}\), an additional \(\ell_\infty\) constraint is introduced:

\[\min_{\boldsymbol{\theta}} \sum_{k=1}^{K} \|v_k \mathbf{w}_k\|_p^p \quad \text{s.t.} \quad f_{\boldsymbol{\theta}}(\mathbf{x}_i) = y_i, \; \|v_k \mathbf{w}_k\|_\infty \leq R\]

Key Designs¶

Variational Reformulation (Proposition 3.1)¶

The network optimization problem is equivalently reformulated as a variational problem over continuous piecewise-linear (CPWL) functions:

\[\min_f V_p(f), \quad \text{s.t.} \quad f(x_i) = y_i\]

where \(V_p(f)\) denotes the \(p\)-variation of the derivative of \(f\). For a CPWL function with \(K\) breakpoints:

\[V_p(f) = \sum_{k=1}^{K} |c_k|^p\]

with \(c_k\) being the slope change at each breakpoint. This reformulation transforms the parameter-space optimization into a geometric problem over function space.

Geometric Characterization (Theorem 3.1)¶

The core theorem precisely characterizes the geometric behavior of optimal functions for \(0 < p < 1\):

Linear regions: The function must be linear before \(x_2\) and after \(x_N\), and also between consecutive data points with alternating discrete curvature signs.
Constant-curvature regions: Between \(m\) consecutive data points \(x_i, \ldots, x_{i+m}\) sharing the same curvature sign, the function is convex (or concave) with at most \(m-1\) breakpoints.
Slope constraints: The slope \(u_j\) at each breakpoint satisfies \(s_{i+j-1} \leq u_j \leq s_{i+j}\) (assuming positive curvature).

Uniqueness and Sparsity (Theorem 3.2)¶

For Lebesgue almost every \(0 < p < 1\), the solution is unique. Moreover, there exists a data-dependent threshold \(p^*\) such that for \(0 < p < p^*\), the \(\ell^p\)-optimal solution coincides exactly with the sparsest (\(\ell^0\)-optimal) interpolating network.

In particular: - Any \(\ell^p\)-optimal solution has at most \(N-2\) active neurons. - By contrast, \(\ell^1\)-optimal solutions may have arbitrarily many neurons.

Multivariate Extension (Theorem 4.1)¶

Using the Bauer maximum principle and properties of concave function minimization over polytopes: - The optimization problem is reduced to minimizing a finite-dimensional concave function over a polytope. - The optimal solution must be attained at an extreme point of the polytope. - There exists \(p^*\) such that \(\ell^p\)-\(\ell^0\) equivalence holds.

Loss & Training¶

This paper does not address specific training algorithms. The core contribution lies in establishing the correct form of the objective function. The \(\ell^p\) path norm objective is continuous and almost everywhere differentiable, and can in principle be optimized via gradient descent (albeit non-convexly).

Key Experimental Results¶

Comparison of Theoretical Guarantees¶

This is a purely theoretical work. Core results are presented as theorems:

Theorem	Dimension	Result	Condition
Theorem 3.1	\(d=1\)	Geometric characterization: optimal function is piecewise linear + locally convex/concave	\(0 < p < 1\)
Corollary 3.1.1	\(d=1\)	\(\ell^p\)-optimal \(\Rightarrow\) \(\ell^1\)-optimal; \(\leq N-2\) neurons	\(0 < p < 1\)
Theorem 3.2	\(d=1\)	Almost unique; \(\exists\, p^*\) such that \(\ell^p = \ell^0\)	No data assumptions required
Proposition 4.1	Any \(d\)	\(\ell^0\)-optimal solution has \(\leq N\) active neurons	General position assumption
Theorem 4.1	Any \(d\)	\(\exists\, p^*\) such that \(\ell^p\)-optimal \(=\) \(\ell^0\)-optimal	Requires \(\ell_\infty\) constraint

Sparsity Guarantees Across Regularization Strategies¶

Regularization Strategy	Sparsest Guarantee	Neuron Count Upper Bound	Solution Uniqueness
\(\ell_1\) path norm	No (arbitrarily many neurons possible)	None (unbounded in worst case)	Non-unique
\(\ell_0\) path norm	Yes (by definition)	\(N\)	Non-unique
\(\ell^p\) (\(0 < p < 1\))	Yes (for sufficiently small \(p\))	\(N-2\) (\(d=1\)) / \(N\) (any \(d\))	Almost unique
Weight decay (\(\ell_2\))	No	No guarantee	—

Key Findings¶

\(\ell_1\) does not guarantee sparsity in neural networks: Unlike finite-dimensional compressed sensing where \(\ell_1\) recovers sparse solutions, \(\ell_1\) path norm minimization in neural networks can yield solutions with arbitrarily many neurons, because the parameterization of CPWL functions is non-unique.
\(\ell^p\) (\(0 < p < 1\)) simultaneously minimizes \(\ell^1\) and \(\ell^0\): Theorem 3.1 shows that \(\ell^p\)-optimal solutions are also \(\ell^1\)-optimal, thereby achieving both sparsity (few neurons) and norm control (bounded weights).
From combinatorial to continuous optimization: While \(\ell^0\) minimization is NP-hard, this paper proves that minimizing the continuously differentiable \(\ell^p\) objective exactly recovers the \(\ell^0\) solution.
Constraints are necessary in the multivariate case: Contrary to claims by Peng et al. (2015), unconstrained \(\ell^p\) minimization does not necessarily yield bounded solutions (this paper corrects an error in their proof); hence an explicit \(\ell_\infty\) constraint is required in the multivariate setting.

Highlights & Insights¶

Conceptual breakthrough: The first rigorous proof that a continuous regularization objective can exactly recover the sparsest neural network.
Variational perspective: Recasting neural network optimization as a variational problem over the space of CPWL functions yields strong geometric intuition.
Elegant use of the Bauer maximum principle: Exploiting the extremal properties of concave functions on convex sets reduces an infinite-dimensional problem to the comparison of finitely many candidate solutions.
Correction of a prior error: Identifies and corrects an incorrect claim in Peng et al. (2015) regarding the boundedness of \(\ell^p\) minimization solutions.

Limitations & Future Work¶

Results are limited to single-hidden-layer ReLU networks; extension to deep networks is an important direction for future work.
No method is provided for computing the threshold \(p^*\); how to select \(p\) in practice remains an open problem.
Only exact interpolation is considered; the case with training error (approximate fitting) is not addressed.
The \(\ell_\infty\) constraint required in the multivariate case needs further study regarding practical implementation during training.
Convergence of non-convex optimization: even with the correct objective, whether gradient descent can find the global optimum remains to be analyzed.

Debarre et al. (2022): Characterizes \(\ell^1\) path norm optimal solutions but cannot guarantee sparsity.
Boursier & Flammarion (2023): Proves sparsity under \(\ell^1\) + bias regularization with strong data assumptions.
Ergen & Pilanci (2021): Requires \(d > N\) and whitened data assumptions.
Pilanci & Ergen (2020): Convex reformulation framework, which inspired the derivation of Lemma 4.1 in this paper.
Parhi & Nowak (2021, 2022): Representation theorems within a variational framework; the present work is a natural extension thereof.
The results of this paper provide a theoretical foundation for designing new sparse training algorithms (e.g., reweighted \(\ell^1\) methods).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First rigorous establishment of \(\ell^p\)-\(\ell^0\) equivalence
Technical Depth: ⭐⭐⭐⭐⭐ — Sophisticated variational analysis and convex geometry arguments
Experimental Thoroughness: ⭐⭐⭐ — Theoretical breakthrough, but algorithmic design requires follow-up work
Writing Quality: ⭐⭐⭐⭐ — Geometric intuition is well presented, though mathematical details are dense
Overall Score: 8.5/10