Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks¶
Conference: NeurIPS 2025 arXiv: 2505.21791 Code: None Area: Neural Network Sparsification / Optimization Theory Keywords: ℓp regularization, ReLU networks, sparsity, global minimizers, network pruning
TL;DR¶
This paper proves that, for single-hidden-layer ReLU networks, global minimizers of the \(\ell^p\) (\(0 < p < 1\)) path norm correspond exactly to the sparsest data-interpolating networks, thereby recasting the combinatorial sparse interpolation problem as a continuously differentiable optimization task.
Background & Motivation¶
Importance of Sparse Networks¶
Overparameterized neural networks can interpolate a given dataset in many ways. A fundamental question is: among all feasible solutions, which should be preferred? Sparse networks (those with fewest parameters or neurons) are particularly valuable for the following reasons: - Computational efficiency: lower storage and inference overhead - Generalization: sparse networks empirically tend to generalize better - Interpretability: fewer neurons yield more transparent model structures - Model compression: provides a foundation for knowledge distillation and deployment
Limitations of Prior Work¶
Existing sparsification strategies are largely heuristic and lack theoretical guarantees:
| Method Category | Representative Methods | Theoretical Guarantee |
|---|---|---|
| Pruning | Lottery Ticket, Magnitude Pruning | No sparsest-solution guarantee |
| \(\ell_1\) Regularization | Lasso, path norm minimization | Non-unique solutions; possibly non-sparsest |
| Structured Sparsity | Group Lasso, \(\ell_{2,1}\) | Group-level sparsity, not finest-grained |
| Stochastic Gating | \(\ell_0\) approximation | Approximate solutions, no exact guarantees |
Notably, \(\ell_1\) path norm minimization, despite its apparent promotion of sparsity, has the following shortcomings in the neural network setting: - Solutions are generally non-unique - Certain \(\ell_1\)-optimal solutions may have arbitrarily many neurons - The intuition "\(\ell_1 =\) sparsity" therefore does not carry over to neural networks
Paper Goals¶
This paper proposes a continuously differentiable regularization objective based on the \(\ell^p\) (\(0 < p < 1\)) quasi-norm, whose global minimizers are provably the sparsest ReLU networks.
Method¶
Overall Architecture¶
Univariate Case (Input Dimension \(d = 1\))¶
Consider a single-hidden-layer ReLU network of the form:
The \(\ell^p\) path norm minimization problem is defined as:
Multivariate Case (Input Dimension \(d > 1\))¶
For ReLU networks mapping \(\mathbb{R}^d \to \mathbb{R}\), an additional \(\ell_\infty\) constraint is introduced:
Key Designs¶
Variational Reformulation (Proposition 3.1)¶
The network optimization problem is equivalently reformulated as a variational problem over continuous piecewise-linear (CPWL) functions:
where \(V_p(f)\) denotes the \(p\)-variation of the derivative of \(f\). For a CPWL function with \(K\) breakpoints:
with \(c_k\) being the slope change at each breakpoint. This reformulation transforms the parameter-space optimization into a geometric problem over function space.
Geometric Characterization (Theorem 3.1)¶
The core theorem precisely characterizes the geometric behavior of optimal functions for \(0 < p < 1\):
- Linear regions: The function must be linear before \(x_2\) and after \(x_N\), and also between consecutive data points with alternating discrete curvature signs.
- Constant-curvature regions: Between \(m\) consecutive data points \(x_i, \ldots, x_{i+m}\) sharing the same curvature sign, the function is convex (or concave) with at most \(m-1\) breakpoints.
- Slope constraints: The slope \(u_j\) at each breakpoint satisfies \(s_{i+j-1} \leq u_j \leq s_{i+j}\) (assuming positive curvature).
Uniqueness and Sparsity (Theorem 3.2)¶
For Lebesgue almost every \(0 < p < 1\), the solution is unique. Moreover, there exists a data-dependent threshold \(p^*\) such that for \(0 < p < p^*\), the \(\ell^p\)-optimal solution coincides exactly with the sparsest (\(\ell^0\)-optimal) interpolating network.
In particular: - Any \(\ell^p\)-optimal solution has at most \(N-2\) active neurons. - By contrast, \(\ell^1\)-optimal solutions may have arbitrarily many neurons.
Multivariate Extension (Theorem 4.1)¶
Using the Bauer maximum principle and properties of concave function minimization over polytopes: - The optimization problem is reduced to minimizing a finite-dimensional concave function over a polytope. - The optimal solution must be attained at an extreme point of the polytope. - There exists \(p^*\) such that \(\ell^p\)-\(\ell^0\) equivalence holds.
Loss & Training¶
This paper does not address specific training algorithms. The core contribution lies in establishing the correct form of the objective function. The \(\ell^p\) path norm objective is continuous and almost everywhere differentiable, and can in principle be optimized via gradient descent (albeit non-convexly).
Key Experimental Results¶
Comparison of Theoretical Guarantees¶
This is a purely theoretical work. Core results are presented as theorems:
| Theorem | Dimension | Result | Condition |
|---|---|---|---|
| Theorem 3.1 | \(d=1\) | Geometric characterization: optimal function is piecewise linear + locally convex/concave | \(0 < p < 1\) |
| Corollary 3.1.1 | \(d=1\) | \(\ell^p\)-optimal \(\Rightarrow\) \(\ell^1\)-optimal; \(\leq N-2\) neurons | \(0 < p < 1\) |
| Theorem 3.2 | \(d=1\) | Almost unique; \(\exists\, p^*\) such that \(\ell^p = \ell^0\) | No data assumptions required |
| Proposition 4.1 | Any \(d\) | \(\ell^0\)-optimal solution has \(\leq N\) active neurons | General position assumption |
| Theorem 4.1 | Any \(d\) | \(\exists\, p^*\) such that \(\ell^p\)-optimal \(=\) \(\ell^0\)-optimal | Requires \(\ell_\infty\) constraint |
Sparsity Guarantees Across Regularization Strategies¶
| Regularization Strategy | Sparsest Guarantee | Neuron Count Upper Bound | Solution Uniqueness |
|---|---|---|---|
| \(\ell_1\) path norm | No (arbitrarily many neurons possible) | None (unbounded in worst case) | Non-unique |
| \(\ell_0\) path norm | Yes (by definition) | \(N\) | Non-unique |
| \(\ell^p\) (\(0 < p < 1\)) | Yes (for sufficiently small \(p\)) | \(N-2\) (\(d=1\)) / \(N\) (any \(d\)) | Almost unique |
| Weight decay (\(\ell_2\)) | No | No guarantee | — |
Key Findings¶
-
\(\ell_1\) does not guarantee sparsity in neural networks: Unlike finite-dimensional compressed sensing where \(\ell_1\) recovers sparse solutions, \(\ell_1\) path norm minimization in neural networks can yield solutions with arbitrarily many neurons, because the parameterization of CPWL functions is non-unique.
-
\(\ell^p\) (\(0 < p < 1\)) simultaneously minimizes \(\ell^1\) and \(\ell^0\): Theorem 3.1 shows that \(\ell^p\)-optimal solutions are also \(\ell^1\)-optimal, thereby achieving both sparsity (few neurons) and norm control (bounded weights).
-
From combinatorial to continuous optimization: While \(\ell^0\) minimization is NP-hard, this paper proves that minimizing the continuously differentiable \(\ell^p\) objective exactly recovers the \(\ell^0\) solution.
-
Constraints are necessary in the multivariate case: Contrary to claims by Peng et al. (2015), unconstrained \(\ell^p\) minimization does not necessarily yield bounded solutions (this paper corrects an error in their proof); hence an explicit \(\ell_\infty\) constraint is required in the multivariate setting.
Highlights & Insights¶
- Conceptual breakthrough: The first rigorous proof that a continuous regularization objective can exactly recover the sparsest neural network.
- Variational perspective: Recasting neural network optimization as a variational problem over the space of CPWL functions yields strong geometric intuition.
- Elegant use of the Bauer maximum principle: Exploiting the extremal properties of concave functions on convex sets reduces an infinite-dimensional problem to the comparison of finitely many candidate solutions.
- Correction of a prior error: Identifies and corrects an incorrect claim in Peng et al. (2015) regarding the boundedness of \(\ell^p\) minimization solutions.
Limitations & Future Work¶
- Results are limited to single-hidden-layer ReLU networks; extension to deep networks is an important direction for future work.
- No method is provided for computing the threshold \(p^*\); how to select \(p\) in practice remains an open problem.
- Only exact interpolation is considered; the case with training error (approximate fitting) is not addressed.
- The \(\ell_\infty\) constraint required in the multivariate case needs further study regarding practical implementation during training.
- Convergence of non-convex optimization: even with the correct objective, whether gradient descent can find the global optimum remains to be analyzed.
Related Work & Insights¶
- Debarre et al. (2022): Characterizes \(\ell^1\) path norm optimal solutions but cannot guarantee sparsity.
- Boursier & Flammarion (2023): Proves sparsity under \(\ell^1\) + bias regularization with strong data assumptions.
- Ergen & Pilanci (2021): Requires \(d > N\) and whitened data assumptions.
- Pilanci & Ergen (2020): Convex reformulation framework, which inspired the derivation of Lemma 4.1 in this paper.
- Parhi & Nowak (2021, 2022): Representation theorems within a variational framework; the present work is a natural extension thereof.
- The results of this paper provide a theoretical foundation for designing new sparse training algorithms (e.g., reweighted \(\ell^1\) methods).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First rigorous establishment of \(\ell^p\)-\(\ell^0\) equivalence
- Technical Depth: ⭐⭐⭐⭐⭐ — Sophisticated variational analysis and convex geometry arguments
- Experimental Thoroughness: ⭐⭐⭐ — Theoretical breakthrough, but algorithmic design requires follow-up work
- Writing Quality: ⭐⭐⭐⭐ — Geometric intuition is well presented, though mathematical details are dense
- Overall Score: 8.5/10