Depth-Bounds for Neural Networks via the Braid Arrangement¶

Conference: NeurIPS 2025 arXiv: 2502.09324 Code: None Area: Others (Deep Learning Theory / Expressivity Analysis) Keywords: ReLU network depth lower bounds, piecewise linear functions, braid arrangement, maxout networks, combinatorial proof

TL;DR¶

This paper proves that, under \(\mathcal{B}_d^0\)-conforming constraints, exactly representing \(\max\{0, x_1, \ldots, x_d\}\) with a ReLU network requires \(\Omega(\log \log d)\) layers—the first non-constant depth lower bound without weight restrictions. It also shows that rank-(3,2) maxout networks can compute the maximum of 7 values, demonstrating that the standard upper bound is not tight.

Background & Motivation¶

Background: ReLU networks can exactly represent all continuous piecewise linear (CPWL) functions. Arora et al. (2018) showed that \(\lceil \log_2(d+1) \rceil\) layers suffice. Bakaev et al. (2025) improved this to \(\lceil \log_3(d-1) \rceil + 1\) layers. However, depth lower bounds remain largely unexplored.

Core Open Problem: What is the minimum number of layers required to exactly represent all CPWL functions on \(\mathbb{R}^d\)? The best unconditional lower bound currently stands at only 2.

Key Reduction: By Wang & Sun (2005), any CPWL function can be expressed in terms of \(\max\) operations—reducing the problem to: how deep a network is needed to compute the maximum of \(d\) values?

Research Approach: The analysis restricts breakpoints to lie on the braid arrangement, i.e., \(\mathcal{B}_d^0\)-conforming networks. While not fully general, this restriction provides a tractable analytical framework.

Method¶

Core Concepts¶

Braid Arrangement \(\mathcal{B}_d\): The hyperplane arrangement defined by \(x_i = x_j\) for \(1 \leq i < j \leq d\). \(\mathcal{B}_d^0\) additionally includes \(x_i = 0\).

\(\mathcal{B}_d^0\)-conforming Networks: All neuron breakpoints lie on hyperplanes of the braid arrangement. The standard binary-tree computation of max (taking pairwise maxima layer by layer) is \(\mathcal{B}_d^0\)-conforming.

Key Isomorphism: The space \(\mathcal{V}_{\mathcal{B}_d}\) of \(\mathcal{B}_d\)-compatible CPWL functions is isomorphic to the space of set functions \(\mathcal{F}_d = \mathbb{R}^{2^{[d]}}\) (dimension \(2^d\)) via the map \(\Phi(f)(S) = f(\mathbb{1}_S)\).

Subspace Hierarchy: \(\mathcal{V}_{\mathcal{B}_d}(k) = \text{span}\{\sigma_M \mid M \subseteq [d], |M| \leq k\}\), where \(\sigma_M(x) = \max_{i \in M} x_i\). A strict inclusion chain holds:

\[\mathcal{V}_{\mathcal{B}_d}(0) \subsetneq \mathcal{V}_{\mathcal{B}_d}(1) \subsetneq \cdots \subsetneq \mathcal{V}_{\mathcal{B}_d}(d) = \mathcal{V}_{\mathcal{B}_d}\]

Main Theorem: Double-Logarithmic Lower Bound¶

Theorem 4.8: For an \(\ell\)-layer rank-2 maxout network, \(\mathcal{M}_{\mathcal{B}_d}^2(\ell) \subseteq \mathcal{V}_{\mathcal{B}_d}(2^{2^\ell - 1})\).

Corollary 4.9: \(\max\{0, x_1, \ldots, x_{2^{2^\ell-1}}\}\) cannot be represented by an \(\ell\)-layer \(\mathcal{B}_d^0\)-conforming ReLU network.

This establishes a depth lower bound of \(\Omega(\log \log d)\).

Proof Mechanism¶

Define the operation \(\mathcal{A}(U)\): the new subspace obtained by applying one layer of rank-2 maxout to functions in a subspace \(U\).

Key Recurrence (Proposition 4.7): \(\mathcal{A}(\mathcal{F}_{\mathcal{L}}(k)) \subseteq \mathcal{F}_{\mathcal{L}}(k^2 + k)\)

That is, one maxout layer can raise the subspace level from at most \(k\) to at most \(k^2 + k\). Iterating: - 1 layer: \(k = 2\) - 2 layers: \(k = 6\) - 3 layers: \(k = 42\) - \(\ell\) layers: \(k = 2^{2^\ell - 1}\) (double-exponential growth)

Therefore, representing \(\sigma_{[d]}\) (the max of \(d\) values) requires \(d \leq 2^{2^\ell - 1}\), i.e., \(\ell \geq \Omega(\log \log d)\).

Key technical ingredients: 1. The isomorphism \(\Phi\) translates the problem into analysis over set functions. 2. The conforming condition is reformulated as: \(F \in \mathcal{C}_{\mathcal{L}}\) (i.e., \(F(S)\) and \(F(T)\) cannot have opposite signs when \(S \subseteq T\)). 3. Lemma 4.5: "Pushing down supports"—positive and negative supports can be pushed to lower levels. 4. Decomposition and induction over sublattices of the Boolean lattice.

Combinatorial Proof for \(d=4\)¶

Theorem 5.2: \(\mathcal{M}_{\mathcal{B}_d}^2(2) = \mathcal{V}_{\mathcal{B}_d}(4)\).

That is, a two-layer \(\mathcal{B}_d^0\)-conforming ReLU network can exactly represent \(\max\{0, x_1, \ldots, x_4\}\). Previously, this was only verified computationally via MIP; this paper provides the first purely combinatorial proof.

Unexpected Expressivity of Maxout¶

Theorem 6.2: \(\max\{0, x_1, \ldots, x_6\}\) can be computed by a rank-(3,2) maxout network.

This shows that the standard upper bound \(\prod r_i = 3 \times 2 = 6\) is not tight—a rank-(3,2) network can represent the max of 7 values (one more than expected), contradicting naive intuition.

Construction: Two functions \(f_1, f_2 \in \mathcal{V}_{\mathcal{B}_7}(3)\) are identified such that \(\max\{f_1, f_2\} \in \mathcal{V}_{\mathcal{B}_7}(7) \setminus \mathcal{V}_{\mathcal{B}_7}(6)\).

Key Experimental Results¶

This is a purely theoretical work with no numerical experiments. Core results are summarized below:

Result	Content
Depth lower bound	\(\Omega(\log \log d)\) (under \(\mathcal{B}_d^0\)-conforming)
Prev. SOTA	Constant 2
Upper bound	\(\lceil \log_3(d-1) \rceil + 1\) (Bakaev et al.)
Exact result for \(d=4\)	2 layers \(\Leftrightarrow\) \(\mathcal{V}_{\mathcal{B}_d}(4)\)
Maxout surprise	rank-(3,2) can compute max of 7 values (standard bound: 6)

Comparison with Existing Lower Bounds¶

Method	Restriction	Lower Bound
Averkov et al. (2025)	N-ary fractional weights	\(\Omega(\log d / \log \log N)\)
Haase et al. (2023)	Integer weights	\(\lceil \log_2(d+1) \rceil\)
Bakaev et al. (2025a)	Non-negative weights	Dedicated lower bound
Ours	Breakpoints on braid arrangement	\(\Omega(\log \log d)\)

This work is the only non-constant depth lower bound that imposes no restriction on weights.

Highlights & Insights¶

Although \(\log \log d\) grows slowly, it breaks the barrier of constant lower bounds—representing the first substantial progress in this direction in several years.
The proof methodology is elegant: a geometric problem (CPWL functions on polyhedral fans) is fully reduced to a combinatorial one (set functions on Boolean lattices).
The combinatorial proof for \(d=4\) supersedes the MIP-based computational verification of Hertrich et al. (2023), providing deeper structural insight.
The unexpected expressivity of maxout networks has independent significance: it shows that the expressive power of higher-order activation functions cannot be characterized by simple product formulas.
The approach is potentially generalizable to other fan structures, offering tools for understanding depth lower bounds for general networks.

Limitations & Future Work¶

The \(\mathcal{B}_d^0\)-conforming constraint is a genuine restriction: Bakaev et al. (2025b) have shown that general (non-conforming) 2-layer networks can compute the max of 5 values.
The \(\log \log d\) growth is extremely slow (10 layers are not required until \(d = 2^{2^{10}-1}\)), leaving a large gap from the expected \(\Theta(\log d)\) upper bound.
As a purely theoretical work, its practical implications for architecture design are limited.
The proof techniques rely heavily on the structure of the braid arrangement and do not directly extend to general breakpoint configurations.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First non-constant depth lower bound without weight restrictions; of major academic significance.
Experimental Thoroughness: ⭐⭐⭐ Purely theoretical work with no numerical experiments (compensated by constructive proofs).
Writing Quality: ⭐⭐⭐⭐ Clear structure; illustrations of Boolean lattice decompositions are helpful for comprehension.
Value: ⭐⭐⭐⭐ Advances the core question of depth vs. expressivity in deep learning theory.