LOWER BOUNDS ON THE DEPTH OF INTEGRAL RELU NEURAL NETWORKS VIA LATTICE POLYTOPES

Abstract

We prove that the set of functions representable by ReLU neural networks with integer weights strictly increases with the network depth while allowing arbitrary width. More precisely, we show that log 2 (n) hidden layers are indeed necessary to compute the maximum of n numbers, matching known upper bounds. Our results are based on the known duality between neural networks and Newton polytopes via tropical geometry. The integrality assumption implies that these Newton polytopes are lattice polytopes. Then, our depth lower bounds follow from a parity argument on the normalized volume of faces of such polytopes.

1. INTRODUCTION

Classical results in the area of understanding the expressivity of neural networks are so-called universal approximation theorems (Cybenko, 1989; Hornik, 1991) . They state that shallow neural networks are already capable of approximately representing every continuous function on a bounded domain. However, in order to gain a complete understanding of what is going on in modern neural networks, we would also like to answer the following question: what is the precise set of functions we can compute exactly with neural networks of a certain depth? For instance, insights about exact representability have recently boosted our understanding of the computational complexity to train neural networks in terms of both, algorithms (Arora et al., 2018; Khalife & Basu, 2022) and hardness results (Goel et al., 2021; Froese et al., 2022; Bertschinger et al., 2022) . Arguably, the most prominent activation function nowadays is the rectified linear unit (ReLU) (Glorot et al., 2011; Goodfellow et al., 2016) . While its popularity is primarily fueled by intuition and empirical success, replacing previously used smooth activation functions like sigmoids with ReLUs has some interesting implications from a mathematical perspective: suddenly methods from discrete geometry studying piecewise linear functions and polytopes play a crucial role in understanding neural networks (Arora et al., 2018; Zhang et al., 2018; Hertrich et al., 2021) supplementing the traditionally dominant analytical point of view. A fundamental result in this direction is by Arora et al. (2018) , who show that a function is representable by a ReLU neural network if and only if it is continuous and piecewise linear (CPWL). Moreover, their proof implies that log 2 (n + 1) many hidden layers are sufficient to represent every CPWL function with n-dimensional input. A natural follow-up question is the following: is this logarithmic number of layers actually necessary or can shallower neural networks already represent all CPWL functions? Hertrich et al. ( 2021) conjecture that the former alternative is true. More precisely, if ReLU n (k) denotes the set of CPWL functions defined on R n and computable with k hidden layers, the conjecture can be formulated as follows: Conjecture 1 (Hertrich et al. ( 2021)). ReLU n (k -1) ReLU n (k) for all k ≤ log 2 (n + 1) . Note that ReLU n ( log 2 (n + 1) ) is the entire set of CPWL functions defined on R n by the result of Arora et al. (2018) . While Hertrich et al. (2021) provide some evidence for their conjecture, it remains open for every input dimension n ≥ 4. Even more drastically, there is not a single CPWL function known for which one can prove that two hidden layers are not sufficient to represent it. Even for a function as simple as max{0, x 1 , x 2 , x 3 , x 4 }, it is unknown whether two hidden layers are sufficient. In fact, max{0, x 1 , x 2 , x 3 , x 4 } is not just an arbitrary example. Based on a result by Wang & Sun (2005) , Hertrich et al. (2021) show that their conjecture is equivalent to the following statement. Conjecture 2 (Hertrich et al. ( 2021)). For n = 2 k , the function max{0, x 1 , . . . , x n } is not contained in ReLU n (k). This reformulation gives rise to interesting interpretations in terms of two elements commonly used in practical neural network architectures: max-pooling and maxout. Max-pooling units are used between (ReLU or other) layers and simply output the maximum of several inputs (that is, "pool" them together). They do not contain trainable parameters themselves. In contrast, maxout networks are an alternative to (and in fact a generalization of) ReLU networks. Each neuron in a maxout network outputs the maximum of several (trainable) affine combinations of the outputs in the previous layer, in contrast to comparing a single affine combination with zero as in the ReLU case. Thus, the conjecture would imply that one needs in fact logarithmically many ReLU layers to replace a max-pooling unit or a maxout layer, being a theoretical justification that these elements are indeed more powerful than pure ReLU networks.

1.1. OUR RESULTS

In this paper we prove that the conjecture by Hertrich et al. ( 2021) is true for all n ∈ N under the additional assumption that all weights in the neural network are restricted to be integral. In other words, if ReLU Z n (k) is the set of functions defined on R n representable with k hidden layers and only integer weights, we show the following. Theorem 3. For n = 2 k , the function max{0, x 1 , . . . , x n } is not contained in ReLU Z n (k). Proving Theorem 3 is our main contribution. The overall strategy is highlighted in Section 1.2. We put all puzzle pieces together and provide a formal proof in Section 4. The arguments in Hertrich et al. ( 2021) can be adapted to show that the equivalence between the two conjectures is also valid in the integer case. Thus, we obtain that adding more layers to an integral neural network indeed increases the set of representable functions up to a logarithmic number of layers. A formal proof can be found in Section 4. Corollary 4. ReLU Z n (k -1) ReLU Z n (k) for all k ≤ log 2 (n + 1) . To the best of our knowledge, our result is the first non-constant (namely logarithmic) lower bound on the depth of ReLU neural networks without any restriction on the width. Without the integrality assumption, the best known lower bound remains two hidden layers Mukherjee & Basu (2017), which is already valid for the simple function max{0, x 1 , x 2 }. While the integrality assumption is rather implausible for practical neural network applications where weights are usually tuned by gradient descent, from a perspective of analyzing the theoretical expressivity, the assumption is arguably plausible. To see this, suppose a ReLU network represents the function max{0, x 1 , . . . , x n }. Then every fractional weight must either cancel out or add up to some integers with other fractional weights, because every linear piece in the final function has only integer coefficients. Hence, it makes sense to assume that no fractional weights exist in the first place. However, unfortunately, this intuition cannot easily be turned into a proof because it might happen that combinations of fractional weights yield integer coefficients which could not be achieved without fractional weights.

