LOWER BOUNDS ON THE DEPTH OF INTEGRAL RELU NEURAL NETWORKS VIA LATTICE POLYTOPES

Abstract

We prove that the set of functions representable by ReLU neural networks with integer weights strictly increases with the network depth while allowing arbitrary width. More precisely, we show that log 2 (n) hidden layers are indeed necessary to compute the maximum of n numbers, matching known upper bounds. Our results are based on the known duality between neural networks and Newton polytopes via tropical geometry. The integrality assumption implies that these Newton polytopes are lattice polytopes. Then, our depth lower bounds follow from a parity argument on the normalized volume of faces of such polytopes.

1. INTRODUCTION

Classical results in the area of understanding the expressivity of neural networks are so-called universal approximation theorems (Cybenko, 1989; Hornik, 1991) . They state that shallow neural networks are already capable of approximately representing every continuous function on a bounded domain. However, in order to gain a complete understanding of what is going on in modern neural networks, we would also like to answer the following question: what is the precise set of functions we can compute exactly with neural networks of a certain depth? For instance, insights about exact representability have recently boosted our understanding of the computational complexity to train neural networks in terms of both, algorithms (Arora et al., 2018; Khalife & Basu, 2022) and hardness results (Goel et al., 2021; Froese et al., 2022; Bertschinger et al., 2022) . Arguably, the most prominent activation function nowadays is the rectified linear unit (ReLU) (Glorot et al., 2011; Goodfellow et al., 2016) . While its popularity is primarily fueled by intuition and empirical success, replacing previously used smooth activation functions like sigmoids with ReLUs has some interesting implications from a mathematical perspective: suddenly methods from discrete geometry studying piecewise linear functions and polytopes play a crucial role in understanding neural networks (Arora et al., 2018; Zhang et al., 2018; Hertrich et al., 2021) supplementing the traditionally dominant analytical point of view. A fundamental result in this direction is by Arora et al. (2018) , who show that a function is representable by a ReLU neural network if and only if it is continuous and piecewise linear (CPWL). Moreover, their proof implies that log 2 (n + 1) many hidden layers are sufficient to represent every CPWL function with n-dimensional input. A natural follow-up question is the following: is this logarithmic number of layers actually necessary or can shallower neural networks already represent all CPWL functions?

