BEYOND COUNTING LINEAR REGIONS OF NEURAL NETWORKS, SIMPLE LINEAR REGIONS DOMINATE!

Abstract

Functions represented by a neural network with the widely-used ReLU activation are piecewise linear functions over linear regions (polytopes). Figuring out the properties of such polytopes is of fundamental importance for the development of neural networks. So far, either theoretical or empirical studies on polytopes stay at the level of counting their number. Despite successes in explaining the power of depth and so on, counting the number of polytopes puts all polytopes on an equal booting, which is essentially an incomplete characterization of polytopes. Beyond counting, here we study the shapes of polytopes via the number of simplices obtained by triangulations of polytopes. First, we demonstrate the properties of the number of simplices in triangulations of polytopes, and compute the upper and lower bounds of the maximum number of simplices that a network can generate. Next, by computing and analyzing the histogram of simplices across polytopes, we find that a ReLU network has surprisingly uniform and simple polytopes, although these polytopes theoretically can be rather diverse and complicated. This finding is a novel implicit bias that concretely reveals what kind of simple functions a network learns and sheds light on why deep learning does not overfit. Lastly, we establish a theorem to illustrate why polytopes produced by a deep network are simple and uniform. The core idea of the proof is counter-intuitive: adding depth probably does not create a more complicated polytope. We hope our work can inspire more research into investigating polytopes of a ReLU neural network, thereby upgrading the knowledge of neural networks to a new level.

1. INTRODUCTION

It was shown in a thread of studies Chu et al. (2018) ; Balestriero & Baraniuk (2020) ; Hanin & Rolnick (2019b) ; Schonsheck et al. (2019) that a neural network with the piecewise linear activation is to partition the input space into many convex regions, mathematically referred to as polytopes, and each polytope is associated with a linear function (hereafter, we use convex regions, linear regions, and polytopes interchangeably). Hence, a neural network is essentially a piecewise linear function over the input domain. Based on this adorable result, the core idea of a variety of important theoretical advances and empirical findings is to turn the investigation of neural networks into the investigation of polytopes. By addressing basic questions such as how common operations affect the formation of polytopes (Zhang & Wu, 2020), how the network topology affects the number of polytopes (Cohen et al., 2016; Poole et al., 2016; Xiong et al., 2020) , and so on, the understanding to expressivity of the networks is greatly deepened. To demonstrate the utility of the study on polytopes, we present two representative examples as follows: The first representative example is the explanation to the power of depth. In the era of deep learning, many studies (Mohri et al., 2018; Bianchini & Scarselli, 2014; Telgarsky, 2015; Arora et al., 2016) attempted to explain why a deep network can perform superbly over a shallow one. One explanation to this question is on the superior representation power of deep networks, i.e., a deep network can express a more complicated function but a shallow one with a similar size cannot (Cohen et al., 2016; Poole et al., 2016; Xiong et al., 2020) . Their basic idea is to characterize the complexity of the function expressed by a neural network, thereby demonstrating that increasing depth can greatly maximize such a complexity measure compared to increasing width. Currently, the number of linear regions is one of the most popular complexity measures because it respects the functional structure of the widely-used ReLU networks. Pascanu et al. (2013) firstly proposed to use the number of linear regions as the complexity measure. By directly applying Zaslavsky's Theorem (Zaslavsky, 1997), Pascanu et al. (2013) obtained a lower bound L-1 l=0 n l n0 n0 i=0 n L i for the maximum number of linear regions of a fully-connected ReLU network with n 0 inputs and L hidden layers of widths n 1 , n 2 , • • • , n L . Since this work, deriving the lower and upper bounds of the maximum number of linear regions becomes a hot topic (Montufar et al., 2014; Telgarsky, 2015; Montúfar, 2017; Serra et al., 2018; Croce et al., 2019; Hu & Zhang, 2018; Xiong et al., 2020) . All these bounds suggest the expressive ability of depth. The second interesting example is the finding of the high-capacity-low-reality phenomenon (Hu et al., 2021; Hanin & Rolnick, 2019b) , that the theoretical tight upper bound for the number of polytopes is much larger than what is actually learned by a network, i.e., deep ReLU networks have surprisingly few polytopes both at initialization and throughout the training. Specifically, Hanin & Rolnick (2019b) proved that the expected number of linear regions in a ReLU network is bounded by a function of the number of total neurons and the input dimension. This counter-intuitive phenomenon can also be regarded as an implicit bias, which to some extent suggests why a deep network does not overfit. Although theoretically a lot of linear regions can be generated to learn a task, a deep network tends to find a simple function for a given task that is with few polytopes.

Linear Regions

-1 1 1 -1 Simplices -1 1 1 -1 Figure 1: The number of simplices a polytope contains can reveal the shape information of a polytope, with which one can dig out valuable information of a neural network. Despite figuring out the properties of polytopes of a neural network is of fundamental importance for the understanding of neural networks, the current studies on the polytopes have an important limit. So far, either theoretical or empirical studies only stay at the level of counting the number of polytopes, which blocks us from gaining other valuable findings. As we know, in a feed-forward network of L hidden layers, each polytope is encompassed by a group of hyperplanes, as shown in Figure 1 (a), and each hyperplane is associated with a neuron. The details of how polytopes are formed in a ReLU network can be referred to in Appendix A. Hence, any polytope is created by at most L i=1 n i and at least n 0 + 1 hyperplanes, which is a large range. Face numbers of polytopes can vary a lot. Unfortunately, the existing "counting" studies did not accommodate the differences among polytopes. Therefore, it is highly necessary to move a step forward, i.e., know what each polytope is, thereby capturing a more complete picture of a neural network. To realize so, as a first attempt, we seamlessly divide each polytope into simplices in a triangulation of the polytope, and we describe the shape of polytopes by the minimum number of simplices to partition it, as Figure 1 shows. For example in R 2 , if a polytope comprises three simplices, it is a pentagon. In this manuscript, 1) to demonstrate the utility of the total number of simplices (#simplices) relative to the total number of polytopes (#polytopes), we characterize the basic proprieties and estimate the lower and upper bounds of the maximum #simplices for ReLU networks. The key to bound estimation is to estimate the total sum of the number of faces for all polytopes. 2) We observe that polytopes formed by ReLU networks are surprisingly uniform and simple. Here, the uniformity and simplicity mean that although theoretically quite diverse and complicated polytopes can be derived, simple polytopes dominate, i.e., deep networks tend to find a function with a uniform and simple polytope pattern instead of a complicated polytope pattern. This is another high-capacity-low-reality phenomenon and an implicit simplicity bias of a neural network, implying how fruitful it is to go beyond counting. Previously, Hanin & Rolnick (2019b) showed that deep ReLU networks have few polytopes. Our report is that polytopes are not only few but also simple and uniform. Compared to (Hanin & Rolnick, 2019b) , our observation more convincingly illustrates why deep networks do not overfit. Showing the number of polytopes is few is insufficient to claim that a network learns a simple solution because a network can have a small number of very complicated polytopes. 3) We establish a theorem that bounds the average face numbers of polytopes of a network to a small number under some mild assumption, thereby illustrating why polytopes produced by a deep network are simple and uniform. To summarize, our contributions are threefold. 1) We point out the limitation of counting #polytopes. To solve it, we propose to use the #simplices to investigate the shape of polytopes. Investigating polytopes of a network can lead to a more complete characterization of ReLU networks and upgrade the knowledge of ReLU networks to a new level. 2) We empirically find that a ReLU network

