TROPEX: AN ALGORITHM FOR EXTRACTING LINEAR TERMS IN DEEP NEURAL NETWORKS

Abstract

Deep neural networks with rectified linear (ReLU) activations are piecewise linear functions, where hyperplanes partition the input space into an astronomically high number of linear regions. Previous work focused on counting linear regions to measure the network's expressive power and on analyzing geometric properties of the hyperplane configurations. In contrast, we aim to understand the impact of the linear terms on network performance, by examining the information encoded in their coefficients. To this end, we derive TropEx, a non-trivial tropical algebrainspired algorithm to systematically extract linear terms based on data. Applied to convolutional and fully-connected networks, our algorithm uncovers significant differences in how the different networks utilize linear regions for generalization. This underlines the importance of systematic linear term exploration, to better understand generalization in neural networks trained with complex data sets.

1. INTRODUCTION

Many of the most widely used neural network architectures, including VGG (Simonyan & Zisserman, 2015) , GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016) , make use of rectified linear activations (ReLU, (Hahnloser et al., 2000; Glorot et al., 2011) , i.e., σ(x) = max{x, 0}) and are therefore piecewise linear functions. Despite the apparent simplicity of these functions, there is a lack of theoretical understanding of the factors that contribute to the success of such architectures. Previous attempts of understanding piecewise linear network functions have focused on estimating the number of linear terms, which are the linear pieces (affine functions) that constitute the network function. A linear region is being defined as a maximally connected subset of the input space on which the network function is linear. Since computing the exact number of linear regions is intractable, work has focused on obtaining upper and lower bounds for this number (Arora et al., 2016; Serra et al., 2018; Pascanu et al., 2013; Raghu et al., 2017; Montufar et al., 2014; Montúfar, 2017; Xiong et al., 2020; Zhang et al., 2018) . To our knowledge, the currently best upper and lower bounds were calculated by Serra et al. (2018) . Raghu et al. (2017) show these bounds to be asymptotically tight. All of the mentioned papers share the intuition that the number of linear regions of neural networks measures their expressivity. Since the bounds grow linearly in width and exponentially in depth, deep networks are interpreted to have greater representational power. However, these bounds are staggeringly high: the upper bound on the number of linear regions in (Serra et al., 2018) exceeds 10 300 even for the smallest networks we experimented on. (There are approximately 10 80 atoms in the universe.) For slightly larger networks, the upper bound exceeds 10 17000 whereas the lower bound exceeds 10 83 linear regions. The number of training samples is generally much smaller than the estimated number of linear regions (≤ 10 6 ), so that almost none of the linear regions contains training data. This raises the question of how representative the number of linear regions is for network performance and how information extracted from training samples passes on to the many linear regions free of data for successful generalization to test data. There are indications that a high number of linear regions is not required for good network performance. Frankle & Carbin (2019) point out that smaller networks perform similarly well as large ones, when a suitable initialization of the smaller network can be found from training the larger one. Hence, * Denotes equal contribution. 1

