TROPEX: AN ALGORITHM FOR EXTRACTING LINEAR TERMS IN DEEP NEURAL NETWORKS

Abstract

Deep neural networks with rectified linear (ReLU) activations are piecewise linear functions, where hyperplanes partition the input space into an astronomically high number of linear regions. Previous work focused on counting linear regions to measure the network's expressive power and on analyzing geometric properties of the hyperplane configurations. In contrast, we aim to understand the impact of the linear terms on network performance, by examining the information encoded in their coefficients. To this end, we derive TropEx, a non-trivial tropical algebrainspired algorithm to systematically extract linear terms based on data. Applied to convolutional and fully-connected networks, our algorithm uncovers significant differences in how the different networks utilize linear regions for generalization. This underlines the importance of systematic linear term exploration, to better understand generalization in neural networks trained with complex data sets.

1. INTRODUCTION

Many of the most widely used neural network architectures, including VGG (Simonyan & Zisserman, 2015) , GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016) , make use of rectified linear activations (ReLU, (Hahnloser et al., 2000; Glorot et al., 2011) , i.e., σ(x) = max{x, 0}) and are therefore piecewise linear functions. Despite the apparent simplicity of these functions, there is a lack of theoretical understanding of the factors that contribute to the success of such architectures. Previous attempts of understanding piecewise linear network functions have focused on estimating the number of linear terms, which are the linear pieces (affine functions) that constitute the network function. A linear region is being defined as a maximally connected subset of the input space on which the network function is linear. Since computing the exact number of linear regions is intractable, work has focused on obtaining upper and lower bounds for this number (Arora et al., 2016; Serra et al., 2018; Pascanu et al., 2013; Raghu et al., 2017; Montufar et al., 2014; Montúfar, 2017; Xiong et al., 2020; Zhang et al., 2018) . To our knowledge, the currently best upper and lower bounds were calculated by Serra et al. (2018) . Raghu et al. (2017) show these bounds to be asymptotically tight. All of the mentioned papers share the intuition that the number of linear regions of neural networks measures their expressivity. Since the bounds grow linearly in width and exponentially in depth, deep networks are interpreted to have greater representational power. However, these bounds are staggeringly high: the upper bound on the number of linear regions in (Serra et al., 2018) exceeds 10 300 even for the smallest networks we experimented on. (There are approximately 10 80 atoms in the universe.) For slightly larger networks, the upper bound exceeds 10 17000 whereas the lower bound exceeds 10 83 linear regions. The number of training samples is generally much smaller than the estimated number of linear regions (≤ 10 6 ), so that almost none of the linear regions contains training data. This raises the question of how representative the number of linear regions is for network performance and how information extracted from training samples passes on to the many linear regions free of data for successful generalization to test data. There are indications that a high number of linear regions is not required for good network performance. Frankle & Carbin (2019) point out that smaller networks perform similarly well as large ones, when a suitable initialization of the smaller network can be found from training the larger one. Hence, 2020) study geometric properties of linear regions and notice that batch normalization and dropout, albeit leading to similar network accuracies, produce differently looking linear regions. Our approach to the understanding of linear regions differs in that it investigates the linear coefficients of linear regions. To this end, we propose TropEx, a tropical algebra-based algorithm extracting linear terms of the network function N (Figure 1 ) using a data set X . TropEx outputs an extracted function N (X ) containing only the linear terms corresponding to regions on which data lies. As a result, N and N (X ) agree on neighbourhoods of all data points. This creates a tool for the study of generalization from a new viewpoint, i.e., the perspective of linear regions and their coefficients.

Our contributions are as follows:

• A new computational framework representing tropical functions (Definition B.4) as matrices to efficiently perform tropical calculations appearing in networks with rectified linear activations. • This framework allows us to derive TropEx, an algorithm to systematically extract linear terms from piecewise linear network functions.foot_0 • An application of TropEx to fully-connected (FCN) and convolutional networks (CNN) reveals that (i) consistently all training and test samples fall into different linear regions; (ii) Simple tasks (MNIST) can be solved with the few linear regions of training samples alone, while this does not hold for more complex data sets. (iii) FCNs and CNNs differ in how they use linear regions free of training data for their performance on test data: Several measures illustrate that CNNs, in contrast to FCNs, tend to learn more diverse linear terms. (iv) We confirm that the number of linear regions alone is not a good indicator for network performance and show that the coefficients of linear regions contain information on architecture and classification performance.

2. BACKGROUND AND OVERVIEW

It was recently shown by Charisopoulos & Maragos (2018); Zhang et al. (2018) that ReLU neural network functions are the same as tropical rational maps. Tropical rational maps are exactly those functions where each entry in the output vector can be written as a difference of maxima N i (x) = max{a + i1 (x), . . . , a + in (x)} -max{a - i1 (x), . . . , a - im (x)},



Link to open source implementation: https://github.com/martrim/tropex



Figure 1: A ReLU network function before (left) and after (right) extraction. Left: Hyperplanes separate the input space into linear regions. Most of them do not contain any data points. Each data point occupies its own linear region. Right: After extraction, the function remains unchanged on the linear regions of training samples. Test samples now fall into regions of training samples.

