TROPEX: AN ALGORITHM FOR EXTRACTING LINEAR TERMS IN DEEP NEURAL NETWORKS

Abstract

Deep neural networks with rectified linear (ReLU) activations are piecewise linear functions, where hyperplanes partition the input space into an astronomically high number of linear regions. Previous work focused on counting linear regions to measure the network's expressive power and on analyzing geometric properties of the hyperplane configurations. In contrast, we aim to understand the impact of the linear terms on network performance, by examining the information encoded in their coefficients. To this end, we derive TropEx, a non-trivial tropical algebrainspired algorithm to systematically extract linear terms based on data. Applied to convolutional and fully-connected networks, our algorithm uncovers significant differences in how the different networks utilize linear regions for generalization. This underlines the importance of systematic linear term exploration, to better understand generalization in neural networks trained with complex data sets.

1. INTRODUCTION

Many of the most widely used neural network architectures, including VGG (Simonyan & Zisserman, 2015) , GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016) , make use of rectified linear activations (ReLU, (Hahnloser et al., 2000; Glorot et al., 2011) , i.e., σ(x) = max{x, 0}) and are therefore piecewise linear functions. Despite the apparent simplicity of these functions, there is a lack of theoretical understanding of the factors that contribute to the success of such architectures. Previous attempts of understanding piecewise linear network functions have focused on estimating the number of linear terms, which are the linear pieces (affine functions) that constitute the network function. A linear region is being defined as a maximally connected subset of the input space on which the network function is linear. Since computing the exact number of linear regions is intractable, work has focused on obtaining upper and lower bounds for this number (Arora et al., 2016; Serra et al., 2018; Pascanu et al., 2013; Raghu et al., 2017; Montufar et al., 2014; Montúfar, 2017; Xiong et al., 2020; Zhang et al., 2018) . To our knowledge, the currently best upper and lower bounds were calculated by Serra et al. (2018) . Raghu et al. (2017) show these bounds to be asymptotically tight. All of the mentioned papers share the intuition that the number of linear regions of neural networks measures their expressivity. Since the bounds grow linearly in width and exponentially in depth, deep networks are interpreted to have greater representational power. However, these bounds are staggeringly high: the upper bound on the number of linear regions in (Serra et al., 2018) exceeds 10 300 even for the smallest networks we experimented on. (There are approximately 10 80 atoms in the universe.) For slightly larger networks, the upper bound exceeds 10 17000 whereas the lower bound exceeds 10 83 linear regions. The number of training samples is generally much smaller than the estimated number of linear regions (≤ 10 6 ), so that almost none of the linear regions contains training data. This raises the question of how representative the number of linear regions is for network performance and how information extracted from training samples passes on to the many linear regions free of data for successful generalization to test data. There are indications that a high number of linear regions is not required for good network performance. Frankle & Carbin (2019) point out that smaller networks perform similarly well as large ones, when a suitable initialization of the smaller network can be found from training the larger one. Hence, the expressivity of the large network is helpful to explore the parameter space, but the small, less expressive network is sufficient to achieve high accuracy. Lee et al. (2019) and Croce et al. (2018) modify the training loss to encourage larger linear regions with the goal of robustness to adversarial attacks. Hanin & Rolnick (2019b; a) argue that in practice there are fewer linear regions than expected from the bounds and empirically investigate this for the MNIST data set. All these observations question the explanatory power of astronomically high bounds for the number of linear regions. More recently, the focus of research on linear regions has been shifting away from pure counting towards an understanding of the linear regions themselves. Zhang & Wu (2020) study geometric properties of linear regions and notice that batch normalization and dropout, albeit leading to similar network accuracies, produce differently looking linear regions. Our approach to the understanding of linear regions differs in that it investigates the linear coefficients of linear regions. To this end, we propose TropEx, a tropical algebra-based algorithm extracting linear terms of the network function N (Figure 1 ) using a data set X . TropEx outputs an extracted function N (X ) containing only the linear terms corresponding to regions on which data lies. As a result, N and N (X ) agree on neighbourhoods of all data points. This creates a tool for the study of generalization from a new viewpoint, i.e., the perspective of linear regions and their coefficients.

Our contributions are as follows:

• A new computational framework representing tropical functions (Definition B.4) as matrices to efficiently perform tropical calculations appearing in networks with rectified linear activations. • This framework allows us to derive TropEx, an algorithm to systematically extract linear terms from piecewise linear network functions.foot_0 • An application of TropEx to fully-connected (FCN) and convolutional networks (CNN) reveals that (i) consistently all training and test samples fall into different linear regions; (ii) Simple tasks (MNIST) can be solved with the few linear regions of training samples alone, while this does not hold for more complex data sets. (iii) FCNs and CNNs differ in how they use linear regions free of training data for their performance on test data: Several measures illustrate that CNNs, in contrast to FCNs, tend to learn more diverse linear terms. (iv) We confirm that the number of linear regions alone is not a good indicator for network performance and show that the coefficients of linear regions contain information on architecture and classification performance.

2. BACKGROUND AND OVERVIEW

It was recently shown by Charisopoulos & Maragos (2018) ; Zhang et al. (2018) that ReLU neural network functions are the same as tropical rational maps. Tropical rational maps are exactly those functions where each entry in the output vector can be written as a difference of maxima N i (x) = max{a + i1 (x), . . . , a + in (x)} -max{a - i1 (x), . . . , a - im (x)}, where each a + ij , a - ij : R d → R is an affine function with only positive coefficients, taking the form x → j w j x j +w 0 with all w j ∈ R ≥0 . Since the number of terms in (1) dwarfs the number of atoms in the universe, it is impossible to obtain this expression in practice. Therefore, we only extract those terms that correspond to linear regions of data points. For a fixed data point x ∈ X , the maximum of the network outputs can be written as max i N i (x) = a + x (x) -a - x (x), where a + x , a - x are the affine functions such that a + x (x) ≥ a + ij (x), a - x (x) ≥ a - ij (x) for all i, j. TropEx extracts a + x and a - x . The extracted terms can be used to construct a tropical map N (X ) (x) = N (X ) 1 (x), . . . , N (X ) s (x) with maximally enlarged linear regions, given by N (X ) i (x) = max{a + x k 1 (x), . . . , a + x k D i (x)} -max{a - x k 1 (x), . . . , a - x k D i (x)}, where there are D i data points x k1 , . . . , x k D i given label i by the original network. Being a tropical rational map, the function N (X ) is again a ReLU neural network function by Zhang et al. (2018) . The maximal entries of the two output vectors (hence also the assigned labels) of the extracted function N (X ) and the original network N agree in the neighbourhood of any data point x ∈ X . We discuss the basics of tropical algebra in Appendix B.1 and refer to Maclagan & Sturmfels (2015) for a detailed introduction. 

3.1. MATRIX REPRESENTATION OF TROPICAL RATIONAL MAPS

If one were to represent tropical rational maps symbolically on a computer, computations would be too slow. Therefore, we present tropical rational maps as multi-dimensional arrays. Definition 3.1. Given an affine function a : R d0 → R; x → k w k x k + w 0 , we will call the vector (w 0 , w 1 , . . . , w d0 ) its coefficient vector, the scalar w 0 its constant part and the vector (w 1 , . . . , w d0 ) its variable part. We can represent functions N i : R d0 → R as in equation ( 1) in the following way: Let the rows of the matrix A + i ∈ R n×d0 and the vector a + i ∈ R n×1 be the variable and the constant parts of the affine functions a + ij , respectively. (Analogously for A - i and a - i .) We can then define (A + i , a + i )(x) = max{A + i x + a + i } , where the maximum is taken over the rows of the resulting column vector. If we define the formal quotientfoot_1 of matrix-vector pairs by (A + i , a + i )/(A - i , a - i )(x) = max{A + i x + a + i } -max{A - i x + a - i }, then N i (x) = (A + i , a + i )/(A - i , a - i )(x) , giving us a matrixrepresentation of the function N i . An entire network function with s output dimensions can then be represented by a list (A + i , a + i )/(A - i , a - i ) 1≤i≤s . The advantage of the proposed matrix representation of tropical rational maps are natural operations performing calculations that arise for (concatenations of) layers of neural networks (see supplements). A dense layer : R d1 → R d2 with ReLU activation is represented as a list (A + i , a + i )/(A - i , a - i ) 1≤i≤d2 . Denoting by W pos and W neg the positive and negative part of a matrix W, respectively, i.e. w pos ij = max{w ij , 0} and w neg ij = max{-w ij , 0}, the matrix representation of a single neuron n i (x) = max{w • x + b, 0} is given by A + i = w pos w neg , a + i = b pos b neg ; A - i = w neg , a - i = b neg .

3.2. EXTRACTING LINEAR TERMS OF A CLASSIFICATION NETWORK

We now consider a classification neural network N with s labels. We show that we can represent the network N with a matrix-vector pair (A -, a -) in the denominator that is constant over all output dimensions. The proof is given in Section C.2 of the supplementary material. Algorithm 3.1 TropEx: Extracting Linear Terms of a Neural Network Inputs: Neural Network N Data set X = {(x i k , i)} with D i points of label i Output: Extracted Function N (X ) = ((A + i , a + i )/(A -, a -)) 1≤i≤s 1: W, b ← weight matrix and bias vector of last layer 2: C -, c -← column sums of W neg , b neg 3: C + ← W + C -, c + ← b + c - 4: for i = 1 to s do 5: A + i ← rep(C + i• , D i ), a + i ← rep(c + i , D i ) Repetition D i times along the rows. 6: A -← rep(C -, D), a -← rep(c -, D) D = total no of data points 7: A max ← maxima of the columns of all A + i and A -stacked 8: for last layer in N not yet used do 9: (N (X ) , A max ) ← merge (N (X ) , A max ) according to Table 1 Lemma 3.2. Let N : R d → R s be the function of a ReLU neural network for classification with s output neurons. Then there are affine functions a + ij , a - j such that N (x) =    max{a + 11 (x), . . . , a + 1n1 (x)} . . . max{a + s1 (x), . . . , a + sns (x)}    -max{a - 1 (x), . . . , a - m (x)}, where the maxima of the a - i (x) on the right is subtracted from each entry of the vector on the left. 3 In terms of our matrices the classification network N with s labels can be represented by a list ((A + i , a + i )/(A -, a -)) 1≤i≤s of matrix-vector pairs. The label is then given by argmax 1≤i≤s (A + i , a + i )(x). How to get the extracted function N (X ) of (2) from the network N . TropEx extracts, for each data point x k of label i, the affine functions a + ij and a - l from the network representation in (3) such that a + ij (x k ) ≥ a + ĩj (x k ), a - l (x k ) ≥ a l(x k ) for all ĩ, j, l. We start with the last layer of the network and inductively merge new layers into the existing matrix pairs. The merge operations depend on the type of the layer as shown in Table 1 . Putting things together gives Algorithm 3.1, which has the extracted function N (X ) as its output. The run-time and storage complexities per data point correspond to 3 forward passes through the network. Theorem 3.3 states that TropEx indeed results in a selection of linear terms based on a data set of points and that the extracted tropical function agrees with the network on neighbourhoods of all these points. Its proof and the complete, non-trivial derivation of the algorithm are in the appendix, where we develop a framework that enables calculations on tropical matrices that correspond to manipulations of the tropical functions they represent. For illustrative purposes, we also present a worked-out example of applying TropEx to a toy neural network there. Theorem 3.3. Let N = (N 1 , . . . , N s ) : R d → R s be the function of a ReLU neural network for classification into s classes. Let N (X ) be the network obtained from Algorithm 3.1, applied to N using a data set X = {(x kj , i)| 1 ≤ i ≤ s, 1 ≤ j ≤ D i } of D i data points x kj given label i by N . (There are D points x 1 , . . . , x D in total). Then, (1) for all labels i, N (X ) i (x) = max{a + i1 (x), . . . , a + iDi (x)} -max{a - 1 (x), . . . , a - D (x)} , where the a + ij and a - l are extracted from a representation of N i as in Equation 3; and (2) every data point x k has a neighbourhood U k on which the maximum of the extracted function agrees with the maximal network output: max 1≤i≤s N (X ) i (x) = max 1≤i≤s N i (x) for all x ∈ U k . In particular, N (X ) and N classify all points in U k by assigning the same label. ) , A max ), according to type of layer . A k denotes the slice of A corresponding to data point k. denotes all of the network up to and including . Read expressions like A ← K + aW as Type Operation Type Operation BNorm γ, β, µ, σ, ← Batchnorm parameters Maxpool A, A max ← repeat to s ← γ / √ σ 2 + , t ← -µ • s + β input shape of A ← sA; a ← a + At A k ← set 0 according to A max ← |s|A max activations of (x k ) Conv F, b ← filter, bias of Dense W, b ← weights, bias of K ← ConvTrans(A max , F neg ) K ← A max W neg A max ← K + ConvTrans(A max , F pos ) A max ← K + A max W pos a ← a + Ab a ← a + Ab A ← K + ConvTrans(A, F ) A ← K + AW Flatten A, A max ← reshape to input shape of L-ReLU α ← Leaky ReLU parameter ReLU A kj ← 0 if (x k ) j = 0 A kj ← α • A kj if (x k ) j ≤ 0 Table 1: Merge operations (N (X ) , A max ) → (N (X A + i ← K + A + i W; A -← K + A -W for all i.

4. EXPERIMENTS

TropEx extracts a function containing only linear terms corresponding to regions on which the given data lies. The extracted function agrees with the network on this data. This allows us to compare linear regions of train and test data, to separate the network structure from the information contained in the linear coefficients, and to test how well the linear coefficients generalize to test data. Setup We train neural networks on MNIST (LeCun et al., 2010) and CIFAR10 (Krizhevsky, 2009) . After training, we use training data points x (tr) to extract linear terms a + x (tr) (x) and a - x (tr) (x) and construct an extracted function N (X ) as in equation ( 2). For some experiments, we also extract linear terms a + x (te) (x) and a - x (te) (x) corresponding to test data points x (te) . Regarding the architecture, we use fully-connected networks, AllCNN-C from Springenberg et al. (2015) and variations of VGG-B from Simonyan & Zisserman (2015) . Section D in the appendix summarizes all architectures we used in our experiments. If not stated otherwise, we use architectures Conv for CNNs and FCN8 for FCNs in our experiments. It is not our goal to train networks to state-of-the-art performance, but rather to compare variations of simple networks which are composed of the layers shown in Table 1 . All layers have ReLU activations except for the last layer where we apply a softmax output function into the ten respective classes. 4 We train five networks of each architecture to ensure the consistency of our results. Further details on the training setup can be found in the appendix. Train and Test Linear Regions At first, we investigate how training and test samples are distributed over the linear regions of the neural networks. For each data point x, let a x = a + x -a - x be the function corresponding to the linear region on which x lies. Observing that a x = a x for all training and test points x, x , we see that all points lie in different linear regions.foot_4 This is not a result of overfitting during training: All data points also occupy different regions when we check the linear regions after 1, 3, 5, 10, 15, 20, 30, 40, and X ) . Columns 2&3: Multi-class accuracy for N and N (X ) . All values are averages over 5 runs and over each of the architectures in table 3 in section D of the appendix. vectors, we calculate the (i) angle and (ii) Euclidean norm difference. Figure 2 shows both values for all test samples, where we differentiate between those test points x (te) that get correctly classified by N (X ) (in blue) and those that get a wrong label (in red). We observe a clear difference between the CNN and the FCN. For CNNs, the coefficient vectors of both training and test affine functions are all close to orthogonal for correctly as well as incorrectly classified points, For FCNs, the angle and distance of correctly classified points are smaller than for incorrectly classified ones, but still far away from zero and therefore rule out the possibility of test samples falling into very similar neighboring regions. Finally, instead of comparing linear coefficients, we also tested the similarity in activation patterns before and after extraction. The results in section E.5 of the appendix show that in each layer approx 80% of neuron activations agree between test and training region, showing that also the activation patterns of test samples deviate considerably and generalization cannot be simply explained by very similar activation patterns. Accuracy of the Extracted Functions As predicted by Theorem 3.3, the maximum of the extracted function N (X ) agrees with the maximum of the original network on all training points for each of our networks. In particular, network and extracted function assign the same label to each training point. To investigate how well the coefficients of these linear regions generalize to unseen data, we compare test accuracy of network and extracted function in Table 2 . We see a consistent difference between CNNs and FCNs across both data sets and all architectures: There is a drastic drop in the test accuracy of the CNNs, as opposed to a relatively small drop in the accuracy of the FCNs. Interestingly, for MNIST, the extracted tropical function has almost the same test accuracy as the original network. This is surprising as all known bounds on the number of linear regions of the original network suggest numbers of the order of 10 80 up to over 10 17000 from which we only observe 60.000 after reduction. Hence, for fully-connected networks on a simple task, the coefficients used on training data generalize well to test data, but for complex data, the learned coefficients generalize worse. This is remarkable, since previous studies (Hanin & Rolnick, 2019b; a; Zhang & Wu, 2020) of linear regions were forced to base their experiments on small data sets (or small networks) for computational reasons, and it seems that care must be taken when generalizing observations to more complex tasks. Moreover, the results reveal another difference between FCNs and CNNs that we further investigate. X ) . There is a clear difference between fully-connected networks and CNNs, which is consistent over all networks (Figure 3, right) .

Number of linear regions

The fully-connected network Wide and the CNN Narrow have the same number of nodes after each parameter layer. Since Narrow has only few connections between its nodes and Wide is fully-connected, it is reasonable to assume that the number of linear regions of Wide is greater than the one of Narrow as its theoretical upper bound is higher. Hence, it would be expected that extraction of a fixed number of linear terms resulted in a smaller change of results for an initially worse performing Narrow, but the drop in test accuracy for Narrow is almost 5 times the drop for Wide (36.6% vs 8.1%). An estimate of the number of linear regions in practice (Appendix E.8) further suggests that Narrow has more linear regions than Wide, both being astronomically high. This all contradicts our intuition about how CNNs and FCNs work from the perspective of network expressivity in terms of bounding the maximal number of linear regions.

Network Training

We compare the performance of extracted functions N (X ) during the training of the network N . Figure 4 displays the test accuracy (mean and standard deviation over 5 networks) of the extracted function and the agreement of label assignments of extracted form with the original network function for Narrow (CNN) and Wide (fully-connected) for several epochs. The difference between fully-connected and convolutional networks is here even more striking. For the CNN, the agreement between the extracted tropical function and the original network function falls rapidly after only one epoch and only slightly reduces from there. For the FCN, the agreement decreases slowly over the entire 50 training epochs and it never reaches a value as low as the CNN after its first epoch.

Information encoded in linear coefficients

The extracted functions all share the same number of linear terms, hence their difference in performance must be explained by the coefficient values. With this in mind, we attempt an interpretation of the above results and hypothesize that the difference lies in how FCN and CNN store important information for the classification task in the coefficients of linear regions. An FCN has the full freedom to compose weights to tailored coefficients of linear regions, whereas CNNs impose a structure on the weight space by filters and weight sharing, which results in the incapability to compose tailored linear coefficients for correct label assignments. Instead, the structural properties of convolutions play a significant role in generalization, which we remove by extracting linear terms. This changes the outcome on test data as the coefficients of linear regions alone are limited in meaning. As training progresses, to achieve higher accuracy, the FCN reduces the information stored in linear coefficients and also learns to use some structure, so that the removal of this structure could explain the decrease in performance of the extracted function. An experiment, where we visually inspect misclassified images (Appendix E.2) is in line with this interpretation suggesting that the object shape is encoded in the linear coefficients of the FCNs, but for CNNs only simple features such as background color are encoded in the linear coefficients of linear regions. We visualize coefficients in E.7 to further support observed differences. For each dimension, we additionally compute the Pearson correlation between the linear coefficients of two separately trained networks over all training samples. We reduce the resulting vector to a single number by averaging the correlation factors over the dimensions. We experiment with two pairs of FCNs and CNNs trained on CIFAR10 on MNIST. Figure 5 shows the evolution of the correlation during training, confirming that the similarity of coefficient values is also larger for FCNs than for CNNs if measured by correlation. The correlation of linear coefficients after re-training and convergence for the FCNs is significant. Interestingly, for the networks trained on CIFAR10, we notice jumps in the correlation values precisely when the learning rates get decreased.

5. CONCLUSION

The function of a ReLU network is piecewise linear, with an astronomically high number of linear regions. We introduced TropEx, an algorithm to systematically extract linear regions based on data points. The derivation is based on a matrix representation of tropical functions that supports efficient algorithmic development. TropEx enables investigations of the linear components of piecewise linear network functions: By extracting the networks' linear terms, the algorithm allows us to compare training and test regions and to systematically analyze their linear coefficients. Applying TropEx to fully-connected and convolutional architectures shows significant differences between linear regions of CNNs and FCNs. Other possible use cases are outlined in Appendix G. Our findings indicate a potential benefit of shifting focus from counting linear regions to an understanding of their interplay, as differences between CNNs and FCNs may be found in the coefficients of the extracted linear terms. Several measures of similarity indicate that the linear terms of CNNs are more diverse than those of FCNs and suggest that CNNs efficiently exploit the structure imposed by their architecture, whereas FCNs rely on encoding information in the values of linear coefficients.



Link to open source implementation: https://github.com/martrim/tropex This is in line with the tropical algebra notation: a tropical quotient is the same as a usual difference. This implies that every ReLU neural network classifier can be represented by a convex function. We also experimented with replacing ReLU with Leaky ReLU. Our observations are in line with what we describe for ReLU. More details can be found in the appendix, Section E.3. Except for the small 2-layer architecture, where the data points lie on 59,850 regions instead of 60,000. We are comparing linear coefficients of the full network function instead of weight vectors. Whereas symmetries in the parameterization of a network function make comparisons of weight vectors complicated, the comparison of linear coefficients is well-defined.



Figure 1: A ReLU network function before (left) and after (right) extraction. Left: Hyperplanes separate the input space into linear regions. Most of them do not contain any data points. Each data point occupies its own linear region. Right: After extraction, the function remains unchanged on the linear regions of training samples. Test samples now fall into regions of training samples.

Figure 2: Angle and distance between coefficients of the linear regions of test data and training data used for correct (blue) and incorrect (red) classification by the extracted function. All angles for CNNs (left) are close to orthogonal, while FCNs (right) shows clear correlations between angles, distances and correctness of prediction.

Figure 3: Left: Comparison of average test accuracy (dotted red) and agreement with labels assigned by the original network (blue) for networks Narrow and Wide while being transformed into tropical functions. The letters D,M,C denote dense, maxpooling and convolutional layers, respectively. Right: Average test accuracy for all CIFAR networks while being transformed into a tropical function. Fully-connected: dotted blue lines, CNNs: full red lines. The curves show a significant difference between the network types with FCN performance being more stable to extraction of linear regions.

Figure 4: Mean and standard deviation of the performance of extracted tropical function during training over 5 networks. The performance of the CNN network function suffers strongly from extraction early in training, whereas the FCN shows a slow, gradual decline.

Figure 5: Pearson correlation between the linear coefficients of two separately trained networks averaged over all input dimensions. Black dots indicate a reduction of the learning rate by a factor of 10. Coefficients are more correlated after re-training for FCNs than for CNNs, suggesting that FCNs encode more information in the coefficients of linear regions than CNNs Re-training the network The observation of smaller angles for FCNs in Figure 2 further supports our interpretation that coefficient values of linear regions play a larger role in classification for FCNs, since smaller angles together with small Euclidean distance are explained by a similarity of the coefficient values. This suggests to also compare the similarity of linear coefficients after re-training the network in order to further understand the information encoded in the linear coefficients. Again, we find that the coefficient vectors of two separately trained CNNs are close to orthogonal, whereas both angles and distances are considerably smaller for FCNs. 6 Plots are shown in Appendix E.6.

50 epochs of training on CIFAR10, and from epochs 1 to 20 on MNIST. Therefore we conclude that generalization capabilities of neural networks cannot be explained by test samples falling into the same linear region as training samples (or, in other words, by test samples inducing the same activation pattern as training samples). Results on test data after extraction of linear terms. Column 1: Agreement of network N with extracted function N (

acknowledgement

ACKNOWLEDGEMENTS This work was supported in part by the European Research Council Consolidator grant SEED, CNCSUEFISCDI PN-III-PCCF-2016-0180, Swedish Foundation for Strategic Research (SSF) Smart Systems Program, as well as the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

