ON THE DECISION BOUNDARIES OF NEURAL NET-WORKS. A TROPICAL GEOMETRY PERSPECTIVE

Abstract

This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piecewise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to characterize the decision boundaries of a simple network of the form (Affine, ReLU, Affine). Our main finding is that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of these zonotopes are functions of the network parameters. This geometric characterization provides new perspectives to three tasks. (i) We propose a new tropical perspective to the lottery ticket hypothesis, where we view the effect of different initializations on the tropical geometric representation of a network's decision boundaries. (ii) Moreover, we propose new tropical based optimization reformulations that directly influence the decision boundaries of the network for the task of network pruning. (iii) At last, we briefly discuss the reformulation of the generation of adversarial attacks in a tropical sense, where we elaborate on this in detail in the supplementary material. 1

1. INTRODUCTION

Deep Neural Networks (DNNs) have demonstrated outstanding performance across a variety of research domains, including computer vision (Krizhevsky et al., 2012) , speech recognition (Hinton et al., 2012) , natural language processing (Bahdanau et al., 2015; Devlin et al., 2018) , quantum chemistry Schütt et al. (2017) , and healthcare (Ardila et al., 2019; Zhou et al., 2019) to name a few (LeCun et al., 2015) . Nevertheless, a rigorous interpretation of their success remains elusive (Shalev-Shwartz & Ben-David, 2014) . For instance, in an attempt to uncover the expressive power of DNNs, the work of Montufar et al. (2014) studied the complexity of functions computable by DNNs that have piecewise linear activations. They derived a lower bound on the maximum number of linear regions. Several other works have followed to improve such estimates under certain assumptions (Arora et al., 2018) . In addition, and in attempt to understand some of the subtle behaviours DNNs exhibit, e.g. the sensitive reaction of DNNs to small input perturbations, several works directly investigated the decision boundaries induced by a DNN for classification. The work of Moosavi-Dezfooli et al. (2019) showed that the smoothness of these decision boundaries and their curvature can play a vital role in network robustness. Moreover, the expressiveness of these decision boundaries at perturbed inputs was studied in He et al. (2018) , where it was shown that these boundaries do not resemble the boundaries around benign inputs. The work of Li et al. (2018) showed that under certain assumptions, the decision boundaries of the last fully connected layer of DNNs will converge to a linear SVM. Also, Beise et al. (2018) showed that the decision regions of DNNs with width smaller than the input dimension are unbounded. More recently, and due to the popularity of the piecewise linear ReLU as an activation function, there has been a surge in the number of works that study this class of DNNs in particular. As a result, this has incited significant interest in new mathematical tools that help analyze piecewise linear functions, such as tropical geometry. While tropical geometry has shown its potential in many applications such as dynamic programming (Joswig & Schröter, 2019) , linear programming (Allamigeon et al., 2015) , multi-objective discrete optimization (Joswig & Loho, 2019) , enumerative geometry (Mikhalkin, 2004) , and economics (Akian et al., 2009; Mai Tran & Yu, 2015) , it has only been recently used to analyze DNNs. For instance, the work of Zhang et al. (2018) showed an equivalency between the family of DNNs with piecewise linear activations and integer weight matrices and the family of tropical rational maps, i.e. ratio between two multi-variate polynomials in tropical algebra. This study was mostly concerned about characterizing the complexity of a DNN by counting the number of linear regions, into which the function represented by the DNN can divide the input space. This was done by counting the number of vertices of a polytope representation recovering the results of Montufar et al. (2014) with a simpler analysis. More recently, Smyrnis & Maragos (2019) leveraged this equivalency to propose a heuristic for neural network minimization through approximating the tropical rational map. Contributions. In this paper, we take the results of Zhang et al. (2018) several steps further and present a novel perspective on the decision boundaries of DNNs using tropical geometry. To that end, our contributions are three-fold. (i) We derive a geometric representation (convex hull between two zonotopes) for a super set to the decision boundaries of a DNN in the form (Affine, ReLU, Affine). (ii) We demonstrate a support for the lottery ticket hypothesis (Frankle & Carbin, 2019 ) from a geometric perspective. (iii) We leverage the geometric representation of the decision boundaries, referred to as the decision boundaries polytope, in two interesting applications: network pruning and adversarial attacks. For tropical pruning, we design a geometrically inspired optimization to prune the parameters of a given network such that the decision boundaries polytope of the pruned network does not deviate too much from its original network counterpart. We conduct extensive experiments with AlexNet (Krizhevsky et al., 2012) and VGG16 (Simonyan & Zisserman, 2014) on SVHN (Netzer et al., 2011) , CIFAR10, and CIFAR 100 (Krizhevsky & Hinton, 2009) datasets, in which 90% pruning rate is achieved with a marginal drop in testing accuracy. For tropical adversarial attacks, we show that one can construct input adversaries that can change network predictions by perturbing the decision boundaries polytope.

2. PRELIMINARIES TO TROPICAL GEOMETRY

For completeness, we first provide preliminaries to tropical geometry (Itenberg et al., 2009; Maclagan & Sturmfels, 2015) . Definition 1. (Tropical Semiringfoot_1 ) The tropical semiring T is the triplet {R ∪ {-∞}, ⊕, }, where ⊕ and define tropical addition and tropical multiplication, respectively. They are denoted as: x ⊕ y = max{x, y}, x y = x + y, ∀x, y ∈ T. It can be readily shown that -∞ is the additive identity and 0 is the multiplicative identity. Given the previous definition, a tropical power can be formulated as x a = x x • • • x = a. x, for x ∈ T, a ∈ N, where a.x is standard multiplication. Moreover, a tropical quotient can be defined as: x y = x -y, where x -y is standard subtraction. For ease of notation, we write x a as x a . Definition 2. (Tropical Polynomials) For x ∈ T d , c i ∈ R and a i ∈ N d , a d-variable tropical polynomial with n monomials f : T d → T d can be expressed as: f (x) = (c 1 x a1 ) ⊕ (c 2 x a2 ) ⊕ • • • ⊕ (c n x an ), ∀ a i = a j when i = j. We use the more compact vector notation x a = x a1 1 x a2 2 • • • x a d d . Moreover and for ease of notation, we will denote c i x ai as c i x ai throughout the paper. Definition 3. (Tropical Rational Functions) A tropical rational is a standard difference or a tropical quotient of two tropical polynomials: f (x) -g(x) = f (x) g(x). Algebraic curves or hypersurfaces in algebraic geometry, which are the solution sets to polynomials, can be analogously extended to tropical polynomials too. Definition 4. (Tropical Hypersurfaces) A tropical hypersurface of a tropical polynomial f (x) = c 1 x a1 ⊕ • • • ⊕ c n x an is the set of points x where f is attained by two or more monomials in f , i.e. T (f ) := {x ∈ R d : c i x ai = c j x aj = f (x), for some a i = a j }. Tropical hypersurfaces divide the domain of f into convex regions, where f is linear in each region. Also, every tropical polynomial can be associated with a Newton polytope. A tropical polynomial determines a dual subdivision, which can be constructed by projecting the collection of upper faces (UF) in P(f ) := ConvHull{(a i , c i ) ∈ R d ×R : i = 1, . . . , n} onto R d . That is to say, the dual subdivision determined by f is given as δ(f ) := {π(p) ⊂ R d : p ∈ UF(P(f ))}, where π : R d ×R → R d is the projection that drops the last coordinate. It has been shown by Maclagan & Sturmfels (2015) that the tropical hypersurface T (f ) is the (d-1)-skeleton of the polyhedral complex dual to δ(f ). This implies that each node of the dual subdivision δ(f ) corresponds to one region in R d where f is linear. This is exemplified in Figure 1 with three tropical polynomials, and to see this clearly, we will elaborate on the first tropical polynomial example f (x, y) = x ⊕ y ⊕ 0. Note that as per Definition 4, the tropical hypersurface is the set of points (x, y) where x = y, y = 0, and x = 0. This indeed gives rise to the three solid red lines indicating the tropical hypersurfaces. As for the dual subdivision δ(f ), we observe that x ⊕ y ⊕ 0 can be written as (x 1 y 0 ) ⊕ (x 0 y 1 ) ⊕ (x 0 y 0 ). Thus, and since the monomials are bias free (c i = 0), then P(f ) = ConvHull{(1, 0, 0), (0, 1, 0), (0, 0, 0)}. It is then easy to see that δ(f ) = ConvHull{(1, 0), (0, 1), (0, 0)}, since UP(P(f )) = P(f ), which is the black triangle in solid lines in Figure 1 . One key observation in all three examples in Figure 1 is that the number of regions where f is linear (that is 3, 6 and 10, respectively) is equal to the number of nodes in the corresponding dual subdivisions. Second, the tropical hypersurfaces are parallel to the normals to the edges of the dual subdivision polytope. This observation will be essential for the remaining part of the paper. Several other observations are summarized by Brugallé & Shaw (2014) . Moreover, Zhang et al. (2018) showed an equivalency between tropical rational maps and a family of neural network f : R n → R k with piecewise linear activations through the following theorem. Theorem 1. (Tropical Characterization of Neural Networks, (Zhang et al., 2018) ). A feedforward neural network with integer weights and real biases with piecewise linear activation functions is a function f : R n → R k , whose coordinates are tropical rational functions of the input, i.e., f (x) = H(x) Q(x) = H(x) -Q(x) , where H and Q are tropical polynomials. While this is new in the context of tropical geometry, it is not surprising, since any piecewise linear function can be written as a difference of two max functions over a set of hyperplanes (Melzer, 1986) . Before any further discussion, we first recap the definition of zonotopes. Definition 6. Let u 1 , . . . , u L ∈ R n . The zonotope formed by u 1 , . . . , u L is defined as Z(u 1 , . . . , u L ) := { L i=1 x i u i : 0 ≤ x i ≤ 1}. Equivalently, Z can be expressed with respect to the generator matrix U ∈ R L×n , where U(i, : ) = u i as Z U := {U x : ∀x ∈ [0, 1] L }. Another common definition for a zonotope is the Minkowski sum of the set of line segments {u 1 , . . . , u L } (refer to appendix), where a line segment of the vector u i in R n is defined as {αu i : ∀α ∈ [0, 1]}. It is well-known that the number of vertices of a zonotope is polynomial in the number of line segments, i.e. Gritzmann & Sturmfels, 1993) . 3 DECISION BOUNDARIES OF NEURAL NETWORKS AS POLYTOPESfoot_2  |vert (Z U ) | ≤ 2 n-1 i=0 L-1 i = O L n-1 ( In this section, we analyze the decision boundaries of a network in the form (Affine, ReLU, Affine) using tropical geometry. For ease, we use ReLUs as the non-linear activation, but any other piecewise linear function can also be used. The functional form of this network is: f (x) = Bmax (Ax + c 1 , 0) + c 2 , where max(.) is an element-wise operator. The outputs of the network f are the logit scores. Throughout this section, we assumefoot_3 that A ∈ Z p×n , B ∈ Z 2×p , c 1 ∈ R p and c 2 ∈ R 2 . For ease of notation, we only consider networks with two outputs, i.e. B 2×p , where the extension to a multi-class output follows naturally and is discussed in the appendix. Now, since f is a piecewise linear function, each output can be expressed as a tropical rational as per Theorem 1. If f 1 and f 2 refer to the first and second outputs respectively, we have f 1 (x) = H 1 (x) Q 1 (x) and f 2 (x) = H 2 (x) Q 2 (x), where H 1 , H 2 , Q 1 and Q 2 are tropical polynomials. In what follows and for ease of presentation, we present our main results where the network f has no biases, i.e. c 1 = 0 and c 2 = 0, and we leave the generalization to the appendix. Theorem 2. For a bias-free neural network in the form f (x) : R n → R 2 , where A ∈ Z p×n and B ∈ Z 2×p , let R(x) = H 1 (x) Q 2 (x) ⊕ H 2 (x) Q 1 (x) be a tropical polynomial. Then: • Let B = {x ∈ R n : f 1 (x) = f 2 (x)} define the decision boundaries of f , then B ⊆ T (R(x)). • δ (R(x)) = ConvHull (Z G1 , Z G2 ). Z G1 is a zonotope in R n with line segments {(B + (1, j) + B -(2, j))[A + (j, :), A -(j, :)]} p j=1 and shift (B -(1, :) + B + (2, :))A -, where A + = max(A, 0) and A -= max(-A, 0). Z G2 is a zonotope in R n with line segments {(B -(1, j) + B + (2, j))[A + (j, :), A -(j, :)]} p j=1 and shift (B + (1, :) + B -(2, :))A -. Digesting Theorem 2. This theorem aims at characterizing the decision boundaries (where f 1 (x) = f 2 (x)) of a bias-free neural network of the form (Affine, ReLU, Affine) through the lens of tropical geometry. In particular, the first result of Theorem 2 states that the tropical hypersurface T (R(x)) of the tropical polynomial R(x) is a superset to the set of points forming the decision boundaries, i.e. B. Just as discussed earlier and exemplified in Figure 1 , tropical hypersurfaces are associated with a corresponding dual subdivision polytope δ(R(x)). Based on this, the second result of Theorem 2 states that this dual subdivision is precisely the convex hull of two zonotopes denoted as Z G1 and Z G2 , where each zonotope is only a function of the network parameters A and B. Theorem 2 bridges the gap between the behaviour of the decision boundaries B, through the superset T (R(x)), and the polytope δ (R(x)), which is the convex hull of two zonotopes. It is worthwhile to mention that Zhang et al. (2018) discussed a special case of the first part of Theorem 2 for a neural network with a single output and a score function s(x) to classify the output. To the best of our knowledge, this work is the first to propose a tropical geometric formulation of a superset containing the decision boundaries of a multi-class classification neural network. In particular, the first result of Theorem 2 states that one can perhaps study the decision boundaries, B, directly by studying their superset T (R(x)). While studying T (R(x)) can be equally difficult, the second result of Theorem 2 comes in handy. First, note that, since the network is bias-free, π becomes an identity mapping with δ(R(x)) = ∆(R(x)), and thus the dual subdivision δ(R(x)), which is the Newton polytope ∆(R(x)) in this case, becomes a well-structured geometric object that can be exploited to preserve training dataset, decision boundaries polytope of original network (before pruning), followed by the decision boundaries polytope for networks pruned at different pruning percentages using different initializations. Note that in the original polytope there are many more vertices than just 4, but they are very close to each other forming many small edges that are not visible in the figure. While Theorem 2 presents a strong relation between a polytope (convex hull of two zonotopes) and the decision boundaries, it remains unclear how such a polytope can be efficiently constructed. Although the number of vertices of a zonotope is polynomial in the number of its generating line segments, fast algorithms for enumerating these vertices are still restricted to zonotopes with line segments starting at the origin Stinson et al. (2016) . Since the line segments generating the zonotopes in Theorem 2 have arbitrary end points, we present the next result that transforms these line segments into a generator matrix of line segments starting from the origin as in Definition 6. This result is essential for an efficient computation of the zonotopes in Theorem 2. Proposition 1. The zonotope formed by p line segments in R n with arbitrary end points {[u i 1 , u i 2 ]} p i=1 is equivalent to the zonotope formed by the line segments {[u i 1 -u i 2 , 0]} p i=1 with a shift of p i=1 u i 2 . We can now represent with the following corollary the arbitrary end point line segments forming the zonotopes in Theorem 2 with generator matrices, which allow us to leverage existing algorithms that enumerate zonotope vertices Stinson et al. (2016) . Next, we show several applications for Theorem 2 by leveraging the tropical geometric structure.

4. TROPICAL PERSPECTIVE TO THE LOTTERY TICKET HYPOTHESIS

The lottery ticket hypothesis was recently proposed by Frankle & Carbin (2019) , in which the authors surmise the existence of sparse trainable sub-networks of dense, randomly-initialized, feedforward networks that when trained in isolation perform as well as the original network in a similar number of iterations. To find such sub-networks, Frankle & Carbin (2019) propose the following simple algorithm: perform standard network pruning, initialize the pruned network with the same initialization that was used in the original training setting, and train with the same number of epochs. They hypothesize that this results in a smaller network with a similar accuracy. In other words, a sub-network can have decision boundaries similar to those of the original larger network. While we do not provide a theoretical reason why this pruning algorithm performs favorably, we utilize the geometric structure that arises from Theorem 2 to reaffirm such behaviour. In particular, we show that the orientation of the dual subdivision δ(R(x)) (referred to as decision boundaries polytope), where the normals to its edges are parallel to T (R(x)) that is a superset to the decision boundaries, is preserved after pruning with the proposed initialization algorithm of Frankle & Carbin (2019) . Conversely, pruning with a different initialization at each iteration results in a significant variation in the orientation of the decision boundaries polytope and ultimately in reduced accuracy. To this end, we train a neural network with 2 inputs (n = 2), 2 outputs, and a single hidden layer with 40 nodes (p = 40). We then prune the network by removing the smallest x% of the weights. The pruned network is then trained using different initializations: (i) the same initialization as the original network (Frankle & Carbin, 2019) , (ii) Xavier (Glorot & Bengio, 2010) , (iii) standard Gaussian, and (iv) zero mean Gaussian with variance 0.1. Figure 3 shows the decision boundaries polytope, i.e. δ(R(x)), as we perform more pruning (increasing the x%) with different initializations. First, we show the decision boundaries by sampling and classifying points in a grid with the trained network (first subfigure). We then plot the decision boundaries polytope δ(R(x)) as per the second part of Theorem 2 denoted as original polytope (second subfigure). While there are many overlapping vertices in the original polytope, the normals to some of the edges (the major visible edges) are parallel to the decision boundaries shown in the first subfigure of Figure 3 . We later show the decision boundaries polytope for the same network under different levels of pruning. One can observe that the orientation of δ(R(x)) for all different initialization schemes deviates much more from the original polytope as compared to the lottery ticket initialization. This gives an indication that lottery ticket initialization indeed preserves the decision boundaries, since it preserves the orientation of the decision boundaries polytope throughout the evolution of pruning. An alternative means to study the lottery ticket could be to directly observe the polytopes representing the functional form of the network, i.e. δ(H {1,2} (x)) and δ(Q {1,2} (x)), in lieu of the decision boundaries polytopes. However, this strategy may fail to provide a conclusive analysis of the lottery ticket, since there can exist multiple polytopes δ(H {1,2} (x)) and δ(Q {1,2} (x)) for networks with the same decision boundaries. This highlights the importance of studying the decision boundaries directly. Additional discussions and experiments are left for the appendix.

5. TROPICAL NETWORK PRUNING

Network pruning has been identified as an effective approach to reduce the computational cost and memory usage during network inference. While it dates back to the work of LeCun et al. (1990) and Hassibi & Stork (1993) , network pruning has recently gained more attention. This is due to the fact that most neural networks over-parameterize commonly used datasets. In network pruning, the task is to find a smaller subset of the network parameters, such that the resulting smaller network has similar decision boundaries (and thus supposedly similar accuracy) to the original over-parameterized network. In this section, we show a new geometric approach towards network pruning. In particular and as indicated by Theorem 2, preserving the polytope δ(R(x)) preserves a superset to the decision boundaries, T (R(x)), and thus the decision boundaries themselves. Motivational Insight. For a single hidden layer neural network, the dual subdivision to the decision boundaries is the polytope that is the convex hull of two zonotopes, where each is formed by taking the Minkowski sum of line segments (Theorem 2). Figure 4 shows an example, where pruning a neuron in the network has no effect on the dual subdivision polytope and hence no effect on performance. This occurs, since the tropical hypersurface T (R(x)) before and after pruning is preserved, thus, keeping the decision boundaries the same. Problem Formulation. In light of the motivational insight, a natural question arises: Given an over-parameterized binary output neural network f (x) = B max (Ax, 0), can one construct a new neural network, parameterized by sparser weight matrices Ã and B, such that this smaller network has a dual subdivision δ( R(x)) that preserves the decision boundaries of the original network? To address this question, we propose the following optimization problem to compute Ã and B: min Ã, B d δ( R(x)), δ(R(x)) = min Ã, B d ConvHull Z G1 , Z G2 , ConvHull (Z G1 , Z G2 ) . The function d(.) defines a distance between two geometric objects. Since the generators G1 and G2 are functions of Ã and B (as per Theorem 2), this optimization problem can be challenging to solve. However, for pruning purposes, one can observe from Theorem 2 that if the generators G1 and G2 had fewer number of line segments (rows), this corresponds to a fewer number of rows in the weight matrix Ã (sparser weights). So, we observe that if G1 ≈ G 1 and G2 ≈ G 2 , then δ(R(x)) ≈ δ(R(x)), and thus the decision boundaries tend to be preserved as a consequence. Therefore, we propose the following optimization problem as a surrogate to the one in Problem (1): min Ã, B 1 2 G1 -G 1 2 F + G2 -G 2 2 F + λ 1 G1 2,1 + λ 2 G2 2,1 . (2) The matrix mixed norm for C ∈ R n×k is defined as C 2,1 = n i=1 C(i, :) 2 , which encourages the matrix C to be row sparse, i.e. complete rows of C are zero. The first two terms in Problem (2) aim at approximating the original dual subdivision δ(R(x)) by approximating the underlying generator matrices, G 1 and G 2 . This aims to preserve the orientation of the decision boundaries of the newly constructed network. On the other hand, the second two terms in Problem (2) act as regularizers to control the sparsity of the constructed network by controlling the sparsity in the number of line segments. We observe that Problem ( 2 2) is separable in the rows of Ã and B, we solve Problem (2) via alternating optimization over these rows, where each sub-problem can be shown to be convex and exhibits a closed-form solution leading to a very efficient solver. For ease of notation, we refer to ReLU( B(i, :)) and ReLU(-B(i, :)) as B+ (i, :) and B-(i, :), respectively. As such, the per row update for Ã (first linear layer) is given as follows: Ã(i, :) = max   1 - 1 2 λ 1 c i 1 + λ 2 c i 2 1 2 (c i 1 + c i 2 ) 1 c i 1 G1(i,:)+c i 2 G2(i,:) 1 2 (c i 1 +c i 2 ) 2 , 0   c i 1 G 1 (i, :) + c i 2 G 2 (i, :) 1 2 (c i 1 + c i 2 ) , where c i 1 is the i th element of c 1 = ReLU(B(1, :)) + ReLU(-B(2, :)) and c 2 = ReLU(B(2, : )) + ReLU(-B(1, :)). Similarly, the closed form update for the j th element of the second linear layer is as follows: B+ (1, j) = max 0, Ã(j, :) G1 + (j, :) -λ Ã(j, :) 2 Ã(j, :) 2 2 , where G 1 + = Diag(B + (1, :))A. A similar argument can be used to update the variables B+ (2, :), B-(1, :), and B-(2, :). The details of deriving the aforementioned update steps and the extension to the multi-class case are left to the appendix. Note that all updates are cheap, as they are expressed in a closed form single step. In all subsequent experiments, we find that running the alternating optimization for a single iteration is sufficient to converge to a reasonable solution, thus, leading to a very efficient overall solver. Extension to Deeper Networks. While the theoretical results in Theorem 2 and Corollary 1 only hold for a shallow network in the form of (Affine, ReLU, Affine), we propose a greedy heuristic to prune much deeper networks by applying the aforementioned optimization for consecutive blocks of (Affine, ReLU, Affine) starting from the input and ending at the output of the network. This extension from a theoretical study of 2 layer network was observed in several works such as (Bibi et al., 2018) . Experiments on Tropical Pruning. Here, we evaluate the performance of the proposed pruning approach as compared to several classical approaches on several architectures and datasets. In particular, we compare our tropical pruning approach against Class Blind (CB), Class Uniform (CU) and Class Distribution (CD) (Han et al., 2015; See et al., 2016) . In Class Blind, all the parameters of a layer are sorted by magnitude where the x% with smallest magnitude are pruned. In contrast, Class Uniform prunes the parameters with smallest x% magnitudes per node in a layer. Lastly, Class Distribution performs pruning of all parameters for each node in the layer, just as in Class Uniform, but the parameters are pruned based on the standard deviation σ c of the magnitude of the parameters per node. Since fully connected layers in deep neural networks tend to have much higher memory complexity than convolutional layers, we restrict our focus to pruning fully connected layers. We train AlexNet and VGG16 on SVHN, CIFAR10, and CIFAR100 datasets. We observe that we can prune more than 90% of the classifier parameters for both networks without affecting the accuracy. Since pruning is often a single block within a larger compression scheme that in many cases involves inexpensive fast fine tuning, we demonstrate experimentally that our approach can is competitive and sometimes outperforms other methods even when all parameters or when only the biases are fine-tuned after pruning. These experiments in addition to many others are left for the appendix. Setup. To account for the discrepancy in input resolution, we adapt the architectures of AlexNet and VGG16, since they were originally trained on ImageNet (Deng et al., 2009) . The fully connected layers of AlexNet and VGG16 have sizes (256,512,10) and (512,512,10) for SVHN and CIFAR10, respectively, and with the last dimension increased to 100 for CIFAR100. All networks were trained to baseline test accuracy of (92%,74%,43%) for AlexNet on SVHN, CIFAR10, and CIFAR100, respectively and (92%,92%,70%) for VGG16. To evaluate the performance of pruning and following previous work Han et al. (2015) , we report the area under the curve (AUC) of the pruning-accuracy plot. The higher the AUC is, the better the trade-off is between pruning rate and accuracy. For efficiency purposes, we run the optimization in Problem 2 for a single alternating iteration to identify the rows in Ã and elements of B that will be pruned. Results. Figure 5 shows the comparison between our tropical approach and the three popular pruning schemes on both AlexNet and VGG16 over the different datasets. Our proposed approach can indeed prune out as much as 90% of the parameters of the classifier without sacrificing much of the accuracy. For AlexNet, we achieve much better performance in pruning as compared to other methods. In particular, we are better in AUC by 3%, 3%, and 2% over other pruning methods on SVHN, CIFAR10 and CIFAR100, respectively. This indicates that the decision boundaries can indeed be preserved by preserving the dual subdivision polytope. For VGG16, we perform similarly well on both SVHN and CIFAR10 and slightly worse on CIFAR100. While the performance achieved here is comparable to the other pruning schemes, if not better, we emphasize that our contribution does not lie in outperforming state-of-the-art pruning methods, but in giving a new geometry-based perspective to network pruning. More experiments were conducted where only network biases or only the classifier are fine-tuned after pruning. Retraining only biases can be sufficient, as they do not contribute to the orientation of the decision boundaries polytope (and effectively the decision boundaries), but only to its translation. Discussions on biases and more results are left for the appendix. Comparison Against Tropical Geometry Approaches. A recent tropical geometry inspired approach was proposed to address the problem of network pruning. In particular, Smyrnis & Maragos (2019; 2020) (SM) proposed an interesting yet heuristic algorithm to directly approximate the tropical rational by approximating the Newton polytope. For fair comparison and following the setup of SM, we train LeNet on MNIST and monitor the test accuracy as we prune its neurons. We report (neurons kept, SM, ours) triplets in (%) as follows: (100, 98.60, 98.84), (90, 95.71, 98.82), (75, 95.05, 98.8), (50, 95.52, 98.71), (25, 91.04, 98.36), (10, 92.79, 97.99), and (5, 92.93, 94.91) . It is clear that tropical pruning outperforms SM by a margin that reaches 7%. This demonstrates that our theoretically motivated approach is still superior to more recent pruning approaches.

6. TROPICAL ADVERSARIAL ATTACKS

DNNs are notorious for being sensitive to imperceptible noise at their inputs referred to as adversarial attacks. Several works investigated DNNs' decision boundaries in the presence of such adversaries. For instance, Khoury & Hadfield-Menell (2018) analyzed the high dimensional geometry of adversar-ial examples by means of manifold reconstruction while He et al. (2018) crafted adversarial attacks by estimating the distance to the decision boundaries using random search directions. In this work, we show how Theorem 2 can be leveraged to construct a tropical geometric adversarial attack. Due to the space limitation, we leave the extensive formulation, the algorithm to find the adversary, and the experimental results on synthetic and real datasets to the appendix.

7. CONCLUSION

We leverage tropical geometry to characterize the decision boundaries of neural networks in the form (Affine, ReLU, Affine) and relate it to geometric objects such as zonotopes. We then provide a tropical perspective to support the lottery ticket hypothesis, prune networks, and design adversarial attacks. A natural extension is a compact derivation for the characterization of the decision boundaries of convolutional neural networks and graphical convolutional networks. 8 PRELIMINARIES AND DEFINITIONS. Fact 1. P +Q = {p + q, ∀p ∈ P and q ∈ Q} is the Minkowski sum between two sets P and Q. Fact 2. Let f be a tropical polynomial and let a ∈ N. Then P(f a ) = aP(f ). Let both f and g be tropical polynomials. Then Fact 3. P(f g) = P(f ) +P(g). (3) Fact 4. P(f ⊕ g) = ConvexHull V (P(g)) ∪ V (P(g)) . (4) Note that V(P(f )) is the set of vertices of the polytope P(f ). Definition 7. Upper Face of a Polytope P : UF(P ) is an upper face of polytope P in R n if x+te n / ∈ P for any x ∈ UF(P ), t > 0 where e n is a canonical vector. Formally, UF(P ) = {x : x ∈ P, x + te n / ∈ P ∀t > 0}

9. EXAMPLES

We revise the second example in Figure 1 . Note that the two dimensional tropical polynomial f (x, y) can be written as follows: f (x, y) = (x ⊕ y ⊕ 0) ((x 1) ⊕ (y 1) ⊕ 0) = (x x 1) ⊕ (x y 1) ⊕ (x 0) ⊕ (y x 1) ⊕ (y y 1) ⊕ (y 0) ⊕ (0 x 1) ⊕ (0 y 1) ⊕ (0 0) = (x 2 1) ⊕ (x y 1) ⊕ (x) ⊕ (y x 1) ⊕ (y 2 1) ⊕ (y) ⊕ (x 1) ⊕ (y 1) ⊕ (0) = (x 2 1) ⊕ (x y 1) ⊕ (x) ⊕ (y 2 1) ⊕ (y 1) ⊕ (0) = (x 2 y 0 1) ⊕ (x y 1) ⊕ (x y 0 0) ⊕ (x 0 y 2 1) ⊕ (x 0 y 1) ⊕ (x 0 y 0 0) First equality follows since multiplication is distributive in rings and semi rings. The second equality follows since 0 is the multiplication identity. The penultimate equality follows since y 1 ≥ y, x y 1 ≥ x y 1 and x ≥ x 1 ∀ x, y. Therefore, the tropical hypersurface T (f ) is defined as the of (x, y) where f achieves its maximum at least twice in its monomials. That is to say, T (f ) ={f (x, y) = (x 2 1) = (x y 1)} ∪ {f (x, y) = x 2 1 = x}∪ {(f (x, y) = x = 0} ∪ {f (x, y) = x = x y 1}∪ {f (x, y) = y 1 = 0} ∪ {f (x, y) = y 1 = x y 1}∪ {f (x, y) = y 1 = y 2 1} ∪ {f (x, y) = y 2 1 = x y 1}. This set T (f ) is shown by the red lines in the second example in Figure 1 . As for constructing the dual subdivision δ(f ), we project the upperfaces in the newton polygon P(f ) to R 2 . Note that P(f ) with biases as per the definition in Section 2 is given as P(f ) = ConvHull{(a i , c i ) ∈ R 2 × R ∀i = 1, . . . , 6} where (a i , c i ) are the exponents and biases in the monomials of f , respectively. Therefore, P(f ) = ConvHull{(2, 0, -1), (1, 1, 1), (1, 0, 0), (0, 2, 1), (0, 1, 1), (0, 0, 0)} as shown in Figure 6 (a). As per Definition 7, the set of upper faces of P is: UP(P(f )) = ConvHull{(0, 2, 1), (1, 1, 1), (0, 1, 1)} ∪ ConvHull{(0, 1, 1), (1, 1, 1), (1, 0, 0)} ∪ ConvHull{(0, 1, 1), (1, 0, 0), (0, 0, 0)} ∪ ConvHull{(1, 1, 1), (2, 0, -1), (1, 0, 0)}. This set UP(P(f )) is then projected, through π, to R 2 shown in the yellow dashed lines in Figure 6 (a) to construct the dual subdivision δ(f ) in Figure 6 (b). For example, note that the point (0, 2, 1) ∈ UF(f ) and thereafter, π(0, 2, 1) = (0, 2, 0) ∈ δ(f ).

10. PROOF OF THEOREM 2

Theorem 2. For a bias-free neural network in the form of f (x) : R n → R 2 where A ∈ Z p×n and B ∈ Z 2×p , let R(x) = H 1 (x) Q 2 (x) ⊕ H 2 (x) Q 1 (x) be a tropical polynomial. Then: • Let B = {x ∈ R n : f 1 (x) = f 2 (x)} define the decision boundaries of f , then B ⊆ T (R(x)). • δ (R(x)) = ConvHull (Z G1 , Z G2 ). Z G1 is a zonotope in R n with line segments {(B + (1, j) + B -(2, j))[A + (j, :), A -(j, :)]} p j=1 and shift (B -(1, :)+B + (2, :))A -. Z G2 is a zonotope in R n with line segments {(B -(1, j) + B + (2, j))[A + (j, :), A -(j, :)]} p j=1 and shift (B + (1, :) + B -(2, :))A -. The line segment (B + (1, j) + B -(2, j))[A + (j, : ), A -(j, :)] has end points A + (j, :) and A -(j, :) in R n and scaled by (B + (1, j) + B -(2, j)). Note that A + = max(A, 0) and A -= max(-A, 0) where the max(.) is element-wise. The line segment (B(1, j) + + B(2, j) -)[A(j, :) + , A(j, :) -] is one that has the end points A(j, :) + and A(j, :) -in R n and scaled by the constant B(1, j) + + B(2, j) -. Proof. For the first part, recall from Theorem1 that both f 1 and f 2 are tropical rationals and hence, f 1 (x) = H 1 (x) -Q 1 (x) f 2 (x) = H 2 (x) -Q 2 (x) Thus B = {x ∈ R n : f 1 (x) = f 2 (x)} = {x ∈ R n : H 1 (x) -Q 1 (x) = H 2 (x) -Q 2 (x)} = {x ∈ R n : H 1 (x) + Q 2 (x) = H 2 (x) + Q 1 (x)} = {x ∈ R n : H 1 (x) Q 2 (x) = H 2 (x) Q 1 (x)} Recall that the tropical hypersurface is defined as the set of x where the maximum is attained by two or more monomials. Therefore, the tropical hypersurface of R(x) is the set of x where the maximum is attained by two or more monomials in (H 1 (x) Q 2 (x)), or attained by two or more monomials in (H 2 (x) Q 1 (x)), or attained by monomials in both of them in the same time, which is the decision boundaries. Hence, we can rewrite that as T (R(x)) = T (H 1 (x) Q 2 (x)) ∪ T (H 2 (x) Q 1 (x)) ∪ B. Therefore B ⊆ T (R(x)). For the second part of the Theorem, we first use the decomposition proposed by Zhang et al. (2018) ; Berrada et al. (2016) to show that for a network f (x) = B max (Ax, 0), it can be decomposed as tropical rational as follows f (x) = B + -B -max(A + x, A -x) -A -x = B + max(A + x, A -x) + B -A -x -B -max(A + x, A -x) + B + A -x . Therefore, we have that H 1 (x) + Q 2 (x) = B + (1, :) + B -(2, :) max(A + x, A -x) + B -(1, :) + B + (2, :) A -x H 2 (x) + Q 1 (x) = B -(1, :) + B + (2, :) max(A + x, A -x) + B + (1, :) + B -(2, :) A -x. Therefore, note that: δ(R(x)) = δ H 1 (x) Q 2 (x) ⊕ H 2 (x) Q 1 (x) 4 = ConvexHull δ H 1 (x) Q 2 (x) , δ H 2 (x) Q 1 (x) 3 = ConvexHull δ H 1 (x) +δ Q 2 (x) , δ H 2 (x) +δ Q 1 (x) . Now observe that H 1 (x) = p j=1 B + (1, j) + B -(2, j) max A (j, :), A -(j, :)x tropically is given as follows H 1 (x) = p j=1 x A + (j,:) ⊕ x A -(j,:) B + (1,j) B -(2,j) , thus we have that : δ(H 1 (x)) = B + (1, 1) + B -(2, 1) δ x A + (1,:) ⊕ x A -(1,:) + . . . + B + (1, p) + B -(2, p) δ(x A + (p,:) ⊕ x A -(p,:) ) = B + (1, 1) + B -(2, 1) ConvexHull A + (1, :), A -(1, :) + . . . + B + (1, p) + B -(2, p) ConvexHull A + (p, :), A -(p, :) . The operator + indicates a Minkowski sum between sets. Note that ConvexHull A + (i, :), A -(i, : ) is the convexhull between two points which is a line segment in Z n with end points that are {A + (i, :), A -(i, :)} scaled with B + (1, i) + B -(2, i). Observe that δ(F 1 (x)) is a Minkowski sum of line segments which is is a zonotope. Moreover, note that Q 2 (x) = (B -(1, :) + B + (2, :))A -x tropically is given as follows Q 2 (x) = p j=1 x A -(j,:) (B + (1,j) B -(2,j)) . One can see that δ(Q 2 (x)) is the Minkowski sum of the points {(B -(1, j) -B + (2, j))A -(j, :)}∀j in R n (which is a standard sum) resulting in a point. Lastly, δ(H 1 (x)) +δ(Q 2 (x)) is a Minkowski sum between a zonotope and a single point which corresponds to a shifted zonotope. A similar symmetric argument can be applied for the second part δ(H 2 (x)) +δ(Q 1 (x)). It is also worthy to mention that the extension to network with multi class output is trivial. In that case all of the analysis can be exactly applied studying the decision boundary between any two classes (i, j) where B = {x ∈ R n : f i (x) = f j (x)} and the rest of the proof will be exactly the same.

11. PROOF OF PROPOSITION 1

Proposition 1. The zonotope formed by p line segments in R n with arbitrary end points {[u i 1 , u i 2 ]} p i=1 is equivalent to the zonotope formed by the line segments {[u i 1 -u i 2 , 0]} p i=1 with a shift of p i=1 u i 2 . Proof. Let U j be a matrix with U j (:, i) = u i j , i = 1, . . . , p, w be a column-vector with w(i) = w i , i = 1, . . . , p and 1 p is a column-vector of ones of length p. Then, the zonotope Z formed by the Minkowski sum of line segments with arbitrary end points can be defined as: Z = p i=1 w i u i 1 + (1 -w i )u i 2 ; w i ∈ [0, 1], ∀ i = U 1 w -U 2 w + U 2 1 p , w ∈ [0, 1] p = (U 1 -U 2 ) w + U 2 1 p , w ∈ [0, 1] p = (U 1 -U 2 ) w, w ∈ [0, 1] p + U 2 1 p . Since the Minkowski sum of between a polytope and a point is a translation; thereafter, the proposition follows directly from Definition 6. Proof. This follows directly by applying Proposition 1 to the second bullet point of Theorem 2. 11.1 OPTIMIZATION OF OBJECTIVE 2 OF THE BINARY CLASSIFIER min Ã, B 1 2 G1 -G 1 2 F + 1 2 G2 -G 2 2 F + λ 1 G1 2,1 + λ 2 G2 2,1 . Note that G1 = Diag ReLU( B(1, : )) + ReLU(-B(2, :)) Ã, G2 = Diag ReLU( B(2, :)) + ReLU(-B(1, :)) Ã. Note that G 1 = Diag ReLU(B(1, :)) + ReLU(-B(2, :)) A and G 2 = Diag ReLU(B(2, :)) + ReLU(-B(1, :)) A. For ease of notation, we refer to ReLU( B(i, :)) and ReLU(-B(i, :)) as B+ (i, :) and B-(i, :), respectively. We solve the problem with co-ordinate descent an alternate over variables. Update Ã. Ã ← arg min Ã 1 2 Diag (c 1 ) Ã -G 1 2 F + 1 2 Diag(c 2 ) Ã -G 2 2 F + λ 1 Diag(c 1 ) Ã 2,1 + λ 2 Diag(c 2 ) Ã 2,1 , where c 1 = ReLU(B(1, :)) + ReLU(-B(2, :)) and c 2 = ReLU(B(2, :)) + ReLU(-B(1, :)). Note that the problem is separable per-row of Ã. Therefore, the problem reduces to updating rows of Ã independently and the problem exhibits a closed form solution. Hence, G(i + ,j -) can be seen a concatenation between Gi + and Gj -. Thus, the objective in 6 can be expanded as follows: min Ã, B ∀{i,j}∈S d ConvexHull Z G(i + ,j -) , Z G(j + ,i -) , ConvexHull Z G (i + ,j -) , Z G (j + ,i -) = min Ã, B ∀{i,j}∈S d ConvexHull Z Gi + +Z Gj-, Z G+ j +Z Gi -, ConvexHull Z G i + +Z Gj-, Z G + j +Z G i - ≈ min Ã, B ∀[i,j]∈S Gi + Gj - - G i + G j - 2 F + Gi - Gj + - G i - G j + 2 F = min Ã, B ∀{i,j}∈S 1 2 Gi + -G i + 2 F + 1 2 Gi --G i - 2 F + 1 2 Gj + -G j + 2 F + 1 2 Gj --G j - 2 F = min Ã, B k -1 2 k i=1 Gi + -G i + 2 F + Gi --G i - 2 F . The approximation follows in a similar argument to the binary classifier case. The last equality follows from a counting argument. We solve the objective for all multi-class networks in the experiments with alternating optimization in a similar fashion to the binary classifier case. Similarly to the binary classification approach, we introduce the . 2,1 to enforce sparsity constraints for pruning purposes. Therefore the overall objective has the form: min Ã, B 1 2 k i=1 Gi + -G i + 2 F + Gi --G i - 2 F + λ Gi + 2,1 + Gi - 2,1 . For completion, we derive the updates for Ã and B. Update Ã.

Ã = arg min

Ã k i=1 1 2 Diag B+ (i, :) Ã -G i + 2 F + Diag B-(i, :) Ã -G i - 2 F + λ Diag B+ (i, :) Ã 2,1 + Diag B-(i, :) Ã 2,1 . Similar to the binary classification, the problem is separable in the rows of Ã. and a closed form solution in terms of the proximal operator of 2 norm follows naturally for each Ã(i, :). Update B+ (i, :). B+ (i, :) = arg min B+ (i,:) 1 2 Diag B+ (i, :) Ã -Gi + 2 F + λ Diag B+ (i, :) Ã 2,1 , s.t. B+ (i, :) ≥ 0. Note that the problem is separable per coordinates of B + (i, :) and each subproblem is updated as: B+ (i, j) = arg min B+ (i,j) 1 2 B+ (i, j) Ã(j, :) -Gi + (j, :) 2 2 + λ B+ (i, j) Ã(j, :) 2 , s.t. B+ (i, j) ≥ 0 = arg min B+ (i,j) 1 2 B+ (i, j) Ã(j, :) -Gi + (j, :) 2 2 + λ B(i, j) Ã(j, :) 2 , s.t. B+ (i, j) ≥ 0 = max 0, Ã(j, :) Gi + (j, :) -λ Ã(j, :) 2 Ã(j, :) 2 2 . A similar argument can be used to update B-(i, :) ∀i. Finally, the parameters of the pruned network will be constructed A ← Ã and B ← B+ -B-. 13 TROPICAL ADVERSARIAL ATTACKS. Dual View to Adversarial Attacks. For a classifier f : R n → R k and input x 0 classified as c, a standard formulation for targeted adversarial attacks to a different class t is defined as: min η D(η) s.t. arg max i f i (x 0 + η) = t = c This objective aims to compute the lowest energy input noise η (measured by D) such that the the new sample (x 0 + η) crosses the decision boundaries of f to a new classification region. Here, we present a dual view to adversarial attacks. Instead of designing a sample noise η such that (x 0 + η) belongs to a new decision region, one can instead fix x 0 and perturb the network parameters to move the decision boundaries in a way that x 0 appears in a new classification region. In particular, let A 1 be the first linear layer of f , such that f (x 0 ) = g(A 1 x 0 ). One can now perturb A 1 to alter the decision boundaries and relate this parameter perturbation to the input perturbation as follows: g((A 1 + ξ A1 )x 0 ) = g (A 1 x 0 + ξ A1 x 0 ) = g(A 1 x 0 + A 1 η) = f (x 0 + η). From this dual view, we observe that traditional adversarial attacks are intimately related to perturbing the parameters of the first linear layer through the linear system: A 1 η = ξ A1 x 0 . The two views and formulations are identical under such condition. With this analysis, Theorem 2 provides explicit means to geometrically construct adversarial attacks by perturbing the decision boundaries. In particular, since the normals to the dual subdivision polytope δ(R(x)) of a given DNN represent the tropical hypersurface T (R(x)), which is a superset to the decision boundaries set B, ξ A1 can be designed to sufficiently perturb the dual subdivision resulting in a change in the network prediction of x 0 to the targeted class t. Based on this observation, we design an optimization problem that generates two sets of perturbations, an input perturbation and parameter perturbation, that are equivalent to each other. Formulation. Based on this observation, we formulate the problem as follows: min η,ξ A 1 D 1 (η) + D 2 (ξ A1 ) s.t. -loss(g(A 1 (x 0 + η)), t) ≤ -1; η ∞ ≤ 1 ; -loss(g(A 1 + ξ A1 )x 0 , t) ≤ -1; (x 0 + η) ∈ [0, 1] n , ξ A1 ∞,∞ ≤ 2 , A 1 η = ξ A1 x 0 . ( ) The loss is the standard cross-entropy loss. The first row of constraints ensures that the network prediction is the desired target class t when the input x 0 is perturbed by η, and equivalently by perturbing the first linear layer A 1 by ξ A1 . This is identical to f 1 as proposed by Carlini & Wagner (2016) . Moreover, the third and fourth constraints guarantee that the perturbed input is feasible and that the perturbation is bounded, respectively. The fifth constraint is to limit the maximum perturbation on the first linear layer, while the last constraint enforces the dual equivalence between input perturbation and parameter perturbation. The function D 2 captures the perturbation of the dual subdivision polytope upon perturbing the first linear layer by ξ A1 . For a single hidden layer neural network parameterized as (A 1 + ξ A1 ) ∈ R p×n and B ∈ R 2×p for the first and second layers respectively, D 2 can capture the perturbations in each of the two zonotopes discussed in Theorem 2 and we define it as: D 2 (ξ A1 ) = 1 2 2 j=1 Diag B + (j, :) ξ A1 2 F + Diag B -(j, :) ξ A1 2 F . We solve Problem (9) with a penalty method on the linear equality constraints, where each penalty step is solved with ADMM Boyd et al. (2011) in a similar fashion to the work of Xu et al. (2018) .  D 2 (ξ A1 ) = 1 2 G1 -G 1 2 F + 1 2 G2 -G 2 2 F = 1 2 Diag B + (1, :) ξ A1 2 F + 1 2 Diag B -(1, :) ξ A1 2 F + 1 2 Diag B + (2, :) ξ A1 2 F + 1 2 Diag B -(2, :) ξ A1 2 F . This can thereafter be extended to multi-class network with k classes as follows D 2 (ξ A1 ) = 1 2 k j=1 Diag B + (j, :) ξ A1 2 F + Diag B -(j, :) ξ A1 2 F . Following Xu et al. (2018) , we take D 1 (η) = 1 2 η 2 2 . Therefore, we can write 9 as follows: min η,ξ A D 1 (η) + k j=1 Diag B + (j, :) ξ A 2 F + Diag B -(j, :) ξ A 2 F . s.t. -loss(g(A 1 (x 0 + η)), t) ≤ -1, -loss(g((A 1 + ξ A1 )x 0 ), t) ≤ -1, (x 0 + η) ∈ [0, 1] n , η ∞ ≤ 1 , ξ A1 ∞,∞ ≤ 2 , A 1 η -ξ A1 x 0 = 0. To enforce the linear equality constraints A 1 η -ξ A1 x 0 = 0, we use a penalty method, where each iteration of the penalty method we solve the sub-problem with ADMM updates. That is, we solve the following optimization problem with ADMM with increasing λ such that λ → For ease of notation, lets denote L(x 0 + η) = -loss(g(A 1 (x 0 + η)), t), and L(A 1 ) = -loss(g((A 1 + ξ A1 )x 0 ), t).  + z) + h 1 (w) + h 2 (ξ A1 ) + λ A 1 η -ξ A1 x 0 2 2 + L(A 1 ). s.t. η = z z = w. where h 1 (η) = 0, if (x 0 + η) ∈ [0, 1] n , η ∞ ≤ 1 ∞, else h 2 (ξ A1 ) = 0, if ξ A1 ∞,∞ ≤ 2 ∞, else . The augmented Lagrangian is given as follows: L(η, w, z, ξ A1 , u, v) := η 2 2 + L(x 0 + z) + h 1 (w) + k j=1 Diag(B + (j, :))ξ A1 2 F + Diag(B -(j, :))ξ A1 2 F + L(A 1 ) + h 2 (ξ A1 ) + λ A 1 η -ξ A1 x 0 2 2 + u (η -z) + v (w -z) + ρ 2 ( η -z 2 2 + w -z 2 2 ). Thereafter, ADMM updates are given as follows: {η k+1 , w k+1 } = arg min η,w L(η, w, z k , ξ k A1 , u k , v k ), z k+1 = arg min z L(η k+1 , w k+1 , z, ξ k A1 , u k , v k ), ξ k+1 A1 = arg min ξ A 1 L(η k+1 , w k+1 , z k+1 , ξ A1 , u k , v k ). u k+1 = u k + ρ(η k+1 -z k+1 ), v k+1 = v k + ρ(w k+1 -z k+1 ). Updating η: η k+1 = arg min η η 2 2 + λ A 1 η -ξ A1 x 0 2 2 + u η + ρ 2 η -z 2 2 = 2λA 1 A 1 + (2 + ρ)I -1 2λA 1 ξ k A1 x 0 + ρz k -u k . Updating w: w k+1 = arg min w v k w + h 1 (w) + ρ 2 w -z k 2 2 = arg min w 1 2 w -z k - v k ρ 2 2 + 1 ρ h 1 (w). The update w is separable in coordinates as follows: w k+1 =    min(1 -x 0 , 1 ) : z k -1 /ρv k > min(1 -x 0 , 1 ) max(-x 0 , -1 ) : z k -1 /ρv k < max(-x 0 , -1 ) z k -1 /ρv k : otherwise Updating z: Liu et al. (2019) showed that the linearized ADMM converges for some non-convex problems. Therefore, by linearizing L and adding Bergman divergence term η k /2 zz k 2 2 , we can then update z as follows: z k+1 = arg min z L(x 0 + z) -u k z -v k z + ρ 2 η k+1 -z 2 2 + w k+1 -z 2 2 . z k+1 = 1 η k + 2ρ η k z k + ρ η k+1 + 1 ρ u k + w k+1 + 1 ρ v k -∇L(z k + x 0 ) . It is worthy to mention that the analysis until this step is inspired by Xu et al. (2018) with modifications to adapt our new formulation. Updating ξ A : ξ k+1 A = arg min ξ A ξ A1 2 F + λ ξ A1 x 0 -A 1 η 2 2 + L(A 1 ) s.t. ξ A1 ∞,∞ ≤ 2 . The previous problem can be solved with proximal gradient methods. Experimental Setup. For the tropical adversarial attacks experiments, there are five different hyper parameters which are 1 : The upper bound for the infinite norm of δ. 2 : The upper bound for the . ∞,∞ of the perturbation on the first linear layer. λ : Regularizer to enforce the equality between input perturbation and first layer perturbation η : Bergman divergence constant. ρ : ADMM constant. Algorithm 1: Solving Problem (9) For all of the experiments, we set the values of 2 , λ, η and ρ to 1, 10 -3 , 2.5 and 1, respectively. As for 1 it is set to 0.1 upon attacking MNIST images of digit 4 set to 0.2 for all other MNIST images. Input: A 1 ∈ R p×n , B ∈ R k×p , x 0 ∈ R n , t, λ > 0, γ > 1, K > 0, ξ A1 = 0 p×n , η 1 = z 1 = w 1 = z 1 = u 1 = w 1 = 0 n . Output: η, ξ A1 Initialize: ρ = ρ 0 while not converged do for k ≤ K do η update: η k+1 = (2λA 1 A 1 + (2 + ρ)I) -1 (2λA 1 ξ k A1 x 0 + ρz k -u k ) w update: w k+1 =    min(1 -x 0 , 1 ) : z k -1 /ρv k > min(1 -x 0 , 1 ) max(-x 0 , -1 ) : z k -1 /ρv k < max(-x 0 , -1 ) z k -1 /ρv k : otherwise z update: z k+1 = 1 η k+1 +2ρ (η k+1 z k + ρ(η k+1 + 1 /ρu k + w k + 1 /ρv k ) -∇L(z k + x 0 )) ξ A1 update: ξ k+1 A1 = arg min ξ A ξ A1 2 F + λ ξ A1 x 0 -A 1 η k+1 2 2 + L(A 1 ) s.t. ξ A1 ∞,∞ ≤ 2 u update: u k+1 = u k + ρ(η k+1 -z k+1 ) v update: v k+1 = v k + ρ(w k+1 -z k+1 )) ρ ← γρ λ ← γλ ρ ← ρ 0 Motivational Insight to the Dual View. We train a network with 2 inputs, 50 hidden nodes and 2 outputs on a synthetic dataset where we then then solve Equation 9 for a given x 0 shown in black in Figure 7 . We show the decision boundaries with and without the perturbation ξ A1 at the first linear layer. As show in Figure 7 , perturbing an edge of the dual subdivision polytope, by perturbing the first linear layer, corresponds to perturbing the decision boundaries and results in the misclassification of x 0 . As expected, perturbing different decision boundaries corresponds to perturbing different edges of the dual subdivision. Note that the generated input perturbation η is sufficient as well into fooling the network in classifying x 0 + η, and by construction is equivalent to perturb the decision boundaries of the network. We show later another example where we alternate the position of x 0 and construct successful adversaries in both the input space, and the parameter space. Furthermore, we conduct experiments on MNIST images in a later section, which show that successful adversarial attacks η can be designed by solving Problem (9). Figure 7 shows another example where the sampled to be attacked is closer to a different decision boundary. Observe how the edge corresponding to that decision boundary of the decision boundary polytope has respectively been altered. MNIST Experiments. Here, we design perturbations to misclassify MNIST images. Figure 8 shows several adversarial examples that change the network prediction for digits 8 and 9 to digits 7, 5, and 4, respectively. In some cases, the perturbation η is as small as = 0.1, where x 0 ∈ [0, 1] n . Several other adversarial results are reported in Figure 9 . We again emphasize that our approach is not meant to be compared with (or beat) state of the art adversarial attacks but rather to provide a novel geometrically inspired perspective that can shed new light in this field. 

14. EXPERIMENTAL DETAILS AND SUPPLEMENTAL RESULTS

In this section, we describe the settings and the values of the hyper parameters used in the experiments. Moreover, we will show some further supplemental results to the results in the main manuscript paper. 14.1 TROPICAL VIEW TO THE LOTTERY TICKET HYPOTHESIS. We first conduct some further supplemental experiments to those conducted in Section 4. In particular, we conduct further experiments re-affirming the lottery ticket hypothesis on three more synthetic datasets in a similar experimental setup to the one shown in Figure 3 . The new supplemental experiments are shown in Figure 10 . A similar conclusion is present where the lottery ticket initialization consistently better preserves the decision boundaries polytope compare to other initialization schemes over different percentages of pruning. A natural question is whether it is necessary to visualize the dual subdivision polytope of the decision boundaries, i.e. δ(R(x)), where R(x) = H 1 (x) Q 2 (x)⊕H 2 (x) Q 1 (x) as opposed to visualizing the tropical polynomials δ(H {1,2} (x)) and δ(Q {1,2} (x)) directly for the tropical re-affirmation of the lottery ticket hypothesis. That is similar to asking whether it is necessary to visualize and study the decision boundaries polytope δ(R(x)) as compared to the the dual subdivision polytope of the functional form of the network since for the 2-output neural network described in Theorem 2 we have that f 1 (x) = H 1 (x) Q 1 (x) and f 2 (x) = H 2 (x) Q 2 (x). We demonstrate this with an experiment that demonstrates the differences between these two views. For this purpose, we train a single hidden layer neural network on the same dataset shown in Figure 3 . We perform several iterations of pruning in a similar fashion to Section 5 and visualise at each iteration both the decision boundaries polytope and all the dual subdivisions of the aforementioned tropical polynomials representing the functional form of the network, i.e. δ(H {1,2} (x)) and δ(Q {1,2} (x)). It is to be observed from Figure 11 that despite that the decision boundaries were barely affected with the lottery ticket pruning, the zonotopes representing the functional form of the network endure large variations. That is to say, investigating the dual subdivisions describing the functional form of the networks through the four zonotopes δ(H {1,2} (x)) and δ(Q {1,2} (x)) is not indicative enough to the behaviour of the decision boundaries.

14.2. TROPICAL PRUNING

Toy Setup. To verify our theoretical work, we first start by pruning small networks that are in the form of Affine followed by ReLU followed by another Affine layer. We train the aforementioned network on two 2D datasets with a varying number of hidden nodes (100, 200, 300) . In this setup, we observe from Figure that when Theorem 2 assumptions hold, our proposed tropical pruning is indeed competitive, and in many cases outperforms, the other non decision boundaries aware pruning schemes. Experimental Setup. In all experiments of the tropical pruning section, all algorithms are run for only a single iteration where λ increases linearly from 0.02 with a factor of 0.01. Increasing λ corresponds to increasing weight sparsity and we keep doing until sparsification is 100%. Supplemental Experiments. We conduct more experimental results on AlexNet and VGG16 on SVHN, CIFAR10 and CIFAR100 datasets. We examine the performance for when the networks have only the biases of the classifier fine tuned after tuning as shown in Figure 13 . Moreover, a similar experiments is reported for the same networks but for when the biases for the complete networks are fine tuned as in Figure 14 . 



Code regenerating all our experiments is attached in the supplementary material. A semiring is a ring that lacks an additive inverse. All proofs are left for the appendix. Without loss of generality, as one can very well approximate real weights as fractions and multiply by the least common multiple of the denominators as discussed inZhang et al. (2018).



Figure 1: Tropical Hypersurfaces and their Corresponding Dual Subdivisions. We show three tropical polynomials, where the solid red and black lines are the tropical hypersurfaces T (f ) and dual subdivisions δ(f ) to the corresponding tropical polynomials, respectively. T (f ) divides The domain of f into convex regions where f is linear. Moreover, each region is in one-to-one correspondence with each node of δ(f ). Lastly, the tropical hypersurfaces are parallel to the normals of the edges of δ(f ) shown by dashed red lines.

Figure 2: Decision Boundaries as Geometric Structures. The decision boundaries B (in red) comprise two linear pieces separating classes C1 and C2. As per Theorem 2, the dual subdivision of this single hidden neural network is the convex hull between the zonotopes Z G 1 and Z G 2 . The normals to the dual subdivison δ(R(x)) are in one-to-one correspondence to the tropical hypersurface T (R(x)), which is a superset to the decision boundaries B. Note that some of the normals to δ(R(x)) (in red) are parallel to the decision boundaries.

Figure 3: Effect of Different Initializations on the Decision Boundaries Polytope. From left to right:

decision boundaries as per the second part of Theorem 2. Now, based on the results ofMaclagan & Sturmfels (2015) (Proposition 3.1.6) and as discussed in Figure1, the normals to the edges of the polytope δ(R(x)) (convex hull of two zonotopes) are in one-to-one correspondence with the tropical hypersurface T (R(x)). Therefore, one can study the decision boundaries, or at least their superset T (R(x)), by studying the orientation of the dual subdivision δ(R(x)).

The generators of Z G1 , Z G2 in Theorem 2 can be defined as G 1 = Diag[(B + (1, :)) + (B -(2, :))]A and G 2 = Diag[(B + (2, :)) + (B -(1, :))]A, both with shift (B -(1, :) + B + (2, :) + B + (1, :) + B -(2, :)) A -, where Diag(v) arranges v in a diagonal matrix.

Figure 4: Tropical Pruning Pipeline. Pruning the 4 th node, or equivalently removing the two yellow vertices of zonotope ZG 2 does not affect the decision boundaries polytope, which will lead to no change in accuracy.

) is not quadratic in its variables, since as per Corollary, 1 G1 = Diag[ReLU( B(1, :)) + ReLU(-B(2, :))] Ã and G2 = Diag[ReLU( B(2, : )) + ReLU(-B(1, :))] Ã. However, since Problem (

Figure 5: Results of Tropical Pruning. Pruning-accuracy plots for AlexNet (top) and VGG16 (bottom) trained on SVHN, CIFAR10, and CIFAR100, pruned with our tropical method and three other pruning methods.

Figure 6: The Newton Polygon and the Corresponding Dualsubdivision. The Figure on the left shows the newton polygon P(f ) for the tropical polynomial defined in the second example in Figure 1. The dual subdivision δ(f ) is constructed by projecting the upper faces of P(f ), shadowing, on R 2 .

The generators of Z G1 , Z G2 in Theorem 2 can be defined as G 1 = Diag[(B + (1, :)) + (B -(2, :))]A and G 2 = Diag[(B + (2, :)) + (B -(1, :))]A, both with shift (B -(1, :) + B + (2, :) + B + (1, :) + B -(2, :)) A -, where Diag(v) arranges v in a diagonal matrix.

The function D 2 (ξ A ) captures the perturbation in the dual subdivision polytope such that the dual subdivision of the network with the first linear layer A 1 is similar to the dual subdivision of the network with the first linear layer A 1 + ξ A1 . This can be generally formulated as an approximation to the following distance functiond ConvHull Z G1 , Z G2 , ConvHull (Z G1 , Z G2 ) ,where G1 = Diag ReLU( B(1, :)) + ReLU(-B(2, :)) Ã + ξ A1 , G2 = Diag ReLU( B(2, :)) + ReLU(-B(1, :)) Ã + ξ A1 , G 1 = Diag ReLU( B(1, :)) + ReLU(-B(2, :)) Ã and G 2 =Diag ReLU( B(2, :)) + ReLU(-B(1, :)) Ã. In particular, to approximate the function d, one can use a similar argument as in used in network pruning 5 such that D 2 approximates the generators of the zonotopes directly as follows:

Figure 7: Dual View of Tropical Adversarial Attacks. We show the effects of tropical adversarial attacks on a synthetic binary dataset at two different input points (in black). From left to right: the decision regions of the original and perturbed models, and decision boundaries polytopes (green for original and blue for perturbed).

Figure 8: Effect of Tropical Adversarial Attacks on MNIST Dataset. We show qualitative examples of adversarial attacks, produced by solving Problem (9), on two digits (8,9) from MNIST. From left to right, images are classified as [8,7,5,4] and [9,7,5,4] respectively.

Figure 9: Effect of Tropical Adversarial Attacks on MNIST Images. First row from the left: Clean image, perturbed images classified as [7,3,2,1,0] respectively. Second row from left: Clean image, perturbed images classified as [9,8,7,3,2]  respectively. Third row from left: Clean image, perturbed images classified as[9,8,7,5,3]  respectively. Fourth row from left: Clean image, perturbed images classified as[9,4,3,2,1]  respectively. Fifth row from left: Clean image, perturbed images classified as[8,4,3,2,1]  respectively.

Figure 10: Effect of Different Initializations on the Decision Boundaries Polytope. From left to right: training dataset, decision boundaries polytope of original network followed by the decision boundaries polytope during several iterations of pruning with different initializations.

Figure11: Comparison between the decision boundaries polytope and the polytopes representing the functional representation of the network. First column: decision boundaries polytope δ(R(x)) while the remainder of the columns are the zonotopes δ(H 1 (x)), δ(Q 1 (x)), δ(H 2 (x)) and δ(Q 2 (x)) respectively. Under varying pruning rate across the rows, it is to be observed that the changes that affected the dual subdivisions of the functional representations are far smaller compared to the decision boundaries polytope.

Figure 12: Pruning Ressults on Toy Networks. We apply tropical pruning on the toy network that is in the form of Affine followed by a ReLU followed by another Affine. From left to right: (a) dataset used for training (b) pruning networks with 100 hidden nodes (c) 200 hidden nodes (d) 300 hidden nodes.

Figure 13: Results of Tropical Pruning with Fine Tuning the Biases of the Classifier. Tropical pruning applied on AlexNet and VGG16 trained on SVHN, CIFAR10, CIFAR100 against different pruning methods with fine tuning the biases of the classifier only.

Figure 14: Results of Tropical Pruning with Fine Tuning the Biases of the Network. Tropical pruning applied on AlexNet and VGG16 trained on SVHN, CIFAR10, CIFAR100 against different pruning methods with fine tuning the biases of the network.

annex

Ã(i, :) = arg min Ã(i,:).Update B+ (1, :).B+ (1, :) = arg min) Ã and where Diag B-(2, :) Ã. Note the problem is separable in the coordinates of B+ (1, :) and a projected gradient descent can be used to solve the problem in such a way as:A similar symmetric argument can be used to update the variables B+ (2, :), B+ (1, :) and B-(2, :).

12. ADAPTING OPTIMIZATION 2 FOR MULTI-CLASS CLASSIFIER

Note that Theorem 2 describes a superset to the decision boundaries of a binary classifier through the dual subdivision R(x), i.e. δ(R(x)). For a neural network f with k classes, a natural extension for it is to analyze the pair-wise decision boundaries of of all k-classes. Thus, let T (R ij (x)) be the superset to the decision boundaries separating classes i and j. Therefore, a natural extension to the geometric loss in Equation 1 is to preserve the polytopes among all pairwise follows:The set S is all possible pairwise combinations of the k classes such that S = {{i, j}, ∀i = j, i = 1, . . . , k, j = 1, . . . , k}. The generator Z( G(i,j) ) is the zonotope with the generator matrix G(i + ,j -) = Diag ReLU( B(i, :)) + ReLU(-B(j, :)) Ã. However, such an approach is generally computationally expensive, particularly, when k is very large. To this end, we make the following observation that G(i + ,j -) can be equivalently written as a Minkowski sum between two sets zonotopes with the generators G i + = Diag ReLU( B(i, :) Ã and G j -= Diag ReLU( Bj -) Ã. That is to say, Z G(i + ,j -) = Z Gi+ +Z Gj-. This follows from the associative property of Minkowski sums given as follows: Fact 5. Let {S i } n i=1 be the set of n line segments. Then we have that S = S 1 + . . . +S n = P +V where the sets P = +j∈C1 S j and V = +j∈C2 S j where C 1 and C 2 are any complementary partitions of the set {S i } n i=1 .

