A GENERAL COMPUTATIONAL FRAMEWORK TO MEA-SURE THE EXPRESSIVENESS OF COMPLEX NETWORKS USING A TIGHT UPPER BOUND OF LINEAR REGIONS Anonymous

Abstract

The expressiveness of deep neural network (DNN) is a perspective to understand the surprising performance of DNN. The number of linear regions, i.e. pieces that a piece-wise-linear function represented by a DNN, is generally used to measure the expressiveness. And the upper bound of regions number partitioned by a rectifier network, instead of the number itself, is a more practical measurement of expressiveness of a rectifier DNN. In this work, we propose a new and tighter upper bound of regions number. Inspired by the proof of this upper bound and the framework of matrix computation in Hinz & Van de Geer ( 2019), we propose a general computational approach to compute a tight upper bound of regions number for theoretically any network structures (e.g. DNN with all kind of skip connections and residual structures). Our experiments show our upper bound is tighter than existing ones, and explain why skip connections and residual structures can improve network performance.

1. INTRODUCTION

Deep nerual network (DNN) (LeCun et al., 2015) has obtained great success in many fields such as computer vision, speech recognition and neural language process (Krizhevsky et al., 2012; Hinton et al., 2012; Devlin et al., 2018; Goodfellow et al., 2014) . However, it has not been completely understood why DNNs can perform well with satisfying generalization on different tasks. Expressiveness is one perspective used to address this open question. More specifically, one can theoretically study expressiveness of DNNs using approximation theory (Cybenko, 1989; Hornik et al., 1989; Hanin, 2019; Mhaskar & Poggio, 2016; Arora et al., 2016) , or measure the expressiveness of a DNN. While sigmoid or tanh functions are employed as the activation functions in early work of DNNs, rectified linear units (ReLU) or other piece-wise linear functions are more popular in nowadays. Yarotsky (2017) has proved that any DNN with piece-wise linear activation functions can be transformed to a DNN with ReLU. Thus, the study of expressiveness usually focuses on ReLU DNNs. It is known that a ReLU DNN represents a piece-wise linear (PWL) function, which can be regarded to have different linear transforms for each region. And with more regions the PWL function is more complex and has stronger expressive ability. Therefore, the number of linear regions is intuitively a meaningful measurement of expressiveness (Pascanu et al., 2013; Montufar et al., 2014; Raghu et al., 2017; Serra et al., 2018; Hinz & Van de Geer, 2019) . A direct measurement of linear regions number is difficult, if not impossible, and thus the upper bound of linear regions number is practically used as a figure of metrics to characterize the expressiveness. Inspired by the computational framework in (Hinz & Van de Geer, 2019) , we improve the upper bound in Serra et al. (2018) for multilayer perceptrons (MLPs) and extend the framework to more complex networks. More importantly, we propose a general approach to construct a more accurate upper bound for almost any type of network. The contributions of this paper are listed as follows. • Through a geometric analysis, we derive a recursive formula for γ, which is a key parameter to construct a tight upper bound. Employing a better initial value, we propose a tighter upper bound for deep fully-connected ReLU networks. In addition, the recursive formula provide a potential to further improve the upper bound given an improved initial value. • Different from Hinz & Van de Geer (2019), we not only consider deep fully-connected ReLU networks, but also extend the computational framework to more widely used network architectures, such as skip connections, pooling layers and so on. With the extension, the upper bound of U-Net (Ronneberger et al., 2015) or other common networks can be computed. By comparing the upper bound of different networks, we show the relation between expressiveness of networks with or without special structures. • Our experiments show that novel network structures enhance the upper bound in most cases. For cases in which the upper bound is almost not enhanced by novel network settings, we explain it by analysing the partition efficiency and the practical number of linear regions.

2.1. RELATED WORK

There are literature on the linear regions number in the case of ReLU DNNs. Pascanu et al. (2013) compare the linear regions number of shallow networks by providing a lower bound. Montufar et al. (2014) give a simple but improved upper bound compared with Pascanu et al. (2013) . Montúfar (2017) proposes a even tighter upper bound than Montufar et al. (2014) . And Raghu et al. ( 2017) also prove a similar result which has the same order compared to Montúfar (2017) . Later, Serra et al. (2018) propose a tighter upper bound and a method to count the practical number of linear regions. Furthermore, Serra & Ramalingam (2018); Hanin & Rolnick (2019a; b) explore the properties of the practical number of linear regions. Finally, Hinz & Van de Geer (2019) employ the form of matrix computation to erect a framework to compute the upper bound, which is a generalization of previous work (Montufar et al., 2014; Montúfar, 2017; Serra et al., 2018) 

2.2. NOTATIONS, DEFINITIONS AND PROPERTIES

In this section, we will introduce some definitions and propositions. Since the main computational framework is inspired by Hinz & Van de Geer (2019) , some notations and definitions are similar. Let us assume a ReLU MLP has the form as follows. f (x) = W (L) σ(W (L-1) • • • σ (L-1) (W (1) x + b (1) ) • • • + b (L-1) ) + b (L) (1) where i) is the weights in the i th layer and b (i) is the bias vector. f (x) can also be written as: x ∈ R n0 , W (i) ∈ R ni×ni-1 , b (i) ∈ R ni and σ(x) = max(x, 0) denoting the ReLU function. W h 0 (x) = x, h i (x) = σ(W (i) h i-1 (x) + b (i) ), 1 ≤ i < L, f (x) = h L (x) = W (L) h L-1 (x) + b (L) Firstly, we define the linear region in the following way. Definition 1. For a PWL function f (x) : R n0 → R n L , we define D is a linear region, if D satisfies that: (a) D is connected; (b) f is an affine function on D; (c) Any D D, f is not affine on D . For a PWL function f , the domain can be partitioned into different linear regions. Let P(f ) = {D i |D i is a linear region of f, ∀D i = D j , D i ∩ D j = ∅} represent all the linear regions of f . We then define the activation pattern of ReLU DNNs as follows. Definition 2. For any x ∈ R n0 , we define the activation pattern of x in i th layer s hi (x) ∈ {0, 1} ni as follows. s hi (x) j = 1, if W (i) j,: h i-1 (x) + b (i) j > 0 0, if W (i) j,: h i-1 (x) + b (i) j ≤ 0 , for i ∈ {1, 2, . . . , L -1}, j ∈ {1, 2, . . . , n i }, where W (i) j,: is the j th row of W (i) , b (i) j is the j th component of b (i) . (Hinz & Van de Geer, 2019) For any x, h i (x) can be rewritten as h i (x) = W (i) (x)h i-1 (x) + b (i) (x) , where W (i) (x) is a matrix with some rows of zeros and b (i) (x) is a vector with some zeros. More precisely, W (i) (x) j,: = W (i) j,: , if s hi (x) j = 1; 0, if s hi (x) j = 0. b (i) (x) j = b (i) j , if s hi (x) j = 1; 0, if s hi (x) j = 0. To conveniently represent activation patterns of multi-layer in a MLP, we denote h (i) = {h 1 , ..., h i }, where h i is defined in Eq.2, and S h (i) (x) = (s h1 (x), ..., s hi (x)), S(h (i) ) = {S h (i) (x)|x ∈ R n0 }. Given a fixed x, it is easy to prove that h i (x) is an affine transform (i = 1, 2, . . . , L). Suppose that s ∈ {0, 1} n1 × • • • × {0, 1} n L-1 , h = {h 1 , ..., h L-1 } and h(x) represents h L-1 (x), if D = {x|S h (x) = s} = ∅, then f is an affine transform in D. And it is easy to prove that there exists a linear region D so that D ⊆ D . Therefore we have |P(f )| ≤ |S(h)|. In our computational framework, histogram is a key concept and defined as follows. Definition 3. Define a histogram v as follows. (Hinz & Van de Geer, 2019 ) v ∈ V =    x ∈ N N | x 1 = ∞ j=0 x j < ∞    A histogram is used to represent a discrete distribution of N. For example, the histogram of nonnegative integers G = {1, 0, 1, 4, 3, 2, 3, 1} is (1, 3, 1, 2, 1) . For convenience, let v i = x∈G 1 x=i and the histogram of G is denoted by Hist(G). We can then define an order relation. Definition 4. For any two histograms v, w, define the order relation as follows. (Hinz & Van de Geer, 2019 ) v w :⇔ ∀J ∈ N, ∞ j=J v j ≤ ∞ j=J w j It is obvious that any two histograms are not always comparable. But we can define a max operation such that v (i) max {v (i) |i ∈ I} where I is an index collection. More precisely: Definition 5. For a finite index collection I, let V I = {v (i) |i ∈ I}, define max operation as follows. max (V I ) J = max i∈I   ∞ j=J v (i) j   -max i∈I   ∞ j=J+1 v (i) j   for J ∈ N (7) where v (i) j is the j th component of the histogram v (i) . (Hinz & Van de Geer, 2019) When a region is divided by hyperplanes, the partitioned regions number will be affected by the space dimension which is defined as: Definition 6. For a connected and convex set D ⊆ R n , if there exists a set of linear independent vectors i) where {a i } are not all 0; (b) there exists v (i) |v (i) ∈ R n , i = 1, . . . , k, k ≤ n and a fixed vector c ∈ R n , s.t. (a) any x = c ∈ D, x = c + k i=1 a i v ( a i = 0 s.t. a i v (i) + c ∈ D. Then the space dimension of D is k and denote it as Sd(D) = k. The following proposition shows the change of space dimension after an affine transform. Proposition 1. Suppose D ⊆ R n is a connected and convex set with space dimension of k (k ≤ n), f is an affine transform with domain of D and can be written as f (x) = Ax + b, where A ∈ R m×n and b ∈ R m . Then f (D) is a connected and convex set and Sd(f (D)) ≤ min(k, rank(A)) The proof of proposition 1 is given in Appendix A.1.1.

Now we can analyze the relationship between the change of space dimension and activation patterns.

Let us first consider the 1st layer in an MLP f . W (1) j,: and b (1) j construct a hyperplane (W (1) j,: x + b (1) j = 0) in R n0 and this hyperplane divides the whole region of R n0 into two parts. One part corresponds to successful activation of the j th node in the 1st layer and the other to unsuccessful activation. This can be represented by the j th component of s h1 (x). Therefore all the possible activation pattern s h1 (x) is one by one correspondent to the divided regions by n 1 hyperplanes of {W (1) j,: x+b (1) j = 0, j = 1, . . . , n 1 }, which are denoted as H h1 . For any region D divided by H h1 , we have h 1 (x) = W (1) (x (0) )x+b (1) (x (0) ) where x (0) is any point in D and rank(W (1) (x (0) )) ≤ |s| 1 where s is the correspondent activation pattern of D. According to proposition 1, h 1 (D) satisfies that Sd(h 1 (D)) ≤ min{n 0 , |s| 1 }. Similarly, H h2 divides h 1 (D) into different parts, and this corresponds to divide D into more sub-regions. In general, every element of S(h (i) ) is correspondent to one of regions that are partitioned by H h1 , H h2 , . . . , H hi . The next two definitions are used to describe the relationship between the change of space dimension and activation patterns. Definition 7. Define H sd (S h ) as the space dimension histogram of regions partitioned by a ReLU network defined by Eq.1 where h = {h 1 , . . . , h L-1 }, i.e. Here, for h ∈ RL(0, n ), we define H(S h ) = e 0 = (1, 0, 0, . . . ) . Let Γ be the set of all (γ n,n ) n ∈N+,n∈{0,...,n } that satisfy the bound conditions. When n > n , γ n,n is defined to be equal to Van de Geer, 2019) . By the definition γ n,n represents an upper bound of the activation histogram of regions which are derived from n-dimension space partitioned by n hyperplanes. According to Proposition 1, this upper bound is also related to the upper bound of space dimension. Therefore when γ n,n is tighter the computation of upper bound of linear regions number will be more accurate. The following function is used to describe the relationship between upper bounds of activation histogram and space dimension. Definition 11. For i * ∈ N, define a clipping function cl i * (•) : V → V as follows. (Hinz & Van de Geer, 2019 ) H sd (S h ) = Hist ({Sd (h (D (s))) | D (s) = {x|S h (x) = s} , s ∈ S (h)}) . γ n ,n since max {H a (S h ) |h ∈ RL (n, n )} is equal to max {H a (S h ) |h ∈ RL (n , n )} (Hinz & cl i * (v) i =    v i for i < i * ∞ j=i * v j for i = i * 0 for i > i * (12) With the definitions and notations above, we can introduce the computational framework to compute the upper bound of linear regions number as follows. Proposition 3. For a γ ∈ Γ, define the matrix B (γ) n ∈ N (n +1)×(n +1) as B (γ) n i,j = (cl j-1 (γ j-1,n )) i-1 , i, j ∈ {1, . . . , n + 1} . Then, the upper bound of linear regions number of an MLP in Eq.1 is B (γ) n L-1 M n L-2 ,n L-1 . . . B (γ) n1 M n0,n1 e n0+1 1 , where M n,n ∈ R (n +1)×(n+1) , (M ) i,j = δ i,min(j,n +1) . (Hinz & Van de Geer, 2019) 3 MAIN RESULTS In this section, we introduce a better choice of γ and compare it to Serra et al. (2018) . We also extend the computational framework to some widely used network structures.

3.1. A TIGHTER UPPER BOUND ON THE NUMBER OF REGIONS

Before giving our main results, we define a new function of histograms as follows. Definition 12. Define a downward-move function dm(•) : V → V by dm(v) i = v i-1 . When i = 0, set dm(v) 0 = 0. We then prove the following two theorems used to compute γ. Theorem 1. If γ n,n satisfies the bound condition in Definition 10 when n < n , then γ n,n = γ n-1,n -1 + dm(γ n,n -1 ) (14) also satisfies the bound condition. Theorem 2. Given any h ∈ RL(1, n), its activation histogram H a (S h ) denoted by v satisfies n i=t v i ≤ 2(n -t) + 1. The proofs of Theorem 1 and 2 are in Appendix A.1.3 and A.1.4 respectively. From Theorem 2, we can derive a feasible choice of γ 1,n whose form is γ 1,n = (0, . . . , 0 n 2 -1 , n mod 2, 2, . . . , 2 n 2 , 1) For example, γ 1,4 = (0, 0, 2, 2, 1) . Since there exists h ∈ RL(1, n) such that H a (S h ) = γ 1,n (the proof can be seen in the Appendix A.1.5), Eq.15 is the tightest upper bound, i.e. max {H a (S h ) |h ∈ RL (1, n)} satisfies Eq.15. With the fact that max {H a (S h ) |h ∈ RL (2, 1)} = (1, 1) , any γ n,n can be computed.

3.2. COMPARISON WITH OTHER BOUNDS

We use γ ours n,n and γ serra n,n to represent the ones proposed by us and Serra et al. ( 2018). According to Hinz & Van de Geer ( 2019), (γ serra n,n ) i = 0, 0 ≤ i < n -n n i , n -n ≤ i ≤ n It is easy to verify that γ serra n,n also satisfies Eq.14. But the initial value of γ serra n,n is different from γ ours n,n and we have that γ serra 1,n = (0, ..., 0, n, 1) . Obviously, γ ours 1,n γ serra 1,n . Since the two operations, + and dm(•), keep the relation of , for any n, n ∈ N + , γ ours n,n γ serra n,n (see an example in Appendix A.2.1). By the following theorem, the upper bound computed by Eq.1 using γ ours n,n is tighter than γ serra n,n . Theorem 3. Given γ (1) , γ (2) ∈ Γ, if for any n, n ∈ N + such that γ (1) n,n γ (2) n,n , then upper bound computed by Eq.13 using γ (1) is less than or equal to the one using γ (2) . The proof of Theorem 3 is in Appendix A.1.6. Theorem 3 shows that if we can get a "smaller" γ n,n then the upper bound will be tighter. And Theorem 1 implies that if we have a more accurate initial value, e.g. "smaller" γ 2,n or γ 3,n , the upper bound could be further improved. Therefore Theorem 1 provides a potential approach to achieve even tighter bound.

3.3. EXPANSION TO COMMON NETWORK STRUCTURES

Proposition 3 is only applied for ReLU MLPs. In this section we extend it to widely used network structures by introducing the corresponding matrix computation. Pooling an unpooling layers Since the pooling layer or unpooling layer can be written as a linear transform., e.g. average-pooling and linear interpolation, it can be denoted by y = Ax. Suppose v is the space dimension of input region and w is the histogram of the output regions. Then we have the following proposition. Proposition 4. Suppose rank(A) = k, then w cl k (v). Since cl k (v) has a matrix computation form which is similar to M in Eq.13, the effect of this type of layer on the space dimension histogram can be regarded as another matrix in Eq.13. Similarly, we have the following proposition for Max-pooling layers. Proposition 5. Suppose the max-pooling layer is equivalent to a k-rank maxout layer with n input nodes and n l output nodes. Let c = (k 2 -k)n l , then w cl n l (diag {|γ 0,c | 1 , . . . , |γ n,c | 1 } v) The proofs of Proposition 4, Proposition 5 are in Appendix A.1.7 and A.1.8.

Skip connection and residual structure

The skip connection is very popular in current network architecture design. When a network is equipped with a skip connection, the upper bound will be changed. The following Proposition gives the correspondent method to compute the upper bound with skip connection. Proposition 6. Given a network, suppose that v is the space dimension histogram of the input regions of the i th layer and w the histogram of the output regions of the j th layer. And we have that w j k=i A k v. When the input of the i th layer is concatenated to the j th layer, i.e. a skip connection, then the computation of w will change and satisfies that w v n |Be n | 1 e n where B = j k=i A k . It also has a matrix multiplication form. Residual structures (He et al., 2016) are similar to skip connections and can be regarded as a skip connection plus a simple full-rank square matrix computation. Therefore its effect on space dimension histogram is the same as the skip connection. Concatenation is equivalent to addition in upper bound computation. Thus, we have the following proposition. Proposition 7. Suppose that the residual structure adds the input of the i th layer to the output of the j th layer. v, w, B have the same meaning in Proposition 6. Then we have that w v n |Be n | 1 e n . The proofs of Proposition 6 and Proposition 7 are in Appendix A.1.9 and A.1.10. In addition, the analysis for dense connections (Huang et al., 2017) is similar to skip connections and is further discussed in Section 5. With the above analysis, we can deal with more complex network. For example, a U-net (Ronneberger et al., 2015) is composed of convolutional layers, pooling layers, unpooling layers and skip connections. If we regard convolutional layers as fully-connected layers, then we could compute the upper bound of it (see details in Appendix A.2.3). Though in this paper we have not extend to all possible network structures and architecture, according to Proposition 1 and by using our computational framework, it is possible to compute the change of the space dimension histogram for any other network structure and architecture. Based on these bricks of different layers or structures, we may derive the upper bound of much more complex and practically used networks.

4. EXPERIMENTS

We changes as the number of hidden layers increases from 1 to 10 with different n 0 . Figure 1(c) shows how the ratio changes when the depth of the network becomes larger. The second one is to verify the effectiveness of skip connections and residual structures on the improvement of expressiveness, as illustrated in Table 1 and Table 2 . We compute the upper bound of auto-encoders (AE) and U-nets. AEs and correspondent U-nets have the same network architecture except the skip connection. As for residual structures, we build two identical networks for the image classification task with or without residual structure. For each pair of networks (with or without one special structure), the ratio of their bounds are computed to measure how the upper bound will be enhanced with special structure. Different network architecture settings are tried in experiments. Because of the large memory required by complete γ n,n , we only consider networks of relatively small sizes (network input size is 24 × 24 or 16 × 16). In addition, convolutional layers are regarded as fully-connected layers in the computation.

5. DISCUSSION

The proposed upper bound has been theoretically proved to be tighter than Serra et al. (2018) . Furthermore, Figure 1 (a) and Figure 1 (b) show that when n i increases from 0.6n 0 to 1.5n 0 , the ratio curve of two bounds moves upward. However, when n i increases to 2n 0 , the curve moves downward. And when n i increases further and is much larger than n 0 , the two bounds are the same. This trend is related to the extent of similarity in B differ most while the left and right part are similar (see detials in Appendix A.2.3). As for the Figure 1(c) , it shows the difference between two upper bounds will be enlarged when the depth of networks increases. In general, both theoretical analysis and experimental results show our bound is tighter than Serra et al. (2018) . The result in Table 1 , Table 2 shows that special structure such as skip connection and residual structure will enhance the upper bound. And when the residual structure is used in more layers, the enhancement will increase (comparing comparing No.1, 3 and 4 in Table 2 ). However, it seems that when the number of channels is larger, the enhancement is weaker (comparing No.1 to 2, 3 to 4, and No.5 to No.6 in Tabel 1). Similar result can be observed in Table 2 (comparing No.1 to No.5). These results explain why skip connection and residual structure are effective from the perspective of expressiveness. However, there also exist some settings such that the two upper bounds are close. We provide an explanation for the observation through the concept of partition efficiency. First of all, we note that it is not the upper bound but the practical number of linear regions that is directly related to the network expressiveness. Though upper bound measures the expressiveness in some extent, there is a gap between upper bound and the real practical maximum. Actually, any upper bound which can be computed by the framework of matrix computation, including Pascanu et al. ( 2013 2018) etc., assume that all the output regions of the current layer will be further divided at the same time by all the hyperplanes which the next layer implies. However, this assumption is not always sound (see detail in Appendix A.2.4). When the number of regions far exceeds the dimension of hyperplanes, it is hard to imagine that one hyperplane can partition all the regions. But the special structure (e.g. skip connection) may increase the number of regions partitioned by one hyperlane such that the practical number can be increased even if the upper bound is not enhanced. Intuitively, skip connections increase the dimension of input space and the partition efficiency may be higher. An good example is dense connections (Huang et al., 2017) which is only a special case of skip connection but can further increase the dimension and thus lead to better expressiveness; this may explain the success of DenseNets. Though we have not completely proved the partition efficiency of skip connections, some evidences seem to confirm our intuition, as shown in Appendix A.1.11 and A.1.12.

6. CONCLUSION

In this paper, we study the expressiveness of ReLU DNNs using the upper bound of number of regions that are partitioned by the network in input space. We then provide a tighter upper bound than previous work and propose a computational framework to compute the upper bounds of more complex and practically more useful network structures. Our experiments verify that special network structures (e.g. skip connection and residue structure) can enhance the upper bound. Our work reveals that the number of linear regions is a good measurement of expressiveness, and it may guide us to design more efficient new network architectures since our computational framework is able to practically compute its upper bound.  i = 0} satisfies that a i {v (i) + c ∈ D. Let D = D -c, then D is a translation of D. Since D is convex, D and f (D ) is also convex. Because f (D) = f (D + c) = AD + Ac + b, then Sd(f (D)) = Sd(AD ). Suppose that D = A(D ) = {Ax|x ∈ R }. For any y ∈ D , there exists x ∈ D , s.t. y = Ax. Denote Av (i) by w (i) , then y can be represented by {w (1) , w (2) , . . . , w (k) } and a i w (i) ∈ D . From {w (1) , w (2) , . . . , w (k) } choose a set of vectors which are linear independent and can linearly represent {w (1) , w (2) , . . . , w (k) }. For convenience, suppose they are {w (1) , w (2) , . . . , w (t) }, t ≤ k. Therefore any y ∈ D can be represented by {w (1) , w (2) , . . . , w (t) }. Thus, Sd(D ) = Sd(f (D)) = t and we have that t = rank( w (1) . . . w (t) ) = rank( w (1) . . . w (k) ) = rank(A v (1) . . . v (k) ) ≤ min{rank(A), rank( v (1) . . . v (k) )} = min{rank(A), k}. The last equality is derived from that {v (1) , v (2) , . . . , v (k) } are k linear independent vectors.

A.1.2 THE PROOF OF PROPOSITION 2

Proof. For any s ∈ S(h), suppose D(s) = {x|S h (x) = s}. As long as Sd (h (D (s))) ≤ min{n 0 , |s h1 | 1 , |s h2 | 1 , ..., s h L-1 1 } is established, the proposition can be proved. Apparently, Sd(D(s)) = n 0 . Through the analysis in Section 2, h(•) in D(s) is an affine transform. For any x (1) , x (2) ∈ D(s), 2) ). Therefore we use x (0) ∈ D(s) to represent them. Then h(•) in D(s) can be written as W (i) (x (1) ) = W (i) (x (2) ), b (i) (x (1) ) = b (i) (x h(x) = W (L-1) (x (0) )(. . . W (1) (x (0) )x + b (1) (x (0) )) + b (L-1) (x (0) ) = W (x (0) )x + b(x (0) ) By Proposition 1, we have Sd (h (D (s))) ≤ min rank(W (x (0) )), n 0 = min n 0 , rank L-1 i=1 W (i) (x (0) ) ≤ min n 0 , min rank W (1) (x (0) ) , . . . , rank W (L-1) (x (0) ) ≤ min{n 0 , |s h1 | 1 , |s h2 | 1 , ..., s h L-1 1 } The last two inequality are derived from that rank L-1 i=1 W (i) (x (0) ) ≤ min{rank(W (1) (x (0) )), ..., W (L-1) (x (0) ))} and rank(W (i) (x (0) )) ≤ |s hi | 1 . A.1.3 THE PROOF OF THEOREM 1 Proof. Obviously, γ n,n satisfies the second condition. As for the first condition, consider that hyperplanes {L 1 , L 2 , ..., L n } divide R n . These hyperlanes correspond to one h ∈ RL(n, n ) and its activation histogram is denoted by v 1 . Here, we assume that L 1 , L 2 , . . . , L n are not parallel to each other since in parallel case it is easy to imagine and verify that there exists a  h ∈ RL(n, n ) which satisfies H a (S h ) H a (S h ). Suppose that hyperplanes {L 1 , L 2 , . . . , L n -1 } divide R n into t regions {R 1 , R 2 , ..., v 2 = v 3 + v 4 + v 5 (17) v 1 = v 3 + dm(v 3 ) + dm(v 4 ) + v 5 ( ) Let us only focus on {R 1 .R 2 , ..., R p }. Suppose {L 1 , L 2 , ..., L m }(m ≤ n -1) are borders of these regions. {L n ∩ L 1 , L n ∩ L 2 , ..., L n ∩ L m } are hyperplanes in L n , and their active directions are projections of {L 1 , L 2 , ..., L m } in R n . So the activation histogram of {L n ∩ L 1 , L n ∩ L 2 , ..., L n ∩ L m } in L n which is denoted by v 6 is equal to v 3 . By the assumption that when n < n , γ n,n satisfies the bound condition, we have v 3 = v 6 γ n-1,m γ n-1,n -1 . ( ) By Eq.17, Eq.18 and Eq.19, v 1 satisfies v 1 γ n-1,n -1 + dm(v 3 + v 4 + v 5 ) = γ n-1,n -1 + dm(v 2 ) γ n-1,n -1 + dm(γ n,n -1 ) Because of the arbitrariness of v 1 , we can get γ n,n = γ n-1,n -1 + dm(γ n,n -1 ) which satisfies the first condition.

A.1.4 THE PROOF OF THEOREM 2

Proof. For 1-dimension space, its hyperlane is one point in the number axis and the activation direction is either left or right. It is apparent that n segmentation points divide one line into n + 1 parts. And for any x ∈ R, its activation number is at most n. Therefore if t < n 2 , then n i=t v i ≤ n + 1 ≤ 2(n -t) + 1. If t ≥ n 2 , the activation pattern of n + 1 regions in R is denoted by {s 1 , s 2 , . . . , s n+1 } (see Figure 2  1 2 3 4 n-1 n 𝑠 ! 𝑠 " 𝑠 #$! 𝑠 # 𝑠 % 𝑠 & (a) (n -t) + 1. So n i=t v i ≤ 2(n -t) + 1. A.1.5 THE PROOF OF EQ.15 Proof. Suppose the n points of h in the number axis are p 1 , p 2 , ..., p n . Let the direction sof p 1 , ..., p n 2 are right and other points left. It is easy to verify that H a (S h ) = γ 1,n where γ 1,n is defined by n in the same position satisfy that γ 1,n = (0, . . . , 0 n 2 -1 , n mod 2, 2, . . . , 2 n 2 , 1) Then γ 1,n max {H a (S h ) |h ∈ RL(1, n)}. Since max {H a (S h ) |h ∈ RL(1, n)} γ 1,n , γ 1,n = max {H a (S h ) |h ∈ RL(1, n)}. B γ (1) n :,j B γ (2) n :,j Another fact is that when v w, then |v| 1 ≤ |w| 1 . Here we prove that if v w, then M ni,ni+1 v M ni,ni+1 w and B γ (1) ni v B γ (2) ni w Then the theorem can derived easily from Eq.21 and Eq.22. Because M ni,ni+1 v = cl ni+1 (v), Eq.21 is easy to verified. For B γ (k) ni , k = 1, 2, we have that B γ (k) ni :,j1 B γ (k) ni :,j2 when j 1 ≤ j 2 because of the second condition in Definition 10 and the property of the clipping function. For convenience, B γ (k) ni m,j is denoted by B (k) m,j , k = 1, 2. By Definition 5, we have that ∞ m=M B (1) v m ≤ ∞ m=M B (2) w m ⇐⇒ ∞ m=M   ∞ j=0 B (1) m,j v j   m ≤ ∞ m=M   ∞ j=0 B (2) m,j w j   m ⇐⇒ ∞ j=0 ∞ m=M B (1) m,j v j ≤ ∞ j=0 ∞ m=M B (2) m,j w j Because B (1) :,j B :,j , i.e. ∀M ≥ 0, ∞ m=M B (1) m,j ≤ ∞ m=M B (2) m,j . Let a j = ∞ m=M B (1) m,j , b j = ∞ m=M B (2) m,j , then a j ≤ b j . Because B (k) :,j1 B :,j2 when j 1 < j 2 , then a 0 ≤ a 1 ≤ a 2 ≤ ... (23) and b 0 ≤ b 1 ≤ b 2 ≤ ... ) By the notations, we have that ∞ j=0 ∞ m=M B (1) m,j v j ≤ ∞ j=0 ∞ m=M B (2) m,j w j ⇐⇒ ∞ j=0 a j v j ≤ ∞ j=0 b j w j ⇐⇒ ni j=0 a j v j ≤ ni j=0 b j w j The last equivalence is derived from that v j = w j = 0 when j > n i . Consider the left part of last inequality. Employing v w ⇔ ni j=J v j ≤ ni j=J w j and Eq.24, the following inequality can be derived ni j=0 a j v j ≤ ni j=0 b j v j = b 0 v 0 + ni j=1 b j v j ≤ b 0   ni j=0 w j - ni j=1 v j   + ni j=1 b j v j ≤ b 0 w 0 + b 1   ni j=1 w j - ni j=1 v j   + ni j=1 b j v j = b 0 w 0 + b 1 w 1 + b 1   ni j=2 w j - ni j=2 v j   + ni j=2 b j v j ≤ 1 j=0 b j w j + b 2   ni j=2 w j - ni j=2 v j   + ni j=2 b j v j = 2 j=0 b j w j + b 2   ni j=3 w j - ni j=3 v j   + ni j=3 b j v j ... ≤ ni-1 j=0 b j w j + b ni (w ni -v ni ) + b ni v ni = ni j=0 b j w j Therefore the left part is less than the right part, i.e. Eq.22 is established and the theorem is proved. n+1) , (M ) i,j = δ i,min(j,n+1) then w cl k (v). Suppose v ∈ R n+1 , i,e, v i = 0 when i > n, it is easy to verify that cl k (v) = M n,k v, where M n,k ∈ R (k+1)×( A.1.8 THE PROOF OF PROPOSITION 5 Before the proof, we show the following lemma. Lemma 1. Suppose γ n,n satisfies the recursion formula in Theorem 1 and the initial value satisfies that |γ n1,n2 | 1 = n1 s=0 n 2 s , then any γ n,n satisfies |γ n,n | 1 = n s=0 n s . Proof. Since γ n,n = γ n-1,n -1 + dm(γ n,n -1 ) and dm(•) does not change | • | 1 , we have |γ n,n | 1 = |γ n-1,n -1 | 1 + |γ n,n -1 | 1 . By the assumption, suppose that |γ n1,n2 | 1 = n1 s=0 n 2 s is established when n 1 ≤ n and n 2 < n . Then |γ n,n | 1 = n-1 s=0 n -1 s + n s=0 n -1 s = n s=1 (C n -1 s + C n -1 s-1 ) + C n -1 0 = n s=1 C n s + C n 0 = n s=0 n s Therefore any γ n,n satisfies the formula. It is easy to verify that |γ 1,n | 1 = n + 1 = For any sub-region p i , Sd(p i ) = n . Since h is equivalent to an affine transform in p i with matrix A of rank n A , we have that Sd(h(p i )) ≤ min{n A , n } by Proposition 1. Another fact is that n A ≤ n l . Therefore Sd(h(p i )) ≤ min{n l , n }. That is to say, R with n space dimension is divided in to at most |γ n ,c | 1 sub-regions and the space dimension of each output sub-region is no larger than min{n l , n }. If the space dimension histogram of input regions is v (∈ R n ) , then it is easy to verify that after the partition by h, the histogram of sub-regions v satisfies that v diag {|γ 0,c | 1 , . . . , |γ n,c | 1 } v. In addition, h change their space dimension. Thus w cl n l (v ) cl n l (diag {|γ 0,c | 1 , . . . , |γ n,c | 1 } v). Let C = diag {|γ 0,c | 1 , . . . , |γ n,c | 1 }, then cl n l (diag {|γ 0,c | 1 , . . . , |γ n,c | 1 } v) = M n,n l Cv. A.1.9 THE PROOF OF PROPOSITION 6 Proof. When the skip connection is not added, w Bv. Let v = e n , i.e. the the number of the input regions is one. Suppose the region is R ⊆ R m and R is divided into p sub-regions. Apparently p ≤ |Be n | and Sd(R) = Sd(r) = n ≤ m where r is one of the sub-regions. Suppose the part of network which is from the i th layer to the j th layer is equivalent to an affine transform with matrix C. Let r = C(r). When the skip connection is added, the output of r is C I (r) denoted by r . Since rank C I = m, Sd(r ) ≤ min{m, Sd(r)} = n. This implies that the space dimension of r may be enhanced to n. Therefore the space dimension histogram of p sub-regions w R satisfies that w R |Be n | 1 e n . ( ) For any input region with n space dimension, Eq.25 is always established. Thus, when the space dimension histogram of input regions is v the histogram of output regions satisfies w R w R R |Be Sd(R) | 1 e Sd(R) = n v n |Be n | 1 e n . ( ) Let C = diag{|Be 0 | 1 , |Be 2 | 1 , ..., |Be n | 1 , ...}, then n v n |Be n | 1 e n = Cv. A.1.10 THE PROOF OF PROPOSITION 7 Proof. Any residual structure can be regarded as the composition of one skip connection and an linear transform. For any input region r ⊆ R m partitioned by the network, the residual structure part has the following form. , ) where z G l+1 (z, y) = σ W (l+1, ) z y + b ( = f l (x) ∈ R n l , y = f m (x) ∈ R nm , 1 < m < l. Then, given a specific F l+1 , there exists a G l+1 such that the the total number of regions partitioned by G l+1 is no more less than that by F l+1 . Proof. For a connected set R ⊆ input space, f l (R) and f m (R) are still connected sets. For each hyperplane H i represented by n l j=1 a i,j z j + b i = 0 in F l+1 , design corresponding hyperplane H i , n l j=1 a i,j z j + nm j=1 c i,j y j +b i = 0, in G l+1 , where c i,j = 0. Suppose H i crosses {R 1 , R 2 , ..., R ki } and take k i intersections {p 1 , p 2 , ..., p ki } where p j is a interior point in R j (1 ≤ j ≤ k i ). Let f l (x j ) = p j , then (f l (x j ), (f m (x j )) ∈ H i and it is also a interior point in R j which is the same as R j when cutting last n m dimensions. So H i crosses at least k i regions which indicates the total number of regions partitioned by G l+1 is no more less than that by F l+1 . A.1.12 PROPOSITION 9 AND ITS PROOF Proposition 9. For a three-layer MLP which has two hidden layers, i.e. f (x) = W (3) σ(W (2) σ(W (1) x + b 1 ) + b (2) ) + b (3) , ( ) suppose that n 0 = 1 and every segment partitioned by the first layer keeps their space dimension, i.e. their output of the first layer is still segment. If the input is concatenated to the output of the first layer (just like a skip connection), then no matter what the parameters in the first layer are, the practical maximum number of linear regions is (n 1 + 1)(n 2 + 1). Specifically, without the special structure (just like an MLP) and assume that n 1 = 3, there exists some parameters in the first layer such that the practical maximum number is no more than (n 1 + 1)(n 2 + 1) -n 2 . Proof. Denote h 1 (x) = σ(W (1) x + b (1) ). The first layer divides input space into n 1 + 1 regions {r 1 , r 2 , ...r n1+1 }, then h 1 (r i ) ⊆ R. Because of the skip connection, the dimension of hyperplanes in the second layer is n 1 . It is apparent that for any n 1 + 1 points there always exits a hyperplane containing them. Thus, for any interior point x i ∈ r i (1 ≤ i ≤ n 1 + 1), there exists a hyperplane contaning them and therefore crossing n 1 + 1 regions. We have n 2 hyperplanes in the second layer, which shows the number of linear regions is (n 1 + 1)(n 2 + 1). And obviously the maximum number is not larger than it (see Theorem 7 in Serra et al. ( 2018)) As for the special case, take [0, 1] as input space without loss of generality. Let W 1 =   1 t1 -1 t2 1 t3   , 1 = -1 1 -1 Then we get four output regions in R 3 shown in Figure 3 . We can see that R 1 , R 2 and R 3 are on the same plane, i.e. the x-y plane. For any interior points p i ∈ R i , i = 1, 2, 3, the only hyperplane crossing them is x-y plane which is not able to divide R 1 , R 2 and R 3 . So the practical maximum number is no more than (n 1 + 1)(n 2 + 1) -n 2 . Under review as a conference paper at ICLR 2021 Here n = 6. Then γ ours n,n and γ serra n,n are shown as follows.         0 0 0 0 0 0 1 0 0 0 0 1 6 6 0 0 1 4 14 15 15 0 2 5 16 20 20 20 0 2 9 15 15 15 15 0 2 6 6 6 6 6 1 1 1 1 1 1 1         γ ours •,6         0 0 0 0 0 0 1 0 0 0 0 0 6 6 0 0 0 0 15 15 15 0 0 0 20 20 20 20 0 0 15 15 15 15 15 0 6 6 6 6 6 6 1 We take the first setting in Table 1 as example. The numbers of "4-8-16-32" correspond to one in red color in Figure 5 and all the numbers represent the channel number of current tensor. For all setting in Table 1 , the channel numbers of input and output are 1. The kernel size in every convolutional layer is (3 × 3) with stride = 1. We keep the size unchanged after the convolutional layer by padding zero. We use average-pooling as pooling layers and filling-zero as unpooling layers. The down-sampling rate and up-sampling rate are both 2. Except from the last convolutional layer, ReLU is added in other ones. The only differences in setting are channel numbers and the depth of down-sampling. 1 1 1 1 1 1 1         γ serra •,6

A.3.2 NETWORK ARCHITECTURES IN TABLE 2

We also take the first setting in Table 2 as example. The numbers of "4-p16-p16-r16-r16-r16" correspond to one in red color in Figure 6 . "p" means that before the convolutional layer, there exists a pooling layer. The last part of network is three fully-connected layers and ReLU is not added in the final layer. Other setting is the same as Appendix A.3.1. 



Figure 1: The comparison of our upper bound and Serra et al. (2018). The y axis represents the ratio of them which can be used to measure the difference. The MPLs are in the form of n 0 -n i -n i -• • • -n i -1 with k hidden layers. (a) Setting is n 0 = 10, n i = 6, 8, 10, 15, 20, 25, k = 1, 2, ..., 10. (b) Setting is n 0 = 10, n i = 6, 8, 10, 15, 20, 25, k = 1, 2, ..., 10. (c) The setting is n 0 = n i = 10, k = 10, 20, ..., 100.

When n i < n 0 , B ni affects the upper bound most. And the difference between B (ours) ni and B(serra) ni is more significant when n i decreases. When n i > n 0 , (B n ) :,n0+1 , the (n 0 + 1) th column of B n , is the main source of

); Montufar et al. (2014); Serra et al. (

(a)). No matter how the activation directions of n points are, we have |s 1 | + |s n+1 | = n since |s 1 | is equal to the number of left activation directions and |s n+1 | is equal to the number of right activation direction. Another obvious conclusion is that

Figure 2: (a) The number axis is partitioned into n + 1 parts. s i represents the i th region. (b). An example of a partition and the activation number of each region. The abscissa axis corresponds to the position and the vertical axis represents the number of activation

A.1.6 THE PROOF OF THEOREM 3 Proof. Different B γ n can be derived from different γ n,n . Since the clipping function keep the order relation , every column of B γ (1) n and B γ (2)

A.1.7 THE PROOF OF PROPOSITION 4 Proof. Consider any input region D. Let D is the corresponding output region, i.e. D = A(D). By Proposition 1, Sd(D ) ≤ min{Sd(D), k}. Because Hist({min{Sd(D), k}| D is any input region}) = cl k (v)

γ n,n proposed by us satisfies that |γ n,n | 1 = n s . Next we prove Proposition 5. Proof. Denote the maxout layer by h, according to the proof of Theorem 10 in Serra et al. (2018), one k-rank maxout layer with n l output nodes corresponds to divide one region by k(k -1) 2 n l hyperplanes. Suppose R is one input region with Sd(R) = n and partitioned into p sub-regions {r 1 , ..., r p }. Since one d-dimension space is at most partitioned into

By Proposition 6, the space dimension histogram of output regions C I (r) denoted by w satisfies that w v n |Be n | 1 e n Since rank([I I]) = m ≥ Sd(r), according to Proposition 1 the linear transform will not change the histogram. A.1.11 PROPOSITION 8 AND ITS PROOF Proposition 8. For an MLP, let f m represent the first m layers (1 ≤ m ≤ l), i.e. f m (x) is the output of the m th layer, and F l+1 (z) = σ(W (l+1) z + b (l+1) ) representing the (l + 1) th layer in the MLP. Consider another network layer,

Figure 3: The four output regions in R 3

.2 AN EXAMPLE OF UPPER BOUND COMPUTATION FOR U-NETWe take the U-net in Appendix A.3.1 as example. Firstly, we use matrices to represent all layers except skip connections. When computing upper bound convolutional layers are regarded as fullyconnected layers denoted by C i . Suppose pooling layers are average pooling layers and unpooling ones are filling-zero ones. They are denoted by P i and U i . Here, the subscript i means the order in the network. Then according to Proposition 3 and Proposition 4 we haveC 1 = B 2304 , C 2 = B 1152 , C 3 = B 576 , C 4 = B 288 , C 5 = B 144 , C 6 = B 288 , C 7 = B 576 , C 8 = B 2304 P 1 = M 2304,576 , P 2 = M 1152,288 , P 3 = M 576,144 , U 1 = M 144,144 = I 144 , U 2 = M 288,288 = I 288 , U 3 = M 576,576 = I 576 .where B, M are defined by Proposition 3. By Proposition 6, the upper bound N is computed as follows.S 3 = U 1 C 5 M 288,144 C 4 M 144,288 P 3 ∈ R 144×576 S 3 = diag{|S 3 e 0 | 1 , |S 3 e 2 | 1 , . . . , |S 3 e 576 | 1 , } ∈ R 576×576 S 2 = U 2 C 6 M 576,288 S 3 C 3 M 288,576 P 2 ∈ R 288,1152 S 2 = diag{|S 2 e 0 | 1 , |S 2 e 2 | 1 , . . . , |S 2 e 1152 | 1 , } ∈ R 1152×1152 S 1 = U 1 C 7 M 1152,576 S 2 C 2 M 576,1152 P 1 ∈ R 576,2304 S 1 = diag{|S 1 e 0 | 1 , |S 1 e 2 | 1 , . . . , |S 1 e 2304 | 1 , } ∈ R 1152×1152 N = |C 8 S 1 C 1 M 576,2304 e 576 | 1 A.2.3 THE COMPARISON OF B ours n AND B serra n Here we consider n = 6. By Eq.29 and the definition of clipping function, we have .4 AN SIMPLE EXAMPLE OF IMPERFECT PARTITIONIn this part, we use a simple example (in Figure4) to illustrate imperfect partition. Consider a two-layer MLPs with n 0 = 1, n 1 = 2. The original input space is R, i.e. the number axis or one line. The first layer partition the line into three parts (see Figure4(a)) and the corresponding output regions in R 2 are shown in Figure4(b). In Figure4(c), it is easy to observe that any hyperplane can not partition all three regions simultaneously.

Figure 4: (a) The input region R is divided into three parts in three colors and activation directions of blue points are drawn above the line; (b) The output regions of the first layer; (c) The blue lines represent hyperplanes in the second layer

Figure 6: Network architecture with residual structures in No.1 of Table 2

Definition 8. Define H d (S h ) as the dimension histogram of regions partitioned by a ReLU network defined by Eq.1 where h = {h 1 , . . . , h L-1 }, i.e.H d (S h ) = Hist min n 0 , |s h1 | 1 , ..., |s h L-1 | 1 |s ∈ S(h) .Given a ReLU network defined by Eq.1, let h = {h 1 , . . . , h L-1 }, then H sd (S h ) H d (S h ).

The upper bound of AEs and U-nets. The channels setting means the architecture of networks. For instance, 4-8-16-32 represents an AE which encoder has three down-sampling layers and the channels in different layers are 4, 8, 16, 32 respectively (see details in Appendix A.3.1). The input size in the first four experiments is (24 × 24) while the last two is (16 × 16). The upper bounds are listed in the third and fourth column. The former corresponds to U-nets while the latter to AEs. Besides, the ratios of two them are listed in the last column.

The upper bound of a simple network for the classification task with or without residual structure. The channels setting represents where main differences in architectures are (see details in Appendix A.3.2). "p16" means that there is a pooling layer before the convolutional layer with 16 channels. And "r16" represents that a residual structure is added in the convolutional layer with 16 channels. The last three columns in this table is similar to Table1. In this part, the input size in all the experiments is (24 × 24).

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2847-2854. JMLR. org, 2017.

.3 NETWORK ARCHITECTURES A.3.1 NETWORK ARCHITECTURES IN TABLE

