RETHINKING THE PRUNING CRITERIA FOR CONVOLU-TIONAL NEURAL NETWORK Anonymous

Abstract

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), and various pruning criteria have been proposed to remove the redundant filters of CNNs. From our comprehensive experiments, we found some blind spots on pruning criteria: (1) Similarity: There are some strong similarities among several primary pruning criteria that are widely cited and compared. According to these criteria, the ranks of filters' importance in a convolutional layer are almost the same, resulting in similar pruned structures. (2) Applicability: For a large network (each convolutional layer has a large number of filters), some criteria can not distinguish the network redundancy well from their measured filters' importance. In this paper, we theoretically validate these two findings with our assumption that the well-trained convolutional filters in each layer approximately follow a Gaussian-alike distribution. This assumption is verified through systematic and extensive statistical tests. Model Pruned Filters' Index (Top 8) Model Pruned Filters' Index (Top 8) 1

1. INTRODUCTION

Pruning (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2015; He et al., 2019) a trained neural network is commonly seen in network compression. In particular, for neural networks with convolutional filters, channel pruning refers to the pruning of the filters in the convolutional layers. There are several critical factors for channel pruning. Procedures. One-shot method (Li et al., 2016) : Train a network from scratch; Use a certain criterion to calculate filters' importance, and prune the filters which have small importance; After additional training, the pruned network can recover its accuracy to some extent. Iterative method (He et al., 2018; Frankle & Carbin, 2019) : Unlike One-shot methods, they prune and fine-tune a network alternately. Criteria. The filters' importance can be definded by a given criterion. From different ideas, many types of pruning criteria have been proposed, such as Norm-based (Li et al., 2016) , Activation-based (Hu et al., 2016; Luo & Wu, 2017) , Importance-based (Molchanov et al., 2016; 2019a) , BN-based (Liu et al., 2017b ) and so on. Strategy. Layer-wise pruning: In each layer, we can sort and prune the filters, which have small importance measured by a given criterion. Global pruning: Different from layer-wise pruning, global pruning sort the filters from all the layers through their importance and prune them. Table 1 : The pruned filters' index ordered by the filters' importance from given pruning criteria, taking VGG16 (3 rd Conv) and ResNet18 (12 th Conv) as examples. The pruned filters' index (the ranks of filters' importance) are almost the same from different pruning criterion and it will lead to the similar pruned structures. not be too small; (2) the minimum norm of the filters should be small enough. Since these two conditions do not always hold, a new criterion considering the relative importance of the filters ( 1 and 2 norm can be seen as one of algorithm which uses absolute importance of filters) is proposed (He et al., 2019) . Since this criterion uses the Fermat point (i.e., geometric median (Cohen et al., 2016 )), we call this method Fermat . However, due to the high calculation cost of Fermat point, He et al. (2019) relaxed it and then got another criterion GM method. Let F ij ∈ R Ni×k×k represents the j th filter of the i th convolutional layer, where N i is the number of input channels for i th layer and k denotes the kernel size of the convolutional filter. In i th layer, there are N i+1 filters. The details of these criteria are shown in Table 2 . F denotes the Fermat point of F ij in Euclidean space. These four pruning criteria are called Norm-based pruning in this paper as they include norm in their design. (He et al., 2019) ||F -Fij||2 GM (He et al., 2019 ) N i+1 k=1 ||F ik -Fij||2 In previous works (Luo et al., 2017; Han et al., 2015; Ding et al., 2019; Dong et al., 2017; Renda et al., 2020) , including the criteria mentioned above, they usually focused on (a) How much the model was compressed; (b) How much performance was restored; (c) The inference efficiency of the pruned network and (d) The cost of finding the pruned network. However, there is little work to discuss two blind spots about the pruning criteria: (1) Similarity: What are the actual differences among these previous pruning criteria? Using VGG16 and ResNet18 on ImageNet, we show the ranks of filters' importance under different criteria in Table 1 . It is easy to find that they have almost the same sequence, leading to similar pruned structures. In this situation, the criteria used absolute importance of filters ( 1 , 2 ) and the criteria used relative importance of filters (Fermat,GM) may not be significantly different. (2) Applicability: What is the Applicability of these pruning criteria to prune the CNNs? There is a toy example w.r.t. 2 criterion. If the 2 norm (regarded as importance) of the filters in one layer are 0.9, 0.8, 0.4 and 0.01, according to smaller-norm-less-informative assumption (Ye et al., 2018) , it's easy to know that we should prune the last filter. However, if the norm are close, like 0.91, 0.92, 0.93, 0.92, it is hard to know which filter should be pruned even though the first one is the smallest. As shown in Fig. 1 , taking Wide ResNet28-10 (WRN) as an example, this is a real and existing problem for pruning a network with the 2 criterion. Some similar research methods and opinions on pruning criteria are mentioned in He et al. (2019) and Molchanov et al. (2019a) . We make further analysis and observation of them in this paper. Since the criteria in Table 2 are widely cited and compared (Liu et al., 2020b; Li et al., 2020b; He et al., 2020; Liu et al., 2020a; Li et al., 2020a) , therefore it's important to study them for the above two issues. In order to rigorously study them: First, we come up with an assumption about the distribution of the parameters of the convolutional filters, called Convolution Weight Distribution Assumption (CWDA), with systematic and comprehensive statistical tests (Appendix P) in Section 2. Next, in Section 3 and Section 4, we theoretically verify the problem about Similarity and Applicability for some Norm-based criteria in layer-wise pruning. Last but not least, in Section 5, we discuss more issues including: (1) the conditions for CWDA to be satisfied, (2) the Similarity and Applicability when we use other types of pruning criteria, (3) and another pruning strategy, the global pruning.

Contribution. (1)

We propose and verify an assumption called CWDA, which reveals that the well-trained convolutional filters approximately follow a Gaussian-alike distribution. (2) We analyze the Applicability problem and Similarity from different types of pruning criteria. Under CWDA, we strictly validate these issues of Norm-based criteria in layer-wise pruning. (3) Under CWDA, some Norm-based criteria using the global pruning strategy have the problem about the inconsistent magnitude of importance between different layers, which explains the phenomenon that they sometimes cut off the network. The estimations using CWDA almost coincides to the actual results obtained from the real network, which also demonstrates the effectiveness of the CWDA.

2. WEIGHT DISTRIBUTION-ASSUMPTION

In this section, to study the Similarity and the Applicability problem among the pruning criteria shown in Table 2 , we propose and verify an assumption about the distribution of the parameters in convolutional filters. (Convolution Weight Distribution Assumption) Let F ij ∈ R Ni×k×k be the j th well-trained filter of the i th convolutional layer. In i th layer, F ij , j = 1, 2, ..., N i+1 are i.i.d and follow a Gaussianalike distribution: F ij ∼ N(0, c 2 • Σ block ), where c is a constant and Σ block = diag(K 1 , K 2 , ..., K Ni ) is a block diagonal matrix. The values of the off-diagonal elements are close to 0 and K l ∈ R k 2 ×k 2 , l = 1, 2, ..., N i . This assumption is based on the observation shown in the Fig. 2 . Let F ∈ R (Ni×k×k)×Ni+1 denote all the parameters in i th layer and we use the correlation matrix F F T to estimate the shape of Σ block . Taking the last convolutional layers of ResNet18 and VGG16 trained ImageNet as an example, we find that F F T is a block diagonal matrix. Specifically, each block is a k 2 × k 2 matrix and the offdiagonal elements are close to 0. For j th filter F ij ∈ R Ni×k×k in i th layer as shown in Fig. 2 (e), this phenomenon reveals that the parameters in the same channel of F ij tend to be linearly correlated, and the parameters of any two different channels (yellow and green channels in Fig. 2 (e)) in F ij only have a low linear correlation. Since the kernel size k is a small constant (like 1 or 3) and N i 1, the length k 2 of each block is much smaller than the length k 2 • N i of the covariance matrix in most convolutional layers. Therefore, the block diagonal matrix Σ block can be regarded as a diagonal matrix, as shown in Fig. 2(b ) and (d). For the convenience of analysis, we relax the assumption, i.e., F ij approximately follows a Gaussian-alike distributionfoot_0 : F ij ∼ N(0, c 2 • I Ni×k×k ), where I Ni×k×k is an identity matrix. The statistical tests in Section 2.1 and the estimation by CWDA in Fig. 10 show the reasonability of this relaxation. In the remaining sections, we use Eq.2 to represent CWDA unless otherwise specified. In Fig. 3 , taking VGG16 and ResNet18 on ImageNet dataset as examples, we visualize the CWDA through the distribution of the convolutional filters. 

2.1. STATISTICAL TEST

In fact, CWDA is not easy to be verified. For example, for ResNet164 (On Cifar100), the number of filters in the first stage is only 16, which is too small to be used to estimate the statistics accurately. More other objective reasons are shown in Section 5.1. As these problems, we consider verifying three necessary conditions of CWDA: (1) Gaussian (i.e., to verify whether F ij approximately follows a Gaussian-alike distribution); (2) Standard Deviation (i.e., to verify whether the standard deviation of each filter in any layers is close to a constant c); (3) Mean (i.e., to verify whether the mean of F ij is close to 0). In Table 3 , to illustrate that CWDA holds universally, we consider a variety of factors, such as network structure, optimizer, regularizationfoot_1 , initialization, dataset, training strategy, and other tasks in computer vision (e.g., semantic segmentation, detection, image matting and so on). The details of these statistical tests are shown in Appendix P. Table 3 : The experiments for having the comprehensive statistical tests on CWDA. NETWORK STRUCTURE (P.1) OPTIMIZER (P.2) REGULARIZATION (P.3) Figure 4 : Test accuracy of the ResNet56 on CIFAR10/100 while using different pruning ratios. L1 pruned and L1 tuned denote the test accuracy of the ResNet56 after 1 pruning and fine-tuning, respectively. If pruning ratio is equal to 0.5, we prune 50% filters in all layers. Empirical Analysis. (1) In Fig. 4 , we show the test accuracy of the ResNet56 after pruning and fine-tuning under using different pruning ratios and datasets. The test accuracy curves of different pruning criteria at different stages are very close under different pruning ratios. This phenomenon implies that those pruned networks using different pruning criteria are very similar, and there are strong similarities among these pruning criteria. The experiments about other commonly used pruning ratio can be found in Appendix N. (2) In Fig. 5 , we show the Spearman's rank correlation coefficient (Sp)foot_2 between different pruning criteria. The Sp in most convolutional layers are more than 0.9, which means the network structures are almost the same after pruning. Note that the Sp in transition layer (i.e., the layer where the dimensions of the filter change, like the layer between stage 1 and stage 2 of ResNet164. The number of dimensions in stage 1 and stage 2 are 16 and 32 respectively.) are relatively small. It is interesting but will not greatly impact the structural similarity of the whole pruned network. The reason for this phenomenon may be that the layers in these areas are sensitive. The similar observations are shown in Fig. 2 in Ding et al. ( 2019), and Fig. 6 and Fig. 10 in Li et al. (2016) . Theoretical Analysis. After the verification by experiments, the similarities via using layer-wise pruning among the criteria in Table 2 are proved theoretically in this section. Let C 1 and C 2 be two pruning criteria to calculate the importance for all convolutional filters of one layer. If they can produce the similar ranks of importance, we define C 1 and C 2 are approximately monotonic to each other and use C 1 ∼ = C 2 to represent this relationship. In Section 3, we use the Sp to describe this relationship but it's hard to be analyzed theoretically. Therefore, we consider about a stronger condition. Let X = (x 1 , x 2 , ..., x k ) and Y = (y 1 , y 2 , ..., y k ) be two given sequences. we first normalize their magnitude, i.e., let X = X/E(X) and Y = Y/E(Y) . This operation does not change the ranking sequence of the elements of X and Y, because E(X) and E(Y) are constants, i.e., X ∼ = Y ⇔ X ∼ = Y. After that, if both Var( X/ Y) and Var( Y/ X) are small enough, then the Sp between X and Y is close to 1, where X/ Y = ( x 1 / y 1 , .., x k / y k ). The reason is that in these situations, the ratio X/ Y and Y/ X will be close to two constants a, b. Note that, for any 1 ≤ i ≤ k, x i ≈ a • y i and y i ≈ b • x i . Then, ab ≈ 1 and a, b = 0. Therefore, there exists an approximately monotonic mapping from y i to x i (linear function) and it makes the Sp between X and Y close to 1. Theorem 1. Let X ∼ N (0, cfoot_3 • I n ), and (C 1 , C 2 ) is one of ( 1 , 2 ), ( 1 , Fermat) or (Fermat, GM), we have max Var X C 2 (X) C 1 (X) , Var X C 1 (X) C 2 (X) B(n), where C 1 (X) denotes C 1 (X)/E(C 1 (X)) and C 2 (X) denotes C 2 (X)/E(C 2 (X)). B(n) denotes the upper bound of left-hand side and when n is large enough, B(n) → 0. Proof. (See Appendix G). For i th convolutional layer of a neural network, since F ij , j = 1, 2, ...N i+1 , meet CWDA and the dimensions of F ij are generally large, we can obtain 1 ∼ = 2 , 2 ∼ = Fermat and Fermat ∼ = GM according to Theorem 1. Therefore, we have 1 ∼ = 2 ∼ = Fermat ∼ = GM, which verifies the strong similarities among the criteria shown in Table 2 .

4. APPLICABILITY

In this section, we analyze the Applicability problem of the Norm-based criteria when we use these criteria to prune a large network (each convolutional layer has a large number of filters). In Fig. 1 , we can find that 2 can not distinguish the redundancy of Wide ResNet28-10 (regarded as a large network) well from their measured filters' importance, because their importance are very close (the distribution looks sharp). This problem is related to the variance of importance. He et al. (2019) argue that a small norm deviation (the values of variance of importance are small) makes it difficult to find an appropriate threshold to select filters to prune. However, even if the values of the variance are large, it may still have the Applicability problems. This is because the magnitude of these importance may be much greater than the values of the variance, where we can use the mean of importance to represent their magnitude. We believe that the following two situations may cause Applicability problem: for the filters F i in i th convolutional layer, (1) If the mean E(F i ) = 0 and the variance Var(F i ) is close to 0; (2) If E(F i ) = 0 and Var(F i )/E(F i ) is close to 0. For a network with large number of convolutional filters, it's easy to know that the dimensions of the filters are also large enough. For 2 pruning, according to CWDA (the proof in Appendix L), we can obtain that the mean of 2 (F i ) is √ 2σ i • Γ( 2 )/Γ( ki 2 ) = 0, where σ i and k i are the standard deviation and dimension of the parameters in i th layer, respectively. And Var(F i )/E(F i ) → 0 when k i is large enough. From these reasons, the importance measured by 2 norm tends to be identical, i.e., it's hard to distinguish the network redundancy well in this situation. Moreover, from the proof in Appendix H, we know that the Fermat point F of F i and the origin 0 approximately coincide. According to Table 2  , ||F -F i || 2 ≈ ||0 -F i || 2 = ||F i || 2 . Therefore, the importance of Fermat and 2 are almost equivalent when CWDA holds. Hence, a similar conclusion can be obtained for Fermat criterion. Intuitively, the 1 criterion should have the same Applicability problem as the Var(F i )/E(F i ) tends to be a none-zero constant with respect to σ i . In other words, 1 criterion does not necessarily have Applicability problems unless σ i is small enough. Except for the Norm-based criteria, we analyze another type pf pruning criteria called RePr (Prakash et al., 2019) . This criterion considers the orthogonality among the filters in one layer. Based on the proof in Appendix L and the fact that the importance measured by 2 norm tend to be identical, we also know that this criterion cannot prune the network well when the number of filters is too large. Moreover, in Section 5.2, we study the Applicability problem for more different types of pruning criteria from a numerical perspective.

5.1. WHY CWDA SOMETIMES DOES NOT HOLD?

CWDA may not always hold. As shown in Appendix P, a small number of convolutional filters may not pass the statistical tests. In this section, we try to analyze this phenomenon. (1) Need to be trained well enough. The distribution of parameters can only be discussed when the network is trained well. If the network does not converge, it is easy to construct a scenario which does not satisfy CWDA, e.g., in Fig. 16 (Appendix D ), we train a network with uniform initialization. Although the distribution of parameters converges with the increase of epoch more and more to a normal distribution, when only a few epochs are trained, the distribution of parameters is closer to a uniform distribution. At this time, the distribution obviously does not satisfy CWDA. Moreover, if a network is not trained well, e.g., its parameters converge to bad local minima, there may be many unforeseen circumstances causing CWDA to be unsatisfied. In order to eliminate these factors, the network should be trained when studying the parameter distribution. (2) The number of filters is insufficient. In Appendix P, the layers that can not pass the statistical tests are almost those whose position is in the front of the network. A common feature of these layers is that they have a few filters, which may not estimate statistics well. In Fig. 5 , due to the sensibility of layers, the Sp in the transition layer are relatively small. Taking the second convolutional layer (64 filters) in VGG16 on CIFAR10 as an example, we find that increasing the number of filters could alleviate this sensitivity. As shown in Fig. 6 , we change the number of filters in this layer from 64 to 128 or 256. After that, the Sp increases significantly, and it suggests that enough number of filters are essential. (3) The dimensions of the filter are not large enough. The dimension of a filter in the i th layer is closely related to the number of filters in the (i -1) th layer. To eliminate the influence from the number of the filters, we can consider the correlation matrix of each parameter in the filter. As the analysis in Section 2, only when the dimension of the convolutional filter is large enough will the matrix approximate a diagonal matrix. It may also be a reason why the layer in the front of the network usually can not pass the statistical test in Appendix P. Figure 6 : The Sp between different pruning criteria on VGG16 (CIFAR10). The number of filters in the second convolutional layers is changed from 64 to 256.

5.2. WHAT ABOUT OTHER PRUNING CRITERIA?

In this section, we study the Similarity and Applicability problem in more types of pruning criteria through numerical experiments, such as Activation-based pruning (Hu et al., 2016; Luo & Wu, 2017) , Importance-based pruning (Molchanov et al., 2016; 2019a) and BN-based pruning (Liu et al., 2017b) . APoZ (Hu et al., 2016) . The details of these criteria can be found in Appendix M. The Similarity for different types of pruning criteria. In Fig. 7 , we show the Sp between different types of pruning criteria, and only the Sp greater than 0.7 are shown because if Sp < 0.7, it means that there is no strong similarity between two criteria in the current layer. According to the Sp shown in Fig. 7 , we can have the following observations: (1) As verified in Section 3, 1 and 2 maintain a strong similarity in each layer; (2) In the layers shown in Fig. 7 (a) and Fig. 7(d) , the Sp between most different pruning criteria are not large in these layers, which indicates that these methods have great differences in the redundancy measurement of convolution filters. This may lead to a phenomenon that one criterion considers a convolutional filter to be important, while another considers it redundant. We find a specific example which is shown in Appendix E; (3) Intuitively, the same type of criteria should be similar. However, it can be seen from Fig. The Applicability for different types of pruning criteria. According to the analysis in Section 4, the Applicability problem depends on the mean and variance of the parameters. Fig. 8 shows the result of the importance measured by different pruning criteria on each layer of VGG16. VGG16 is a network whose width becomes larger gradually as the depth increases (from 64 to 512). For this reason, we can study the Applicability problem in generally wide (shallow layers) and sufficiently wide (deep layers) convolutional layers. First, we analyze the Norm-based criteria. The mean of both 1 and 2 are relatively large, but in most layers, the variance/mean of 2 is much smaller than that of 1 , which means that the 2 pruning has Applicability problem, while the 1 does not. This is consistent with the conclusion in Section 4. Next, for the Activation-based criteria, the mean and variance/mean are both large, which means that these two Activation-based criteria can distinguish the network redundancy well from their measured filters' importance. However, for the Importance-based and BN-based criteria, their mean and variance/mean are close to 0. According to the condition (1) shown in Section 4, these criteria have Applicability problem, especially in the deep layers.

5.3. WHAT ABOUT GLOBAL PRUNING?

Compared with layer-wise pruning, global pruning is more widely (Liu et al., 2018; Molchanov et al., 2016; Liu et al., 2017b) used in channel pruning. Figure 9 : Similarity while using global pruning. Similarity while using global pruning. In Fig. 9 , we show the similarity of different types of pruning criteria using global pruning on VGG16 and ResNet56. Comparing to the results from the layer-wise pruning shown in Fig. 7 , we can find that the similarities of most pruning criteria are quite different in global pruning. In particular, for the results of 1 and 2 in Fig. 9 (a), the similarity between 1 and 2 is not as strong as the one in the layer-wise case. We argue that this phenomenon may be due to the differences between the parameter distribution in different convolutional layers. More analysis can be found in Appendix A. Applicability while using global pruning. In fact, for global pruning, Norm-based criteria are not prone to Applicability problems. From Section 4, we have the estimations for the magnitude of importance in i th layer calculated by 1 and 2 as σ i • k 2 π and √ 2σ i • Γ( ki+1 2 )/Γ( ki ). Since σ i and k i are quite different, the variance of the importance is large in this situation. Fig. 10 shows this kind of difference of the magnitude on different convolutional layers. Our estimation also explains a common problem in practical applications of global pruning: the network is easily pruned off. As shown in Fig. 10 , we take ResNet56 as an example. Since the importance in first stage is much smaller than the importance in the deeper layer, global pruning will give priority to prune the convolutional filters of the first stage. To solve the problem of inconsistent magnitude, we suggest that some normalization methods should be implemented or a protection mechanism should be established, e.g., a mechanism which can ensure that each layer has at least a certain number of convolutional filters that will not be pruned. 

6. CONCLUSION

In this paper, we found two blind spots on pruning criteria: Similarity and Applicability. For Similarity, some primary pruning criteria can obtain very similar pruning results. For Applicability, some criteria can not distinguish the redundancy of the filters well when the number of filters is large enough. The comprehensive experiments validate these two findings, and our assumption is called CWDA. Under CWDA, these two blind spots are also discussed when using global pruning. In this section, we use the random functions provided by NumPyfoot_5 to generate data manually to study the influence of distribution on the similarity between Norm-based criteria. We use these distribution functions to generate 100 vectors with 50 dimensions each time. Each vector can be regarded as a convolutional filter with dimension 50. As shown in the Table 4 (Left), we compare the Sp of these generated convolutional filters between 1 , 2 and GM criteria. If Sp is greater than 0.9, we mark it as green. Otherwise, we mark it as red. Under these generated distribution, 1 and 2 can keep larger Sp in most cases, which indicates that they still maintain strong similarity for these distribution. But the similarity between GM and 1 (or 2 ) is relatively poor. If we use zero-mean normalization for the generated data X, i.e., X -EX, as shown in Ta-ble4 (Right), we also find that the Sp between these three criteria are greater than 0.9 for all distribution. This phenomenon may indicate that the normality in CWDA is not necessary for the similarity of criteria, but the zero mean is the key to the similarity. Next, we consider the mixed distribution, i.e., as shown in Table 5 , among the 100 vectors we generated, the first 50 vectors are generated by Distribution 1, and the last 50 vectors are generated by Distribution 2. There are several notable phenomena in Table 5 : (1) For the mixed distribution composed of two normal distribution with zero-mean but different variances, the three criteria can maintain strong similarity; (2) In the mixed distribution, zero-mean normalization may not make the Sp between the three criteria stronger, which is inconsistent with the situation in Table 4 . (3) For 1 and 2 , the Sp measured by the mixed distribution is lower than single distribution (Table 4 ) in general. According to CWDA, the mixed distribution in phenomenon (1) actually corresponds to global pruning strategy. If CWDA holds, the distribution of parameters in different convolutional layers depends on the variance of parameters in each layer, and these variances are generally different. Therefore, the global pruning strategy is actually to prune the convolutional filters that satisfy different Gaussian distribution. Unlike layer-wise pruning, these filters cannot be represented by a general random variable F , where F ∼ N(0, c 2 • I Ni×k×k ). It needs to be described by a stochastic process, i.e., F t ∼ N(0, c 2 • I (t) Ni×k×k ). According to the experiment of phenomenon (1) and CWDA, we expect that 1 and 2 should still maintain strong similarity while using the global pruning. However, as shown in Fig. 9 , the Sp of 1 and 2 is not very large. This seems to contradict the experiment of phenomenon (1). In fact, however, according to the analysis in Section 9, CWDA may not hold in the first few layers of CNNs. So if these convolutional filters are also considered to be globally pruned in global pruning, then the similarity between 1 and 2 may be weakened, which is consistent with the experiment in Table 5 . To verify this statement, Fig. 11 shows the result of VGG16 via using global pruning strategy. The Start layer means the layer that we start to use for global pruning. For example, when Start layer is equal to 0, it is the general global pruning; When Start layer is equal to 5, the filters from the first five layers are not be considered to be pruned. In Fig. 11 , we can see that, with the increase of start layer, the similarity between 1 and 2 becomes stronger and stronger. This shows that the similarity between 1 and 2 is not as strong as layer-wise in global pruning, which is mainly caused by the first few layers that do not satisfy CWDA. Moreover, in phenomenon (2), global pruning is different from layer-wise pruning shown in Table 4 . The zero-mean normalization may not make these pruning criteria similar. Therefore, global pruning is worthy of further studies, such as using the stochastic process to explore the similarity and the Applicability problem of different types of pruning criteria. 

CRITERIA

The following results are the index of pruned filters obtained by the filters' importance from different types of pruning criteria. We take VGG16 (2 nd ) as an example. The 5 th filter in this layer is regarded as a redundant convolutional filter for APoZ criterion, but other criteria consider it to be almost the most important. Taylor 1 : [27, 36, 25, 11, 6, 23, 24, 16, 0, 57, 48, 53, 1, 61, 18, 55, 34, 15, 51, 58, 31, 3, 12, 21, 59, 30, 7, 38, 41, 50, 10, 33, 17, 46, 62, 13, 49, 43, 42, 47, 2, 32, 44, 20, 39, 52, 56, 40, 9, 26, 37, 22, 29, 54, 60, 8, 14, 45, 4, 63, 19, 35, 28, 5] Taylor 2 : [23, 32, 36, 11, 62, 16, 30, 59, 10, 13, 2, 50, 38, 0, 46, 43, 21, 26, 15, 22, 7, 51, 39, 33, 14, 58, 9, 40, 57, 6, 61, 44, 20, 48, 3, 53, 41, 56, 17, 12, 18, 31, 4, 1, 25, 19, 63, 24, 54, 45, 52, 37, 55, 47, 34, 35, 8, 29, 42, 27, 49, 28, 60, 5] BN β: [52, 46, 32, 21, 14, 29, 17, 0, 19, 36, 1, 51, 44, 40, 41, 60, 57, 27, 22, 53, 63, 8, 30, 26, 23, 58, 39, 18, 9, 47, 31, 35, 11, 37, 55, 45, 3, 61, 6, 4, 33, 25, 15, 48, 43, 28, 56, 2, 13, 16, 34, 20, 59, 10, 7, 24, 50, 62, 12, 49, 38, 42, 5, 54] APoZ: [5, 10, 38, 42, 62, 24, 13, 12, 7, 28, 59, 15, 23, 11, 16, 56, 34, 35, 57, 19, 2, 49, 43, 25, 6, 63, 61, 36, 9, 27, 33, 20, 48, 58, 55, 18, 51, 31, 1, 0, 53, 37, 26, 29, 47, 60, 8, 44, 41, 46, 21, 17, 14, 32, 52, 22, 39, 3, 40, 30, 4, 45, 50, 54] F RELATED PROPOSITION Proposition 1 (Amoroso distribution). The Amoroso distribution is a four parameter, continuous, univariate, unimodal probability density, with semi-infinite range (Crooks, 2012) . And its probability density function is Amoroso(X|a, θ, α, β) = 1 Γ(α) | β θ |( X -a θ ) αβ-1 exp -( X -a θ ) β , for x, a, θ, α, β ∈ R, α > 0 and range x ≥ a if θ > 0, x ≤ a if θ < 0. The mean and variance of Amoroso distribution are E X∼Amoroso(X|a,θ,α,β) X = a + θ • Γ(α + 1 β ) Γ(α) , and Var X∼Amoroso(X|a,θ,α,β) X = θ 2 Γ(α + 2 β ) Γ(α) - Γ(α + 1 β ) 2 Γ(α) 2 . ( ) Proposition 2 (Half-normal distribution). Let random variable X follow a normal distribution N (0, σ 2 ), then Y = |X| follows a half-normal distribution (Pescim et al., 2010) . Moreover, Y also follows Amoroso(x|0, √ 2σ, 1 2 , 2). By Eq. ( 5) and Eq. ( 6), the mean and variance of half-normal distribution are E X∼N (0,σ 2 ) |X| = σ 2/π, and Var X∼N (0,σ 2 ) |X| = σ 2 1 - 2 π . Proposition 3 (Scaled Chi distribution). Let X = (x 1 , x 2 , ...x k ) and x i , i = 1, ..., k are k independent, normally distributed random variables with mean 0 and standard deviation σ. The statistic 2 (X) = k i=1 x 2 i follows Scaled Chi distribution (Crooks, 2012) . Moreover, 2 (X) also follows Amoroso(x|0, √ 2σ, k 2 , 2). By Eq. ( 5) and Eq. ( 6), the mean and variance of Scaled Chi distribution are E X∼N (0,σ 2 •I k ) [ 2 (X)] j = 2 j/2 σ j • Γ( k+j 2 ) Γ( k 2 ) , and Var X∼N (0,σ 2 •I k ) 2 (X) = 2σ 2 Γ( k 2 + 1) Γ( k 2 ) - Γ( k+1 2 ) 2 Γ( k 2 ) 2 . ( ) G PROOF OF THEOREM 1 Theorem 1. Let X ∼ N (0, c 2 • I n ) and (C 1 , C 2 ) is one of ( 1 , 2 ), ( 1 , Fermat) or (Fermat, GM), we have max Var X C 2 (X) C 1 (X) , Var X C 1 (X) C 2 (X) B(n). where C 1 (X) denotes C 1 (X)/E(C 1 (X)) and C 2 (X) denotes C 2 (X)/E(C 2 (X) ). B(n) denotes the upper bound of left-hand side and when n is large enough, B(n) → 0. For i th layer, we use v j to represent F ij , j = 1, 2, ...N . And v j meets CWDA (i.e., v j are i.i.d and v j ∼ N (0, c 2 • I)). Actually, from the theoretical analysis in Section 3, the fact that two criteria C 1 and C 2 meet Eq.11 is equivalent to C 1 ∼ = C 2 . (1) For ( 2 , 1 ). In fact, 2 ∼ = 1 (their importance rankings are similar) is not trivial. Generally speaking, for convolutional filters, dim(v j ) is large enough. Since v i satisfies CWDA, from Theorem 2, we know that the variance of ratio between 1 and 2 have a bound O(dim(v j ) -1 ), which means 2 and 1 are appropriate monotonic. Specific numerical validation is shown in Fig. 17 of Appendix H). Theorem 2. Let X ∼ N (0, c 2 • I n ), we have max Var X 2 (X) 1 (X) , Var X 1 (X) 2 (X) 1 n . ( ) where 1 (X) denotes 1 (X)/E( 1 (X)) and 2 (X) denotes 2 (X)/E( 2 (X)). Proof. (See Appendix H). (2) For ( 1 , Fermat). Since v i satisfies CWDA, from Theorem 3, we know that the Fermat point of v i and the origin 0 approximately coincide. According to Table 2  , ||Fermat -v i || 2 ≈ ||0 -v i || 2 = ||v i || 2 . Therefore, from Theorem 2, the bound B(n) for the ( 1 , Fermat) is also 1 n . Moreover, since CWDA, the centroid of v i is G = 1 n N i=1 v i = 0. Hence, G = 0 ≈ Fermat. ( ) Theorem 3. Let random variable v i ∈ R k and they are i.i.d and follow normal distribution N (0, σI k ). For F ∈ R k , we have argmin F E vi∼N (0,σI k ) n i=1 ||F -v i || 2 = 0. Proof. (See Appendix I). (3) For (GM, Fermat). First, we show the following two theorems: Theorem 4. For n random variables a i ∈ R k follow N (0, c 2 • I k ). When k is large enough, we have such an estimation: Var ai F 1 (a i ) F 2 (a i ) ≈ 1 2nk , Var ai F 2 (a i ) F 1 (a i ) ≈ 1 2nk , ( ) where F 1 (a i ) = n i=1 ||a i || 2 /E( n i=1 ||a i || 2 ) and F 2 (a i ) = n i=1 ||a i || 2 2 /E( n i=1 ||a i || 2 2 ). Proof. (See Appendix J). Theorem 5. Let v 0 , v 1 , ..., v k be the k + 1 vectors in n dimensional Euclidean space E n . For all P in E n , k i=0 ||P -v i || 2 2 = k i=0 ||G -v i || 2 2 + (k + 1)||P -G|| 2 2 , ( ) where G is the centroid of v i , will hold if it satisfies one of the following conditions: ( 15) holds with probability 1. 1)if k ≥ n and rank(v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) = n. (2)if k < n and (v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) are linearly independent. (3)if v i ∼ N (0, c 2 • I n ), Eq.( Proof. (See Appendix K). Let P ∈ {v 1 , v 2 , ..., v N }. Since v i ∼ N (0, c 2 • I), we can obtain that a i = P -v i ∼ N (0, 2c 2 • I) if P = v i . According to the analysis in Section 3 and Theorem 4, we have n i=1 ||a i || 2 ∼ = n i=1 ||a i || 2 2 , Next, we can prove (k + 1)||P -F || 2 2 (Fermat) and N i=1 ||P -v i || 2 (GM) are approximately monotonic, where P ∈ {v 1 , v 2 , ..., v N }. (k + 1)||P -F || 2 2 ∼ = (k + 1)||P -G|| 2 2 Since Eq. (13) = N i=1 ||P -v i || 2 2 - N i=1 ||G -v i || 2 2 Since Theorem 5 ∼ = N i=1 ||P -v i || 2 - N i=1 ||G -v i || 2 2 Since Eq. ( 16) ∼ = N i=1 ||P -v i || 2 (17) The reason for the last equation is that N i=1 ||G -v i || 2 2 is a constant for given v i . H PROOF OF THEOREM 2 Proposition 4 (Stirling's formula). 6 For big enough x and x ∈ R + , we have an approximation of Gamma function: Γ(x + 1) ≈ √ 2πx x e x . Proposition 5 (FKG inequality). If f and g are increasing functions on R n (Graham, 1983), we have E(f )E(g) ≤ E(f g). ( ) Say that a function on R n is increasing if it is an increasing function in each of its arguments.(i.e., for fixed values of the other arguments). Proposition 6. Let f (X, Y ) is a two dimensional differentiable function. According to Taylor theorem (Hormander, 1983) , we have f (X, Y ) = f (E(X), E(Y )) + cyc (X -E(X)) ∂ ∂X f (E(X), E(Y )) + Remainder1, f (X, Y ) = f (E(X), E(Y )) + cyc (X -E(X)) ∂ ∂X f (E(X), E(Y ))+ 1 2 cyc (X -E(X)) T ∂ 2 ∂X 2 f (E(X), E(Y ))(X -E(X)) + Remainder2 (21) Lemma 1. Let X and Y are random variables. Then we have such an estimation Var X Y ≈ E(X) E(Y ) 2 VarX E(X) 2 + VarY E(Y ) 2 -2 Cov(X, Y ) E(X)E(Y ) . ( ) Proof. Let f (X, Y ) = X/Y , according to the definition of variance, we have Varf (X, Y ) = E[f (X, Y ) -E(f (X, Y ))] 2 ≈ E[f (X, Y ) -E f (E(X), E(Y )) + cyc (X -E(X)) ∂ ∂X f (E(X), E(Y )) ] 2 from Eq. (20) = E[f (X, Y ) -f (E(X), E(Y )) - cyc E(X -E(X)) ∂ ∂X f (E(X), E(Y ))] 2 = E[f (X, Y ) -f (E(X), E(Y ))] 2 ≈ E[ cyc (X -E(X)) ∂ ∂X f (E(X), E(Y ))] 2 from Eq. (20) = 2Cov(X, Y ) ∂ ∂X f (E(X), E(Y )) ∂ ∂Y f (E(X), E(Y )) + cyc [ ∂ ∂X f (E(X), E(Y ))] 2 • VarX = 2Cov(X, Y ) • 1 E(Y ) • - E(X) (E(Y )) 2 + 1 (E(Y )) 2 • VarX + (EX) 2 (EY ) 4 • VarY = E(X) E(Y ) 2 VarX E(X) 2 + VarY E(Y ) 2 -2 Cov(X, Y ) E(X)E(Y ) . Lemma 2. For big enough x and x ∈ R + , we have lim x→+∞ Γ( x+1 2 ) Γ( x 2 ) 2 • 1 x = 1 2 . ( ) And lim x→+∞ Γ( x 2 + 1) Γ( x 2 ) - Γ( x+1 2 ) Γ( x 2 ) 2 = 1 4 . ( ) Proof. lim x→+∞ Γ( x+1 2 ) Γ( x 2 ) 2 • 1 x ≈ lim x→+∞   2π( x-1 2 ) • ( x-1 2e ) x-1 2 2π( x-2 2 ) • ( x-2 2e ) x-2 2   2 • 1 x from Proposition. 4 = lim x→+∞ x -1 x -2 • ( x-1 2e ) x-2 ( x-2 2e ) x-2 • x -1 2e • 1 x = lim x→+∞ 1 + 1 x -2 x-2 • x -1 x -2 • x -1 2e • 1 x = 1 2 on the other hand, we have lim x→+∞ Γ( x 2 + 1) Γ( x 2 ) - Γ( x+1 2 ) Γ( x 2 ) 2 = lim x→+∞ x 2 -1 + 1 x -2 x-2 • x -1 x -2 • x -1 2e = lim x→+∞ x 2e e -(1 + 1 x ) x = 1 2 - 1 e (-e) 2 = 1 4 Theorem 2 Let X ∼ N (0, c 2 • I n ), we have max Var X 2 (X) 1 (X) , Var X 1 (X) 2 (X) 1 n . where 1 (X) denotes 1 (X)/E( 1 (X)) and 2 (X) denotes 2 (X)/E( 2 (X)). Proof. For the ratio 2 (X)/ 1 (X), we have Var 2 (X) 1 (X) = E( 1 (X)) E( 2 (X)) ≤ Var 2 (X) E( 2 (X)) 2 + Var 1 (X) E( 1 (X)) 2 . Therefore, max Var X 2 (X) 1 (X) , Var X 1 (X) 2 (X) ≤ Var 2 (X) E( 2 (X)) 2 + Var 1 (X) E( 1 (X)) 2 = 2σ 2 Γ( n 2 +1) Γ( n 2 ) - Γ( n+1 2 ) 2 Γ( n 2 ) 2 ( √ 2σ • Γ( n+1 2 ) Γ( n 2 ) ) 2 + σ 2 1 -2 π n (n • σ 2/π) 2 from Proposition. 3 and 2 ≈ 1 2n + ( π 2 -1) 1 n from Lemma 2 = π -1 2n Because the approximation is widely used in the proof of Theorem 1, it is necessary to verify it numerically. As shown in Fig. 17 , we use ResNet56 on Cifar100 and ResNet110 on Cifar10 respectively to verify Theorem 1. From Fig. 17 , we find that the estimationn of Theorem 1 is reliable, i.e., the estimation O( 1 n ) for max Var X 2 (X) 1 (X) , Var X 1(X ) 2(X ) is appropriate.

I PROOF OF THEOREM 3

Proposition 7. Let L (α) p (x) denotes generalized Laguerre function, and it have following properties: ∂ n ∂x n L (α) p = (-1) n L (α+n) p-n (x), and for α > 0, For F in R k , we have L (α) -1 2 (x) > 0. argmin F E vi∼N (0,σ 2 I k ) n i=1 ||F -v i || 2 = 0. Proof. Let w i = F -v i and we have w i ∼ N (F, σ 2 I k ), then E vi∼N (0,σ 2 I k ) n i=1 ||F -v i || 2 = n i=1 E vi∼N (0,σ 2 I k ) ||F -v i || 2 = n i=1 E wi∼N (F,σ 2 I k ) ||w i || 2 = n • σ 2 π 2 • L ( k 2 -1) 1 2 - ||F || 2 2 2σ 2 The reason for the last equation is that ||w i || 2 follows scaled noncentral chi distributionfoot_7 when w i ∼ N (F, σ 2 I k ). Let T (x) = L ( k 2 -1) 1 2 -x 2 2σ 2 , we calculate the minimum of T (x). From Eq. ( 26), d dx T (x) = x σ 2 • L ( k 2 ) -1 2 - x 2 2σ 2 . ( ) Since Eq. ( 27), we find that d dx T (x) > 0 when x > 0 and if x ≤ 0, then d dx T (x) ≤ 0. It means that T (x) gets the minimizer at ||F || 2 = 0, i.e., F = 0.

J PROOF OF THEOREM 4

Lemma 3. For two random variables X, Y ∈ R k follow N (0, c 2 • I k ) and they are i.i.d. When k is large enough, we have: E (||X|| 2 2 -||Y || 2 2 ) 2 2||X|| 2 • ||Y || 2 ≈ 2c 2 + 4c 2 k + 1 2k 2 , ( ) and Var (||X|| 2 2 -||Y || 2 2 ) 2 2||X|| 2 • ||Y || 2 8c 4 + 16c 4 k + c 2 k 2 , ( ) Proof. According to Proposition 3 and Lemma 2, it is easy to know (similar method in Eq.( 86)), when k is large enough, that E (2||X|| 2 • ||Y || 2 ) = 2c 2 k, Var (2||X|| 2 • ||Y || 2 ) = c 2 + 4c 4 k, and E (||X|| 2 2 -||Y || 2 2 ) 2 = 4c 4 k, Var (||X|| 2 2 -||Y || 2 2 ) 2 = 16c 8 (2k 2 + 3k). Since Lemma 1, we have an estimation Var (||X|| 2 2 -||Y || 2 2 ) 2 2||X|| 2 • ||Y || 2 ≤ E(||X|| 2 2 -||Y || 2 2 ) 2 E2||X|| 2 • ||Y || 2 2 Var(||X|| 2 2 -||Y || 2 2 ) 2 E(||X|| 2 2 -||Y || 2 2 ) 2 + Var(2||X|| 2 • ||Y || 2 ) 2 ) E(2||X|| 2 • ||Y || 2 ) 2 ≈ 4c 4 k 2c 2 k 2 • c 2 + 4c 4 k 4c 4 k + 16c 8 (2k 2 + 3k) 16c 8 k 2 Since Eq.( 31) and Eq.( 32) = 8c 4 + 16c 4 k + c 2 k 2 . From Eq.( 21) and Lemma 1, we also can obtain an estimation of E(A/B), where A and B are two random variables. i.e., E A B ≈ EA EB + Var(B) • EA (EB) 3 . ( ) Therefore, E (||X|| 2 2 -||Y || 2 2 ) 2 2||X|| 2 • ||Y || 2 ≈ E(||X|| 2 2 -||Y || 2 2 ) 2 E2||X|| 2 • ||Y || 2 + Var(2||X|| 2 • ||Y || 2 ) • E(||X|| 2 2 -||Y || 2 2 ) 2 (E2||X|| 2 • ||Y || 2 ) 3 Since Eq.( 33) ≈ 4c 4 k 2c 2 k + 4c 4 k 8c 6 k 3 • (c 2 + 4c 4 k) Since Eq.(31) and Eq.(32) = 2c 2 + 4c 2 k + 1 2k 2 . Note that, the approximation is widely used in the proof of Eq.( 29) and Eq.( 30). Hence, it is also necessary to verify it numerically. As shown in Fig. 18 , the estimation is appropriate. According to Lemma 3, the mathematical expectation and variance of the ratio of (||X|| 2 2 -||Y || 2 2 ) 2 and 2||X|| 2 • ||Y || 2 are both close to 0 when k is large enough and c is small enough. that is, 2(||X|| 2 • ||Y || 2 ) (||X|| 2 2 -||Y || 2 2 ) 2 . ( ) By the way, the convolutional filters easily meet the condition that k is large enough. Theorem 4. For n random variables a i ∈ R k follow N (0, c 2 • I k ). When k is large enough, we have such an estimation: Var ai F 1 (a i ) F 2 (a i ) ≈ 1 2nk , Var ai F 2 (a i ) F 1 (a i ) ≈ 1 2nk . where F 1 (a i ) = n i=1 ||a i || 2 /E( n i=1 ||a i || 2 ) and F 2 (a i ) = n i=1 ||a i || 2 2 /E( n i=1 ||a i || 2 2 ). Proof. Since Eq. ( 9) and Eq. ( 10), we have Var ai F 1 (a i ) F 2 (a i ) = nc 2 k nc √ k 2 • Var ai n i=1 ||a i || 2 n i=1 ||a i || 2 2 . ( ) and Var ai F 2 (a i ) F 1 (a i ) = nc √ k nc 2 k 2 • Var ai n i=1 ||a i || 2 2 n i=1 ||a i || 2 . ( ) Figure 18 : (Left) The numerical verification of Eq.( 29) and (Right) The numerical verification of Eq.( 30). X and Y follow N (0, c 2 • I k ). According to Lagrange's identity, we have n i=1 ||a i || 2 2 n i=1 1 = n i=1 ||a i || 2 2 + 1≤i<j≤n (||a i || 2 2 -||a j || 2 2 ) 2 = n i=1 ||a i || 2 2 + 1≤i<j≤n (||a i || 2 • ||a j || 2 ) + 2 1≤i<j≤n (||a i || 2 2 -||a j || 2 2 ) 2 ≈ n i=1 ||a i || 2 2 + 2 1≤i<j≤n (||a i || 2 • ||a j || 2 ) Since Eq. (34) = n i=1 ||a i || 2 2 so we have Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 n i=1 ||a i || 2 2 ≈ Var ai∼N (0,c 2 •I k ) n n i=1 ||a i || 2 By central limit theorem, we have √ n( 1 n n i=1 ||a i || 2 -µ) ∼ N (0, σ 2 ). And let g(x) = 1 x , we can use Delta methodfoot_8 to find the distribution of g( 1 n n i=1 ||a i || 2 ): √ n g( n i=1 ||a i || 2 n ) -g(µ)) ∼ N (0, σ 2 • [g (µ)] 2 ) = N (0, σ 2 • 1 µ 4 ). ( ) where µ and σ 2 denote the mean and variance of ||a i || 2 respectively. From Eq. (37), we have 9) and Eq. ( 10) Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 n i=1 ||a i || 2 2 ≈ Var ai∼N (0,c 2 •I k ) n n i=1 ||a i || 2 = σ 2 • 1 µ 4 • n Since Eq. (38) = 2c 2 Γ( k 2 + 1) Γ( k 2 ) - Γ( k+1 2 ) 2 Γ( k 2 ) 2 • 1 ( √ 2c • Γ( k+1 2 ) Γ( k 2 ) ) 4 • n Since Eq. ( = 1 2c 2 • nk 2 Since Lemma. 2 Figure 19 : A numerical verification of Theorem 4, where F 1 = n i=1 ||a i || 2 /E( n i=1 ||a i || 2 ) and F 2 = n i=1 ||a i || 2 2 /E( n i=1 ||a i || 2 2 ). a i follow N (0, 0.01 2 • I k ). Since Eq. ( 35), we have Var ai F 1 (a i ) F 2 (a i ) = nc 2 k nc √ k 2 • Var ai n i=1 ||a i || 2 n i=1 ||a i || 2 2 ≈ 1 2nk . ( ) Similar to Eq. ( 37), Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 2 n i=1 ||a i || 2 ≈ Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 n ( ) Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 2 n i=1 ||a i || 2 ≈ Var ai∼N (0,c 2 •I k ) n i=1 ||a i || 2 n Similar to Eq. (37) = σ 2 • 1 n Since central limit theorem = 2c 2 Γ( k 2 + 1) Γ( k 2 ) - Γ( k+1 2 ) 2 Γ( k 2 ) 2 • 1 n Since Eq. (10) = c 2 2n Since Lemma. 2 Since Eq. ( 36), we have Var ai F 2 (a i ) F 1 (a i ) = nc √ k nc 2 k 2 • Var ai n i=1 ||a i || 2 2 n i=1 ||a i || 2 ≈ 1 2nk . ( ) From Eq.(39) and Eq.( 41), Theorem 4 holds. In Fig. 19 , we also show a numerical verification of Theorem 4.

K PROOF OF THEOREM 5

Proposition 8. For a n × m random matrix (a ij ) n×m , where a ij ∼ N (0, σ 2 ). And Eq. ( 8) holds with probability 1. rank((a ij ) n×m ) = min(m, n). ( ) Lemma 4. Let v 0 , v 1 , ..., v k be the k + 1 vectors in n dimensional Euclidean space V and k ≤ n. If rank(v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) = n, then ∀x ∈ V , ∃λ i (0 ≤ i ≤ k), s.t. x = k i=0 λ i • v i , and k i=0 λ i = 1. We call λ = (λ 0 , λ 1 , ..., λ k ) the generalized barycentric coordinate with respect to (v 0 , v 1 , ..., v k ). (In general, barycentric coordinate is a concept in Polytope) Proof. Note that v i is the element of n dimensional linear space V and rank(v 1 -v 0 , v 2 -v 0 , ..., v k - v 0 ) = n. It means (v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) form a set of basis in the linear space V . ∀x ∈ V , x -v 0 can be expressed linearly by them, i.e.,∃t i (1 ≤ i ≤ k) s.t. x = v 0 + k i=1 t i (v i -v 0 ) = (1 - k i=1 t i )v 0 + k i=1 t i v i . Let λ 0 = (1 - k i=1 t i ) and λ i = t i (1 ≤ i ≤ k), Lemma 4 holds. Lemma 5. Let v 0 , v 1 , ..., v k be the k + 1 vectors in n dimensional Euclidean space V . ∀a, b ∈ V , and the generalized barycentric coordinate of a, b with respect to (v 0 , v 1 , ..., v k ) are λ = (λ 0 , λ 1 , ..., λ k ) T and µ = (µ 0 , µ 1 , ..., µ k ) T ,respectively. Then ||a -b|| 2 2 = (λ -µ) T D(λ -µ), where D = (-1 2 d ij ) (k+1)×(k+1) , and d ij = ||v i -v j || 2 2 . Proof. Since Lemma 4, let R = [v 0 , v 1 , ..., v k ] n×(k+1) , and we have a = Rλ and b = Rµ. Moreover, ||a -b|| 2 2 = (a -b) T (a -b) (45) = [R(λ -µ)] T [R(λ -µ)] (46) = (λ -µ) T R T R(λ -µ). Note that, for D = (- 1 2 d ij ) (k+1)×(k+1) , - 1 2 d ij = - 1 2 (v i -v j ) T (v i -v j ) (48) = v T i v j - 1 2 (v T i v i + v T j v j ). ( ) So we have D = R T R -1 2 (v T i v i + v T j v j ) (k+1)×(k+1) . It can be further simplified to D = R T R -1 2 (V α T + αV T ), where V = (v T 0 v 0 , ..., v T k v k ) T and α = (1, ..., 1) T . So ||a -b|| 2 2 = (λ -µ) T R T R(λ -µ) (50) = (λ -µ) T (D + 1 2 (V α T + αV T ))(λ -µ) (51) = (λ -µ) T D(λ -µ) + 1 2 (λ -µ) T (V α T + αV T )(λ -µ), therefore, we only need to prove (λ -µ) T (V α T + αV T )(λ -µ) = 0. From Lemma 4, we have α T (λ -µ) = (λ -µ) T α = 0 and the Lemma 5 holds. Definition 1 (Ultra dimension). For a set U composed of vectors in a n dimensional linear space V , we define dim(U ) as the Ultra dimension of U . The definition is that if U has k linearly independent vectors and there are no more, then dim(U ) = k. In fact, if U is a linear subspace in V , then the Ultra dimension and the dimensions of the linear subspace are equivalent. If U is a linear manifold, U = {x + v 0 |x ∈ W }, where v 0 and W are non-zero vectors and linear subspaces in V , respectively. And dim(W ) = r. Then dim(U ) = r, v 0 ∈ W r + 1, v 0 / ∈ W In other words, dim(U ) ≥ dim(W ) always holds. Lemma 6. For arbitrary k (1 ≤ k ≤ n -1), let a 1 , a 2 , ..., a k be k linearly independent vectors in n dimensional linear space V . Consider one n-1 dimensional linear subspace W in V and a non-zero vector v 0 in V . They form a linear manifold  P = {v 0 +α|α ∈ W }. If P ⊂ L = span(a 1 , a 2 , ..., a k ), 1 For the linear manifold P , if v 0 ∈ W . This means that P is equal to the linear subspace W . Since Eq. ( 54), we have W ⊂ L and dim(W ) = dim(L). Hence, P = W = L. However, a 1 , a 2 , ..., a k do not all belong to P , a contradiction. 2 For the linear manifold P , if v 0 / ∈ W , then dim(P ) = n. Because v 0 / ∈ W , that is, v 0 cannot be represented by a set of basis of W . In other words, v 0 and a set of basis of W are linearly independent. However, the dimension of W is n -1, hence dim(P ) = n. From Eq. (54), we have P ⊂ L, so n = dim(P ) ≤ dim(L) = k = n -1, a contradiction. Therefore, Lemma 6 holds for n -k = 1. Assume the induction hypothesis that Lemma 6 is true when n -k = l, where 1 ≤ l. when n -k = l + 1, i.e., k = n -(l + 1), we also can find a vector p 1 ∈ P s.t. a 1 , a 2 , ..., a k , p 1 linearly independent. Otherwise, ∀p ∈ P would be linearly represented by a 1 , a 2 , ..., a k . Similarly, we have Eq. ( 54). Note that, from Definition 1, dim(P ) ≥ n -1, hence n -1 ≤ dim(P ) ≤ dim(L) = k = n -(l + 1). a contradiction. At this time, we have k + 1 = n -(l + 1) + 1 = n -l vectors a 1 , a 2 , ..., a k , p 1 which are not all on P . Note that n -(n -l) = l, using the induction hypothesis, the Lemma 6 also holds for n -k = l. In summary, Lemma 6 holds. Theorem 5. Let v 0 , v 1 , ..., v k be the k + 1 vectors in n dimensional Euclidean space E n . For all P in E n , k i=0 ||P -v i || 2 2 = k i=0 ||G -v i || 2 2 + (k + 1)||P -G|| 2 2 . where G is the centroid of v i , will hold if it satisfies one of the following conditions: (1)if k ≥ n and rank(v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) = n. (2)if k < n and (v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) are linearly independent. (3)if v i ∼ N (0, c • I n ), Eq.( 15) holds with probability 1 where c is a constant. Proof. For Theorem 5 (1). From Lemma 4, ∀P ∈ E n ,∃γ = (γ 0 , ..., γ k ), s.t. P can be represented by k i=0 γ i v i , where k i=0 γ i = 1. In fact, for each v i , it also can be respresented by k j=0 β ij v i , where k i=0 β ij = 1. We just take (β i0 , β i1 , ..., β ik ) as one of the standard orthogonal basis i = (0, 0, ..., 1 i , ...0). According to lemma 5, ||P -v i || 2 2 = (γ -i ) T D(γ -i ) (57) = γ T Dγ -2γ T D i + T i D i (58) = γ T Dγ -2γ T D i . ( ) The final equation is because the diagonal elements of the matrix are all 0. On the other hand, we have ||G -v i || 2 2 = ( 1 k + 1 k i=0 i -i ) T D( 1 k + 1 k i=0 i -i ) (60) = 1 (k + 1) 2 α T Dα - 2 k + 1 α T D i + T i D i (61) = 1 (k + 1) 2 α T Dα - 2 k + 1 α T D i , where α = k i=0 i , i.e.,α = (1, 1, ..., 1). Next, we consider ||P -G|| 2 2 . ||P -G|| 2 2 = (γ - 1 k + 1 α) T D(γ - 1 k + 1 α) (63) = γ T Dγ + 1 (k + 1) 2 α T Dα - 2 k + 1 γ T Dα. In summary, we have k i=0 ||P -v i || 2 2 -||G -v i || 2 2 = (k + 1)γ T Dγ -2γ T Dα + 1 k + 1 α T Dα (65) = (k + 1)||P -G|| 2 2 (66) Therefore, Theorem 5 (1) holds. For Theorem 5 (2). Next, we prove the case of k < n. Obviously, Lemma 4 does not hold. We consider about such a linear space W 1 = span(P -G), i.e., a linear space expanded by P -G, and its orthogonal complement W ⊥ 1 (in E n ). Since dimension formula from linear space, it is easy to konw that dim(W ⊥ 1 ) = n -1. Two linear manifolds T 1 and T 2 are constructed as follows, T 1 = {x + G|x ∈ W ⊥ 1 } T 2 = {x + G -v 0 |x ∈ W ⊥ 1 } ( ) ∀v i ∈ T 1 , we have (v i -G) T (P -G) = 0, Furthermore, ||P -v i || 2 2 = ||v i -G|| 2 2 + ||P -G|| 2 2 . ( ) It is easy to know that G -v 0 is not 0. If v 1 -v 0 , ..., v k -v 0 are all belong to T 2 , it means v 1 , .., v k are all in T 1 . Hence, we have Eq. (69). By summing both sides of Eq. ( 69) for i, it is obvious find that Theorem 5 (2) holds. If v 1 -v 0 , ..., v k -v 0 are not all belong to T 2 , since Lemma 6, there are n -k vectors p 1 -v 0 , p 2 -v 0 , .., p n-k -v 0 from T 2 s.t. they and v 1 -v 0 , ..., v k -v 0 are linearly independent, where p i obviously belongs to manifold T 1 . At the same time, we have 2G -p i ∈ T 1 , we can also construct n -k new vectors 2G -p i -v 0 ∈ T 2 and calculate the rank that rank(v 1 -v 0 , ..., v k -v 0 , p 1 -v 0 , ..., p n-k -v 0 , 2G -p 1 -v 0 , ..., 2G -p n-k -v 0 ) = rank(v 1 -v 0 , ..., v k -v 0 , p 1 -v 0 , ..., p n-k -v 0 , 2(G -v 0 ), ..., 2(G -v 0 )) (70) = rank(v 1 -v 0 , ..., v k -v 0 , p 1 -v 0 , ..., p n-k -v 0 , 0, ..., 0) (71) = n (72) The reason of the final equation is that k i=1 (v i -v 0 ) = (k + 1)(G -v 0 ). Note that there are a total of k + (n -k) + (n -k) = n + (n -k) ≥ n vectors, meets the lemma 4 condition. For the convenience of description, we define L (1) i = v i , (0 ≤ i ≤ k), L (2) i = p i , (1 ≤ i ≤ n -k), L (3) i = 2G -p i , (1 ≤ i ≤ n -k). ( ) And their centroid is G = 1 2n -k + 1 k i=0 v i + n-k i=1 (L (2) i + L (3) i ) (76) = 1 2n -k + 1 ((k + 1)G + 2(n -k)G) (77) = G (78) That is, the newly added vector does not change the centroid of v i . On the other hand, since both L (2) i and L (3) i are in the linear manifold T 1 , and it meets the conditions of the Eq.( 69). Similar to the derivation in the Theorem 5 (1), we have (2n -k + 1)||P -G|| 2 2 = t=L (1) i ,L i ,L i ||P -t|| 2 2 -||G -t|| 2 2 (79) = k i=0 ||P -v i || 2 2 -||G -v i || 2 2 + t=L (2) i ,L (3) i ||P -t|| 2 2 -||G -t|| 2 2 (80) = k i=0 ||P -v i || 2 2 -||G -v i || 2 2 + 2(n -k)||P -G|| 2 2 (81) The final equation is because both L (2) i and L (3) i are in the linear manifold T 1 and satisfy Eq. (69). To simplify Eq. ( 81), we obtain k i=0 ||P -v i || 2 2 -||G -v i || 2 2 = (k + 1)||P -G|| 2 2 . Therefore, Theorem 5 (2) holds. For Theorem 5 (3). When k ≥ n, from Proposition 8, we know that rank(v 1 -v 0 , v 2 -v 0 , ..., v kv 0 ) = n holds with probability 1. Hence, if we use the similar deduction from Theorem 5 (1), we can find that Theorem 5 (3) holds when k ≥ n. On the other hand, when k < n, we can get the same result also according to Proposition 8. The reason is that (v 1 -v 0 , v 2 -v 0 , ..., v k -v 0 ) are linearly independent with probability 1. L THE GEOMETRIC STRUCTURE OF CONVOLUTIONAL FILTERS. Theorem 6. Let v i ∈ R k and v i ∼ N (0, c 2 • I k ). If k → ∞ and c = 0, then (1) ||v i || 2 ≈ ||v j || 2 → √ 2c • Γ((k+1)/2) Γ(k/2) ,1 ≤ i < j ≤ N ; (2) angle(v i , v j ) → π 2 ,1 ≤ i < j ≤ N ; (3) ||v i -v j || 2 ≈ ||v i -v t || 2 ,1 ≤ i < j < t ≤ N ; (4) E(||v i || 1 )/Var(||v i || 1 ) → a non-zero constant. Proof. First, since Chebyshev inequality, for 1 ≤ i ≤ N and a given M , we have P |||v i || 2 -E(||v i || 2 )| ≥ M Var(||v i || 2 ) ≤ 1 M . ( ) from Eq. ( 9), Eq. ( 10) and Lemma. (2), we can rewrite Eq. ( 82) when k → ∞: P ||v i || 2 ∈ √ 2c • Γ((k + 1)/2) Γ(k/2) - M 2 c, √ 2c • Γ((k + 1)/2) Γ(k/2) + M 2 c ≥ 1 - 1 M . For a small enough > 0, let M = 1/ . Note that M 2 c = c/ √ 2 is a constant. When k → ∞, √ 2c • Γ((k+1)/2) Γ(k/2) M 2 c. Hence, for any i ∈ [1, N ] and any small enough , we have P ||v i || 2 ≈ √ 2c • Γ((k + 1)/2) Γ(k/2) ≥ 1 -. So Theorem 6(1) holds. Let v i = (v i1 , v i2 , ..., v ik ) and v j = (v j1 , v j2 , ..., v jk ). So < v i , v j >= k p=1 v ip v jp . Note that, v i and v j are independent, hence E(v ip v jp ) = 0, Var(v ip v jp ) = Var(v ip )Var(v jp ) + (E(v ip )) 2 Var(v jp ) + (E(v jp )) 2 Var(v ip ) = 1, since central limit theorem, we have √ k 1 k k p=1 v ip v jp -0 ∼ N (0, 1), According to Eq. ( 9), Lemma 2 and Eq. ( 87), when k → ∞, we have < v i , v j > ||v i || 2 • ||v j || 2 → 1 √ k • < v i , v j > √ k ∼ N (0, 1 k ) → N (0, 0). ( ) So Theorem 6(2) holds. From Theorem 6(1) and Theorem 6(2), Theorem 6(3) can be proved through Pythagoras theorem. For 6(4), from Proposition. 2, we have E(||v i || 1 ) Var(||v i || 1 ) = k • c 2 π k • c 2 (1 -2 π ) = 2 π c(1 -2 π ) As shown in Fig. 21 , Theorem 6 actually reveals the geometric structure formed by the parameters of the convolutional filters in CNNs. Specifically, from Theorem 6 (1), the convolutional filters v i of each layer locate approximately on the surface of k dimensional sphere with 0 as the origin and √ 2c • Γ((k+1)/2) Γ(k/2) as the radius. Then, from Theorem 6 (2), the vectors formed by any two different convolutional filters in the same layer are approximately orthogonal. As this result, for any three different filters v 1 , v 2 and v j , we can use Pythagoras theorem and Theorem 6 (1) to prove that they are equidistant,i.e., ||v 1 -v 2 || 2 ≈ ||v 2 -v 3 || 2 ≈ ||v 3 -v 1 || 2 . In fact, Fig. 20 provides another view of the geometric structure of convolutional filters. Since CWDA, E(v i ) = 0. So the correlation matrix {(Cor(v i , v j )} N ×N = c • {(v T i v j } N ×N , where c is a constant. That is to say, there is only one coefficient difference between correlation matrix and Gram matrix. Therefore, the diagonal elements of the matrix are ||v i || 2 2 , and the off-diagonal elements are the dot product between the convolutional filters. This numerical visualization also verifies the conclusion of Theorem6.  || 2 ≈ ||v 2 || 2 ≈ ||v 3 || 2 ); (2) They are equidistant (||v 1 -v 2 || 2 ≈ ||v 2 -v 3 || 2 ≈ ||v 3 -v 1 || 2 ); (3) and they are orthogonal (v T 1 v 2 ≈ v T 2 v 3 ≈ v T 3 v 1 ≈ 0).

M THE DETAILS OF OTHER PRUNING CRITERIA

For notation, we denote i th convolutional filter in layer l as F l i and the input feature maps in layer l as I l ∈ R N ×I l ×H l ×W l , where N, I l , H l , W l mean the train set size, number of channels, height and width respectively, i = 1, 2, • • • , λ l , and l = 1, 2, • • • , L. The formulation of the filters' importance under each pruning criteria are illustrated as follows: Norm-based criteria: • 1 -Norm Li et al. (2016): ||F l i || 1 ; • 2 -Norm Frankle & Carbin (2019): ||F l i || 2 ; BN-based criteria Liu et al. (2017b): • BN γ: |γ l i |, where γ l i is the scaling factor in the Batch Normalization layer l; • BN β: |β l i |, where β l i is the shifting factor in the Batch Normalization layer l. Activation-based criteria: • APoZ Hu et al. ( ): p,q 1((|I l * F l i |)p,q>σ) N ×I l ×H l ×W l , where we set σ = 0.0001 same as Luo & Wu (2017), and 1(•) is the indicator function, * is convolution operator and I l * F l i is the i-th output feature map; • Entropy Luo & Wu (2017): we first prepare G l i = GAP (I l * F l i ), where G l i ∈ R N ×1 and GAP (•) is the Global Average Pooling. Then, we estimate statistical distribution for G l i by dividing all elements in G l i into m bins. Let p j is the probability of bin j, and the the importance score is -m j=1 p j log p j . First order Taylor based criteria Molchanov et al. (2016; 2019a; b) : • Taylor 1 -Norm: || ∂loss ∂F l i • F l i || 1 ; • Taylor 2 -Norm: || ∂loss ∂F l i • F l i || 2 ; The loss is the Cross Entropy Loss on the split training set from the original training set. All the setting of these experiments are under can be found in https://github.com/ bearpaw/pytorch-classification. Specifically, for pruning ratio:

N ADDITIONAL EXPERIMENTS ABOUT IMAGE CLASIFICATION

O ABOUT WEIGHT DECAY Figure 22 : KS test (Lilliefors, 1967) while using different settings of weight decay. We train the ResNet110 and WRN-28-10 on CIFAR100 with different weight decay (1e-3, 3e-4 and 0) and use KS test to verify whether the parameters of different layers follow a normal distribution. In Fig. 22 , we can find (1) When weight decay (wd) is non-zero, the normality is higher than that when weight decay is 0. (2) If weight decay is 0, the p-value can still be much greater than 0.05, which means that the regularization of weight decay may not be the key reason for CWDA. The distribution of the parameters in these two networks (weight decay is 0) are shown in Fig. 24 and Fig. 23 . 

P STATISTICAL TEST

In this section, according to Table 3 and Section 2 .1, we have a series of statistical tests for the necessary conditions of CWDA. let F ij ∈ R Ni×k×k represent the j th filter of the i th convolutional layer. (1) Gaussian (i.e., to verify whether F ij approximatively follow a Gaussian-alike distribution.). In i th layer, we use Kolmogorov-Smirnov (KS) test (Lilliefors, 1967) to check if all the weights in the same layer follow a normal distribution. (2) Standard Deviation (i.e., to verify whether the standard deviation of each filter in any layers tends to be a constant c.). Let σ j denotes the standard deviation of all the weights of filter F ij in i th layer. We use Student's t test (Efron, 1969) to check if the variance of these σ j is small enough. The null hypothesis H 0 and the alternative hypothesis H 1 are: H 0 : Var(σ 1 , σ 2 , .., σ Ni ) ≤ σ 2 0 , H 1 : Var(σ 1 , σ 2 , .., σ Ni ) > σ 2 0 . where N i denotes the number of the filters in i th layer and σ 0 is a given real number which is small enough, like σ 2 0 = 0.0001. (3) Mean (i.e., to verify whether the mean of F ij is 0.). Let the mean of all the weights in the same layer is µ. We use Student's t test (Efron, 1969) to check if µ is close to 0. First, we check the upper bound (Mean-Left) of µ, i.e., H 0 : µ ≤ , H 1 : µ > . where is a small constant, like = 0.01. Next, we check the lower bound (Mean-Right) and the null hypothesis H 0 and the alternative hypothesis H 1 are: H 0 : µ ≥ -, H 1 : µ < -. Of course, the p value for H 0 : µ ≥and H 0 : µ ≤ should be the same. There are several Notable points: • In all the statistical tests, let the confidence level be 0.95, = 0.01 and σ 2 0 = 0.0001. • we use Green color to represent that the convolutional filters in one layer can pass the statistical test. Conversely, the Red color means that the filters can not pass the tests. • In most layers, the convolutional filters can pass the statistical tests, except for a few layers which are in front of the network. This phenomenon is consistent with the analysis in Section 5.1 and it does not mean CWDA is not true. • Most of the experiments are image classification, except for the tests in Appendix P.7. • p-value and c-value denote p value and critical value (confidence level is 0.95), respectively. t-value is t values in Student's t test. If p-value is larger than c-value or t-value is smaller than c-value, we think this filter passes the tests. • In fact, some of the experiments in Table 3 are 



In Section 5, we make further discussion and analysis on the conditions for CWDA to be satisfied. The statistical tests about the situation with or without weight decay can be found in Appendix O. Sp is a nonparametric measurement of ranking correlation, and it assesses how well the relationship between two variables can be described using a monotonic function, i.e., filters ranking sequence in the same layer under two criteria in this paper. criterion. However, from the proof in Appendix L, the mean of1 (F i ) is σ i • k 2 π = 0, but The empirical result for slimming training (Liu et al., 2017b) is shown in Appendix B. https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html en.wikipedia.org/wiki/Stirling'sapproximation Survey of simple,continuous,uniariate probability distribution and Wikipredia. https://en.wikipedia.org/wiki/Delta_method



Figure 1: The distribution of 2 norm (WRN)

Figure 2: (a-d) Visualization of correlation matrix F F T . More experiments on the different layers of other networks can be found in Appendix Q .(e) The structure of F ij .

Figure 3: Visualization of the distribution of convolutional filters. The parameters of a convolutional filter approximately follow a Gaussian-alike distribution.

Figure 5: Spearman's rank correlation coefficient (Sp) between different pruning criteria on several networks and datasets (more experiments can be found in Appendix R).

Figure 7: The Sp between different types of pruning criteria on VGG16 and ResNet56. For each type, we choose two representative criteria and we call them: (1) Norm-based: 1 and 2 ; (2) Importance-based: Taylor 1 and Taylor 2 (Molchanov et al., 2016; 2019a;b); (3) BNbased: BN γ 4 and BN β (Liu et al., 2017b); (4) Activation-based: Entropy (Luo & Wu, 2017) andAPoZ(Hu et al., 2016). The details of these criteria can be found in Appendix M.

7(b) and Fig.7(c) that the Sp between Taylor 1 and Taylor 2 is not large, but Taylor 2 has strong similarity with both two Norm-based criteria. Moreover, the Sp between BN γ and each Norm-based criteria exceeds 0.9, but it is not large in other layers (Fig.7(b) and Fig.7(d)). These phenomena are worthy of further study.

Figure 8: The visualization of Applicability problem for different types of criteria. (VGG16)

Figure 10: The magnitude of the importance measured by 1 and 2 criteria. The actual magnitude almost coincides with the estimation obtained by CWDA.

Figure 11: The similarity for different type of pruning criteria while using global pruning with different start layer.

Figure12: The distribution of 2 norm when the WRN is not trained well. Left: without data augmentation. Right: trained with uniform initialization and 6 epochs. Like the phenomenon in Fig.1, 2 pruning still has Applicability problems when the network is not trained well.

Figure 16: The distribution of the convolutional filter (141 th Conv) with kaiming-uniform initialization for each epoch.

Figure 17: The approximation of Theorem 2: (Left) the example about ResNet56; (Right) the example about ResNet110.

Figure 20: The correlation matrix of convolutional filters. For clarity, we use the first 64 filters in each layer to calculate the Gram matrix.

Figure 21: The geometric structure of convolutional filters when the network has large number of filters in each layer. for every pair of filters in one layer, (1) Their 2 norm are equivalent (||v 1 || 2 ≈ ||v 2 || 2 ≈ ||v 3 || 2 ); (2) They are equidistant (||v 1-v 2 || 2 ≈ ||v 2 -v 3 || 2 ≈ ||v 3 -v 1 || 2 ); (3) and they are orthogonal (v T 1 v 2 ≈ v T 2 v 3 ≈ v T 3 v 1 ≈ 0).

Figure 23: The distribution of parameters in different convolutional filters (WRN-28-10, wd = 0).

Figure 24: The distribution of parameters in different convolutional filters (ResNet110, wd = 0).

repeated. These experiments are shown in following tables. The (*) means that ,for the sake of brevity, the repeat experiments are shown only once on the experiment with (*).

Figure 25: The passing rate of statistical test in Appendix P(2), where 0 < σ 0 ≤ 0.1 and passing rate is the ratio between the number of the filters passed the test and the number of total filters.

Figure26: The filters that can not pass the statistical test in Appendix P(2). We record the convolutional filters that can not pass the statistical test under all settings of σ 0 in Fig.25. It can be found that the filters that fail to pass the test are concentrated in the first few layers. This is consistent with the statement in Section 5.1.

Figure 27: Network Structure

Figure 28: Optimizer

Figure 29: Initialization

Figure 30: Attention mechanism

Figure 38: Pytorch pre-trained Model

Norm-based pruning criteria.

Aaditya Prakash, James Storer, Dinei Florencio, and Cha Zhang. Repr: Improved training of convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10666-10675, 2019. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

The similarity when using different distribution.

The similarity when using mixed distribution.

a 1 , a 2 , ..., a k do not all belong to P , then there must exist n -k vectors p 1 , p 2 , ..., p n-k from P , s.t (a 1 , a 2 , ..., a k , p 1 , p 2 , ..., p n-k ) are a set of basis for the linear space V .Proof. we use mathematical induction. First, show that the Lemma 6 holds for n -k = 1. it means we need to find a vector p 1 ∈ P s.t. a 1 , a 2 , ..., a k , p 1 linearly independent. If p 1 does not exist, then ∀p ∈ P would be linearly represented by a 1 , a 2 , ..., a k . In other word,

The accuracy(%) of several networks and datasets using different pruning criteria.

The repeated experiments in Table3.

Cifar100 ResNet164

Cifar100 VGG19

Cifar100 ASGD-ResNet164

Cifar100 kaiming-uniform-ResNet164

Cifar100 Xavier-normal-ResNet164

Cifar100 Xavier-uniform-ResNet164

AlphaGAN matting

Cifar10 VGG19

annex

In this section, we use more networks (Pytorch Pretrain Models 9 to verify the obersevations in Fig. 2 (a-d ). We show the results of three layers for each network.Q.1 VGG16 9 https://pytorch.org/docs/stable/torchvision/models.html There are several additional experiments of Fig. 5 in following figures. They show the Spearman's rank correlation coefficient (Sp) among 1 pruning, 2 pruning and GM pruning for all the experiments in Table 3 . These experiments further visualized the strong similarities of these pruning methods in different situations. 

