RETHINKING THE PRUNING CRITERIA FOR CONVOLU-TIONAL NEURAL NETWORK Anonymous

Abstract

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), and various pruning criteria have been proposed to remove the redundant filters of CNNs. From our comprehensive experiments, we found some blind spots on pruning criteria: (1) Similarity: There are some strong similarities among several primary pruning criteria that are widely cited and compared. According to these criteria, the ranks of filters' importance in a convolutional layer are almost the same, resulting in similar pruned structures. (2) Applicability: For a large network (each convolutional layer has a large number of filters), some criteria can not distinguish the network redundancy well from their measured filters' importance. In this paper, we theoretically validate these two findings with our assumption that the well-trained convolutional filters in each layer approximately follow a Gaussian-alike distribution. This assumption is verified through systematic and extensive statistical tests.

1. INTRODUCTION

Pruning (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2015; He et al., 2019) a trained neural network is commonly seen in network compression. In particular, for neural networks with convolutional filters, channel pruning refers to the pruning of the filters in the convolutional layers. There are several critical factors for channel pruning. Procedures. One-shot method (Li et al., 2016) : Train a network from scratch; Use a certain criterion to calculate filters' importance, and prune the filters which have small importance; After additional training, the pruned network can recover its accuracy to some extent. Iterative method (He et al., 2018; Frankle & Carbin, 2019) : Unlike One-shot methods, they prune and fine-tune a network alternately. Criteria. The filters' importance can be definded by a given criterion. From different ideas, many types of pruning criteria have been proposed, such as Norm-based (Li et al., 2016) , Activation-based (Hu et al., 2016; Luo & Wu, 2017) , Importance-based (Molchanov et al., 2016; 2019a) , BN-based (Liu et al., 2017b ) and so on. Strategy. Layer-wise pruning: In each layer, we can sort and prune the filters, which have small importance measured by a given criterion. Global pruning: Different from layer-wise pruning, global pruning sort the filters from all the layers through their importance and prune them. ResNet18 [111, 212, 33, 61, 68, 152, 171, 45] VGG16 [102, 28, 9, 88, 66, 109, 86, 45] 2 ResNet18 [111, 33, 212, 61, 171, 42, 243, 129] VGG16 [102, 28, 88, 9, 109, 66, 86, 45 ] GM ResNet18 [111, 212, 33, 61, 68, 45, 171, 42] VGG16 [102, 28, 9, 88, 109, 66, 45, 86] Fermat ResNet18 [111, 212, 33, 61, 45, 171, 42, 68] VGG16 [102, 28, 88, 9, 109, 66, 45, 86] As one of the simplest and most effective channel pruning criteria, 1 pruning (Li et al., 2016) is widely used in the industry. The core idea of this criterion is to sort the 1 norm of filters in one layer and then prune the filters, which have a small 1 norm. Similarly, there is 2 pruning (Frankle & Carbin, 2019; He et al., 2018) . Through the study of the distribution of norm, He et al. ( 2019) demonstrates that these criteria should satisfy two conditions: (1) the variance of the norm of the filters can-not be too small; (2) the minimum norm of the filters should be small enough. Since these two conditions do not always hold, a new criterion considering the relative importance of the filters ( 1 and 2 norm can be seen as one of algorithm which uses absolute importance of filters) is proposed (He et al., 2019) . Since this criterion uses the Fermat point (i.e., geometric median (Cohen et al., 2016 )), we call this method Fermat . However, due to the high calculation cost of Fermat point, He et al. (2019) relaxed it and then got another criterion GM method. Let F ij ∈ R Ni×k×k represents the j th filter of the i th convolutional layer, where N i is the number of input channels for i th layer and k denotes the kernel size of the convolutional filter. In i th layer, there are N i+1 filters. The details of these criteria are shown in Table 2 . F denotes the Fermat point of F ij in Euclidean space. These four pruning criteria are called Norm-based pruning in this paper as they include norm in their design.  N i+1 k=1 ||F ik -Fij||2 In previous works (Luo et al., 2017; Han et al., 2015; Ding et al., 2019; Dong et al., 2017; Renda et al., 2020) , including the criteria mentioned above, they usually focused on (a) How much the model was compressed; (b) How much performance was restored; (c) The inference efficiency of the pruned network and (d) The cost of finding the pruned network. However, there is little work to discuss two blind spots about the pruning criteria: (1) Similarity: What are the actual differences among these previous pruning criteria? Using VGG16 and ResNet18 on ImageNet, we show the ranks of filters' importance under different criteria in Table 1 . It is easy to find that they have almost the same sequence, leading to similar pruned structures. In this situation, the criteria used absolute importance of filters ( 1 , 2 ) and the criteria used relative importance of filters (Fermat,GM) may not be significantly different. (2) Applicability: What is the Applicability of these pruning criteria to prune the CNNs? There is a toy example w.r.t. 2 criterion. If the 2 norm (regarded as importance) of the filters in one layer are 0.9, 0.8, 0.4 and 0.01, according to smaller-norm-less-informative assumption (Ye et al., 2018), it's easy to know that we should prune the last filter. However, if the norm are close, like 0.91, 0.92, 0.93, 0.92, it is hard to know which filter should be pruned even though the first one is the smallest. As shown in Fig. 1 , taking Wide ResNet28-10 (WRN) as an example, this is a real and existing problem for pruning a network with the 2 criterion. (2019) and Molchanov et al. (2019a) . We make further analysis and observation of them in this paper. Since the criteria in Table 2 are widely cited and compared (Liu et al., 2020b; Li et al., 2020b; He et al., 2020; Liu et al., 2020a; Li et al., 2020a) , therefore it's important to study them for the above two issues. In order to rigorously study them: First, we come up with an assumption about the distribution of the parameters of the convolutional filters, called Convolution Weight Distribution Assumption (CWDA), with systematic and comprehensive statistical tests (Appendix P) in Section 2. Next, in Section 3 and Section 4, we theoretically verify the problem about Similarity and Applicability for some Norm-based criteria in layer-wise pruning. Last but not least, in Section 5, we discuss more issues including: (1) the conditions for CWDA to be satisfied, (2) the Similarity and Applicability when we use other types of pruning criteria, (3) and another pruning strategy, the global pruning.

Contribution. (1)

We propose and verify an assumption called CWDA, which reveals that the well-trained convolutional filters approximately follow a Gaussian-alike distribution.



Figure 1: The distribution of 2 norm (WRN) Some similar research methods and opinions on pruning criteria are mentioned in He et al.(2019) and Molchanov et al. (2019a). We make further analysis and observation of them in this paper. Since the criteria in Table2are widely cited and compared(Liu et al., 2020b; Li et al.,  2020b; He et al., 2020; Liu et al., 2020a; Li  et al., 2020a), therefore it's important to study them for the above two issues. In order to rigorously study them: First, we come up with an assumption about the distribution of the parameters of the convolutional filters, called Convolution Weight Distribution Assumption (CWDA), with systematic and comprehensive statistical tests (Appendix P) in Section 2. Next, in Section 3 and Section 4, we theoretically verify the problem about Similarity and Applicability for some Norm-based criteria in layer-wise pruning. Last but not least, in Section 5, we discuss more issues including: (1) the conditions for CWDA to be satisfied, (2) the Similarity and Applicability when we use other types of pruning criteria, (3) and another pruning strategy, the global pruning.

The pruned filters' index ordered by the filters' importance from given pruning criteria, taking VGG16 (3 rd Conv) and ResNet18 (12 th Conv) as examples. The pruned filters' index (the ranks of filters' importance) are almost the same from different pruning criterion and it will lead to the similar pruned structures.

Norm-based pruning criteria.

