RETHINKING THE PRUNING CRITERIA FOR CONVOLU-TIONAL NEURAL NETWORK Anonymous

Abstract

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), and various pruning criteria have been proposed to remove the redundant filters of CNNs. From our comprehensive experiments, we found some blind spots on pruning criteria: (1) Similarity: There are some strong similarities among several primary pruning criteria that are widely cited and compared. According to these criteria, the ranks of filters' importance in a convolutional layer are almost the same, resulting in similar pruned structures. (2) Applicability: For a large network (each convolutional layer has a large number of filters), some criteria can not distinguish the network redundancy well from their measured filters' importance. In this paper, we theoretically validate these two findings with our assumption that the well-trained convolutional filters in each layer approximately follow a Gaussian-alike distribution. This assumption is verified through systematic and extensive statistical tests.

1. INTRODUCTION

Pruning (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2015; He et al., 2019) a trained neural network is commonly seen in network compression. In particular, for neural networks with convolutional filters, channel pruning refers to the pruning of the filters in the convolutional layers. There are several critical factors for channel pruning. Procedures. One-shot method (Li et al., 2016) : Train a network from scratch; Use a certain criterion to calculate filters' importance, and prune the filters which have small importance; After additional training, the pruned network can recover its accuracy to some extent. Iterative method (He et al., 2018; Frankle & Carbin, 2019) : Unlike One-shot methods, they prune and fine-tune a network alternately. Criteria. The filters' importance can be definded by a given criterion. From different ideas, many types of pruning criteria have been proposed, such as Norm-based (Li et al., 2016) , Activation-based (Hu et al., 2016; Luo & Wu, 2017) , Importance-based (Molchanov et al., 2016; 2019a) , BN-based (Liu et al., 2017b ) and so on. Strategy. Layer-wise pruning: In each layer, we can sort and prune the filters, which have small importance measured by a given criterion. Global pruning: Different from layer-wise pruning, global pruning sort the filters from all the layers through their importance and prune them. ResNet18 [111, 212, 33, 61, 68, 152, 171, 45] VGG16 [102, 28, 9, 88, 66, 109, 86, 45] 2 ResNet18 [111, 33, 212, 61, 171, 42, 243, 129] VGG16 [102, 28, 88, 9, 109, 66, 86, 45 ] GM ResNet18 [111, 212, 33, 61, 68, 45, 171, 42] VGG16 [102, 28, 9, 88, 109, 66, 45, 86] Fermat ResNet18 [111, 212, 33, 61, 45, 171, 42, 68] VGG16 [102, 28, 88, 9, 109, 66, 45, 86] As one of the simplest and most effective channel pruning criteria, 1 pruning (Li et al., 2016) is widely used in the industry. The core idea of this criterion is to sort the 1 norm of filters in one layer and then prune the filters, which have a small 1 norm. Similarly, there is 2 pruning (Frankle & Carbin, 2019; He et al., 2018) . Through the study of the distribution of norm, He et al. ( 2019) demonstrates that these criteria should satisfy two conditions: (1) the variance of the norm of the filters can-



The pruned filters' index ordered by the filters' importance from given pruning criteria, taking VGG16 (3 rd Conv) and ResNet18 (12 th Conv) as examples. The pruned filters' index (the ranks of filters' importance) are almost the same from different pruning criterion and it will lead to the similar pruned structures.

