AN OPERATOR NORM BASED PASSIVE FILTER PRUNING METHOD FOR EFFICIENT CNNS Anonymous

Abstract

Convolutional neural networks (CNNs) have shown state-of-the-art performance in various applications. However, CNNs are resource-hungry due to their requirement of high computational complexity and memory storage. Recent efforts toward achieving computational efficiency in CNNs involve filter pruning methods that eliminate some of the filters in CNNs based on the "importance" of the filters. Existing passive filter pruning methods typically use the entry-wise norm of the filters to quantify filter importance, without considering how well the filter contributes in producing the node output. Under high pruning ratio where the large number of filters are to be pruned from the network, the entry-wise norm methods always select high entry-wise norm filters as important, and ignore the diversity learned by the other filters that may result in degradation in the performance. To address this, we present a passive filter pruning method where the filters are pruned based on their contribution in producing output by implicitly considering the operator norm of the filters. The computational cost and memory requirement is reduced significantly by eliminating filters and their corresponding feature maps from the network. Accuracy similar to the original network is recovered by fine-tuning the pruned network. The proposed pruning method gives similar or better performance than the entry-wise norm-based pruning methods at various pruning ratios. The efficacy of the proposed pruning method is evaluated on audio scene classification (e.g. TAU Urban Acoustic Scenes 2020) and image classification (MNIST handwritten digit classification, CIFAR-10).

1. INTRODUCTION

Convolutional neural networks (CNNs) have shown great success and exhibit state-of-the-art performance when compared to traditional hand-crafted methods in many domains (Gu et al. (2018) ). Even though CNNs are highly efficient in solving non-linear complex tasks (Denton et al. (2014) ), it may be challenging to deploy large-size CNNs on resource-constrained devices such as mobile phones or internet of things (IoT) devices, owing to high computational costs during inference and the memory requirement for CNNs (Simonyan & Zisserman (2015) ; Krizhevsky et al. (2012) ). Thus, the issue of reducing the size and the computational cost of CNNs has drawn a significant amount of attention in the research community. Recent efforts toward reducing the computational complexity of CNNs involve pruning methods where a set of parameters, such as weights or filters, are eliminated from the CNNs. These pruning methods are motivated by the existence of redundant parameters (Denil et al. (2013); Livni et al. (2014)) in CNNs that only yield extra computations without contributing much in performance (Frankle & Carbin (2019) ). For example, Li et al. (2017) found that 64% of the parameters, contributing approximately 34% of the computation time, are redundant. Eliminating such redundant parameters from CNNs provides small CNNs that perform similar to the original CNNs while reducing the computations and the memory requirement compared to the original CNNs. While eliminating weights from an unpruned CNN may result in a highly sparse network with few parameters, a pruned network obtained by eliminating individual weights is unstructured and may not be straightforward to run more efficiently. The practical acceleration in the unstructured sparse pruned networks is limited due to random connections despite high sparsity (Luo et al. (2017) ). Moreover, the unstructured sparse networks can not be supported by off-the-shelf libraries and re-Figure 1 : A geometrical view of output produced by a convolution operation, where input feature maps X in R 2 are transformed to output feature maps Y in R 2 using a transformation matrix F. F is decomposed to a left singular matrix (U), a right singular matrix (W) and a diagonal matrix (Σ) using a singular value decomposition method. U and W are orthogonal matrices that cause rotation in the input, and σ 1 and σ 2 are singular values that scale the input. F stretches X maximally by ||F|| = σ 1 which is an operator norm of F. 2016)). To address this unstructured pruning problem, several filter pruning methods have been proposed which eliminates whole filters, resulting in a structured pruned network (Luo et al. ( 2018)) that does not require additional resources for speed-up. In these structured filter pruning methods, the "importance" of a filter, used to decide if a filter is retained or eliminated, is measured using either active or passive methods. Active filter pruning methods involve a dataset to compute the importance of filters. For example, (Luo & Wu ( 2017 2019)) compute important set of filters during training process by optimizing a soft mask associated with each feature map using regularization. However, these methods are timeconsuming and use extra memory resources to obtain feature maps. On the other hand, passive filter pruning methods only use the parameters of the filters without involving any dataset or optimizing process to compute the importance of the filters. Therefore, the passive filter pruning methods are less time-consuming and require less memory resources in identifying important set of filters. In particular, when there exists already a pre-trained network to be pruned, identifying pruned set of filters using optimization process (Liu et al. (2017) ; Lin et al. ( 2019)) would be heavily computationally expensive compared to that of the passive filter pruning methods. A typical passive filter pruning method uses an entry-wise norm of the filters to measure their importance. For example, this might be a l 1 -norm a sum of absolute values of each entry in the filter or an l 2 -norm square root of the squared sum of each entry in the filter. Li et al. (2017) eliminates filters having smaller entry-wise l 1 -norm or l 2 -norm as measured from the origin, and finds that eliminating filters based on the entry-wise l 2 -norm of the filters gives similar performance to that of the entry-wise l 1 -norm. He et al. ( 2019) eliminates the filters with smaller l 2 -norm as measured from the geometric median of all filters. Both the previous methods assume that a filter with smaller entry-wise norm is less informative or less important, without considering how significantly a filter contribute in producing output. For example, an illustration of the contribution by the filters in producing output is shown in Figure 1 , where filters F produces an output Y by maximally stretching the input X by a largest singular value σ 1 that represents an operator norm of the F. However, the entry-wise norm methods do not consider any input-output relationship information and rely on each entry of the filter while computing the filter importance. To illustrate the above further, we pictorially show in Figure 2 (a) that two filters F 1 and F 3 having same entry-wise norm contribute differently and produce different output due to different operator norm of the each filter shown in Figure 2(b ). Hence such filters should be given different importance. Moreover, when a few number of filters have to be retained in the CNN to yield a very small pruned CNN, selecting filters with only high entry-wise norm may ignore the smaller norm filters that may also contribute significantly in producing output (Ye et al. (2018) ). This may degrade the accuracy of



quire specialised software or hardware for speed-up (Wen et al. (2016); Han et al. (

); Polyak & Wolf (2015); Hu et al. (2016); Lin et al. (2020); Yeom et al. (2021)) use feature maps which are the outputs generated from CNN filters corresponding to a set of examples, and apply metrics such as entropy, variance, average rank of feature maps and the average percentage of zeros on the feature maps to quantify the importance of the filters. Other methods including (Liu et al. (2017); Lin et al. (

