RECEDING NEURON IMPORTANCES FOR STRUCTURED PRUNING

Abstract

Structured pruning efficiently compresses networks by identifying and removing unimportant neurons. While this can be elegantly achieved by applying sparsityinducing regularisation on BatchNorm parameters, an L1 penalty would shrink all scaling factors rather than just those of superfluous neurons. To tackle this issue, we introduce a simple BatchNorm variation with bounded scaling parameters, based on which we design a novel regularisation term that suppresses only neurons with low importance. Under our method, the weights of unnecessary neurons effectively recede, producing a polarised bimodal distribution of importances. We show that neural networks trained this way can be pruned to a larger extent and with less deterioration. We one-shot prune VGG and ResNet architectures at different ratios on CIFAR and ImagenNet datasets. In the case of VGG-style networks, our method significantly outperforms existing approaches particularly under severe pruning.

1. INTRODUCTION

Modern deep neural network architectures (Simonyan & Zisserman, 2014; He et al., 2016) achieve state-of-the-art performance but require significant computational resources which makes their deployment onto edge devices difficult. Even though it has been shown that it is possible to train less over-parametrised models from scratch and obtain a similar performance (Frankle & Carbin, 2018) , it remains a non-trivial task to actually find such a winning subnetwork. In this work, we focus on structured one-shot pruning as a means to network compression which is typically composed of three stages -i) training a large model to convergence, ii) removing parameters with low importance, and iii) fine-tuning the remaining network. Unstructured pruning, which works on a weight level (Yang et al., 2019; Frankle & Carbin, 2018) , can remove a much higher number of parameters but produces sparse weight matrices which cannot be efficiently utilised without specialised hardware (Han et al., 2016) . In contrast, by removing entire neurons structured pruning finds efficient structures akin to an implicit architecture search (Liu et al., 2018) . Structured pruning methods attribute an importance score to each neuron, which enables their ranking and ultimately the decision of which to dispose (Li et al., 2016; Molchanov et al., 2016a) . To this end, BatchNorm layers (Ioffe & Szegedy, 2015) become very appealing as they explicitly learn parameters which uniformly scale the outputs of each neuron. This scaling parameter can be used as a proxy for the importance the model attributes to a neuron, as a value of zero would effectively suppress an output. Furthermore, one can regularise these layers to obtain neuron level sparsity whilst maintaining classification performance (Liu et al., 2017; Zhuang et al., 2020) . Such methods typically define neuron importance as the absolute value of its scaling parameter, an approach which limits the design of regularisers. Because the measure is only half-bounded one cannot easily define levels of importance without looking at the overall distribution -making it difficult to target specific neurons. An example is the L1 regulariser (Liu et al., 2017) which shrinks all parameters with a constant gradient, even ones with high importance. Ideally, one would design a regulariser which creates sparsity by only shrinking unimportant neurons, leaving the others untouched. In this work, we create such a regulariser and show it outperforms existing approaches at a rate that increases with the amount of neurons pruned. Our contributions are two-fold: we first introduce a simple variation of BatchNorm, which linearly transforms channels using bounded scalers. This

