RECEDING NEURON IMPORTANCES FOR STRUCTURED PRUNING

Abstract

Structured pruning efficiently compresses networks by identifying and removing unimportant neurons. While this can be elegantly achieved by applying sparsityinducing regularisation on BatchNorm parameters, an L1 penalty would shrink all scaling factors rather than just those of superfluous neurons. To tackle this issue, we introduce a simple BatchNorm variation with bounded scaling parameters, based on which we design a novel regularisation term that suppresses only neurons with low importance. Under our method, the weights of unnecessary neurons effectively recede, producing a polarised bimodal distribution of importances. We show that neural networks trained this way can be pruned to a larger extent and with less deterioration. We one-shot prune VGG and ResNet architectures at different ratios on CIFAR and ImagenNet datasets. In the case of VGG-style networks, our method significantly outperforms existing approaches particularly under severe pruning.

1. INTRODUCTION

Modern deep neural network architectures (Simonyan & Zisserman, 2014; He et al., 2016) achieve state-of-the-art performance but require significant computational resources which makes their deployment onto edge devices difficult. Even though it has been shown that it is possible to train less over-parametrised models from scratch and obtain a similar performance (Frankle & Carbin, 2018) , it remains a non-trivial task to actually find such a winning subnetwork. In this work, we focus on structured one-shot pruning as a means to network compression which is typically composed of three stages -i) training a large model to convergence, ii) removing parameters with low importance, and iii) fine-tuning the remaining network. Unstructured pruning, which works on a weight level (Yang et al., 2019; Frankle & Carbin, 2018) , can remove a much higher number of parameters but produces sparse weight matrices which cannot be efficiently utilised without specialised hardware (Han et al., 2016) . In contrast, by removing entire neurons structured pruning finds efficient structures akin to an implicit architecture search (Liu et al., 2018) . Structured pruning methods attribute an importance score to each neuron, which enables their ranking and ultimately the decision of which to dispose (Li et al., 2016; Molchanov et al., 2016a) . To this end, BatchNorm layers (Ioffe & Szegedy, 2015) become very appealing as they explicitly learn parameters which uniformly scale the outputs of each neuron. This scaling parameter can be used as a proxy for the importance the model attributes to a neuron, as a value of zero would effectively suppress an output. Furthermore, one can regularise these layers to obtain neuron level sparsity whilst maintaining classification performance (Liu et al., 2017; Zhuang et al., 2020) . Such methods typically define neuron importance as the absolute value of its scaling parameter, an approach which limits the design of regularisers. Because the measure is only half-bounded one cannot easily define levels of importance without looking at the overall distribution -making it difficult to target specific neurons. An example is the L1 regulariser (Liu et al., 2017) which shrinks all parameters with a constant gradient, even ones with high importance. Ideally, one would design a regulariser which creates sparsity by only shrinking unimportant neurons, leaving the others untouched. In this work, we create such a regulariser and show it outperforms existing approaches at a rate that increases with the amount of neurons pruned. Our contributions are two-fold: we first introduce a simple variation of BatchNorm, which linearly transforms channels using bounded scalers. This layer maintains the same performance as the original, while offering a bounded importance score for neurons. Building on this measure we then define a novel regularisation, focused on shrinking only neurons with lesser weight, by having its gradient decay exponentially for higher importances. Our method significantly outperforms related approaches for VGG models, and we show that severe degradation can be attributed to over-pruning early layers of the network.

2. RELATED WORK

Neural network compression through pruning is most commonly divided into structured and unstructured approaches. Unstructured pruning has gained a lot of attention in recent years (Frankle & Carbin, 2018) as it challenges conventional wisdom over the role of over-parametrisation and weight initialisation in the optimisation of deep neural networks (Frankle et al., 2020) . While these methods can achieve superior theoretical compression rates (Renda et al., 2020) , their use remains impractical without specialised hardware that can take advantage of sparsity (Han et al., 2016) . Structured pruning on the other hand removes entire neurons from an architecture thus achieving real memory and computational efficiencies (Liu et al., 2018) . At the heart of structured pruning lies the task of identifying unimportant neurons to remove from a network. Quantifying importance can be based on numerous criteria including filter norms (Li et al., 2016; He et al., 2018 ), reconstruction errors (He et al., 2017; Luo et al., 2017; Molchanov et al., 2016b; Yu et al., 2018 ), redundancy (He et al., 2019; Suau et al., 2020; Wang et al., 2018) and BatchNorm parameters (Liu et al., 2017; Zhuang et al., 2020) . Our work belongs in the latter category as we focus on deriving an importance score solely based on channel scaling parameters. In addition to defining importance measures, one can add regularisation during training to nudge networks into utilising their capacity more sparingly. Most relevant to our work are methods which apply sparsity regularisations on BatchNorm parameters (Liu et al., 2017; Zhuang et al., 2020) . The most popular approach is Network Slimming (Liu et al., 2017) , which constrains the BatchNorm scaling parameters using the L1 penalty. A drawback of this method is that it shrinks all parameters with an equal gradient irrespective of their importance. This issue is also addressed by Zhuang et al. (2020) who propose a regulariser that explicitly maximises the polarisation of the BatchNorm scaler distribution. While the method effectively increases the margin between important and unimportant neurons, it does so by both shrinking and expanding weights. Another method designed for nonlinear shrinking is Yang et al. (2019) , who propose the ratio of the L1 and L2 norms as a sparsity regulariser. Even though this method is not based on BatchNorm, but is targeted at filter weights, we include it in our comparison as it has a similar motivation to our work. In terms of pruning setup the most popular method is one-shot pruning (Li et al., 2016; Zhuang et al., 2020; Yang et al., 2019) where all desired neurons are removed at once and the remaining network is fine-tuned. Iterative approaches (Han et al., 2015) periodically prune and fine-tune until a target ratio is met. Recent works aim to eliminate the need of fine-tuning (Chen et al., 2021) altogether or prune models after intialisation in a data-free manner (Lee et al., 2018; Wang et al., 2020) .

3. SIGMOID BATCHNORM

We introduce a variation of BatchNorm, which uses a single learnable parameter per channel and offers a bounded importance score for filters. In its original formulation, BatchNorm (Ioffe & Szegedy, 2015) first normalises each input channel x using batch statistics mean µ B and standard deviation σ B , then applies an affine transformation using learnable parameters γ and β. BN (x, γ, β) = γ x -µ B σ 2 B + ϵ + β While BatchNorm has become ubiquitous in Deep Learning, the reasons behind its effectiveness are not fully understood (Santurkar et al., 2018) . The proliferation of BatchNorm variations (Ba et al., 2016; Wu & He, 2018; Ulyanov et al., 2016) suggests its benefits arise from normalising activations rather than the affine transformation following it. We empirically show that the transformation can be replaced to be linear without loss of performance.

