CHIPNET: BUDGET-AWARE PRUNING WITH HEAVISIDE CONTINUOUS APPROXIMATIONS

Abstract

Structured pruning methods are among the effective strategies for extracting small resource-efficient convolutional neural networks from their dense counterparts with minimal loss in accuracy. However, most existing methods still suffer from one or more limitations, that include 1) the need for training the dense model from scratch with pruning-related parameters embedded in the architecture, 2) requiring model-specific hyperparameter settings, 3) inability to include budget-related constraint in the training process, and 4) instability under scenarios of extreme pruning. In this paper, we present ChipNet, a deterministic pruning strategy that employs continuous Heaviside function and a novel crispness loss to identify a highly sparse network out of an existing dense network. Our choice of continuous Heaviside function is inspired by the field of design optimization, where the material distribution task is posed as a continuous optimization problem, but only discrete values (0 or 1) are practically feasible and expected as final outcomes. Our approach's flexible design facilitates its use with different choices of budget constraints while maintaining stability for very low target budgets. Experimental results show that ChipNet outperforms state-of-the-art structured pruning methods by remarkable margins of up to 16.1% in terms of accuracy. Further, we show that the masks obtained with ChipNet are transferable across datasets. For certain cases, it was observed that masks transferred from a model trained on featurerich teacher dataset provide better performance on the student dataset than those obtained by directly pruning on the student data itself.

1. INTRODUCTION

Convolution Neural Networks (CNNs) have resulted in several breakthroughs across various disciplines of deep learning, especially for their effectiveness in extracting complex features. However, these models demand significantly high computational power, making it hard to use them on lowmemory hardware platforms that require high-inference speed. Moreover, most of the existing deep networks are heavily over-parameterized resulting in high memory footprint (Denil et al., 2013; Frankle & Carbin, 2018) . Several strategies have been proposed to tackle this issue, that include network pruning (Liu et al., 2018) , neural architecture search using methods such as reinforcement learning (Jaafra et al., 2019) and vector quantization (Gong et al., 2014) , among others. Among the methods outlined above, network pruning has proved to be very effective in designing small resource-efficient architectures that perform at par with their dense counterparts. Network pruning refers to removal of unnecessary weights or filters from a given architecture without compromising its accuracy. It can broadly be classified into two categories: unstructured pruning and structured pruning. Unstructured pruning involves removal of neurons or the corresponding connection weights from the network to make it sparse. While this strategy reduces the number of parameters in the model, computational requirements are still the same (Li et al., 2017) . Structured pruning methods on the other hand remove the entire channels from the network. This strategy pre-serves the regular structure, thereby taking advantage of the high degree of parallelism provided by modern hardware (Liu et al., 2017; Gordon et al., 2018) . Several structured pruning approaches have been proposed in the recent literature. A general consensus is that variational approaches using sparsity prior loss and learnable dropout parameters outperform the deterministic methods (Lemaire et al., 2019) . Some of these methods learn sparsity as a part of pretraining, and have proved to perform better than the three stage pretrain-prune-finetune methods. However, since such approaches need to train the model from scratch with pruning-related variables embedded into the network, they cannot benefit from off-the-shelf pretrained weights (Liu et al., 2017; Alvarez & Salzmann, 2017) . Others require choosing hyperparameters based on the choice of the network, and cannot be easily adapted for new models (Gordon et al., 2018) . Further, with most of these methods, controlled pruning cannot be performed, and a resource-usage constraint can only be satisfied through trial-and-error approach. Recently, Lemaire et al. ( 2019) presented a budget-aware pruning method that includes the budget constraint as a part of the training process. A major drawback of this approach and other recent methods is that they are unstable for very low resource budgets, and require additional tricks to work. Overall, a robust budget-aware pruning approach that can be coupled with different budget constraints as well as maintains stability for very low target budgets, is still missing in the existing literature. In this paper, we present ChipNet, a deterministic strategy for structured pruning that employs continuous Heaviside function and crispness loss to identify a highly sparse network out of an existing pretrained dense network. The abbreviation 'ChipNet' stands for Continuous Heaviside Pruning of Networks. Our pruning strategy draws inspiration from the field of design optimization, where the material distribution task is posed as a continuous optimization problem, but only discrete values (0 or 1) are practically feasible. Thus, only such values are produced as final outcomes through continuous Heaviside projections. We use a similar strategy to obtain the masks in our sparsity learning approach. The flexible design of ChipNet facilitates its use with different choices of budget constraints, such as restriction on the maximum number of parameters, FLOPs, channels or the volume of activations in the network. Through experiments, we show that ChipNet consistently outperforms state-of-the-art pruning methods for different choices of budget constraints. ChipNet is stable for even very low resource budgets, and we demonstrate this through experiments where network is pruned to as low as 1% of its parameters. We show that for such extreme cases, ChipNet outperforms the respective baselines by remarkable margins, with a difference in accuracy of slightly beyond 16% observed for one of the experiments. The masks learnt by ChipNet are transferable across datasets. We show that for certain cases, masks transferred from a model trained on feature-rich teacher dataset provide better performance on the student dataset than those obtained by directly pruning on the student data itself.

2. RELATED WORK

As has been stated in the hypothesis by Frankle & Carbin (2018) , most neural networks are overparameterized with a large portion (as much as 90%) of the weights being of little significance to the output of the model. Clearly, there exists enormous scope to reduce the size of these networks. Several works have explored the efficiency of network pruning strategies for reducing storage requirements of these networks and accelerating inference speed (LeCun et al., 1990; Dong et al., 2017) . Some early works by Han et al. (2015a; c); Zhu & Gupta (2017) involve removal of individual neurons from a network to make it sparse. This reduces the storage requirements of these networks, however, no improvement in inference speed is observed. Recently, several works have focused on structured network pruning, as it involves pruning the entire channel/filters or even layers to maintain the regular structure (Luo et al., 2017; Li et al., 2017; Alvarez & Salzmann, 2016) . The focus of this paper is on structured network pruning, thus, we briefly discuss here the recent works related to this approach 



. The recent work by Li et al. (2017) identifies less important channels based on L1-norm. Luo et al. (2017); He et al. (2017) perform channel selection based on their influence on the activation values of the next layer. Liu et al. (2017) perform channel-level pruning by imposing LASSO regularization on the scaling terms in the batchnorm layers, and prune the model based on a global threshold. He et al. (2018b) automatically learn the compression ratio of each layer with reinforcement learning. Louizos et al. (2017); Alvarez & Salzmann (2017; 2016) train and prune the network in a single stage strategy.

