LAYER-ADAPTIVE SPARSITY FOR THE MAGNITUDE-BASED PRUNING

Abstract

Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on "how to choose," the layerwise sparsities are mostly selected algorithm-byalgorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level 2 distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-

1. INTRODUCTION

Neural network pruning is an art of removing "unimportant weights" from a model, with an intention to meet practical constraints (Han et al., 2015) , mitigate overfitting (Hanson & Pratt, 1988) , enhance interpretability (Mozer & Smolensky, 1988) , or deepen our understanding on neural network training (Frankle & Carbin, 2019 ). Yet, the importance of weight is still a vaguely defined notion, and thus a wide range of pruning algorithms based on various importance scores has been proposed. One popular approach is to estimate the loss increment from removing the target weight to use as an importance score, e.g., Hessian-based approximations (LeCun et al., 1989; Hassibi & Stork, 1993; Dong et al., 2017 ), coreset-based estimates (Baykal et al., 2019; Mussay et al., 2020 ), convex optimization (Aghasi et al., 2017) , and operator distortion (Park et al., 2020) . Other approaches include on-the-fly 1 regularization (Louizos et al., 2018; Xiao et al., 2019) , Bayesian methods (Molchanov et al., 2017; Louizos et al., 2017; Dai et al., 2018) , and reinforcement learning (Lin et al., 2017) . Recent discoveries (Gale et al., 2019; Evci et al., 2020) demonstrate that, given an appropriate choice of layerwise sparsity, simply pruning on the basis of weight magnitude yields a surprisingly powerful unstructured pruning scheme. For instance, Gale et al. (2019) evaluates the performance of magnitudebased pruning (MP; Han et al. (2015) ; Zhu & Gupta (2018)) with an extensive hyperparameter tuning, and shows that MP achieves comparable or better performance than state-of-the-art pruning algorithms that use more complicated importance scores. To arrive at such a performance level, the authors introduce the following handcrafted heuristic: Leave the first convolutional layer fully dense, and prune up to only 80% of weights from the last fully-connected layer; the heuristic is motivated by the sparsity pattern from other state-of-the-art algorithms (Molchanov et al., 2017) and additional experimental/architectural observations. Unfortunately, there is an apparent lack of consensus on "how to choose the layerwise sparsity" for the magnitude-based pruning. Instead, the layerwise sparsity is selected mostly on an algorithm-byalgorithm basis. One common method is the global MP criteria (see, e.g., Morcos et al. ( 2019)), Contributions. In search of a "go-to" layerwise sparsity for MP, we take a model-level distortion minimization perspective towards MP. We build on the observation of Dong et al. ( 2017); Park et al. ( 2020) that each neural network layer can be viewed as an operator, and MP is a choice that incurs minimum 2 distortion to the operator output (given a worst-case input signal). We bring the perspective further to examine the "model-level" distortion incurred by pruning a layer; preceding layers scale the input signal to the target layer, and succeeding layers scale the output distortion. Based on the distortion minimization framework, we propose a novel importance score for global pruning, coined LAMP (Layer-Adaptive Magnitude-based Pruning). The LAMP score is a rescaled weight magnitude, approximating the model-level distortion from pruning. Importantly, the LAMP score is designed to approximate the distortion on the model being pruned, i.e., all connections with a smaller LAMP score than the target weight is already pruned. Global pruningfoot_0 with the LAMP score is equivalent to the MP with an automatically determined layerwise sparsity. At the same time, pruning with LAMP keeps the benefits of MP intact; the LAMP score is efficiently computable, hyperparameter-free, and does not rely on any model-specific knowledge. We validate the effectiveness of LAMP under a diverse experimental setup, encompassing various convolutional neural network architectures and various image datasets (CIFAR-10/100, SVHN, Restricted ImageNet). In all considered setups, LAMP consistently outperforms the baseline layerwise sparsity selection schemes. We also perform additional ablation studies with one-shot pruning and weight-rewinding setup to confirm that LAMP performs reliably well under a wider range of scenarios. Organization. In Section 2, we briefly describe existing methods to choose the layerwise sparsity for magnitude-based pruning. In Section 3, we formally introduce LAMP and describe how the 2 distortion minimization perspective motivates the LAMP score. In Section 4, we empirically validate the effectiveness and versatility of LAMP. In Section 5, we take a closer look at the layerwise sparsity discovered by LAMP and compare with baseline methods and previously proposed handcrafted heuristics. In Section 6, we summarize our findings and discuss future directions. Appendices include the experimental details (Appendix A), complexity analysis (Appendix B), derivation of the LAMP score (Appendix C), additional experiments on Transformer (Appendix D), and detailed experimental results with standard deviations (Appendix E).

2. RELATED WORK

This section gives a (necessarily non-exhaustive) survey of various layerwise sparsity selection schemes used for magnitude-based pruning algorithms. Magnitude-based pruning of neural networks dates back to the early works of Janowsky (1989); LeCun et al. (1989) , and has been actively studied



i.e., using a global threshold for LAMP score



Figure 1: The LAMP score is a squared weight magnitude, normalized by the sum of all "surviving weights" in the layer. Global pruning by LAMP is equivalent to the layerwise magnitude-based pruning with an automatically chosen layerwise sparsity.

