LAYER-ADAPTIVE SPARSITY FOR THE MAGNITUDE-BASED PRUNING

Abstract

Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on "how to choose," the layerwise sparsities are mostly selected algorithm-byalgorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level 2 distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-

1. INTRODUCTION

Neural network pruning is an art of removing "unimportant weights" from a model, with an intention to meet practical constraints (Han et al., 2015) , mitigate overfitting (Hanson & Pratt, 1988) , enhance interpretability (Mozer & Smolensky, 1988) , or deepen our understanding on neural network training (Frankle & Carbin, 2019 ). Yet, the importance of weight is still a vaguely defined notion, and thus a wide range of pruning algorithms based on various importance scores has been proposed. One popular approach is to estimate the loss increment from removing the target weight to use as an importance score, e.g., Hessian-based approximations (LeCun et al., 1989; Hassibi & Stork, 1993; Dong et al., 2017 ), coreset-based estimates (Baykal et al., 2019; Mussay et al., 2020) , convex optimization (Aghasi et al., 2017) , and operator distortion (Park et al., 2020) . Other approaches include on-the-fly 1 regularization (Louizos et al., 2018; Xiao et al., 2019) , Bayesian methods (Molchanov et al., 2017; Louizos et al., 2017; Dai et al., 2018) , and reinforcement learning (Lin et al., 2017) . Recent discoveries (Gale et al., 2019; Evci et al., 2020) demonstrate that, given an appropriate choice of layerwise sparsity, simply pruning on the basis of weight magnitude yields a surprisingly powerful unstructured pruning scheme. For instance, Gale et al. ( 2019) evaluates the performance of magnitudebased pruning (MP; Han et al. (2015) ; Zhu & Gupta ( 2018)) with an extensive hyperparameter tuning, and shows that MP achieves comparable or better performance than state-of-the-art pruning algorithms that use more complicated importance scores. To arrive at such a performance level, the authors introduce the following handcrafted heuristic: Leave the first convolutional layer fully dense, and prune up to only 80% of weights from the last fully-connected layer; the heuristic is motivated by the sparsity pattern from other state-of-the-art algorithms (Molchanov et al., 2017) and additional experimental/architectural observations. Unfortunately, there is an apparent lack of consensus on "how to choose the layerwise sparsity" for the magnitude-based pruning. Instead, the layerwise sparsity is selected mostly on an algorithm-byalgorithm basis. One common method is the global MP criteria (see, e.g., Morcos et al. (2019) ),

