NEURAL PRUNING VIA GROWING REGULARIZATION

Abstract

Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L 2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at https://github.com/mingsuntse/regularization-pruning.

1. INTRODUCTION

As deep neural networks advance in recent years LeCun et al. (2015) ; Schmidhuber (2015) , their remarkable effectiveness comes at a cost of rising storage, memory footprint, computing resources and energy consumption Cheng et al. (2017) ; Deng et al. (2020) . Neural network pruning Han et al. (2015; 2016) ; Li et al. (2017) ; Wen et al. (2016) ; He et al. (2017) ; Gale et al. (2019) is deemed as a promising force to alleviate this problem. Since its early debut Mozer & Smolensky (1989) ; Reed (1993) , the central problem of neural network pruning has been (arguably) how to choose weights to discard, i.e., the weight importance scoring problem LeCun et al. (1990); Hassibi & Stork (1993); Molchanov et al. (2017b; 2019) ; Wang et al. (2019a); He et al. (2020) . The approaches to the scoring problem generally fall into two groups: importance-based and regularization-based Reed (1993) . The former focuses on directly proposing certain theoretically sound importance criterion so that we can prune the unimportant weights once for all. Thus, the pruning process is typically one-shot. In contrast, regularization-based approaches typically select unimportant weights through training with a penalty term Han et al. (2015) ; Wen et al. (2016); Liu et al. (2017) . However, the penalty strength is usually maintained in a small regime to avoid damaging the model expressivity. Whereas, a large penalty strength can be helpful, specifically in two aspects. (1) A large penalty can push unimportant weights rather close to zero, then the pruning later barely hurts the performance even if the simple weight magnitude is adopted as criterion. (2) It is well-known that different weights of a neural network lie on the regions with different local quadratic structures, i.e., Hessian information. Many methods try to tap into this to build a more accurate scoring LeCun et al. (1990); Hassibi & Stork (1993) ; Wang et al. (2019a) ; Singh & Alistarh (2020). However, for deep networks, it is especially hard to estimate Hessian. Sometimes, even the computing itself can be intractable without resorting to proper approximation Wang et al. (2019a) . On this problem, we ask: Is it possible to exploit the Hessian information without knowing their specific values? This is the second scenario where a growing regularization can help. We will show under a growing regularization, the weight magnitude will naturally separate because of their different underlying local quadratic structure, therein we can pick the unimportant weights more faithfully even using the simple magnitude-based criterion. Corresponding to these two aspects, we will present two algorithms based on a growing L 2 regularization paradigm, in which the first highlights a better pruning schedulefoot_0 and the second explores a better pruning criterion.

Our contributions.

(1) We propose a simple yet effective growing regularization scheme, which can help transfer the model expressivity to the remaining part during pruning. The encouraging performance inspires us that the pruning schedule may be as critical as the weight importance criterion and deserve more research attention. (2) We further adopt growing regularization to exploit Hessian implicitly, without knowing their specific values. The method can help choose the unimportant weights more faithfully with a theoretically sound basis. In this regard, our paper is the first to show the connection between magnitude-based pruning and Hessian-based pruning, pointing out that the latter can be turned into the first one through our proposed growing regularization scheme. (3) The proposed two algorithms are easy to implement and scalable to large-scale datasets and networks. We show their effectiveness compared with many state-of-the-arts. Especially, the methods can work seamlessly for both filter pruning and unstructured pruning.

2. RELATED WORK

Regularization-based pruning. The first group of relevant works is those applying regularization to learn sparsity. The most famous probably is to use L 0 or L 1 regularization Louizos et al. ( 2018 2018) due to their sparsity-inducing nature. In addition, the common L 2 regularization is also explored for approximated sparsity Han et al. (2015; 2016) . The early papers focus more on unstructured pruning, which is beneficial to model compression yet not to acceleration. For structured pruning in favor of acceleration, Group-wise Brain Damage Lebedev & Lempitsky (2016) and SSL Wen et al. (2016) propose to use Group LASSO Yuan & Lin (2006) to learn regular sparsity, where the penalty strength is still kept in small scale because the penalty is uniformly applied to all the weights. To resolve this, Ding et al. (2018) and Wang et al. (2019c) propose to employ different penalty factors for different weights, enabling large regularization. Importance-based pruning. Importance-based pruning tries to establish certain advanced importance criteria that can reflect the true relative importance among weights as faithfully as possible. The pruned weights are usually decided immediately by some proposed formula instead of by training (although the whole pruning process can involve training, e.g., iterative pruning). The most widely used criterion is the magnitude-based: weight absolute value for unstructured pruningHan et al. (2015; 2016) or L 1 /L 2 -norm for structured pruning Li et al. (2017) . This heuristic criterion was proposed a long time ago Reed (1993) and has been argued to be inaccurate. In this respect, improvement mainly comes from using Hessian information to obtain a more accurate approximation of the increased loss when a weight is removed LeCun et al. (1990); Hassibi & Stork (1993) . Hessian is intractable to compute for large networks, so some methods (e. (2019c) . The difference mainly lies in their emphasis: Regularization-based method focuses more on an advanced penalty scheme so that the subsequent pruning criterion can be simple; while the importance-based one focus more on an advanced importance criterion itself. Meanwhile, regularization paradigm always involves iterative training, while the importance-based can be one-shot Le-Cun et al. (1990); Hassibi & Stork (1993) 



By pruning schedule, we mean the way to remove weights (e.g., removing all weights in a single step or multi-steps), not the training schedule such as learning rate settings, etc.



); Liu et al. (2017); Ye et al. (

g., EigenDamage Wang et al. (2019a), WoodFisher Singh & Alistarh (2020)) employ cheap approximation (such as K-FAC Fisher Martens & Grosse (2015)) to make the 2nd-order criteria tractable on deep networks. Note that, there is no a hard boundary between the importance-based and regularization-based. Many papers present their schemes in the combination of the two Ding et al. (2018); Wang et al.

; Wang et al. (2019a) (no training for picking weights to prune) or involve iterative training Molchanov et al. (2017b; 2019); Ding et al. (2019a;b). Other model compression methods. Apart from pruning, there are also many other model compression approaches, e.g., quantization Courbariaux & Bengio (2016); Courbariaux et al. (2016); Rastegari et al. (2016), knowledge distillation Buciluǎ et al. (2006); Hinton et al. (2014), low-

