NEURAL PRUNING VIA GROWING REGULARIZATION

Abstract

Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L 2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at https://github.com/mingsuntse/regularization-pruning.

1. INTRODUCTION

As deep neural networks advance in recent years LeCun et al. (2015) ; Schmidhuber (2015) , their remarkable effectiveness comes at a cost of rising storage, memory footprint, computing resources and energy consumption Cheng et al. ( 2017 The approaches to the scoring problem generally fall into two groups: importance-based and regularization-based Reed (1993) . The former focuses on directly proposing certain theoretically sound importance criterion so that we can prune the unimportant weights once for all. Thus, the pruning process is typically one-shot. In contrast, regularization-based approaches typically select unimportant weights through training with a penalty term Han et al. (2015) ; Wen et al. (2016); Liu et al. (2017) . However, the penalty strength is usually maintained in a small regime to avoid damaging the model expressivity. Whereas, a large penalty strength can be helpful, specifically in two aspects. (1) A large penalty can push unimportant weights rather close to zero, then the pruning later barely hurts the performance even if the simple weight magnitude is adopted as criterion. (2) It is well-known that different weights of a neural network lie on the regions with different local quadratic structures, i.e., Hessian information. Many methods try to tap into this to build a more accurate scoring LeCun et al. (1990); Hassibi & Stork (1993) ; Wang et al. (2019a) ; Singh & Alistarh (2020). However, for deep networks, it is especially hard to estimate Hessian. Sometimes, even the computing itself can be intractable without resorting to proper approximation Wang et al. (2019a) . On this problem, we ask: Is it possible to exploit the Hessian information without knowing



); Deng et al. (2020). Neural network pruning Han et al. (2015; 2016); Li et al. (2017); Wen et al. (2016); He et al. (2017); Gale et al. (2019) is deemed as a promising force to alleviate this problem. Since its early debut Mozer & Smolensky (1989); Reed (1993), the central problem of neural network pruning has been (arguably) how to choose weights to discard, i.e., the weight importance scoring problem LeCun et al. (1990); Hassibi & Stork (1993); Molchanov et al. (2017b; 2019); Wang et al. (2019a); He et al. (2020).

