PROGRESSIVE SKELETONIZATION: TRIMMING MORE FAT FROM A NETWORK AT INITIALIZATION

Abstract

Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration. We then propose two approximate procedures to maximize our objective (1) Iterative SNIP: allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages; and (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization. Empirical analysis on a large suite of experiments show that our approach, while providing at least as good a performance as other recent approaches on moderate pruning levels, provide remarkably improved performance on higher pruning levels (could remove up to 99.5% parameters while keeping the networks trainable).

1. INTRODUCTION

The majority of pruning algorithms for Deep Neural Networks require training dense models and often fine-tuning sparse sub-networks in order to obtain their pruned counterparts. In Frankle & Carbin (2019) , the authors provide empirical evidence to support the hypothesis that there exist sparse sub-networks that can be trained from scratch to achieve similar performance as the dense ones. However, their method to find such sub-networks requires training the full-sized model and intermediate sub-networks, making the process much more expensive. Recently, Lee et al. (2019) presented SNIP. Building upon almost a three decades old saliency criterion for pruning trained models (Mozer & Smolensky, 1989) , they are able to predict, at initialization, the importance each weight will have later in training. Pruning at initialization methods are much cheaper than conventional pruning methods. Moreover, while traditional pruning methods can help accelerate inference tasks, pruning at initialization may go one step further and provide the same benefits at train time Elsen et al. (2020) . Wang et al. (2020) (GRASP) noted that after applying the pruning mask, gradients are modified due to non-trivial interactions between weights. Thus, maximizing SNIP criterion before pruning might be sub-optimal. They present an approximation to maximize the gradient norm after pruning, where they treat pruning as a perturbation on the weight matrix and use the first order Taylor's approximation. While they show improved performance, their approximation involves computing a Hessian-vector product which is expensive both in terms of memory and computation. Figure 1 : Test accuracies on CIFAR-10 (ResNet50) for different pruning methods. Each point is the average over 3 runs of prune-train-test. The shaded areas denote the standard deviation of the runs (too small to be visible in some cases). Random corresponds to removing connections uniformly. We argue that both SNIP and GRASP approximations of the gradients after pruning do not hold for high pruning levels, where a large portion of the weights are removed at once. In this work, while we rely on the saliency criteria introduced by Mozer & Smolensky (1989), we optimize what this saliency would be after pruning, rather than before. Hence, we name our criteria Foresight Connection sEnsitivity (FORCE). We introduce two approximate procedures to progressively optimize our objective. The first, which turns out to be equivalent to applying SNIP iteratively, removes a small fraction of weights at each step and re-computes the gradients after each pruning round. This allows to take into account the intricate interactions between weights, re-adjusting the importance of connections at each step. The second procedure, which we name FORCE, is also iterative in nature, but contrary to the first, it allows pruned parameters to resurrect. Hence, it supports exploration, which otherwise is not possible in the case of iterative SNIP. Moreover, one-shot SNIP can be viewed as a particular case of using only one iteration. Empirically, we find that both SNIP and GRASP have a sharp drop in performance when targeting higher pruning levels. Surprisingly, they perform even worse than random pruning as can be seen in Fig 1 . In contrast, our proposed pruning procedures prove to be significantly more robust on a wide range of pruning levels.

2. RELATED WORK

Pruning trained models Most of the pruning works follow the train -prune -fine-tune cycle (Mozer & Smolensky, 1989; LeCun et al., 1990; Hassibi et al., 1993; Han et al., 2015; Molchanov et al., 2017; Guo et al., 2016) , which requires training the dense network until convergence, followed by multiple iterations of pruning and fine-tuning until a target sparsity is reached. Particularly, Molchanov et al. (2017) present a criterion very similar to Mozer & Smolensky (1989) and therefore similar to Lee et al. (2019) and our FORCE, but they focus on pruning whole neurons, and involve training rounds while pruning. Frankle & Carbin (2019) and Frankle et al. (2020) showed that it was possible to find sparse sub-networks that, when trained from scratch or an early training iteration, were able to match or even surpass the performance of their dense counterparts. Nevertheless, to find them they use a costly procedure based on Han et al. (2015) . All these methods rely on having a trained network, thus, they are not applicable before training. In contrast, our algorithm is able to find a trainable sub-network with randomly initialized weights. Making the overall pruning cost much cheaper and presenting an opportunity to leverage the sparsity during training as well. Induce sparsity during training Another popular approach has been to induce sparsity during training. This can be achieved by modifying the loss function to consider sparsity as part of the optimization (Chauvin, 1989; Carreira-Perpiñán & Idelbayev, 2018; Louizos et al., 2018) or by dynamically pruning during training (Bellec et al., 2018; Mocanu et al., 2018; Mostafa & Wang, 2019; Dai et al., 2019; Dettmers & Zettlemoyer, 2020; Lin et al., 2020; Kusupati et al., 2020; Evci et al., 2019) . These methods are usually cheaper than pruning after training, but they still need to train the network to select the final sparse sub-network. We focus on finding sparse sub-networks before any weight update, which is not directly comparable. Pruning at initialization These methods present a significant leap with respect to other pruning methods. While traditional pruning mechanisms focused on bringing speed-up and memory reduction at inference time, pruning at initialization methods bring the same gains both at training and inference time. Moreover, they can be seen as a form of Neural Architecture Search (Zoph & Le, 2016) to find more efficient network topologies. Thus, they have both a theoretical and practical interest.

