PROGRESSIVE SKELETONIZATION: TRIMMING MORE FAT FROM A NETWORK AT INITIALIZATION

Abstract

Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration. We then propose two approximate procedures to maximize our objective (1) Iterative SNIP: allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages; and (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization. Empirical analysis on a large suite of experiments show that our approach, while providing at least as good a performance as other recent approaches on moderate pruning levels, provide remarkably improved performance on higher pruning levels (could remove up to 99.5% parameters while keeping the networks trainable).

1. INTRODUCTION

The majority of pruning algorithms for Deep Neural Networks require training dense models and often fine-tuning sparse sub-networks in order to obtain their pruned counterparts. In Frankle & Carbin (2019) , the authors provide empirical evidence to support the hypothesis that there exist sparse sub-networks that can be trained from scratch to achieve similar performance as the dense ones. However, their method to find such sub-networks requires training the full-sized model and intermediate sub-networks, making the process much more expensive. Recently, Lee et al. ( 2019) presented SNIP. Building upon almost a three decades old saliency criterion for pruning trained models (Mozer & Smolensky, 1989) , they are able to predict, at initialization, the importance each weight will have later in training. Pruning at initialization methods are much cheaper than conventional pruning methods. Moreover, while traditional pruning methods can help accelerate inference tasks, pruning at initialization may go one step further and provide the same benefits at train time Elsen et al. (2020) . Wang et al. (2020) (GRASP) noted that after applying the pruning mask, gradients are modified due to non-trivial interactions between weights. Thus, maximizing SNIP criterion before pruning might be sub-optimal. They present an approximation to maximize the gradient norm after pruning, where they treat pruning as a perturbation on the weight matrix and use the first order Taylor's approximation. While they show improved performance, their approximation involves computing a Hessian-vector product which is expensive both in terms of memory and computation.

