HOW I LEARNED TO STOP WORRYING AND LOVE RETRAINING

Abstract

Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. ( 2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Improving on existing retraining approaches, we additionally propose a method to adaptively select the initial value of the linear schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of 'hard' pruning by incorporating the sparsification process into the standard training.

1. INTRODUCTION

Modern Neural Network architectures are commonly highly over-parameterized (Zhang et al., 2016) , containing millions or even billions of parameters, resulting in both high memory requirements as well as computationally intensive and long training and inference times. It has been shown however (LeCun et al., 1989; Hassibi & Stork, 1993; Han et al., 2015; Gale et al., 2019; Lin et al., 2020; Blalock et al., 2020) that modern architectures can be compressed dramatically by pruning, i.e., removing redundant structures such as individual weights, entire neurons or convolutional filters. The resulting sparse models require only a fraction of storage and floating-point operations (FLOPs) for inference, while experiencing little to no degradation in predictive power compared to the dense model. Although it has been observed that pruning might have a regularizing effect and be beneficial to the generalization capacities (Blalock et al., 2020) , a very heavily pruned model will normally be less performant than its dense (or moderately pruned) counterpart (Hoefler et al., 2021) . One approach to pruning consists of removing part of a network's weights from the model architecture after a standard training process, seemingly losing most of its predictive performance, and then retraining to compensate for that pruning-induced loss. This can be done either One Shot, that is pruning and retraining only once, or the process of pruning and retraining can be repeated iteratively. Although dating back to the early work of Janowsky (1989), this approach was most notably proposed by Han et al. (2015) in the form of ITERATIVE MAGNITUDE PRUNING (IMP). In its full iterative form, for example formulated by Renda et al. (2020) , IMP can require the original train time several times over to produce a pruned network, resulting in hundreds of retraining epochs on top of the original training procedure and leading to its reputation for being computationally impractical (Liu et al., 2020; Ding et al., 2019; Hoefler et al., 2021; Lin et al., 2020; Wortsman et al., 2019) . This as well as the belief that IMP achieves sub-optimal states (Carreira-Perpinán & Idelbayev, 2018; Liu et al., 2020) is one of the motivating factors behind methods that similarly start with an initially dense model but incorporate the sparsification into the training. We refer to such dense-to-sparse methods as pruning-stable (Bartoldson et al., 2020) . Li et al. (2020) regarding the training of Neural Networks under constraints on the number of training iterations, we challenge these commonly held beliefs by rethinking the retraining phase of IMP within the context of Budgeted Training and demonstrate that it can be massively shortened by using a simple linearly decaying learning rate schedule. We further demonstrate the importance of the learning rate scheme during the retraining phase and improve upon the results of Renda et al. ( 2020) and Le & Hua (2021) by proposing a simple and efficient approach to also choose the initial value of the learning rate, a problem which has not been previously addressed in the context of pruning. We also propose likewise imposing a budget on the initial dense training phase of IMP, turning it into a method capable of efficiently producing sparse, trained networks without the need for a pretrained model by effectively leveraging a cyclic linear learning rate schedule. The resulting method is able to outperform significantly more complex and heavily parameterized state-of-the-art approaches, that aim to reach pruning-stability at the end of training by incorporating the sparsification into the training process, while using less computational resources.

Motivated by recent results of

Contributions. The major contributions are as follows: 1. We empirically find that the results of Li et al. ( 2020 2020) and Le & Hua (2021) . Building on this, we find that the runtime of IMP can be drastically shortened by using a simple linear learning rate schedule with little to no degradation in model performance. 2. We propose a novel way to choose the initial value of this linear schedule without the need to tune additional hyperparameters in the form of ADAPTIVE LINEAR LEARNING RATE RESTARTING (ALLR). Our approach takes the impact of pruning as well as the overall retraining time into account, improving upon previously proposed retraining schedules on a variety of learning tasks. 3. By considering the initial dense training phase as part of the same budgeted training scheme, we derive a simple yet effective method in the form of BUDGETED IMP (BIMP) that can outperform many pruning-stable approaches given the same number of iterations to train a network from scratch. We believe that our findings not only advance the general understanding of the retraining phase, but more broadly question the belief that methods aiming for pruning-stability are generally preferable over methods that rely on 'hard' pruning and retraining both in terms of the quality of the resulting networks and in terms of the speed at which they are obtained. We also hope that BIMP can serve as a modular and easily implemented baseline against which future approaches can be realistically compared. Outline. Section 2 contains a summary of existing literature and network pruning approaches. It also contains a reinterpretation of some of these results in the context of Budgeted Training as well as a technical description of the methods we are proposing. In Section 3 we will experimentally analyze and verify the claims made in the preceding section. We conclude this paper with some relevant discussion in Section 4.

2. PRELIMINARIES AND METHODOLOGY

While the sparsification of Neural Networks includes a wide variety of approaches, we will focus on the analysis of Model Pruning, i.e., the removal of redundant structures in a Neural Network.We focus on performing unstructured pruning, that is the removal of individual weights, further also providing experiments for its structured counterpart, where entire groups of elements, such as convolutional filters, are removed.We will also focus on approaches that follow the dense-to-sparse paradigm, i.e., that start with a dense network and then either sparsify the network during training or after training, as opposed to methods that prune before training (e.g. Lee et al., 2019) or dynamic sparse training methods (e.g. Evci et al., 2020) where the networks are sparse throughout the entire training process. For a full and detailed survey of pruning algorithms we refer the reader to Hoefler et al. ( 2021).



) regarding the Budgeted Training of Neural Networks apply to the retraining phase of IMP, providing further context for the results of Renda et al. (

