HOW I LEARNED TO STOP WORRYING AND LOVE RETRAINING

Abstract

Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. ( 2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Improving on existing retraining approaches, we additionally propose a method to adaptively select the initial value of the linear schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of 'hard' pruning by incorporating the sparsification process into the standard training.

1. INTRODUCTION

Modern Neural Network architectures are commonly highly over-parameterized (Zhang et al., 2016) , containing millions or even billions of parameters, resulting in both high memory requirements as well as computationally intensive and long training and inference times. It has been shown however (LeCun et al., 1989; Hassibi & Stork, 1993; Han et al., 2015; Gale et al., 2019; Lin et al., 2020; Blalock et al., 2020) that modern architectures can be compressed dramatically by pruning, i.e., removing redundant structures such as individual weights, entire neurons or convolutional filters. The resulting sparse models require only a fraction of storage and floating-point operations (FLOPs) for inference, while experiencing little to no degradation in predictive power compared to the dense model. Although it has been observed that pruning might have a regularizing effect and be beneficial to the generalization capacities (Blalock et al., 2020) , a very heavily pruned model will normally be less performant than its dense (or moderately pruned) counterpart (Hoefler et al., 2021) . One approach to pruning consists of removing part of a network's weights from the model architecture after a standard training process, seemingly losing most of its predictive performance, and then retraining to compensate for that pruning-induced loss. This can be done either One Shot, that is pruning and retraining only once, or the process of pruning and retraining can be repeated iteratively. Although dating back to the early work of Janowsky (1989) , this approach was most notably proposed by Han et al. (2015) in the form of ITERATIVE MAGNITUDE PRUNING (IMP). In its full iterative form, for example formulated by Renda et al. (2020) , IMP can require the original train time several times over to produce a pruned network, resulting in hundreds of retraining epochs on top of the original training procedure and leading to its reputation for being computationally impractical (Liu et al., 2020; Ding et al., 2019; Hoefler et al., 2021; Lin et al., 2020; Wortsman et al., 2019) . This as well as the belief that IMP achieves sub-optimal states (Carreira-Perpinán & Idelbayev, 2018; Liu et al., 2020) is one of the motivating factors behind methods that similarly start with an initially dense model but incorporate the sparsification into the training. We refer to such dense-to-sparse methods as pruning-stable (Bartoldson et al., 2020) . Motivated by recent results of Li et al. (2020) regarding the training of Neural Networks under constraints on the number of training iterations, we challenge these commonly held beliefs by rethinking the retraining phase of IMP within the context of Budgeted Training and demonstrate that it can be massively shortened by using a simple linearly decaying learning rate schedule. We further demonstrate the importance of the learning rate scheme during the retraining phase and improve upon the results of Renda et al. (2020) and Le & Hua (2021) by proposing a simple and efficient approach to also choose the initial value of the learning rate, a problem which has not been previously addressed in the context of pruning. We also propose likewise imposing a budget on the initial dense training phase of IMP, turning it into a method capable of efficiently producing sparse, trained networks without the need for a pretrained model by effectively leveraging a cyclic linear learning rate schedule. The resulting method is able to outperform significantly more complex and heavily parameterized state-of-the-art approaches, that aim to reach pruning-stability at the end of training by incorporating the sparsification into the training process, while using less computational resources. Contributions. The major contributions are as follows: 1. We empirically find that the results of Li et al. (2020) regarding the Budgeted Training of Neural Networks apply to the retraining phase of IMP, providing further context for the results of Renda et al. (2020) and Le & Hua (2021) . Building on this, we find that the runtime of IMP can be drastically shortened by using a simple linear learning rate schedule with little to no degradation in model performance. 2. We propose a novel way to choose the initial value of this linear schedule without the need to tune additional hyperparameters in the form of ADAPTIVE LINEAR LEARNING RATE RESTARTING (ALLR). Our approach takes the impact of pruning as well as the overall retraining time into account, improving upon previously proposed retraining schedules on a variety of learning tasks. 3. By considering the initial dense training phase as part of the same budgeted training scheme, we derive a simple yet effective method in the form of BUDGETED IMP (BIMP) that can outperform many pruning-stable approaches given the same number of iterations to train a network from scratch. We believe that our findings not only advance the general understanding of the retraining phase, but more broadly question the belief that methods aiming for pruning-stability are generally preferable over methods that rely on 'hard' pruning and retraining both in terms of the quality of the resulting networks and in terms of the speed at which they are obtained. We also hope that BIMP can serve as a modular and easily implemented baseline against which future approaches can be realistically compared. Outline. Section 2 contains a summary of existing literature and network pruning approaches. It also contains a reinterpretation of some of these results in the context of Budgeted Training as well as a technical description of the methods we are proposing. In Section 3 we will experimentally analyze and verify the claims made in the preceding section. We conclude this paper with some relevant discussion in Section 4.

2. PRELIMINARIES AND METHODOLOGY

While the sparsification of Neural Networks includes a wide variety of approaches, we will focus on the analysis of Model Pruning, i.e., the removal of redundant structures in a Neural Network.We focus on performing unstructured pruning, that is the removal of individual weights, further also providing experiments for its structured counterpart, where entire groups of elements, such as convolutional filters, are removed.We will also focus on approaches that follow the dense-to-sparse paradigm, i.e., that start with a dense network and then either sparsify the network during training or after training, as opposed to methods that prune before training (e.g. Lee et al., 2019) or dynamic sparse training methods (e.g. Evci et al., 2020) where the networks are sparse throughout the entire training process. For a full and detailed survey of pruning algorithms we refer the reader to Hoefler et al. (2021) . Pruning-unstable methods are exemplified by ITERATIVE MAGNITUDE PRUNING (IMP) (Han et al., 2015) . In its original form, it first employs standard network training, adding a common 2regularization term on the objective, and then removes all weights from the network with magnitude below a certain threshold. The network at this point commonly loses some or even all of its learned predictive power, so it is then retrained for a fixed number of epochs. This prune-retrain cycle is usually repeated a number of times; the threshold at every pruning step is determined as the appropriate percentile such that, at the end of a given number of iterations, a desired target sparsity is met. Renda et al. (2020) suggested the following complete approach: train a network for T epochs and then iteratively prune 20% percent of the remaining weights and retrain for T rt = T epochs until the desired sparsity is reached. For a goal sparsity of 98% and T = 200 original training epochs, the algorithm would therefore require 18 prune-retrain-cycles for a massive 3800 total retrain epochs. There has been some recent interest in the learning rate schedule used during retraining. The original approach by Han et al. (2015) is commonly referred to as FINE TUNING (FT): suppose we train for T epochs using the learning rate schedule (η t ) t≤T and retrain for T rt epochs per prune-retrain-cycle, then FT retrains the pruned network using a constant learning rate of η T , i.e., the last learning rate used during the original training. Renda et al. (2020) note that the learning rate schedule during retraining can have a dramatic impact on the predictive performance of the pruned network and propose LEARNING RATE REWINDING (LRW), where one retrains the pruned network for T rt epochs using the last T -T rt learning rates η T -Trt+1 , . . . , η T during each cycle. Le & Hua (2021) further improved upon these results by proposing SCALED LEARNING RATE RESTARTING (SLR), where the pruned network is retrained using a proportionally identical schedule, i.e., by compressing (η t ) t≤T into the retraining time frame of T rt epochs with a short warm-up phase. They also introduced CYCLIC LEARNING RATE RESTARTING (CLR) based on the 1-cycle learning rate schedule of Smith & Topin (2017) , where the original schedule (commonly a stepped one) is replaced with a cosine based one starting at the same initial learning rate η 1 , likewise including a short warm-up phase. Figure 1 depicts the aforementioned schedules for a retraining budget of 60 epochs.

2.1. RETHINKING RETRAINING AS BUDGETED TRAINING

LRW was proposed as a variant of WEIGHT REWINDING (WR) (Frankle et al., 2019) , suggesting that its success is due to some connection to the Lottery Ticket Hypothesis. Le & Hua (2021) already gave a more grounded motivation when introducing SLR by noting that its main feature is the "usage of large learning rates".foot_0 By proposing CLR and motivating it through the 1-cycle learning rate schedule of Smith & Topin (2017) , they also already demonstrated that there is no particular significance to basing the learning rate schedule on the one used during the original training. We think that the results of Li et al. (2020) regarding the training of Neural Networks within a fixed iteration budget (Budgeted Training) provide some relevant further context for the varying success achieved by these methods as well as indications on how one can further improve upon them, in particular when retraining is assumed to be significantly shorter than the original training, that is when T rt T . Li et al. (2020) study training when a resource budget is given in the form of a fixed number of epochs or iterations that the network will be trained for, instead of following the common assumption that training is executed until asymptotic convergence is achieved to some satisfactory degree. Specifically, they empirically determine what learning rate schedules are best suited to achieve the highest possible performance within a given budget. The two major takeaways from their results are as follow: 1. Compressing any given learning rate schedule to fit within a specific budget significantly outperforms simply truncating it once the budget is reached. Li et al. (2020) refer to this compression as BUDGET-AWARE CONVERSION (BAC). This clearly aligns with the findings of Le & Hua (2021) that SLR outperforms LRW, since SLR is simply the BAC of the original training schedule while LRW is a truncated version of it (albeit truncated 'from the back' instead of the front). 2. Certain learning rate schedules are more suited for a wider variety of budgets than others. In particular, their results indicate that a linear schedule performs best when a tight budget is given, closely followed by a cosine based approach. This provides an explanation for why CLR outperforms SLR when the original learning rate schedule follows more traditional recommendations. Put succinctly, the empirical results of Li et al. (2020) regarding the learning rate schedule in a budgeted training context seem to closely resemble the development and improvement of retraining schedules in the context of pruning. Hence, we claim that retraining should first and foremost be considered under the aspect of Budgeted Training and that lessons derived in the latter setting are generally applicable in this context. Motivated by the findings of Li et al. (2020) , we therefore propose LINEAR LEARNING RATE RESTARTING (LLR) leveraging a linear learning rate schedule during retraining: LLR linearly decays the learning rate during each retrain cycle from an initial value of η 1 to zero after a short warm-up phase. This effectively results in a cyclic learning rate schedule when pruning and retraining in the iterative setting, which has previously been found to help generalization (Smith, 2017) . Going one step further, we also propose dynamically adapting the initial value of the retraining schedule by relating it not just to the initial learning rate during the original training but also to the impact of the previous pruning step, resulting in ADAPTIVE LINEAR LEARNING RATE RESTARTING (ALLR). While previous works have focused on the actual schedule of the learning rate during retraining, the initial value has only implicitly been dealt with. FT chooses the last learning rate value η T , which is typically the smallest. On the other hand, SLR and CLR rely on the initial value corresponding to the maximum value η 1 of the original schedule, which the authors attribute the success of their methods to. The initial value of LRW is implicitly chosen in proportion to the retraining time by truncating the original schedule from the back. Existing works have shown that to find minima that generalize well, the learning rate should exhibit a large-step and a small-step retraining regime (Jastrzębski et al., 2017; Li et al., 2019; You et al., 2019; Leclerc & Madry, 2020) . When choosing the initial value of the retraining schedule, the two characteristics of a prune-retrain cycle have to be taken into account: its length and the impact of pruning. Given a tight retraining budget it might occur that large initial steps cannot be compensated adequately, while a too small learning rate (possibly over a long retraining) period might be insufficient to recover large pruning-induced performance degradation. An adaptive way of choosing the initial stepsize must therefore address the following question: how much of an increase in loss do we have to compensate for and do we have sufficient time to properly perform both a large-step and small-step learning rate regime? To that end, ALLR discounts the initial value η 1 by a factor d ∈ [0, 1] to account for both the available retraining time (similar to LRW, where the magnitude of the initial learning rate naturally depends on T rt ) and the performance drop induced by pruning. Since measuring the decrease in train accuracy would require an additional evaluation epoch and is thus undesirable (cf. Appendix C.1 for an ablation study), ALLR achieves this goal by first measuring the relative L 2 -norm change in the weights due to pruning, that is after pruning a s ∈ (0, 1] fraction of the remaining weights, we compute the normalized distance between the weight vector W and its pruned version W p in the form of d 1 = W -W p 2 W 2 • √ s ∈ [0, 1], where normalization by √ s ensures that d 1 can actually attain the full range of values in [0, 1]. We then determine d 2 = T rt /T to account for the length of the retrain phase and choose d η 1 as the initial learning rate for ALLR where d = max(d 1 , d 2 ). This approach effectively interpolates between the recommendations of Renda et al. (2020) and Le & Hua (2021) based on a computationally cheap proxy. Appendix C.1 contains several ablation studies to justify our design choices. In Section 3.1, we will verify our claims by empirically comparing the retraining schedules, namely FT, LRW, SLR, CLR, LLR and ALLR, against one another as well as against tuned versions of their underlying (constant, stepped, cosine or linear) schedules. We then study in Section 3.2 to what degree the retraining phase of IMP can be shortened when leveraging the proposed schedules. 2.2 PRUNING-STABILITY: TRYING TO AVOID RETRAINING Pruning-stable algorithms are defined by their attempt to find a well-performing pruned model from an initially dense one during the training procedure so that the ultimate 'hard' pruning step results in almost no drop in accuracy and the retraining phase becomes superfluous. They do so by inducing a strong implicit bias during some otherwise standard training setup, either by gradual pruning, i.e., extending the pruning mask dynamically, or by employing regularization-and constraint-optimization techniques to learn an almost sparse structure throughout training. Many methods also rely on some kind of 'soft' pruning for this, e.g., by zeroing out weights or strongly pushing them towards zero, but not fully removing them from the network architecture during training. Let us briefly summarize a variety of methods that have been proposed in this category over the last couple of years: LC (Carreira-Perpinán & Idelbayev, 2018) and GSM (Ding et al., 2019) both employ a modification of weight decay and force the k weights with the smallest score more rapidly towards zero, where k is the number of parameters that will eventually be pruned and the score is the parameter magnitude or its product with the loss gradient. Similarly, DNW (Wortsman et al., 2019) zeroes out the smallest k weights in the forward pass while still using a dense gradient. CS (Savarese et al., 2020) , STR (Kusupati et al., 2020) and DST (Liu et al., 2020) all rely on the creation of additional trainable threshold parameters, which are applied to sparsify the model while being regularly trained alongside the usual weights. Here, the training objectives are modified via penalty terms to control the sparsification. GMP (Zhu & Gupta, 2017; Gale et al., 2019) follows a tunable pruning schedule which sparsifies the network throughout training by dynamically extending and updating a pruning mask. Finally, based on this idea, DPF (Lin et al., 2020) maintains a pruning mask which is extended using the pruning schedule of Zhu & Gupta (2017) , but allows for error compensation by modifying the update rule to use the (stochastic) gradient of the pruned model while updating the dense parameters. The two most commonly claimed advantages of pruning-stable methods compared to IMP are the following: 1. They result in a pruned model faster when training from scratch since they avoid the expensive iterative prune-retrain cycles. Ding et al. (2019) for example advertise that is there is "no need for time consuming retraining", Liu et al. (2020) try to "avoid the expensive pruning and fine-tuning iterations", Hoefler et al. (2021) state that sparsifying during training "is usually cheaper than the train-then-sparsify schedule", Lin et al. (2020) argue that IMP is "computationally expensive" and "outperformed by algorithms that explore different sparsity masks instead of a single one", Wortsman et al. (2019) try to "train a sparse Neural Network without retraining or fine-tuning" and Frankle & Carbin (2018) (in the context of the Lottery Ticket Hypothesis) state that "iterative pruning is computationally intensive, requiring training a network 15 or more times consecutively". 2. They produce preferable results either because they avoid 'hard' pruning or due to the particular implicit bias they employ. Liu et al. (2020) for example state that 'hard' pruning methods suffer from a "failure to properly recover the pruned weights" and Carreira-Perpinán & Idelbayev (2018) argue that learning the pruning set throughout training "helps find a better subset and hence prune more weights with no or little loss degradation". While many of the previously listed methods perform well and achieve state-of-the-art results, so far little empirical evidence has been given to support that the claimed advantages of pruning-stability have actually been achieved. To verify this, we propose BUDGETED IMP (BIMP), where the same lessons we previously derived from Budgeted Training for the retraining phase of IMP are applied to the initial training of the network. More specifically, given a budget of T epochs, we simply train a network for some T 0 < T epochs using a linearly decaying learning rate schedule and then apply IMP with the proposed schedules on the output for the remaining T -T 0 epochs. The resulting method is capable of obtaining a pruned model from a random initialization within any given budget T while still maintaining all of the key characteristics of IMP, most notably the fact that (1) we 'hard' prune and do not allow weights to recover in subsequent steps, (2) we do not impose any particular additional implicit bias besides a common weight decay term during either training or retraining, and (3) we follow a prescribed static training schedule with the exception of adapting the initial learning rate to the impact of the last pruning in the case of ALLR. This clearly delineates BIMP from the previously listed methods and allows us to compare them on equal terms by giving all methods the same budget of a total of T epochs, independent of whether they are spent on 'normal' training or retraining. In Section 3.3 we thoroughly compare our proposed approach to previously listed pruning-stable methods in a fair setting. We remark that the implicit biases of many pruning-stable approaches can result in a substantial computational overhead that we are deliberately ignoring here by comparing methods on a per-epoch basis and therefore giving these methods the advantage in the comparison. However, we include the images-per-second throughput of the individual algorithms which highlights that BIMP is among the most efficient approaches. Finally, let us remark that there has been a significant amount of attention on how to select the specific weights to be pruned. Ranking weights for pruning based on the magnitude of their current values has established itself as the approach of choice (Lee et al., 2019) , where specific criteria have been proposed that take the particular network architecture into consideration (Zhu & Gupta, 2017; Gale et al., 2019; Evci et al., 2020; Lee et al., 2020) . We have verified some of these results in Appendix C.2 and will stick to the simple global selection criterion used by Han et al. (2015) for BIMP.

3. EXPERIMENTAL RESULTS

Let us outline the general methodological approach to computational experiments in this section, including datasets, architectures and metrics. Experiment-specific details are found in the respective subsections. We note that, given the surge of interest in pruning, Blalock et al. (2020) proposed experimental guidelines in the hope of standardizing the experimental setup. We aim to follow these guidelines whenever possible. All experiments performed throughout this computational study are based on the PyTorch framework (Paszke et al., 2019) , using the original code of the methods whenever available. All results and metrics were logged and analyzed using Weights & Biases (Biewald, 2020) . We have made our code and general setup available at github.com/ZIB-IOL/BIMP for the sake of reproducibility. We perform extensive experiments on image recognition datasets such as ImageNet (Russakovsky et al., 2015) , CIFAR-10/100 (Krizhevsky et al., 2009) , the semantic segmentation tasks COCO (Lin et al., 2014) and CityScapes (Cordts et al., 2016) as well as neural machine translation (NMT) on WMT16 (Bojar et al., 2016) . In particular, we employed ResNets (He et al., 2015) , Wide ResNets (WRN) (Zagoruyko & Komodakis, 2016) , VGG (Simonyan & Zisserman, 2014) , the transformerbased MaxViT (Tu et al., 2022) architecture, as well as PSPNet (Zhao et al., 2017) and DeepLabV3 (Chen et al., 2017) in the case of CityScapes and COCO, respectively. For NMT, we used a T5 transformer (Raffel et al., 2020) available through HuggingFace (Wolf et al., 2020) . Exact parameters can be found in Appendix A, where we also define what can be considered a 'standard' training setup for each setting that we rely on whenever not otherwise specified. The focus of our analysis will be the tradeoff between the model sparsity and the final test performance, being the accuracy in the case of image classification, the mIoU (mean intersection over union) for segmentation, or the BLEU score (Post, 2018) for NMT. As a secondary measure, we will also consider the theoretical speedup (Blalock et al., 2020) induced by the sparsity, see Appendix A for full details. We use a validation set of 10% of the training data for hyperparameter selection.

3.1. LEARNING RATE SCHEDULES DURING RETRAINING

Table 1 contains part of the results regarding the comparison between FT, LRW, SLR, CLR and our proposed approaches LLR and ALLR for ImageNet in the One Shot setting. First of all, we find that retraining after pruning is in fact a Budgeted Training scenario, as the insights for normal dense training (Li et al., 2020) transfer to the retraining case. This is further observable when comparing the translated schedules to versions of constant, stepped exponential, cosine and linear learning rate schedules, where the initial learning rate after pruning was tuned using a grid search (cf. Appendix B.1, also containing additional results, exact parameter grids and the results for other datasets). In general, linear and cosine based schedules clearly outperform the constant and stepped ones, with a slight advantage of LLR over CLR. However, for short retraining times and for the medium sparsity range, the fixed restarting schedules CLR and LLR fail to yield results competitive to FT and LRW, since a too large initial learning rate is detrimental given a restricted retraining budget. ALLR is able to consistently improve upon previous approaches, which becomes especially noticeable in the small retraining budget regime, where for larger budgets the approaches begin to converge. We think that ALLR is a suitable drop-in replacement when performing retraining. We similarly observe the strength of ALLR in Figure 2 , depicting the highest achievable test accuracy for each number of total retraining epochs, including both the One Shot as well as iterative pruning case. Appendix B.1 includes the full results on different tasks and datsets, longer retraining budgets as well as the structured pruning setting, where we remove convolutional filters based on their respective norm (Li et al., 2016) .

3.2. BUDGETING THE RETRAINING PHASE

In this part we will treat the number of retrain epochs per prune-retrain cycle T rt as well as the total amount of such cycles J as tunable hyperparameters for IMP and try to determine the tradeoff between the predictive performance of the final pruned network and the total number of retrained epochs J • T rt . As a baseline performance for a pruned network, we will use the approach suggested by Renda et al. (2020) as it serves as a good benchmark for the current potential of IMP. In Figure 3 we present the envelope of the results for ResNet-56 trained on CIFAR-10 with target sparsities of 90%, 95%, and 98%, respectively. The parameters for the retraining phase were optimized using a grid search over T rt ∈ {10, 15, 20, ..., 60} and J ∈ {1, 2, . . . , 6} using ALLR. We find that IMP is capable of achieving what has previously been considered its full potential with significantly less than the total number of retraining epochs usually budgeted for its iterative form. More concretely, for all three levels of sparsity, 90%, 95%, and 98%, IMP meets the baseline laid out by Renda et al. (2020) after only around 100 epochs of retraining, instead of requiring the 2000, 2800, and 3600 epochs used to establish that baseline, respectively. In fact, given the superiority of a linear learning rate schedule over a more commonly used stepped one in the case of budgeted training (Li et al., 2020) Given that we have just established that the retraining phase of IMP takes well to enforcing a budget when using an appropriate learning rate schedule and that Li et al. (2020) already established that 'normal' training can be significantly shortened through a linear learning rate schedule, it is reasonable to assume that the same holds for the original training phase without strongly impacting the pruning and retraining part and therefore the ultimate product of IMP. To verify this, we trained ResNet-56 on CIFAR-10 using a linearly decaying learning rate schedule from between 5% up to 100% of 200 epochs, which we consider the 'full' training, and then apply IMP with ALLR on the resulting network both One Shot and iteratively for target sparsities of 90%, 95%, and 98%. Figure 7 in the appendix shows the results for the iterative setting, where we retrain for three cycles of 15 epochs each. We can see that IMP takes well to budgeting the initial training period, both in the One Shot and in the iterative setting, with the target sparsity seemingly having little influence on how much the initial training can be compressed.

3.3. THE EFFICACY OF PRUNING-STABILITY

We conclude by comparing the performance of BIMP to pruning-stable approaches. To that end, we train models on ImageNet, CIFAR-100 and CIFAR-10 and give all methods an equal budget of 90 epochs (200 for CIFAR) to derive a pruned model from a randomly initialized one. For BIMP, we employ ALLR and treat the initial training length T 0 as well as the number of prune-retrain-cycles as hyperparameters, ensuring that the overall epoch budget T is not exceeded. The hyperparameters of each of the pruning-stable methods were likewise tuned using manually defined grid searches, resorting to the recommendations of the original publications whenever possible, see Appendix B.1. For GMP, GSM, DPF, DNW, and LC we give the methods prespecified sparsity levels, the same as given to BIMP. Tuning the remaining methods in a way that allows for a fair comparison however is significantly more difficult, since none of them allow to clearly specify a desired level of sparsity but instead require tuning additional hyperparameters as part of the grid search. Despite our best efforts, we were only able to cover part of the desired sparsity range using STR and DST. For CS in the case of ImageNet, we were unable to tune the hyperparameters. In addition, we noticed that each of these methods can have some variance in the level of sparsity achieved even for fixed hyperparameters, so we list the standard deviation of the final sparsity with respect to the random seed initialization. We note that in the original works, LC and GSM were applied to pretrained models. To allow a fair comparison, we applied LC and GSM to both randomly initialized as well as pretrained models and chose the best results for each sparsity, giving them a larger budget than the others. Further, we noticed that some pruning-stable methods can profit from retraining. For CIFAR-10 and CIFAR-100, we hence retrain all methods excluding BIMP for 30 epochs using FT and use the accuracy reached after retraining when it exceeds the original one. All results are averaged over multiple seeds with standard deviation indicated. Table 2 reports the final test performance, theoretical speedup, and actually achieved sparsity of all methods for CIFAR-10 and ImageNet, where we defer full results to Appendix B.4. The results show that BIMP is able to outperform many of the pruning-stable methods considered here. For ImageNet, DNW consistently performs on par or better than BIMP, albeit at the price of needing roughly twice as long for training, cf. the images-per-second throughput. Surprisingly, despite broad hyperparameter grid search, most methods seem to be in disadvantage compared to BIMP, with DPF being a both efficient and strong competitor. BIMP obtains these results within the same number of overall training epochs and we are ignoring the computational overhead of some of the more involved methods. We note that the authors of STR report better results on ImageNet, which we unfortunately were unable to replicate in our experimental setting (cf. Appendix B.1 for the exact hyperparameter grid). In Appendix C.3, we have included an ablation study where we compare BIMP to several modifications of GMP not previously suggested in the literature, since GMP is the closest in design to BIMP out of all pruning-stable methods considered here. Most notably this includes variants of GMP with both a global and a cyclical linear learning rate schedule as well as a 'hard' pruning variant.

4. DISCUSSION AND OUTLOOK

The learning rate is often considered to be the single most important hyperparameter in Deep Learning which nevertheless still remains poorly understood, certainly from a theoretical perspective but also still from an empirical one. Our work therefore provides an important building block in which we established that, counter to the often explicitly stated belief that IMP is inefficient, many significantly more complex and sometimes strenuously motivated methods are outperformed by perhaps the most basic of approaches when proper care is taken of the learning rate. Despite providing a strong retraining alternative with ALLR, we emphasize that the main goal of this work is not to suggest yet another acronym and claim that it is the be-all and end-all of network pruning, but to instead (a) hopefully focus the efforts of the community on understanding more basic questions before suggesting convoluted novel methods and (b) to emphasize that IMP can serve as a strong, easily implemented, and modular baseline. We think the modularity here is of particular importance, as individual aspects can easily be exchanged or modified, e.g., when, what, how, and how much to prune or how to retrain, in order to formulate rigorous ablation studies.

REPRODUCIBILITY

Reproducibility is of utmost importance for any comparative computational study such as this. All experiments were based on the PyTorch framework and use publicly available datasets. The implementation of the ResNet-56 and ResNet-18 network architecture is based on github.com/JJGO/shrinkbench and github.com/charlieokonomiyaki/pytorch-resnet18-cifar10, respectively, the implementation of the WideResNet network architecture is based on github.com/meliketoy/wide-resnet.pytorch, the implementation of the VGG-16 network architecture is based on github.com/jaeho-lee/layer-adaptivesparsity and the implementation of the Resnet-50 network architecture is taken from PyTorch. Regarding the pruning methods, the code was taken from the respective publications whenever possible. Regarding the different variants of magnitude pruning such as ERK or UNIFORM+, we closely followed the implementation of Lee et al. (2020) available at github.com/jaeho-lee/layeradaptive-sparsity. For metrics such as the theoretical speedup, we relied on the implementation in the ShrinkBench-framework of Blalock et al. (2020) , see github.com/JJGO/shrinkbench. We have made our code and general setup available at github.com/ZIB-IOL/BIMP for the sake of reproducibility.

A TECHNICAL DETAILS AND TRAINING SETTINGS A.1 TECHNICAL DETAILS AND GENERAL TRAINING SETTINGS

We define pruning-stability, theoretical speedup as well as several pruning selection criteria for IMP, i.e., different criteria that are used to select weights for pruning. We will analyze the impact of such criteria under different retraining schedules in Appendix C.2. Further, Table 3 shows the default training settings used throughout this work. Definition A.1 (Bartoldson et al. (2020) ). Let t pre and t post be the test accuracy before pruning and after pruning the trained model, respectively. Assuming t post ≤ t pre , we define the pruning-stability of a method as ∆ stability := 1 - t pre -t post t pre ∈ [0, 1]. Pruning-stable methods are sparsification algorithms that learn a sparse solution throughout training such that ∆ stability ≈ 1. For example, methods that perform the forward-pass using an already sparsified copy of the parameters (e.g. DNW by Wortsman et al., 2019) , will have ∆ stability = 1, since the 'hard' pruning step only consists of an application of the present pruning mask, which has no further effect. Methods that actively drive certain parameter groups towards zero more rapidly (such as Carreira-Perpinán & Idelbayev, 2018; Ding et al., 2019) will have a pruning-stability close to 1, since the projection of (magnitude) pruning at the end of training will perturb the parameters only slightly. Crucial to our analysis are the tradeoffs between the model sparsity, the final performance (measured by final test accuracy, BLEU scores or IoU) and the theoretical speedup induced by the sparsity (Blalock et al., 2020) . The theoretical speedup is a metric measuring the ratio in FLOPs needed for inference comparing the dense and sparse model. More precisely, let F d be the number of FLOPs the dense model needs for inference, and let similarly be F s the same number for the pruned model, given some sparsity s. 2 The theoretical speedup is defined as F d /F s and depends solely on the position of the zero weights within the network and layers, not on the numerical values of non-zero parameters. IMP in its original form treats all trainable parameters as a single vector and computes a global threshold below which parameters are removed, independent of the layer they belong to. This simple approach, which we will refer to as GLOBAL, has been subject to criticism for not determining optimal layer-dependent pruning rates and for being inconsistent (Liu et al., 2020) . Fully-connected layers for example have many more parameters than convolutional layers and are therefore much less sensitive to weight removal (Han et al., 2015; Carreira-Perpinán & Idelbayev, 2018) . Further, it has been observed that the position of a layer can play a role in whether that layer is amenable to pruning: often first and last layers are claimed to be especially relevant for the classification performance (Gale et al., 2019) . On the other hand, in which layers pruning takes place significantly impacts the sparsity-induced theoretical speedup (Blalock et al., 2020) . Lastly, the non-negative homogeneity of modern ReLU-based Neural Network architectures (Neyshabur et al., 2015) would also seem to indicate a certain amount of arbitrariness to this heuristic selection rule, or at least a strong dependence on the network initialization rule and optimizer used, as weights can be rescaled to force it to fully remove all parameters of a layer, destroying the pruned network without having affected the output of the unpruned network. Determining which weights to remove is hence crucial for successful pruning and several methods have been designed to address this fact. Zhu & Gupta (2017) introduced the UNIFORM allocation, in which a global sparsity level is enforced by pruning each layer to exactly this sparsity. Gale et al. (2019) extend this approach in the form of UNIFORM+ by (a) keeping the first convolutional layer dense and (b) pruning at most 80% of the connections in the last fully-connected layer. Evci et al. (2020) propose a reformulation of the ERD ŐS-RÉNYI KERNEL (ERK) (Mocanu et al., 2018) to take the layer and kernel dimensions into account when determining the layerwise sparsity distribution. In particular, ERK allocates higher sparsity to layers with more parameters. Finally, Lee et al. (2020) propose LAYER-ADAPTIVE MAGNITUDE-BASED PRUNING (LAMP), an approach which takes an 2 -distortion perspective by relaxing the problem of minimizing the output distortion at time of pruning with respect to the worst-case input. We note that we follow the advice of Evci et al. (2020)  =                  0.1 t 20 t ∈ [1, 20], 0.1 t ∈ [20, =              0.1t t ∈ [0, 1.0], 0.1 t ∈ [1, 2[, 0.01 t ∈ [2, 3[, 0.001 t ∈ [3, 4[, 0.0001 t ∈ [4, 5] 24.56 BLEU ±0.007 and Dettmers & Zettlemoyer (2019) and do not prune biases and batch-normalization parameters, since they only amount to a negligible fraction of the total weights, however keeping them has a very positive impact on the performance of the learned model. Further, for the computations involving GMP, we similarly employ the global selection criterion since we found it to yield better results than UNIFORM+. We will compare these approaches in Appendix C.2 with a focus on the impact of the retraining phase. Since Le & Hua (2021) found that SLR can be used to obtain strong results even when pruning convolutional filters randomly, i.e., by assigning random importance scores to the filters instead of using the magnitude criterion or others, we are interested in understanding the importance of the retraining technique when considering different sparsity distributions. For experiments involving the pruning of convolutional filters instead of weights, we follow Li et al. (2016) and remove filters using an L 2 -norm criterion, enforcing a uniform distribution of sparsity among the layers.

B EXTENDED RESULTS AND COMPLETE TABLES B.1 LEARNING RATE SCHEDULES DURING RETRAINING

This section contains the complete results regarding the performance of different learning rate schedules for retraining. For fixed schedules, i.e., FT, LRW, SLR, CLR, LLR and ALLR, only the weight decay parameter is tuned and the best configuration reported. For the tuned schedules, i.e., constant, stepped (BAC), cosine and linear, we tune the weight decay, the initial value of the learning rate as well as the length of warm-up as follows. For ResNet-18 and ResNet-56 trained on CIFAR-10, weight decay is tuned using a grid search over 1e-4, 2e-4, and 5e-4 and for the tuned schedules in the lower half of the table the initial value is chosen using a grid search over {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}, where in the iterative case we use the same value for each cycle, and the warm-up is tuned over either zero or ten percent of the retraining budget. For WideResNet on CIFAR-100, weight decay is tuned over 2e-4, and 5e-4, the initial value is chosen using a grid search over {0.0008, 0.004, 0.02, 0.05, 0.1, 0.5}, where we vary the warmup length between zero and ten percent of the retraining budget. For ResNet-50 on ImageNet, MaxViT on ImageNet, DeepLabV3 on COCO, PSPNet on CityScapes and T5-small on WMT16, we only present the results for fixed schedules. For ResNet-50 on ImageNet, we keep the weight decay fixed at 1e-4, while for the other four aforementioned architecture-dataset pairs, we set the weight decay to 1e-5. As indicated in the caption, each table displays the results for one architecture-dataset pair either in the One Shot or the iterative setting. If no pruning method is indicated, we report the results of magnitude pruning with a global selection criterion. Whenever we perform filter pruning, we indicate it the caption of the table and rely on a uniform selection of filters by their L 2 -norm. Similarly to Figure 3 in the main part, Figure 4 and Figure 5 display the envelope when retraining with LLR and ALLR, respectively. Here, the weight decay values, including those used for the baseline, were individually tuned for each datapoint using a grid search over 1e-4, 2e-4 and 5e-4. All results are averaged over two seeds with max-min-bands indicated. Although LLR and ALLR show slightly different behaviour, we note that both reach the performance of the baseline with significantly less retraining epochs than required to establish that baseline. 7 show the behaviour of One Shot as well as iterative pruning when budgeting the initial training, where retraining is performed with LLR and ALLR, respectively. Here, the initial training is budgeted between 5% up to 100% of 200 epochs, which we consider the 'full' training. The initial training follows a linearly decaying learning rate schedule which starts from 0.1. After initial training, IMP is applied One Shot (retraining for 30 epochs) or iteratively (3 cycles of 15 epochs each). The individual datapoints are given by the length of the initial training, namely 10, 25, 50, 75, 100, 125, 150, 175 and 200 

B.4 COMPARISON TO STATE-OF-THE-ART PRUNING-STABLE METHODS

Table 22 lists all methods that take part in the comparative study between BIMP and state-of-the-art pruning-stable methods (cf. Section 3.3), where we added IMP as a method of reference. In the following subsections, we list the exact hyperparameter grids used for each pruning-stable method.

B.4.1 RESNET-56 ON CIFAR-10

For each method (including BIMP) we tune weight decay over 1e-4, 5e-4, 1e-3 and 5e-3 and keep momentum fixed at 0.9. Since the learning rate schedule might need additional tuning, we vary the initial learning rate between 0.05, 0.15, 0.1 and 0.2 for all methods except CS. The decay of the schedule follows the same pattern as listed in Table 3 . Since CS required the broadest grid, we fixed the learning rate schedule to the one in Table 3 . Otherwise, we used the following grids.

BIMP

Initial training budget epochs: {20, 60, 100}. Number of pruning phases of equal length: {1, 2, 3}.

GMP

Equally distributed pruning steps: {20, 100}. GSM Momentum: {0.9, 0.95, 0.99}. Table 22 : Overview of sparsification methods. CS, STR and DST control the sparsity implicitly via additional hyperparameters. IMP is the only method that is pruning instable by design, i.e., it loses its performance right after the ultimate pruning. Further, IMP is the only method that is sparsity agnostic throughout the regular training; the sparsity does not play a role while training to convergence. All other methods require training an entire model when changing the goal sparsity. 

DPF

As for GMP, we tune the number of pruning steps, i.e., {20, 100}, and the weight decay.

DNW

We only tune the weight decay, since there are no additional hyperparameters.

CS

As recommended by Savarese et al. (2020) , we fix the temperature β at 300. We tune the mask initialization s 0 over {-0.3, -0.25, -0.2, -0.1, -0.05, 0, 0.05, 0.1, 0.2, 0.25, 0.3} and the 1 penalty λ over {1e-8, 1e-7}.

STR

We tune the initial threshold value s init ∈ {-100, -50, -5, -2, -1, 0, 5, 50, 100}. In an extended grid search, we also used weight decays in {5e-05, 1e-4} and varied s init ∈ {-40, -30, -20, -10}.

DST

We tune the sparsity-controlling regularization parameter α ∈ {5e-6, 1e-5, 5e-5, 1e-4, 5e-4}. In an extended grid search, we used weight decays in {0, 1e-4} and tuned α over {1e-7, 5e-7, 1e-6}.

B.4.2 WIDERESNET ON CIFAR-100

For each method (including BIMP) we tune weight decay over 1e-4, 2e-4 and 5e-4 and keep momentum fixed at 0.9. We vary the initial learning rate between 0.05, 0.1 and 0.15 for all methods and we use the learning rate schedule of Table 3 but additionally include a linear schedule with initial learning rate value set to 0.1. Otherwise, we used the following grids.

BIMP

Initial training budget epochs: {20, 60, 100}. Number of pruning phases of equal length: {1, 2, 3, 4}.

GMP

Equally distributed pruning steps: {20, 100}. GSM Momentum: {0.9, 0.95, 0.99}.

B.4.4 RESULTS

Table 23 and Table 24 show the full results when comparing BIMP to pruning-stable methods in the case of ResNet-56 on CIFAR-10 and WideResNet on CIFAR-100, respectively. Table 23 : ResNet-56 on CIFAR-10: Results of the comparison between BIMP and pruning stable methods for the sparsity range between 90% and 99.5%. The columns are structured as follows: First the method is stated. Secondly, we denote the images-per-second throughput through training, i.e., a higher number indicates a faster algorithm. The following columns are substructured as follows: Each column corresponds to one goal sparsity and each subcolumn denotes the Top-1 accuracy, the theoretical speedup and the actual sparsity reached. All results include standard deviations. Missing values (indicated by -) correspond to cases where we were unable to obtain results in the desired sparsity range, i.e., there did not exist a training configuration with average final sparsity within a .25% interval around the goal sparsity and the closest one is too far away or belongs to another column.

Model sparsity 90%

Model sparsity 93% We analyze the impact of our choices regarding the design of ALLR. First of all, we justify the usage of the proxy to determine the initial learning rate. Recall that ALLR discounts the initial value η 1 by a factor d ∈ [0, 1] to account for both the available retraining time (similar to LRW) and the actual increase in loss induced by pruning. It does so by choosing d = max(d 1 , d 2 ), where d 1 = W -W p 2 W 2 • √ s ∈ [0, 1] measures the relative L 2 -norm change in the weights due to pruning and d 2 = T rt /T accounts for the length of the retrain phase in comparison to the original training length. The motivation behind choosing these factors lies in handling different retraining scenarios. When choosing the initial value, two aspect have to be taken into account. 1. The number of retraining epochs available, T rt , might be very limited. A large initial learning rate might yield too large oscillations of the loss, from which we cannot recover in a too short retraining timeframe. To find highly generalizing minima, we need both phases: a large-step and a small-step retraining regime (Jastrzębski et al., 2017; Li et al., 2019; You et al., 2019; Leclerc & Madry, 2020) . If T rt is too small, we do not have enough time to do both. This is the advantage of LRW: The magnitude of the initial learning rates depends directly on T rt . 2. The pruning-induced decrease in accuracy might be very small, depending on the fraction of weights we remove. Pruning only a small fraction of the weights will most likely have little impact on the loss: in a highly over-parameterized network, removing a small fraction of the weights will not drive the parameters far away from the current (local) optimum. In that case, convergence is accelerated by performing small learning rate steps. We expect that especially in the higher sparsity regime, where the loss increases dramatically by pruning, a larger initial learning rate is desirable to being able to compensate the drop in accuracy and approach the optimum faster. In other words, we seek to choose the learning rate to initially be as large as possible (but bounded by the largest learning rate used throughout training), where as large as possible means taking two factors into account: How much increase in loss do we have to compensate and do we have enough time to properly perform both a large-step regime and a small-step regime? The metric d 2 is an immediate proxy to the duration of retraining phase, motivated by LRW. Note that we clip d 2 to not become larger than 1. When we are allowed to retrain at least as long as the initial training, we prefer the initial value of the original training, which is the maximum value in the problem setting. In any other case, we take a fraction of it. However, regarding the metric d 1 , we choose to measure the drop in L 2 -norm induced by pruning instead of measuring the actual drop in accuracy, since the latter is only available after performing an entire forward pass on the train dataset. To investigate whether this replacement is justified, we compare ALLR to an accuracy drop based variant of it, namely AccALLR. AccALLR works exactly like ALLR, but we instead choose d 1 as follows: d 1 = Acc(Φ d ) -Acc(Φ s ) Acc(Φ d ) ∈ [0, 1], where Acc(Φ) denotes the train accuracy of model Φ and Φ d , Φ s denote the dense and sparse model, respectively. We clip d 1 between 0 and 1. The metric d 1 measures the relative drop in accuracy compared to the dense model. It hence requires a complete forward pass of the sparse model on the training data before commencing retraining. For CIFAR-10, Table 25 shows that despite AccALLR having a slight advantage over the 'less-informed' ALLR, these often lie within the margin of the standard deviation. The discounting factor of ALLR taking the norm-drop into account is indeed a well-functioning proxy for the pruning-induced drop in accuracy. Similarly, Table 26 shows the comparison on ImageNet in the iterative setting. Especially for moderate sparsities such as 70% we see that the accuracy is a more precise metric when determining the initial learning rate, however it becomes unprecise at higher sparsities. High distortions of the parameters will lead to the accuracy dropping entirely to that of a random classifier, resulting in the largest possible learning rate being taken as the initial value. ALLR gives a more robust estimate. We further show the impact of the two factors d 1 and d 2 of ALLR by considering all four cases of selectively disabling a subset of the metrics, that is, we compare ALLR to ALLRd1 (ALLR only using d 1 ), ALLRd2 (ALLR only using d 2 ) and LLR (which is the same as applying no discounting factor at all) when training ResNet-56 on CIFAR-10. Table 27 displays the performance of the four different variations when testing against the different scenarios as outlined above, i.e., with low to high sparsity (80-98%) and short to long retraining time (2-20% budget), where we stick to the One Shot setting. First of all, we observe ALLR consistenly performs best or second best among all variants. Note that the discounting factor d = max(d 1 , d 2 ) of ALLR is attained at either d 1 or d 2 and ALLR always matches the performance of the better d 1 -or d 2 -disabling variant. Further, we observe that ALLR behaves exactly as designed and addresses the different retraining scenarios as outlined above. In the low sparsity, short retraining regime, both ALLR variants disabling one of the two discounting factor perform equally well and are superior than selecting a large initial learning rate as LLR does. When the retraining time is increased, larger initial learning rates are more suited despite low sparsity, as visible in the marginalized difference to LLR. With further decreasing sparsity and increasing retraining time, the effect of ALLRd1 would vanish to the benefit of ALLRd2. On the other hand, in the high sparsity regime, we notice that it is beneficial to start with a high initial learning rate and it is not sufficient to only include the length of the retraining time, as ALLRd2 shows with a difference of ten percent in test accuracy to its competitors. LRW would behave similarly in this particular case. Overall, ALLR addresses these issues by accounting both for the increase in loss as well as the available retraining time. We compare the original GLOBAL pruning criterion of IMP to the previously introduced proposed alternatives. In the case of ResNet-56 on CIFAR-10 (Figure 8 ) and VGG-16 on CIFAR-10 (Figure 11 ), we report the weight decay config with highest accuracy, where we optimized over the values 1e-4, 5e-4 and 1e-3. For WideResNet on CIFAR-100 (Figure 9 ) and ResNet-50 on ImageNet (Figure 10 ) we relied on a weight decay value of 1e-4 for both architectures. The CIFAR-10 and CIFAR-100 results are averaged over three seeds and max-min-bands are indicated. For ImageNet, the results are based on a single seed. We tested both FT (Han et al., 2015) and SLR (Le & Hua, 2021) to see whether the learning rate scheme during retraining has any impact on the performance of the pruning selection scheme. Surprisingly the simple global selection criterion performs at least on par with the best out of all tested methods at any sparsity level for every combination of dataset and architecture tested here when considering the sparsity of the pruned network as the relevant measure. Using SLR during retraining compresses the results by equalizing performance, but otherwise does not change the overall picture. We note that the results on CIFAR-100 using FT largely track with those reported by Lee et al. (2020) , with the exception of the strong performance of the global selection criterion. Apart from slightly different network architectures, we note that they used significantly more retraining epochs, e.g., 100 instead of 30, and that they use AdamW (Loshchilov & Hutter, 2019 ) instead of SGD. Comparing the impact different optimizers can have on the pruning selection schemes seems like a potentially interesting direction for future research. While the sparsity-vs.-performance tradeoff has certainly been an important part of the justification of modifications to global selection criterion, let us also directly address two further points that are commonly made in this context. First, the global selection criterion has previously been reported to suffer from a pruning-induced collapse at very high levels of sparsity in certain network architectures that is avoided by other approaches. This phenomenon has been studied in the pruning before training literature and was coined layer-collapse by Tanaka et al. (2020) , who hypothesize that it can be avoided by using smaller pruning steps since gradient descent restabilizes the network after pruning by following a layer-wise magnitude conservation law. To verify whether these observations also hold in the pruning after training setting, we trained a VGG-16 network on CIFAR-10, as also done by Lee et al. (2020) , both in the One Shot and in the iterative setting. The results are reported in Figure 11 and show that layer collapse is clearly occurring for both FT and SLR for the global selection criterion at sparsity levels above 99% in the One Shot setting, but disappears entirely when pruning iteratively. This indicates that layer collapse, while a genuine potential issue, can be avoided even using the global selection criterion. We also remark that SLR needs less prune-retrain-cycles to avoid layer-collapse than FT, possibly indicating that the retraining strategy impacts the speed of restabilization of the network in the hypothesis posed by Tanaka et al. (2020) . The second important aspect to consider is that layer-dependent selection criteria are also intended to address the inherent tradeoff not just between the achieved sparsity of the pruned network and its performance, but also the theoretical computational speedup. Figure 8 , Figure 9 and Figure 10 include plots highlighting the achieved performance in relation to the theoretical speedup. The key takeaway here is that for both the ResNet-56 and the WideResNet network architecture, there is overall surprisingly little distinction between all five tested methods, with Uniform+ and ERK taking the lead and the global selection criterion performing well to average. For the ResNet-50 architecture however a much more drastic separation occurs, with Uniform performing the best, followed by Uniform+ and then the global selection criterion. Overall, the picture is significantly less clear. However, despite its simplicity, the global approach performs on par with respect to managing the accuracy vs. speedup tradeoff, where we observe that for ResNet-50 on ImageNet it even outperforms methods such as LAMP and ERK regarding both objectives, performance and speedup. (Zhu & Gupta, 2017) Uniform+ (Gale et al., 2019) ERK (Evci et al., 2019) LAMP (Lee et al., 2020 (Zhu & Gupta, 2017) Uniform+ (Gale et al., 2019) ERK (Evci et al., 2019) LAMP (Lee et al., 2020 (Zhu & Gupta, 2017) Uniform+ (Gale et al., 2019) ERK (Evci et al., 2019) LAMP (Lee et al., 2020 (Zhu & Gupta, 2017) Uniform+ (Gale et al., 2019) ERK (Evci et al., 2019) LAMP (Lee et al., 2020 (Zhu & Gupta, 2017) Uniform+ (Gale et al., 2019) ERK (Evci et al., 2019) LAMP (Lee et al., 2020) Figure 11 : VGG-16 on CIFAR-10: Performance-vs.-sparsity tradeoffs in the One Shot (above) and Iterative (below) setting with FT (left) and SLR (right) as retraining methods. In the One Shot setting the model is retrained for 30 epochs after pruning and the iterative setting consists of 3 prune-retrain cycles with 10 epochs each. For One Shot we observe layer-collapse while the iterative splitting into less severe pruning steps avoids the problem. Note that the total amount of retraining epochs between the two settings is identical here.

C.3 COMPARING BIMP TO GMP

We note that BIMP has a fair number of similarities to GMP (Zhu & Gupta, 2017 ) and we will therefore seek the direct comparison between the two. In particular, both methods effectively prune and retrain (although that terminology is commonly reserved to IMP) at distinct predetermined points during the overall training process. Let us start by highlighting the particular design decisions by which BIMP and GMP differ: 1. BIMP 'hard' prunes while GMP 'soft' prunes. In GMP, pruned weights are zeroed out during the forward and backward pass, but the mask is recomputed at the next pruning step, hence allowing for previously pruned weights to recover. However, it is unclear for example how the momentum buffer of SGD or the memory of optimizers like Adam (Kingma & Ba, 2014) are supposed to take this soft pruning into consideration, but we believe that this aspect overall contributes very little to explain the different performance of GMP and BIMP. 2. BIMP requires sufficiently long retraining time between two pruning points to recover from the last pruning step. GMP on the other hand can have much shorter distances between the equally distributed pruning points with as little as pruning every 100 training iterations (Zhu & Gupta, 2017) . 3. Both BIMP and GMP rely on magnitude pruning, however we have decided to use the simple GLOBAL selection criterion of Han et al. (2015) for both while GMP was originally proposed using the UNIFORM selection and further improved by Gale et al. (2019) with the UNIFORM+ criterion, see Appendix C.2 for a discussion. The version of GMP included in the main part of this text in fact relies on the same global criterion as BIMP, since we found it to yield better results than UNIFORM+ with respect to the final accuracy. 4. BIMP relies on a cyclic linear learning rate schedule where the cycles coincide with the points at which the network is pruned, while GMP can use any kind of learning rate scheme that would commonly be employed for that kind of architecture and data set, i.e., normally not a cyclic one. Zhu & Gupta (2017) note that GMP can be quite sensitive to the learning rate. In the main body of the text we have mostly relied on a common stepped learning rate schedule. Given the importance of the learning rate schedule, we have also tested a version of GMP using both a linearly decaying schedule, as well as a cyclic learning rate schedule similar to what we have suggested for BIMP, i.e., the cycles are chosen to exactly end at the next pruning step. 5. We have relied on the simple exponential pruning schedule suggested by Renda et al. (2020) for BIMP while GMP relies on a particular schedule defined by a cubic polynomial that effectively leads to pruning larger amounts initially and progressively smaller amounts later in training when compared to BIMP. While we think that the pruning schedule probably has a significant impact on the performance of the pruning method and can possibly interact with the learning schedule in particularly interesting ways, we have so far not attempted exchanging the schedule in BIMP for either the one of GMP or a novel one. In Table 28 we have included the previously mentioned modifications of GMP and compared them to BIMP for WideResNet trained on CIFAR-100. In particular, for each variant of GMP we indicate which learning rate schedule we use and whether we prune 'hard' or 'soft'. The models are trained according to the same settings as indicated in Table 3 , where stepped refers to the stepped learning rate case. On the other hand, linear indicates a linear learning rate schedule and cyclic refers to a linearly decaying learning rate schedule that is restarted after every pruning point. We use a weight decay value of 1e-4 and set the initial value of the learning rate to 0.1. For GMP we prune every 5 or 10 epochs, while for BIMP we fix the initial training length to a 100 epochs and split the remaining 100 epochs equally between 1 to 4 cycles. All results are averaged over two seeds with standard deviations indicated. We note that there seems to be surprisingly little difference between hard and soft pruning. The impact of the learning rate schedule is more nuanced: using a linearly decaying schedule throughout training can give a slight increase in performance, albeit the classical stepped learning rate schedule seems to work better in the high sparsity regime. Whether the cyclic restarting of the learning rate works crucially depends on the distance between two pruning points and the impact of pruning, which BIMP with ALLR seems to leverage. For the medium sparsity of 90% a cyclical learning rate seems to be detrimental, while in the high sparsity regime we see improvements. Table 29 further reports results for ResNet-50 trained on ImageNet, where we sticked to the 'hard' pruning for each GMP variant. Similarly, we use the same settings as in Table 3 , set the weight decay to 1e-4 and the initial value of the learning rate to 0.1. GMP prunes every 5 or 10 epochs, whereas BIMP leverages an initial training length of 60 or 75 epochs, splitting the remaining 30 or 15 epochs into 1 to 4 cycles of equal length. Here, the stepped learning rate schedule seems to be in advantage over a linear one, with the slight exception of the highest sparsity. Interestingly, the cyclic learning rate schedule seems to be in conflict with and too aggressive for the sparsification schedule of GMP. Overall we think that there is a fair amount of nuance here that deserves further exploration and will probably require a more diverse testbed to draw any definitive conclusions. 



Note that a large initial and then (often exponentially) decaying learning rate has become the standard practice for regular training(Leclerc & Madry, 2020). The conventional approach to explaining the success of such schedules from an optimization perspective is that an initially large learning rate accelerates training and avoids local minima, while the gradual decay helps to converge to an optimum without oscillating around it. However, there are also indications that the usage of large learning rates and the separation of training into a large-and small-step regime help from a generalization perspective(Jastrzębski et al., 2017; Li et al., 2019;You et al., 2019;Leclerc & Madry, 2020). To compute the number of FLOPs, we sample a single batch from the test set. The code to compute the theoretical speedup has been adapted from the repository of the ShrinkBench framework(Blalock et al., 2020).



Figure 1: The different learning rate schedules for IMP when retraining for 60 epochs, assuming a stepped learning rate schedule during an initial training lasting for 200 epochs.

Figure 2: ResNet-50 on ImageNet: Performance of different retraining schedules compared to the dense model shown over the total number of epochs used for retraining including both One Shot and iterative magnitude pruning. Results are averaged over two seeds with max-min-bands indicated and the plots depict sparsity 70%, 80% and 90% from left to right.

Figure 3: ResNet-56 on CIFAR-10: Envelope of the performance of IMP using ALLR compared to the baseline of Renda et al. (2020) shown over the total number of epochs used for retraining.

Figure 4: ResNet-56 on CIFAR-10: Envelope of the performance of IMP using LLR compared to the baseline of Renda et al. (2020) shown over the total number of epochs used for retraining. Results are averaged over two seeds with max-min-bands indicated.

Figure 5: ResNet-56 on CIFAR-10: Envelope of the performance of IMP using ALLR compared to the baseline of Renda et al. (2020) shown over the total number of epochs used for retraining. Results are averaged over two seeds with max-min-bands indicated.

Figure 6: ResNet-56 on CIFAR-10: Test accuracy achieved by IMP when retraining One Shot for 30 epochs (above) or iteratively for three cycles of 15 epochs each (below) with LLR after budgeting the initial training length. Each line depicts a different goal sparsity and values are indicated as deviation from the performance of IMP when applied to a network trained with the full budget. Results are averaged over two seeds with max-min-bands indicated.

Figure 7: ResNet-56 on CIFAR-10: Test accuracy achieved by IMP when retraining One Shot for 30 epochs (above) or iteratively for three cycles of 15 epochs each (below) with ALLR after budgeting the initial training length. Each line depicts a different goal sparsity and values are indicated as deviation from the performance of IMP when applied to a network trained with the full budget. Results are averaged over two seeds with max-min-bands indicated.

the weight decay.

Figure 9: WideResNet on CIFAR-100 (One Shot): Sparsity-vs.-performance (above) and theoretical speedup-vs.-performance (below) tradeoffs in the One Shot setting with FT (left) and SLR (right) as retraining methods. Retraining is done for 30 epochs. The plot includes max-min confidence intervals.

ResNet-50 on ImageNet: Performance of the different learning rate translation schemes for One Shot IMP for target sparsities of 70%, 80% and 90% and retrain times of 2.22% (2 epochs), 5.55% (5 epochs) and 11.11% (10 epochs) of the initial training budget. The first, second, and third best values are highlighted. Results are averaged over two seeds with the standard deviation indicated.

and: Comparison between BIMP and pruning-stable methods when training for goal sparsity levels of 90%, 95%, 99% (CIFAR-10) and 70%, 80%, 90% (ImageNet), denoted in the main columns. Each subcolumn denotes the Top-1 accuracy, the theoretical speedup and the actual sparsity achieved by the method. Further, we denote the images-per-second throughput during training, i.e., a higher number indicates a faster method. All results are averaged over multiple seeds and include standard deviations. The first, second, and third best values are highlighted.

Exact training configurations used throughout the experiments for IMP. We note that others have reported an accuracy of around 80% for WRN28x10 trained on CIFAR-100 that we were unable to replicate. The discrepancy is most likely due to an inconsistency in PyTorch's dropout implementation. For experiments involving Vision-Transformers, we used label smoothing as well as gradient clipping. For COCO and CityScapes architectures, we rely on pretrained backbones and report the common mean Intersection-over-Union (IoU) metric measured on the validation set. For the NMT task we report the BLEU score on the test set, where we limit the sequence length to 128 throughout.

ResNet-56 on CIFAR-10 (One Shot): Performance of the different learning rate translation schemes (above) compared to tuned schedules (below) for IMP in the One Shot setting for target sparsity of 50%, 90%, 95%, 99% and a retrain time of 2.5%, 5%, 10%, and 25% of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. .09 88.50 ±0.17 89.59 ±0.03 90.27 ±0.54 64.68 ±2.53 68.97 ±2.14 73.48 ±1.13 77.54 ±0.31 LRW 87.61 ±0.04 88.56 ±0.16 89.60 ±0.08 91.03 ±0.08 64.66 ±2.54 68.98 ±2.14 73.44 ±1.17 81.38 ±0.16 SLR 88.38 ±0.45 89.54 ±0.23 90.43 ±0.08 90.75 ±0.48 77.05 ±0.16 78.98 ±0.04 80.23 ±0.28 81.42 ±0.03

ResNet-56 on CIFAR-10 (Iterative): Performance of the different learning rate translation schemes (above) compared to tuned schedules (below) for IMP in the iterative setting for target sparsity of 90%, 95%, 99% and retrain times as indicated in the Budget row. Here 2 × 2.5% indicates two prune-retrain cycles, each of which having length equal to 2.5% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. .09 91.49 ±0.01 91.83 ±0.11 91.22 ±0.10 91.83 ±0.04 91.81 ±0.30 LRW 90.88 ±0.14 91.47 ±0.07 91.99 ±0.08 91.28 ±0.08 91.60 ±0.04 91.74 ±0.20 SLR 91.33 ±0.01 92.08 ±0.07 92.41 ±0.10 91.62 ±0.39 92.05 ±0.01 92.49 ±0.23 CLR 92.05 ±0.08 92.47 ±0.08 92.69 ±0.42 91.92 ±0.11 92.73 ±0.09 92.69 ±0.08 LLR 91.79 ±0.13 92.27 ±0.38 92.76 ±0.22 92.10 ±0.35 92.60 ±0.31 92.81 ±0.35 ALLR 91.95 ±0.32 92.26 ±0.13 92.78 ±0.33 92.20 ±0.47 92.66 ±0.35 92.79 ±0.03 constant 91.10 ±0.21 91.45 ±0.11 91.86 ±0.10 91.20 ±0.13 91.52 ±0.04 91.88 ±0.19 stepped 91.80 ±0.26 92.10 ±0.33 92.62 ±0.08 91.81 ±0.22 92.52 ±0.43 92.77 ±0.13 cosine 92.01 ±0.01 92.30 ±0.15 92.77 ±0.54 92.22 ±0.21 92.74 ±0.11 92.89 ±0.08 linear 92.08 ±0.04 92.42 ±0.20 92.94 ±0.04 92.07 ±0.06 92.67 ±0.26 92.80 ±0.12

ResNet-18 on CIFAR-10 (One Shot): Performance of the different learning rate translation schemes (above) compared to tuned schedules (below) for IMP in the One Shot setting for target sparsity of 70%, 90%, 95%, 98% and a retrain time of 5%, 10%, and 25% of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. .12 95.09 ±0.24 95.23 ±0.01 94.63 ±0.08 94.75 ±0.30 94.77 ±0.07 LRW 95.05 ±0.16 95.08 ±0.23 95.05 ±0.01 94.62 ±0.08 94.74 ±0.30 94.91 ±0.20 SLR 94.21 ±0.10 94.82 ±0.16 94.86 ±0.06 94.24 ±0.07 94.66 ±0.20 94.90 ±0.41 CLR 94.70 ±0.07 95.02 ±0.14 95.27 ±0.01 94.65 ±0.01 94.88 ±0.15 94.97 ±0.08 LLR 94.58 ±0.13 94.82 ±0.16 95.00 ±0.06 94.61 ±0.25 94.87 ±0.14 95.03 ±0.04 ALLR 94.99 ±0.03 95.08 ±0.08 95.34 ±0.15 94.66 ±0.01 94.94 ±0.15 95.02 ±0.07 constant 94.97 ±0.12 95.09 ±0.25 95.25 ±0.01 94.62 ±0.12 94.73 ±0.33 94.73 ±0.04 .13 94.59 ±0.25 94.68 ±0.07 93.48 ±0.06 93.86 ±0.23 93.97 ±0.08 constant 93.69 ±0.11 94.15 ±0.11 94.19 ±0.06 92.09 ±0.21 92.38 ±0.13 93.14 ±0.16 stepped 94.42 ±0.13 94.55 ±0.18 94.72 ±0.28 93.44 ±0.18 93.82 ±0.13 93.86 ±0.22 cosine 94.50 ±0.05 94.73 ±0.01 94.80 ±0.01 93.64 ±0.13 94.01 ±0.10 94.06 ±0.02 linear 94.68 ±0.17 94.70 ±0.01 94.98 ±0.10 93.67 ±0.10 94.09 ±0.01 94.22 ±0.03

WideResNet on CIFAR-100 (One Shot): Performance of the different learning rate translation schemes (above) compared to tuned schedules (below) for IMP in the One Shot setting for target sparsity of 90%, 95%, 99% and a retrain time of 2.5%, 5%, 10% of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-50 on ImageNet (One Shot): Performance of the different learning rate translation schemes for IMP in the One Shot setting for target sparsity of 70%, 80%, 90%, 95% and a retrain time of 2.22% (2 epochs), 5.55% (5 epochs), 11.11% (10 epochs), 22.22% (20 epochs) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-50 on ImageNet (Iterative): Performance of the different learning rate translation schemes for IMP in the iterative setting for target sparsity of 70%, 80%, 90% and retrain times as indicated in the Budget row. Here 2 × 2.22% indicates two prune-retrain cycles, each of which having length equal to 2.22% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

MaxViT on ImageNet (Iterative): Performance of the different learning rate translation schemes for IMP in the iterative setting for target sparsity of 75%, 80%, 85%, 90% and retrain times as indicated in the Budget row. Here 2 × 2.5% indicates two prune-retrain cycles, each of which having length equal to 2.5% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

DeepLabV3 on COCO (Iterative): Performance of the different learning rate translation schemes for IMP in the iterative setting for target sparsity of 50 -80% and retrain times as indicated in the Budget row. Here 2×6.66% indicates two prune-retrain cycles, each of which having length equal to 6.66% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

PSPNet on CityScapes (Iterative): Performance of the different learning rate translation schemes for IMP in the iterative setting for target sparsity of 60 -90% and retrain times as indicated in the Budget row. Here 2×1.66% indicates two prune-retrain cycles, each of which having length equal to 1.66% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated. .05 23.93 ±0.20 24.24 ±0.15 24.08 ±0.07 24.06 ±0.15 24.36 ±0.09 LRW 23.76 ±0.05 23.93 ±0.20 24.24 ±0.15 24.08 ±0.07 24.68 ±0.19 24.92 ±0.12 SLR 21.66 ±0.40 22.09 ±0.13 22.74 ±0.11 23.36 ±0.34 24.09 ±0.36 24.61 ±0.03 CLR 22.69 ±0.06 23.29 ±0.17 23.79 ±0.34 24.47 ±0.31 24.73 ±0.06 25.28 ±0.19 LLR 22.97 ±0.01 23.20 ±0.15 24.00 ±0.14 24.53 ±0.22 25.11 ±0.05 24.95 ±0.43 ALLR 24.11 ±0.17 24.29 ±0.02 24.51 ±0.00 24.63 ±0.21 25.27 ±0.30 25.17 ±0.39

ResNet-56 on CIFAR-10 (One Shot, Filter Pruning): Performance of the different learning rate translation schemes in the One Shot setting for target sparsity of 50 -80%, and a retrain time of 5% (10 epochs), 10% (20 epochs), 15% (30 epochs) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-50 on ImageNet (One Shot, Filter Pruning): Performance of the different learning rate translation schemes in the One Shot setting for target sparsity of 30 -50%, and a retrain time of 2.22% (2 epochs), 5.55% (5 epochs) and 11.11% (10 epochs) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-50 on ImageNet (Iterative, Filter Pruning): Performance of the different learning rate translation schemes in the iterative setting for target sparsity of 30%, 40%, 50% and retrain times as indicated in the Budget row. Here 2 × 2.22% indicates two prune-retrain cycles, each of which having length equal to 2.22% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

MaxViT on ImageNet (One Shot, Filter Pruning): Performance of the different learning rate translation schemes in the One Shot setting for target sparsity of 30 -50%, and a retrain time of 1% (2 epochs), 2.5% (5 epochs), 5% (10 epochs) and 10% (20 epochs) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

MaxViT on ImageNet (Iterative, Filter Pruning): Performance of the different learning rate translation schemes in the iterative setting for target sparsity of 30% -50% and retrain times as indicated in the Budget row. Here 2 × 2.5% indicates two prune-retrain cycles, each of which having length equal to 2.5% of the overall training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

training epochs. We keep weight decay fixed at 5e-4.

WideResNet on CIFAR-100: Results of the comparison between BIMP and pruning stable methods for the sparsity range between 90% and 98%. The columns are structured as follows: First the method is stated. Secondly, we denote the images-per-second throughput through training, i.e., a higher number indicates a faster algorithm. The following columns are substructured as follows: Each column corresponds to one goal sparsity and each subcolumn denotes the Top-1 accuracy, the theoretical speedup and the actual sparsity reached. All results include standard deviations.

ResNet-56 on CIFAR-10 (One Shot): Performance of ALLR versus its accuracy-drop based variant AccALLR for IMP in the One Shot setting for target sparsity of 80%, 90%, 98% and a retrain time of 2% (2 epochs), 5% (5 epochs), 20% (20 epochs) of the initial training budget. The first and second best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-50 on ImageNet (Iterative): Performance of ALLR versus its accuracy-drop based variant AccALLR for IMP in the iterative setting for target sparsity of 70% and 90%. The first and second best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

ResNet-56 on CIFAR-10 (One Shot): Performance of the ALLR derivations for IMP in the One Shot setting for target sparsity of 90%, 98% and a retrain time of 2% (2 epochs), 5% (5 epochs), 20% (20 epochs) of the initial training budget. The first, second, and third best values for the translation schemes are highlighted. Results are averaged over two seeds with the standard deviation indicated.

Figure 8: ResNet-56 on CIFAR-10 (One Shot): Sparsity-vs.-performance (above) and theoretical speedup-vs.-performance (below) tradeoffs in the One Shot setting with FT (left) and SLR (right) as retraining methods. Retraining is done for 30 epochs. The plot includes max-min confidence intervals.

Figure 10: ResNet-50 on ImageNet (One Shot): Sparsity-vs.-performance (above) and theoretical speedup-vs.-performance (below) tradeoffs in the One Shot setting with FT (left) and SLR (right) as retraining methods. Retraining is done for 10 epochs.

WideResNet on CIFAR-100: Comparison between BIMP and GMP variants for goal sparsity levels of 90% and 95%, denoted in the main columns. Each subcolumn denotes the Top-1 accuracy, the theoretical speedup and the actual sparsity achieved by the method. All results are averaged over two seeds and include standard deviations.

Comparison between BIMP and GMP variants for goal sparsity levels of 70%, 80% and 90%, denoted in the main columns. Each subcolumn denotes the Top-1 accuracy, the theoretical speedup and the actual sparsity achieved by the method. All results are averaged over two seeds and include standard deviations.

ACKNOWLEDGEMENTS

This research was partially supported by the DFG Cluster of Excellence MATH+ (EXC-2046/1, project id 390685689) funded by the Deutsche Forschungsgemeinschaft (DFG). We would like to thank Berkant Turan for his support in conducting the NMT experiment using the HuggingFace library.

