UNCOVERING THE IMPACT OF HYPERPARAMETERS FOR GLOBAL MAGNITUDE PRUNING Anonymous authors Paper under double-blind review

Abstract

A common paradigm in model pruning is to train a model, prune, and then either fine-tune or, in the lottery ticket framework, reinitialize and retrain. Prior work has implicitly assumed that the best training configuration for model evaluation is also the best configuration for mask discovery. However, what if a training configuration which yields worse performance actually yields a mask which trains to higher performance? To test this, we decoupled the hyperparameters for mask discovery (H find ) and mask evaluation (H eval ). Using unstructured magnitude pruning on vision classification tasks, we discovered the "decoupled find-eval phenomenon," in which certain H find values lead to models which have lower performance, but generate masks with substantially higher eventual performance compared to using the same hyperparameters for both stages. We show that this phenomenon holds across a number of models, datasets, configurations, and also for one-shot structured pruning. Finally, we demonstrate that different H find values yield masks with materially different layerwise pruning ratios and that the decoupled find-eval phenomenon is causally mediated by these ratios. Our results demonstrate the practical utility of decoupling hyperparameters and provide clear insights into the mechanisms underlying this counterintuitive effect.

1. INTRODUCTION

There has been significant progress in deep learning in recent years, but many of the best performing networks are extremely large (Kolesnikov et al., 2019; Brown et al., 2020) . This can be problematic due to the amount of compute and memory needed to train and deploy such models. One popular approach is model pruning: removing weights (unstructured pruning) or units (structured pruning) from a trained network in order to generate a smaller network with near-equivalent (and in some cases, better) performance (Blalock et al., 2020; Lin et al., 2020; Liu et al., 2017; He et al., 2019b; Molchanov et al., 2019) . One of the most commonly used heuristics is magnitude pruning, in which the lowest magnitude weights/units are removed, as it is simple and competitive with more complex methods (Han et al., 2015; Gale et al., 2019) . The lottery ticket hypothesis (Frankle & Carbin, 2018) , a related concept, posits that a large network contains a smaller subnetwork at initialization that can be trained to high performance in isolation, and provides a simple pruning method to find such winning lottery tickets (LTs). In addition to allowing the training of sparse models from scratch, the existence of LTs also suggests that overparameterization is not necessarily required to train a network to high performance; rather, overparameterization may simply be necessary to find a good starting point for training. When pruning a network, one must decide how much to prune each layer. Should all layers be pruned equally? Or rather, should some layers be pruned more than others? Previous studies have shown that global pruning results in better compression and performance than layerwise (or uniform) pruning (Frankle & Carbin, 2018; Morcos et al., 2019) . This is because global pruning ranks all weights/units together independent of layer, granting the network the flexibility to find the ideal layerwise pruning ratios (LPR), which we define as the percent that each layer is pruned. One can frame most pruning methods as having two phases of training: one to find the mask, and one to evaluate the mask by training the pruned model (whether by rewinding and re-training or finetuning). Common practice for methods that rewind weights after pruning, such as lottery ticket pruning, has been to use the same hyperparameters for both finding and evaluating masks (see Blalock et al. (2020) for an extensive review). Methods that fine-tune weights after pruning typically train at a smaller learning rate than the training phase to find the mask (Han et al., 2015; Renda et al., 2020) , but other hyperparameters are held constant. Pruning methods also rest on the assumption that models with the best performance will generate the best masks, such that the optimal hyperparameters for mask generation and mask evaluation should be identical. However, what if the mechanisms underlying mask generation are not perfectly correlated with those leading to good performance? Consequentially, what if models which converge to worse performance can yield better masks? To test this, we explored settings for global magnitude pruning in which different hyperparameters were used for mask generation and mask evaluation. In particular, we focused on three hyperparameters that are commonly adjusted in practice: learning rate, batch size, and weight decay. Using this paradigm, we make the following contributions: • Surprisingly, we found that the best hyperparameters to find the mask (H find ) are often different from the best hyperparameters to train the regular model or to evaluate the mask (H eval ; Figures 1 and 2 ), which we term the "decoupled find-eval phenomenon". Counterintuitively, this means that models with worse performance prior to pruning can generate masks which leads to better performance for the final pruned model than a mask generated by a higher performance model pre-pruning. • We show that this phenomenon is not an artifact of the particular setting we studied and also occurs for structured pruning (Figure 5 ), other datasets, and other variants of LTs, including late rewinding, learning rate warmup, and learning rate rewinding (Figure 4 ). • We found that different hyperparameters for mask generation led to materially different LPRs (Figures 6a and A18 ). Notably, we observed that a larger learning rate, smaller batch size, and larger weight decay (which resulted in worse masks) consistently pruned early layers far more than in masks found with the opposite. • Finally, we show that this phenomenon is causally mediated by the differences in LPR. When the LPR is fixed to a "good" LPR (i.e., that of a high performance mask), the previously bad hyperparameters now lead to better performance and better mask generation (Figure 7 ). The same is true for the inverse experiment.

2.1. MODIFIED PRUNING PROCEDURES

Our main experiments are mainly based on the lottery ticket procedure from Frankle & Carbin (2018) along with follow up work (Renda et al., 2020) for unstructured pruning. We used global iterative magnitude pruning (IMP) for LTs because it has been shown to perform better than one shot pruning, where the pruning procedure is only done once rather than iteratively, and local (or uniform layerwise) pruning, where each layer has the same pruning ratio (Morcos et al., 2019; Frankle & Carbin, 2018) . We also investigated structured one-shot pruning, following Liu et al. (2017) , which also prunes globally. All experiments use magnitude pruning. We define four different sets of hyperparameters: H unpruned , which is used for regular training of an unpruned model; H find , used to find masks (i.e. the pre-training part of the pruning procedure); H eval , used for obtaining the final pruned model to be used for inference and evaluating the final performance of a mask; and H LT to refer to the hyperparameters optimized for LTs where H find H eval . Note that H LT is often different from H unpruned in practice and in previous literature (Liu et al., 2019; Frankle & Carbin, 2018) . Our modified procedures can incorporate changes in any hyperparameters, but in this paper we focus on experiments with learning rate (LR), batch size (BS), and weight decay (WD). For the main experiments, we focus on one hyperparameter at a time. When only one hyperparameter is changed from H find and H eval , we will denote it such as LR find , WD eval , etc. Unstructured pruning To account for these distinct sets of hyperparameters, we slightly modified the procedure for unstructured IMP as described in Algorithm A1. We emphasize that our modification requires no additional compute compared to the original LT procedure if used practically. However, to generate additional data points for analysis in the present work, we separately evaluated

