ON THE PREDICTABILITY OF PRUNING ACROSS SCALES

Abstract

We show that the error of iteratively-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing that it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different sparsities are freely interchangeable. We demonstrate the accuracy of this functional approximation over scales spanning orders of magnitude in depth, width, dataset size, and sparsity. We show that the scaling law functional form holds (generalizes) for large scale data (CIFAR-10, ImageNet), architectures (ResNets, VGGs) and iterative pruning algorithms (IMP, SynFlow). As neural networks become ever larger and more expensive to train, our findings suggest a framework for reasoning conceptually and analytically about pruning.

1. INTRODUCTION

For decades, neural network pruning-eliminating unwanted parts of the network-has been a popular approach for reducing network sizes or computational demands of inference (LeCun et al., 1990; Reed, 1993; Han et al., 2015) . In practice, pruning can reduce the parameter-counts of contemporary models by 2x (Gordon et al., 2020) to 5x (Renda et al., 2020) with no reduction in accuracy. More than 80 pruning techniques have been published in the past decade (Blalock et al., 2020) , but, despite this enormous volume of research, there remains little guidance on important aspects of pruning. Consider a seemingly simple question one might ask when using a particular pruning technique: Given a family of neural networks (e.g., ResNets on ImageNet of various widths and depths), which family member should we prune (and by how much) to obtain the network with the smallest parametercount such that error does not exceed some threshold k ? As a first try, we could attempt to answer this question using brute force: we could prune every member of a family (i.e., perform grid search over widths and depths) and select the smallest pruned network that satisfies our constraint on error. However, depending on the technique, pruning one network (let alone grid searching) could take days or weeks on expensive hardware. If we want a more efficient alternative, we will need to make assumptions about pruned networks: namely, that there is some structure to the way that their error behaves. For example, that pruning a particular network changes the error in a predictable way. Or that changing the width or depth of a network changes the error when pruning it in a predictable way. We could then train a smaller number of networks, characterize this structure, and estimate the answer to our question. We have reason to believe that such structure does exist for pruning: techniques already take advantage of it implicitly. For example, Cai et al. ( 2019) create a single neural network architecture that can be scaled down to many different sizes; to choose which subnetwork to deploy, Cai et al. train an auxiliary, black-box neural network to predict subnetwork performance. Although this black-box approach implies the existence of structure, it does not reveal this structure explicitly or make it possible to reason analytically in a fashion that could answer our research question. Outside the context of pruning algorithms, such structure has been observed-and further codified explicitly-yielding insights and predictions in the form of scaling laws. Tan and Le (2019) design the EfficientNet family by developing a heuristic for predicting efficient tradeoffs between depth, width, and resolution. Hestness et al. ( 2017) observe a power-law relationship between dataset size and the error of vision and NLP models. Rosenfeld et al. (2020) use a power scaling law to predict the error of all variations of architecture families and dataset sizes jointly, for computer vision and natural language processing settings. Kaplan et al. (2020) develop a similar power law for language models that incorporates the computational cost of training. Inspired by this line of work, we address our research question about pruning by developing a scaling law to predict the error of networks as they are pruned. To the best of our knowledge, no explicit scaling law holding over pruning algorithms and network types currently exists. In order to formulate such a predictive scaling law, we consider the dependence of generalization error on the pruning-induced density for networks of different depths and width trained on different dataset sizes. We begin by developing a functional form that accurately estimates the generalization error of a specific model as it is pruned (Section 3). We then account for other architectural degrees of freedom, expanding the functional form for pruning into a scaling law that jointly considers density alongside width, depth, and dataset size (Section 4). The basis for this joint scaling law is an invariant we uncover that describes ways that we can interchange depth, width, and pruning without affecting error. The result is a scaling law that accurately predicts the performance of pruned networks across scales. Finally, we use this scaling law to answer our motivating question (Section 7). The same functional form can accurately estimate the error for both unstructured magnitude pruning (Renda et al., 2020) and SynFlow (Tanaka et al., 2020) when fit to the corresponding data, suggesting we have uncovered structure that may be applicable to iterative pruning more generally. And now that we have established this functional form, fitting it requires only a small amount of data (Appendix 5). In summary, our contributions are as follows: • We develop a scaling law that accurately estimates the error when pruning a single network. • We observe and characterize an invariant that allows error-preserving interchangeability among depth, width, and pruning density. • Using this invariant, we extend our single-network scaling law into a joint scaling law that predicts the error of all members of a network family at all dataset sizes and all pruning densities. • In doing so, we demonstrate that there is structure to the behavior of the error of iteratively pruned networks that we can capture explicitly with a simple functional form. • Our scaling law enables a framework for reasoning analytically about pruning, allowing us to answer our motivating question and similar questions about pruning.

2. EXPERIMENTAL SETUP

Pruning. We study two techniques for pruning neural networks: iterative magnitude pruning (IMP) (Janowsky, 1989; Han et al., 2015; Frankle et al., 2020) in the main body of the paper and SynFlow (Tanaka et al., 2020) in Appendix E. We describe IMP in detail here and SynFlow in Appendix A. IMP prunes by removing a fraction-typically 20%, as we do here-of individual weights with the lowest magnitudes at the end of training. 1 We choose these weights globally throughout the network, i.e., without regard to specific layers. We use per-weight magnitude pruning because it is generic, well-studied (Han et al., 2015) , and matches the sparsity/accuracy tradeoffs of more complicated methods (Gale et al., 2019; Blalock et al., 2020; Renda et al., 2020) . Pruning weights typically reduces the accuracy of the trained network, so it is standard practice to further train after pruning to recover accuracy. For IMP, we use a practice called weight rewinding, in which the values of unpruned weights are rewound to their values at epoch 10 and the training process is repeated from there to completion. To achieve density levels below 80%, this process is repeated iteratively-pruning by 20%, rewinding, and retraining-until a desired density level is reached. Renda et al. (2020) demonstrate that IMP with weight rewinding achieves state-of-the-art tradeoffs between sparsity and accuracy. For a formal statement of this pruning algorithm, see Appendix A. Datasets. We study the image classification tasks CIFAR-10 and ImageNet. Our scaling law predicts the error when training with the entire dataset and smaller subsamples. 



We do not prune biases or BatchNorm, so pruning 20% of weights prunes fewer than 20% of parameters.



To subsample a dataset to a size of n, we randomly select n of the training examples without regard to individual classes such

