ON THE PREDICTABILITY OF PRUNING ACROSS SCALES

Abstract

We show that the error of iteratively-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing that it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different sparsities are freely interchangeable. We demonstrate the accuracy of this functional approximation over scales spanning orders of magnitude in depth, width, dataset size, and sparsity. We show that the scaling law functional form holds (generalizes) for large scale data (CIFAR-10, ImageNet), architectures (ResNets, VGGs) and iterative pruning algorithms (IMP, SynFlow). As neural networks become ever larger and more expensive to train, our findings suggest a framework for reasoning conceptually and analytically about pruning.

1. INTRODUCTION

For decades, neural network pruning-eliminating unwanted parts of the network-has been a popular approach for reducing network sizes or computational demands of inference (LeCun et al., 1990; Reed, 1993; Han et al., 2015) . In practice, pruning can reduce the parameter-counts of contemporary models by 2x (Gordon et al., 2020 ) to 5x (Renda et al., 2020) with no reduction in accuracy. More than 80 pruning techniques have been published in the past decade (Blalock et al., 2020) , but, despite this enormous volume of research, there remains little guidance on important aspects of pruning. Consider a seemingly simple question one might ask when using a particular pruning technique: Given a family of neural networks (e.g., ResNets on ImageNet of various widths and depths), which family member should we prune (and by how much) to obtain the network with the smallest parametercount such that error does not exceed some threshold k ? As a first try, we could attempt to answer this question using brute force: we could prune every member of a family (i.e., perform grid search over widths and depths) and select the smallest pruned network that satisfies our constraint on error. However, depending on the technique, pruning one network (let alone grid searching) could take days or weeks on expensive hardware. If we want a more efficient alternative, we will need to make assumptions about pruned networks: namely, that there is some structure to the way that their error behaves. For example, that pruning a particular network changes the error in a predictable way. Or that changing the width or depth of a network changes the error when pruning it in a predictable way. We could then train a smaller number of networks, characterize this structure, and estimate the answer to our question. We have reason to believe that such structure does exist for pruning: techniques already take advantage of it implicitly. For example, Cai et al. ( 2019) create a single neural network architecture that can be scaled down to many different sizes; to choose which subnetwork to deploy, Cai et al. train an auxiliary, black-box neural network to predict subnetwork performance. Although this black-box approach implies the existence of structure, it does not reveal this structure explicitly or make it possible to reason analytically in a fashion that could answer our research question. Outside the context of pruning algorithms, such structure has been observed-and further codified explicitly-yielding insights and predictions in the form of scaling laws. Tan and Le (2019) design the EfficientNet family by developing a heuristic for predicting efficient tradeoffs between depth, width, and resolution. Hestness et al. ( 2017) observe a power-law relationship between dataset size 1

