WHEN ARE NEURAL PRUNING APPROXIMATION BOUNDS USEFUL?

Abstract

Approximation bounds for neural network pruning attempt to predict the tradeoff between sparsity and fidelity while shrinking neural networks. In the first half of this paper, we empirically evaluate the predictive power of two recently proposed methods based on coreset algorithms. We identify several circumstances in which the bounds are loose or impractical to use and provide a brief analysis of the components of the bounds that contribute to those short-comings. In the second half, we examine the role of fine-tuning in prunability and observe that even tight approximation bounds would be poor predictors of accuracy after finetuning. This is because fine-tuning can recover large amounts of accuracy while simultaneously maintaining or increasing approximation error. We discuss the implications of these finding on the application of coreset-based pruning methods in practice and the role of approximation in the pruning community. Our code is available in the attached supplementary material.

1. INTRODUCTION

Recent work has attempted to prune neural networks while providing a mathematical upper bound on the deviation from the original network. These bounds were first introduced as a component in proofs of generalization in statistical learning theory (Arora et al., 2018) . Recent improvements on these pruning methods have come from the field of randomized algorithms, leveraging the concept of coresets (Mussay et al., 2020) (Baykal et al., 2019a) (Baykal et al., 2019b) (Liebenwein et al., 2019) , with empirical evaluations focusing on practical improvement in the sparsity/accuracy trade-off. However, surprisingly little attention has been given to evaluating the tightness of the proposed bounds or the predictive power of the guarantees they provide. In the first half of this paper, we fill this gap by empirically evaluating the tightness of two recently proposed coreset bounds, using them to predict the accuracy of neural networks after pruning. We conduct a series of experiments of increasing difficulty, starting with networks that are known to be highly sparse and ending with networks in which no regularization is applied during training. We identify several circumstances in which the bounds are loose or impractical to use and provide a brief analysis of the bound components that contribute to those short-comings. In the second half of the paper we introduce a fine-tuning procedure after pruning, which is common practice in the pruning community. Fine-tuning can drastically improve the accuracy achieved at high levels of sparsity, taking some models from 40% accuracy after pruning to 98% accuracy after fine-tuning. However, we observe that this improvement does not usually include a reduction in approximation error; indeed, fine-tuning can even increase approximation error in hidden layers while simultaneously recovering the original performance of the network. We argue that this phenomenon limits the predictive power of approximation bounds which are constructed "layer-bylayer" and which ignore the final softmax layer at the output. This motivates further discussion of the practical applications of coreset-based methods and whether "approximation" is the correct goal for the pruning community.

2. RELATED WORK

Over-parameterization A popular recipe for training neural networks is to make the network as large as possible such as to achieve 0 training loss, then regularize to encourage generalization (Karpathy, 2019) . It is possible that such over-parameterized networks have have favorable optimization properties (Li et al., 2020 ) (Du et al., 2018b) . However, over-parameterized networks are becoming increasingly burdensome to evaluate on consumer hardware, which has driven the interest in researching precise and effective pruning methods that can take advantage of sparsity in the discovered solutions. Pruning With Guarantees Despite the massive literature on neural pruning techniques, few papers provide any kind of mathematical guarantee on the result of pruning. Besides the coreset methods which are discussed below, Serra et al. (2020) provide a method for "losslessly" compressing feedforward neural networks by solving a mixed-integer linear program. They observe that applying L1 regularization during training is required to achieve non-negligible lossless compression rates, which closely aligns with the observations in our work. Havasi et al. ( 2018) provide an explicit compression/accuracy trade-off for Bayesian neural networks, but their compression method cannot be used to improve inference efficiency, only the memory needed to store the model. Empirical Scaling Laws Frankel et al. have observed that error after pruning can be empirically modeled as a function of network size, dataset size, and sparsity with good predictive results (Rosenfeld et al., 2020) . This can be seen as a more practical alternative to formal guarantees on accurancy after pruning, where a few small-scale experiments are run to estimate the accuracy after pruning at larger scales.

3. CORESET APPROXIMATION BOUNDS

3.1 CORESET WEIGHT PRUNING An algorithm from Baykal et al. (2019b) for pruning the weights of a single perceptron is presented in Algorithm 1; it involves sampling from a distribution over weights, where the probability of being sampled roughly measures the weight's importance to maintaining the output. The algorithm keeps the selected weights (with some re-weighting) and prunes the rest. The resulting pruned perceptron is guaranteed to output values that are arbitrarily close to the original perceptron if a sufficiently large number of samples is taken. The relationship between the closeness of the approximation and number of samples required is given by Theorem A.1. Further details and required assumptions are provided in Appendix 3.1. The method can be used to sparsify a whole layer by sparsifying each neuron individually, and bounds for pruning multi-layer neural networks can be constructed using a "layer-by-layer" approach, which is discussed in Section 5.2. Baykal et al. (2019b) also present a deterministic version of their algorithm, which selects the weights with the lowest importances for pruning. This alternative algorithm is presented in Algorithm 2 and the corresponding guaranteed approximation error is presented in Theorem A.2, the proof of which is based on the probabalistic proof of Theorem A.1.

3.2. CORESET NEURON PRUNING

Whole neurons can be pruned rather than individual weights. In this case, importances are assigned to each neuron, creating a probability distribution that can be sampled from. A method from Mussay et al. ( 2020) for pruning feed-forward neural networks with a single hidden layer is shown in Algorithm 3. This algorithm is data-independent: it ignores the training data and instead only assumes that the inputs have a bounded L2 norm. Neuron importance is determined by the L2 norm of the parameters of each neuron and the magnitude of the weights assigned to them in the next layer. Similar to weight pruning, the approximation error can be bounded above depending on the number of samples taken from the distribution over neurons. The relationship is expressed in Theorem A.4. Notice that unlike the multiplicative approximation bound provided for weight pruning, this bound is an additive error bound, meaning that the guaranteed error term is constant, regardless of the scale of the "true" value. More details about this method are provided in Appendix A.3.

