WHEN ARE NEURAL PRUNING APPROXIMATION BOUNDS USEFUL?

Abstract

Approximation bounds for neural network pruning attempt to predict the tradeoff between sparsity and fidelity while shrinking neural networks. In the first half of this paper, we empirically evaluate the predictive power of two recently proposed methods based on coreset algorithms. We identify several circumstances in which the bounds are loose or impractical to use and provide a brief analysis of the components of the bounds that contribute to those short-comings. In the second half, we examine the role of fine-tuning in prunability and observe that even tight approximation bounds would be poor predictors of accuracy after finetuning. This is because fine-tuning can recover large amounts of accuracy while simultaneously maintaining or increasing approximation error. We discuss the implications of these finding on the application of coreset-based pruning methods in practice and the role of approximation in the pruning community. Our code is available in the attached supplementary material.

1. INTRODUCTION

Recent work has attempted to prune neural networks while providing a mathematical upper bound on the deviation from the original network. These bounds were first introduced as a component in proofs of generalization in statistical learning theory (Arora et al., 2018) . Recent improvements on these pruning methods have come from the field of randomized algorithms, leveraging the concept of coresets (Mussay et al., 2020 ) (Baykal et al., 2019a ) (Baykal et al., 2019b ) (Liebenwein et al., 2019) , with empirical evaluations focusing on practical improvement in the sparsity/accuracy trade-off. However, surprisingly little attention has been given to evaluating the tightness of the proposed bounds or the predictive power of the guarantees they provide. In the first half of this paper, we fill this gap by empirically evaluating the tightness of two recently proposed coreset bounds, using them to predict the accuracy of neural networks after pruning. We conduct a series of experiments of increasing difficulty, starting with networks that are known to be highly sparse and ending with networks in which no regularization is applied during training. We identify several circumstances in which the bounds are loose or impractical to use and provide a brief analysis of the bound components that contribute to those short-comings. In the second half of the paper we introduce a fine-tuning procedure after pruning, which is common practice in the pruning community. Fine-tuning can drastically improve the accuracy achieved at high levels of sparsity, taking some models from 40% accuracy after pruning to 98% accuracy after fine-tuning. However, we observe that this improvement does not usually include a reduction in approximation error; indeed, fine-tuning can even increase approximation error in hidden layers while simultaneously recovering the original performance of the network. We argue that this phenomenon limits the predictive power of approximation bounds which are constructed "layer-bylayer" and which ignore the final softmax layer at the output. This motivates further discussion of the practical applications of coreset-based methods and whether "approximation" is the correct goal for the pruning community.

