A GRADIENT FLOW FRAMEWORK FOR ANALYZING NETWORK PRUNING

Abstract

Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and its use is therefore justified for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100.

1. INTRODUCTION

The use of Deep Neural Networks (DNNs) in intelligent edge systems has been enabled by extensive research on model compression. "Pruning" techniques are commonly used to remove "unimportant" filters to either preserve or promote specific, desirable model properties. Most pruning methods were originally designed to compress trained models, with the goal of reducing inference costs only. For example, Li et al. (2017) ; He et al. (2018) proposed to remove filters with small 1/ 2 norm, thus ensuring minimal change in model output. Molchanov et al. (2017; 2019); Theis et al. (2018) proposed to preserve the loss of a model, generally using Taylor expansions around a filter's parameters to estimate change in loss as a function of its removal. Recent works focus on pruning models at initialization (Lee et al. (2019; 2020) ) or after minimal training (You et al. ( 2020)), thus enabling reduction in both inference and training costs. To estimate the impact of removing a parameter, these methods use the same importance measures as designed for pruning trained models. Since such measures focus on preserving model outputs or loss, Wang et al. (2020) argue they are not well-motivated for pruning models early-on in training. However, in this paper, we demonstrate that if the relationship between importance measures used for pruning trained models and the evolution of model parameters is established, their use early-on in training can be better justified. In particular, we employ gradient flow 1 to develop a general framework that relates state-of-theart importance measures used in network pruning through the norm of model parameters. This framework establishes the relationship between regularly used importance measures and the evolution of a model's parameters, thus demonstrating why measures designed to prune trained models also perform well early-on in training. More generally, our framework enables better understanding of what properties make a parameter dispensable according to a particular importance measure. Our findings follow. (i) Magnitude-based pruning measures remove parameters that contribute least to reduction in loss. This enables magnitude-based pruned models to achieve faster convergence than magnitude-agnostic measures. (ii) Loss-preservation based measures remove parameters with the least tendency to change, thus preserving first-order model evolution dynamics. This shows the use of loss-preservation is justified for pruning models early-on in training as well. (iii) Gradient-norm based pruning is linearly related to second-order model evolution dynamics. Increasing gradient norm via pruning for even slightly trained models can permanently damage earlier layers, producing poorly performing architectures. This behavior is a result of aggressively pruning filters that maximally increase model loss. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100.

2. RELATED WORK

Several pruning frameworks define importance measures to estimate the impact of removing a parameter. Most popular importance measures are based on parameter magnitude (Li et al. ( 2017 2020)). These works thus prove networks can be pruned without loss in performance, but do not indicate how a network should be pruned, i.e, which importance measures are preferable. In fact, Liu et al. (2019) show reinitializing pruned models before retraining rarely affects their performance, indicating the consequential differences among importance measures are in the properties of architectures they produce. Since different importance measures perform differently (see Appendix E), analyzing popular measures to determine which model properties they tend to preserve can reveal which measures lead to better-performing architectures. From an implementation standpoint, pruning approaches can be placed in two categories. The first, structured pruning (Li et al. ( 2017 2020)) that are capable of accelerating sparse operations. By clarifying benefits and pitfalls of popular importance measures, our work aims to ensure practitioners are better able to make informed choices for reducing DNN training/inference expenditure via network pruning. Thus, while results in this paper are applicable in both structured and unstructured settings, our experimental evaluation primarily focuses on structured pruning early-on in training. Results on unstructured pruning are relegated to Appendix H.

3. PRELIMINARIES: CLASSES OF STANDARD IMPORTANCE MEASURES

In this section, we review the most successful classes of importance measures for network pruning. These measures will be our focus in subsequent sections. We use bold symbols to denote vectors and italicize scalar variables. Consider a model that is parameterized as Θ(t) at time t. We denote the gradient of the loss with respect to model parameters at time t as g(Θ(t)), the Hessian as H(Θ(t)), and the model loss as L(Θ(t)). A general model parameter is denoted as θ(t). The importance of a set of parameters Θ p (t) is denoted as I(Θ p (t)).



gradient flow refers to gradient descent with infinitesimal learning rate; see Equation 6 for a short primer.



); He et al. (2018); Liu et al. (2017)) or loss preservation (Molchanov et al. (2019; 2017); Theis et al. (2018); Gao et al. (2019)). Recent works show that using these measures, models pruned at initialization (Lee et al. (2019); Wang et al. (2020); Hayou et al. (2021)) or after minimal training (You et al. (2020)) achieve final performance similar to the original networks. Since measures for pruning trained models are motivated by output or loss preservation, Wang et al. (2020) argue they may not be well suited for pruning models early-on in training. They thus propose GraSP, a measure that promotes preservation of parameters that increase the gradient norm. Despite its success, the foundations of network pruning are not well understood. Recent work has shown that good "subnetworks" that achieve similar performance to the original network exist within both trained (Ye et al. (2020)) and untrained models (Frankle & Carbin (2019); Malach et al. (2020); Pensia et al. (

); He et al. (2018); Liu et al. (2017); Molchanov et al. (2019; 2017); Gao et al. (2019)), removes entire filters, thus preserving structural regularity and directly improving execution efficiency on commodity hardware platforms. The second, unstructured pruning (Han et al. (2016b); LeCun et al. (1990); Hassibi & Stork (1993)), is more fine-grained, operating at the level of individual parameters instead of filters. Unstructured pruning has recently been used to reduce computational complexity as well, but requires specially designed hardware (Han et al. (2016a)) or software (Elsen et al. (

