REVISITING LOSS MODELLING FOR UNSTRUCTURED PRUNING

Abstract

By removing parameters from deep neural networks, unstructured pruning methods aim at cutting down memory footprint and computational cost, while maintaining prediction accuracy. In order to tackle this otherwise intractable problem, many of these methods model the loss landscape using first or second order Taylor expansions to identify which parameters can be discarded. We revisit loss modelling for unstructured pruning: we show the importance of ensuring locality of the pruning steps, and systematically compare first and second order Taylor expansions. Finally, we show that better preserving the original network function does not necessarily transfer to better performing networks after fine-tuning, suggesting that only considering the impact of pruning on the loss might not be a sufficient objective to design good pruning criteria.

1. INTRODUCTION

Neural networks are getting bigger, requiring more and more computational resources not only for training, but also when used for inference. However, resources are sometimes limited, especially on mobile devices and low-power chips. In unstructured pruning, the goal is to remove some parameters (i.e. setting them to zeros), while still maintaining good prediction performances. This is fundamentally a combinatorial optimization problem which is intractable even for small scale neural networks, and thus various heuristics have been developed to prune the model either before training (Lee et al., 2019b; Wang et al., 2020 ), during training (Louizos et al., 2017; Molchanov et al., 2017; Ding et al., 2019) , or in an iterative training/fine-tuning fashion (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2015; Frankle & Carbin, 2018; Renda et al., 2020) . Early pruning work Optimal Brain Damage (OBD) (LeCun et al., 1990) , and later Optimal Brain Surgeon (OBS) (Hassibi & Stork, 1993) , proposed to estimate the importance of each parameter by approximating the effect of removing it, using the second order term of a Taylor expansion of the loss function around converged parameters. This type of approach involves computing the Hessian, which is challenging to compute since it scales quadratically with the number of parameters in the network. Several approximations have thus been explored in the literature (LeCun et al., 1990; Hassibi & Stork, 1993; Heskes, 2000; Zeng & Urtasun, 2019; Wang et al., 2019) . However, state-ofthe-art unstructured pruning methods typically rely on Magnitude Pruning (MP) (Han et al., 2015) , a simple and computationally cheap criterion based on weight magnitude, that works extremely well in practice (Renda et al., 2020) . This paper revisits linear and diagonal quadratic models of the local loss landscape for unstructured pruning. In particular, since these models are local approximations and thus assume that pruning steps correspond to small vectors in parameter space, we propose to investigate how this locality assumption affects their performance. Moreover, we show that the convergence assumption behind OBD and OBS, which is overlooked and violated in current methods, can be relaxed by maintaining the gradient term in the quadratic model. Finally, to prevent having to compute second order information, we propose to compare diagonal quadratic models to simpler linear models. While our empirical study demonstrates that pruning criteria based on linear and quadratic loss models are good at preserving the training loss, it also shows that this benefit does not necessarily transfer to better networks after fine-tuning, suggesting that preserving the loss might not be the best objective to optimize for. Our contributions can be summarized as follows: 1. We present pruning criteria based on both linear and diagonal quadratic models of the loss, and show how they compare at preserving training loss compared to OBD and MP. 2. We study two strategies to better enforce locality in the pruning steps, pruning in several stages and regularising the step size, and show how they improve the quality of the criteria. 3. We show that using pruning criteria that are better at preserving the loss does not necessarily transfer to better fine-tuned networks, raising questions about the adequacy of such criteria. 2 BACKGROUND: UNSTRUCTURED PRUNING

2.1. UNSTRUCTURED PRUNING PROBLEM FORMULATION

For a given architecture, neural networks are a family of functions f θ : X → Y from an input space X to an output space Y, where θ ∈ R D is the vector that contains all the parameters of the network. Neural networks are usually trained by seeking parameters θ that minimize the empirical risk L(θ) = 1 N i (f θ (x i ) , t i ) of a loss function on a training dataset D = {(x i , t i )} 1≤i≤N , composed of N (example, target) pairs. The goal of unstructured pruning is to find a step ∆θ to add to the current parameters θ such that θ +∆θ 0 = (1-κ)D, i.e. the parameter vector after pruning is of desired sparsity κ ∈ [0, 1]. While doing so, the performance of the pruned network should be maintained, so L(θ + ∆θ) should not differ much from L(θ). Unstructured pruning thus amounts to the following minimization problem: minimize ∆θ ∆L(θ, ∆θ) def = |L(θ + ∆θ) -L(θ)| s.t. θ + ∆θ 0 = (1 -κ)D Directly solving this problem would require evaluating L(θ + ∆θ) for all possible values of ∆θ, which is prohibitively expensive, so one needs to rely on heuristics to find good solutions.

2.2. OPTIMAL BRAIN DAMAGE CRITERION

Optimal Brain Damage (OBD) (LeCun et al., 1990) proposes to use a quadratic modelling of L(θ + ∆θ), leading to the following approximation of ∆L(θ, ∆θ): ∆L QM (θ, ∆θ) = ∂L(θ) ∂θ ∆θ + 1 2 ∆θ H(θ)∆θ where H(θ) is the Hessian of L(θ). H(θ) being intractable, even for small-scale networks, its Generalized Gauss-Newton approximation G(θ) (Schraudolph, 2002) is used in practice, as detailed in Appendix A. 1 Then, two more approximations are made: first, it assumes the training of the network has converged, thus the gradient of the loss wrt θ is 0, which makes the linear term vanish. Then, it neglects the interactions between parameters, which corresponds to a diagonal approximation of G(θ), leading to the following model: ∆L OBD (θ, ∆θ k ) ≈ 1 2 G kk (θ)∆θ 2 k ⇒ s OBD k = 1 2 G kk (θ)θ 2 k (3) s OBD k is the saliency of each parameter, estimating how much the loss will change if that parameter is pruned, so if ∆θ k = -θ k . Parameters can thus be ranked by order of importance, and the ones with the smallest saliencies (i.e. the least influence on the loss) are pruned, while the ones with the biggest saliencies are kept unchanged. This can be interpreted as finding and applying a binary mask m ∈ {0, 1} D to the parameters such that θ + ∆θ = θ m, where is the element-wise product.

2.3. MAGNITUDE PRUNING CRITERION

Magnitude Pruning (MP) (Han et al., 2015) , is a popular pruning criterion in which the saliency is simply based on the norm of the parameter: s MP k = θ 2 k (4) Despite its simplicity, MP works extremely well in practice (Gale et al., 2019) , and is used in current state-of-the-art methods (Renda et al., 2020) . We use global MP as baseline in all our experiments.



AlthoughLeCun et al. (1990) uses H(θ) in the equations of OBD, it is actually G(θ) which was used in practice (LeCun, 2007).

