BALANCING TRAINING TIME VS. PERFORMANCE WITH BAYESIAN EARLY PRUNING

Abstract

Pruning is an approach to alleviate overparameterization of deep neural network (DNN) by zeroing out or pruning DNN elements with little to no efficacy at a given task. In contrast to related works that do pruning before or after training, this paper presents a novel method to perform early pruning of DNN elements (e.g., neurons or convolutional filters) during the training process while preserving performance upon convergence. To achieve this, we model the future efficacy of DNN elements in a Bayesian manner conditioned upon efficacy data collected during the training and prune DNN elements which are predicted to have low efficacy after training completion. Empirical evaluations show that the proposed Bayesian early pruning improves the computational efficiency of DNN training. Using our approach we are able to achieve a 48.6% faster training time for ResNet-50 on ImageNet to achieve a validation accuracy of 72.5%.

1. INTRODUCTION

Deep neural networks (DNNs) are known to be overparameterized (Allen-Zhu et al., 2019) as they usually have more learnable parameters than needed for a given learning task. So, a trained DNN contains many ineffectual parameters that can be safely pruned or zeroed out with little/no effect on its predictive accuracy. Pruning (LeCun et al., 1989 ) is an approach to alleviating overparameterization of a DNN by identifying and removing its ineffectual parameters while preserving its predictive accuracy on the validation/test dataset. Pruning is typically applied to the DNN after training to speed up testing-time evaluation. For standard image classification tasks with MNIST, CIFAR-10, and ImageNet datasets, it can reduce the number of learnable parameters by up to 50% or more while maintaining test accuracy (Han et al., 2015; Li et al., 2017; Molchanov et al., 2017) . In particular, the overparameterization of a DNN also leads to considerable training time being wasted on those DNN elements (e.g., connection weights, neurons, or convolutional filters) which are eventually ineffectual after training and can thus be safely pruned. Our work in this paper considers early pruning of such DNN elements by identifying and removing them throughout the training process instead of after training. 1 As a result, this can significantly reduce the time incurred by the training process without compromising the final test accuracy (upon convergence) much. Recent work (Section 5) in foresight pruning (Lee et al., 2019; Wang et al., 2020) show that pruning heuristics applied at initialization work well to prune connection weights without significantly degrading performance. In contrast to these work, we prune throughout the training procedure, which improves performance after convergence of DNNs, albeit with somewhat longer training times. In this work, we pose early pruning as a constrained optimization problem (Section 3.1). A key challenge in the optimization is accurately modeling the future efficacy of DNN elements. We achieve this through the use of multi-output Gaussian process which models the belief of future efficacy conditioned upon efficacy measurements collected during training (Section 3.2). Although the posed optimization problem is NP-hard, we derive an efficient Bayesian early pruning (BEP) approximation algorithm, which appropriately balances the inherent training time vs. performance tradeoff in pruning prior to convergence (Section 3.3). Our algorithm relies on a measure of network element efficacy, termed saliency (LeCun et al., 1989) . The development of saliency functions is an active area of research with no clear optimal choice. To accomodate this, our algorithm is agnostic, and therefore flexible, to changes in saliency function. We use BEP to prune neurons and convolutional filters to achieve practical speedup during training (Section 4). 2 Our approach also compares favorably to state-of-the-art works such as SNIP (Lee et al., 2019) , GraSP (Wang et al., 2020) , and momentum based dynamic sparse reparameterization (Dettmers & Zettlemoyer, 2019) .

2. PRUNING

Consider a dataset of D training examples X = {x 1 , . . . , x D }, Y = {y 1 , . . . , y D } and a neural network N vt parameterized by a vector of M pruneable network elements (e.g. weight parameters, neurons, or convolutional filters) v t [v a t ] a=1,...,M , where v t represent the network elements after t iterations of stochastic gradient descent (SGD) for t = 1, . . . , T . Let L(X , Y; N vt ) be the loss function for the neural network N vt . Pruning aims at refining the network elements v t given some sparsity budget B and preserving the accuracy of the neural network after convergence (i.e., N v T ), which can be stated as a constrained optimization problem (Molchanov et al., 2017) : min m∈{0,1} M |L(X , Y; N m v T ) -L(X , Y; N v T )| s.t. ||m|| 0 ≤ B (1) where is the Hadamard product and m is a pruning mask. Note that we abuse the Hadamard product for notation simplicity: for a = 1, .., M , m a × v a T corresponds to pruning v a T if m a = 0, and keeping v a T otherwise. Pruning a network element refers to zeroing the network element or the weight parameters which compute the network element. Any weight parameters which reference the output of the pruned network element are also zeroed since the element outputs a constant 0. The above optimization problem is difficult due to the NP-hardness of combinatorial optimization. This leads to the approach of using saliency function s which measures efficacy of network elements at minimizing the loss function. A network element with small saliency can be pruned since it's not salient in minimizing the loss function. Consequently, pruning can be done by maximizing the saliency of the network elements given the sparsity budget B: max m∈{0,1} M M a=1 m a s(a; X , Y, N v T , L) s.t. ||m|| 0 ≤ B (2) where s(a; X , Y, N v T , L) measures the saliency of v a T at minimizing L after convergence through T iterations of SGD. The above optimization problem can be efficienctly solved by selecting the B most salient network elements in v T . The construction of the saliency function has been discussed in many existing works: Some approaches derived the saliency function from first-order (LeCun et al., 1989; Molchanov et al., 2017) and second-order (Hassibi & Stork, 1992; Wang et al., 2020) Taylor series approximations of L. Other common saliency functions include L 1 (Li et al., 2017) or L 2 (Wen et al., 2016) norm of the network element weights, as well as mean activation (Polyak & Wolf, 2015) . In this work, we use a first-order Taylor series approximation saliency function defined for neurons and convolutional filtersfoot_3 (Molchanov et al., 2017) , however our approach remains flexible to arbitrary choice of saliency function on a plug-n-play basis.

3.1. PROBLEM STATEMENT

As has been mentioned before, existing pruning works based on the saliency function are typically done after the training convergence (i.e., (2)) to speed up the testing-time evaluation, which waste considerable time on training these network elements which will eventually be pruned. To resolve this issue, We extend the pruning problem definition (2) along the temporal dimension, allowing network elements to be pruned during the training process consisting of T iterations of SGD. Let s a t s(a; X , Y, N vt , L) be a random variable which denotes the saliency of network element v a t after t iterations of SGD, s t [s a t ] a=1,...,M for t = 1, . . . , T , and s τ1:τ2 [s t ] t=τ1,...,τ2 be a vector of saliency of all the network elements between iterations τ 1 and τ 2 . Our early pruning algorithm is designed with the goal of maximizing the saliency of the unpruned elements after iteration T , yet allowing for pruning at each iteration t given some computational budget B t,c for t = 1, . . . , T : ρ T (m T -1 , B T,c , B s ) max m T m T • s T (3a) s.t. ||m T || 0 ≤ B s (3b) m T ≤ m T -1 (3c) B T,c ≥ 0 (3d) ρ t (m t-1 , B t,c , B s ) max mt E p(st+1|s1:t) ρ t+1 (m t , B t,c -||m t || 0 , B s ) (4a) s.t. m t ≤ m t-1 (4b) where B s is the trained network sparsity budget, s1:t is a vector of observed values for s 1:t , m 0 is an M -dimensional 1's vector, and m t ≤ m t-1 represents an element-wise comparison between m t and m t-1 : m a t ≤ m a t-1 for a = 1, . . . , M . At each iteration t, the saliency s t is observed and m t ∈ {0, 1} M in ρ t represents a pruning decision performed to maximize the expectectation of ρ t+1 conditioned upon saliency measurements s 1:t collected up to and including iteration t. This recursive structure terminates with base case ρ T where the saliency of the unpruned elements is maximized after T iterations of training. In the above early pruning formulationfoot_4 , constraints (3c) and (4b) ensure pruning is performed in a practical manner whereby once a network element is pruned, it can no longer be recovered in a later training iteration. We define a trained network sparsity budget, B s (3b), which may differ significantly from initial network size ||m 0 || 0 (e.g., in the case where the network is trained on GPUs, but deployed on resource constrained edge or mobile devices). We also constrain a total computational effort budget B t,c which is reduced per training iteration t by the number of unpruned network elements ||m t || 0 . We constrain B T,c ≥ 0 (3d) to ensure training completion within the specified computational budget. Here we assume that a more sparse pruning mask m t corresponds to lower computational effort during training iteration t due to updating fewer network elements. Finally, (3a) maximizes the saliency with a pruning mask m T constrained by a sparsity budget B s (3b). Our early pruning formulation balances the saliency of network elements after convergence against the total computational effort to train such network (i.e., m T • s T vs. T t=1 ||m t || 0 ). This appropriately captures the balancing act of training-time early pruning whereby the computational effort is saved by early pruning network elements while preserving the saliency of the remaining network elements after convergence.

3.2. MODELING THE SALIENCY WITH MULTI-OUTPUT GAUSSIAN PROCESS

To solve the above early pruning problem, we need to model the belief p(s 1:T ) of the saliency for computing the predictive belief p(s t+1:T |s 1:t ) of the future saliency in (4a). At the first glance, one may consider to decompose the belief: p(s 1:T ) M a=1 p(s a 1:T ) and model the saliency s a 1:T [s a t ] t=1,...,T of each network element independently. Such independent models, however, ignore the co-adaptation and co-evolution of the network elements which have been shown to be a common occurrence in DNN (Hinton et al., 2012; Srivastava et al., 2014; Wang et al., 2020) . Also, modeling the correlations between the saliency of different network elements explicitly is non-trivial since considerable feature engineering is needed for representing diverse network elements such as neurons, connections, or convolutional filters. To resolve such issues, we use multi-output Gaussian process (MOGP) to jointly model the belief p(s 1:T ) of all saliency measurements. To be specific, we assume that the saliency s a t of the a-th network element at iteration t is a linear mixturefoot_5 of Q independent latent functions {u q (t)} Q q=1 : s a t Q q=1 γ a q u q (t). As shown in ( Álvarez & Lawrence, 2011) , if each u q (t) is an independent GP with prior zero mean and covariance k q (t, t ), then the resulting distribution over p(s 1:T ) is a multivariate Gaussian distribution with prior zero mean and covariance determined by the mixing weights: cov[s a t , s a t ] = Q q=1 γ a q γ a q k q (t, t ). This explicit covariance between s a t and s a t helps to exploit the co-evolution and co-adaptation of network elements within the neural networks. To capture the horizontal asymptote trend of s a 1 , . . . , s a T as visualized in Appendix A.2, we turn to a kernel used for modeling decaying exponential curves known as the "exponential kernel" (Swersky et al., 2014) and set k q (t, t ) βq αq (t+t +βq) αq where α q and β q are hyperparameters of MOGP and can be learned via maximum likelihood estimation ( Álvarez & Lawrence, 2011) . Then, given a vector of observed saliency s1:t , the MOGP regression model can provide a Gaussian predictive distribution for any future saliency s t . Thus, the predictive mean µ  ρt (m t-1 , B t,c , B s ) max mt E p(s T |s1:t) [ρ T (m t , B t,c -(T -t)||m t || 0 , B s )] This approach allows us to lift (3d) from (3), to which we add a Lagrange multiplier and achieve: ρt (m t-1 , B t,c , B s ) max mt E p(s T |s1:t) [ρ T (m t , B s )] + λ (B t,c -(T -t)||m t || 0 ) for t = 1, . . . , T -1 and ρT is defined as ρ T without constraint (3d). Consequently, such a ρT can be solved in a greedy manner as in (2). Afterwards, we will omit B t,c as a parameter of ρT as it no longer constrains the solution of ρT . Note that the presence of an additive penalty in a maximization problem is due to the constraint B T,c ≥ 0 ⇔ -B T,c ≤ 0 which is typically expected prior to Lagrangian reformulation. The above optimization problem remains NP-hard as E p(s T |s1:t) [ρ T (m t , B s )] is submodular in m t (see Appendix B). Although greedy approximations exist for submodular optimization, their running time of O(||m t-1 || 2 0 ) remains far too slow due to the large number of network elements in DNNs. Fortunately, it can be significantly simplified by exploiting the following lemma (its proof is in Appendix C.): Lemma 1. Let e (i) be an M -dimensional one-hot vectors with the i-th element be 1. ∀ 1 ≤ a, b ≤ M ; m ∈ {0, 1} M s.t. m ∧ (e (a) ∨ e (b) ) = 0. Given a vector of observed saliency s1:t , if µ a T |1:t ≥ µ b T |1:t and µ a T |1:t ≥ 0, then E p(s T |s1:t) [ρ T (m ∨ e (b) )] -E p(s T |s1:t) [ρ T (m ∨ e (a) )] ≤ µ b T |1:t Φ(ν/θ) + θ φ(ν/θ) where θ σ aa T |1:t + σ bb T |1:t -2σ ab T |1:t , ν µ b T |1:t -µ a T |1:t , and Φ and φ are standard normal CDF and PDF, respectively. Here, '∨' and '∧' represent bitwise OR and AND operations, respectively. The bitwise OR operation is used to denote the inclusion of e (a) or e (b) in m t . Due to the strong tail decayfoot_7 of φ and Φ, Lemma 1 indicates at most marginal possible improvement provided by opting for m t = m ∨ e (b) as opposed to m t = m ∨ e (a) given µ a T |1:t ≥ µ b T |1:t . Lemma 1 admits the following approach to optimize ρt : starting with m t = 0 M , we consider the inclusion of network elements in m t by the descending order of {µ a T |1:t } M a=1 which can be computed analytically using MOGP. A network element denoted by e (a) is included in m t if it improves the objective in (5). The algorithm terminates once the highest not-yet-included element does not improve the objective function as a consequence of the penalty term outweighing the improvement in E p(s T |s1:t) [ρ T ]. The remaining excluded elements are then pruned. Following the algorithm sketch above, we define the utility of network element v a t with respect to candidate pruning mask m t ≤m t-1 which measures the improvement in E p(s T |s1:t) [ρ T ] as a consequence of inclusion of e (a) in m t : ∆(a, m t , s1:t , B s ) E p(s T |s1:t) ρ T (e (a) ∨ m t , B s ) -ρ T (m t , B s ) . We can now take a Lagrangian approach to pruning decisions during iteration t by balancing the utility of network element v a t against the change of the penalty (i.e., λ(T -t)) in Algorithm 1. Due to the relatively expensive cost of performing early pruning, we chose to early prune every T step iterations of SGD. Typically T step was chosen to correspond to 10-20 epochs of training. To compute ∆(•) we sampled from p(s T |s 1:t ) and used a greedy selection algorithm per sample as in (2). During implementation, we also enforced an additional hard constraint ||m t || 0 ≥ B s which we believe is desirable for practicality reasons. We used a fixed value of B 1,c = ||m 0 || 0 T 0 + B s (T -T 0 ) in all our experiments. Algorithm 1 Bayesian Early Pruning Require: N , v 1 , T 0 , T step , T , B 1,c , B s , λ DNN N , Lagrangian penalty λ 1: S 1:T0 ← train(N v1 , T 0 ) Train for T 0 iterations to create seed dataset. 2: B T0,c ← B 1,c -T 0 dim(v 1 ) Track computational effort expenditure. 3: for k ← 0, . . . , T -T0 Tstep ; t ← T 0 + kT step do Early prune every T step iterations from T 0 . 4: µ T |1:t , σ T |1:t ← M OGP (S 1:t ) Train and perform inference. 5: s T ← argsort(-µ T |1:t ) Sort descending. 6: m t ← 0 dim(vt) Initial pruning mask. 7: for a ← s 1 T , . . . , s dim(vt) T do Consider each network element. 8: if B t,c -(T -t)||m t || 0 > 0 then Remaining B t,c budget can support training v a t . 9: m t = m t ∨ e (a) 10: else if ∆(a, m t , s1:t , B s ) ≥ λ(T -t) then Balance utility against change of penalty. 11: m t = m t ∨ e (a) 12: else 13: break 14:  prune(v t , m t ) dim(v t )

4. EXPERIMENTS AND DISCUSSION

We evaluate our modeling approach as well as our BEP algorithm on the CIFAR-10, CIFAR-100 (Krizhevsky, 2009) , and ImageNet (Deng et al., 2009) datasets. For CIFAR-10/CIFAR-100 we used a benchmark Convolutional Neural Network (CNN) with 4 convolutional layers, and 1 dense layer. 8 For ImageNet we validated on the ResNet-50 architecture (He et al., 2016a) . Due to the cubic time complexity of MOGPs, we used a variational approximation (Hensman et al., 2015) . In all of our models, we used 60 variational inducing points per latent function. We used GPFlow library (Matthews et al., 2017) to build our models.

4.1. MODELING EVALUATION

A key assertion in our approach is the importance of capturing co-adaptation and co-evolution effects in network elements. To verify our MOGP approach captures these effects, we compare MOGP vs. GP belief modeling where GP assumes independence in saliency measurements across network elements (i.e., p(s 1:T ) M a=1 p(s a 1:T )). A dataset of saliency measurements of convolutional filters and neurons was constructed by instrumenting the training process of our 5-layer CNN on the CIFAR-10/CIFAR-100 dataset. Keras (Chollet, 2015) was used to train this model over 150 epochs.foot_9 Lyr 3 GP 0.75(0.06) 5.7(5.7)e4 5.6(5.6)e4 0.64(0.04) 0.70(0.04) 2.13(0.05) 3.4(3.4)e3 0.31(0.02) 1.06(0.02) 4-MOGP 0.79(0.05) 0.98(0.12) 3.13(0.10) 0.44(0.04) 0.60(0.10) 2.29(0.06) 0.12(0.01) 0.24(0.03) 1.07(0.03) 8-MOGP 0.65(0.05) 0.89(0.11) 3.00(0.09) 0.38(0.04) 0.60(0.10) 2.20(0.06) 0.10(0.01) 0.18(0.01) 1.02(0.03) 18-MOGP 0.62(0.05) 0.84(0.11) 2.93(0.10) 0.36(0.03) 0.56(0.10) 2.22(0.07) 0.09(0.01) 0.18(0.01) 1.01(0.03) 32-MOGP 0.65(0.05) 0.85(0.09) 2.89(0.10) 0.36(0.03) 0.59(0.10) 2.16(0.06) 0.09(0.02) 0.18(0.01) 1.00(0.03) We trained belief models with small (t = [0, 26] epochs), medium (t = [0, 40] epochs), and large (t = [0, 75] epochs) training dataset of saliency measurements. For GPs, a separate model was trained per network element (convolutional filter, or neuron). For MOGP, all network elements in a single layerfoot_10 shared one MOGP model. We evaluated these models using log likelihood of the remainder of the saliency measurements. We present the performance of the models in Table 1 for CIFAR-100. 11 Our MOGP approach better captures the saliency of network elements than a GP approach. Furthermore, using additional latent functions improves MOGP modeling with diminishing returns. We visualize the qualitative differences between GP and MOGP prediction in Figure 1 . We observe that MOGP is able to capture the long term trend of saliency curves with significantly less data than GP.

4.2. SMALL-SCALE EXPERIMENTS

We applied the early pruning algorithm on the aforementioned architecture, and training regimen. We investigated the behavior of the penalty parameter, λ. We observed that the penalty parameter was difficult to tune properly, either being too aggressive at pruning, or too passive. To rectify this issue, we used a feedback loop to determine the penalty at iteration t, λ t dynamically. Dynamic penalty scalingfoot_12 uses feedback from earlier pruning iterations to increase or decrease the iteration penalty at time t: λ t = λ [(1/λ) ∧ ((T -t)||m t || 0 /B t,c -1)]. The dynamic penalty is increased if the anticipated compute required to complete training, (T -t)||m t || 0 begins to exceed the amount of compute budget remaining, B t,c . In such case, a higher penalty is needed to satisfy the computational budget constraint as per (6). We compare dynamic penalty scaling, and penalty without scaling in Fig. 2 using T 0 = 20 epochs, T step = 10 epochs for the first convolutional layer of our CNN. Going forward, we use dynamic penalty scaling in our experiments. 2)% 65.3(0.4)% 52.8(0.5)% 22.1(5.0)% 37.9(0.7)% 29.7(0.1)% 17.5(0.3)% 4.4(1.4)% SNIP 75.4(4.7)% 67.7(0.7)% 50.8(0.8)% 29.4(4.9)% 22.9(9.0)% 15.7(6.1)% 9.9(3.7)% 2.2(1.2)% GraSP 74.6(0.6)% 66.5(0.9)% 50.7(0.6)% 32.9(1.0)% 28.4(7.0)% 22.6(5.4)% 13.9(3.2)% 1.0(0.0)% BEP 1e-2 75.9(0.3)% 69.7(0.4)% 54.8(1.0)% 18.9(5.4)% 40.6(0.2)% 32.2(0.6)% 19.1(0.5)% 7.1(1.6)% BEP 1e-4 75.4(1.7)% 70.5(3.2)% 55.7(0.9)% 36.1(1.1)% 41.3(0.3)% 32.4(0.3)% 19.7(0.8)% 8.5(0.8)% BEP 1e-7 76.0(0.1)% 70.6(0.2)% 56.2(0.4)% 30.4(5.1)% 40.6(0.2)% 33.0(0.5)% 19.5(0.5)% 6.6(1.5)% We compare our work with SNIP (Lee et al., 2019) , GraSP (Wang et al., 2020) , and momentum-based dynamic sparse remaparameterization (DSR) (Dettmers & Zettlemoyer, 2019) . To compare against DSR, we instantiate a smaller network of the size BEP yields after training has completed as it is a prune and regrow method. The SNIP and GraSP approaches are extended to neurons/filters by averaging the saliencies of the constituent weight parameters. We experimented with various degrees of sparsity, using BEP to prune a portion of filters/neurons of each layer. 13 We present the results in Table 2 . Our approach better preserves performance at equivalent sparsity. A lower penalty yields higher performing results showing λ serves well at balancing performance vs. computational budget. We investigate the robustness of BEP and MOGP hyperparameters. We vary the number of MOGP variational inducing points, MOGP latent functions, and T step and observe the performance of BEP 1e-4 on CIFAR-10/CIFAR-100 at 80%, 90%, and 95% sparsity. We present these results in Table 3 . We observe that in general, all hyperparameters are robust to changes. Mild degradation is observed in the extremal hyperparameter settings.

4.3. SPEEDING UP RESNET TRAINING ON IMAGENET

Our chief goal in this work is to speed up training of large-scale DNNs such as ResNet (He et al., 2016a; b) on the ImageNet dataset. Pruning ResNet requires a careful definition of network element saliency to allow pruning of all layers. ResNet contains long sequences of residual units with matching number of input/output channels. The inputs of residual units are connected with shortcut connections (i.e., through addition) to the output of the residual unit. 14 Due to shortcut connections, this structure requires that within a sequence of residual units, the number of inputs/output channels of all residual units must match exactly. This requires group pruning of residual unit channels for 13 In our observations saliency measurements don't well capture network element efficacy when comparing across layers. Thus pruning whole networks using network element saliency yields poor performing networks with bottlenecks. This limitation of saliency functions is well known (See Molchanov et al. (2017) a sequence of residual units, where group pruning an output channel of a residual unit sequence requires pruning it from the inputs/outputs of all residual units within the sequence. 15We trained ResNet-50 with BEP as well as SNIP and GraSP. 16 We group pruned less aggressively as residual unit channels feed into a large number of residual units, thus making aggressive pruning likely to degrade performance. We ran BEP iterations at t = [15, 20, 25, 35, 45, 55, 75] epochs. We trained for 100 epochs on 4× Nvidia Geforce GTX 1080 Ti GPUs. More experimental details found in Appendix G.2. We present our results in Table 4 . We achieve higher performance than related techniques, albeit at longer wall time. Our approach captures the training time vs. performance tradeoff present in DNNs, unlike competing approaches.

5. RELATED WORK

Pruning and related techniques. Initial works in DNN pruning center around saliency based pruning after training including Skeletonization (Mozer & Smolensky, 1988) , Optimal Brain Damage and followup work (Hassibi & Stork, 1992; LeCun et al., 1989) as well as sensitivity based pruning (Karnin, 1990) . In recent years, saliency functions been adapted to pruning neurons or convolutional filters. Knowledge distillation (Hinton et al., 2015; Lu et al., 2017; Tung & Mori, 2019; Yim et al., 2017) aim to transfer the capabilities of a trained network into a smaller network. Weight sharing (Nowlan & Hinton, 1992; Ullrich et al., 2017) and low rank matrix factorization (Denton et al., 2014; Jaderberg et al., 2014) aim to compress the parameterization of neural networks. Network quantization (Courbariaux et al., 2015; Hubara et al., 2017; Micikevicius et al., 2018) provides coarse granularity in trading off computational effort vs. performance. Current GPUs only extend native support to 16-bit floating point operations. Furthermore, our approach is orthogonal to quantization allowing the techniques to be combined for further speedup. Initialization time or training time pruning. Frankle & Carbin (2019) show that a randomly initialized DNN contains a small subnetwork, which if trained by itself, yields equivalent performance to the original network. SNIP (Lee et al., 2019) and GraSP (Wang et al., 2020) propose pruning connection weights prior to the training process through a first order and second order saliency function respectively. Sparse Evolutionary Training (Mocanu et al., 2018) propose initializing networks with sparse topology prior to training. Narang et al. (2017) consider connection weight pruning during training for recurrent neural networks using a heuristic approach. Dynamic sparse reparameterization considers pruning and regrowing parameter weights during the training process (Bellec et al., 2018; Dettmers & Zettlemoyer, 2019; Mostafa & Wang, 2019) . Dai et al. (2019) propose a grow and prune approach to learning network architecture and connection layout. We differ from existing work as our focus is on speeding up neural network training, meanwhile other works in training time pruning aim to achieve sparse network layouts. To the best of our knowledge, except for small speedups presented in (Dettmers & Zettlemoyer, 2019) , the above works do not demonstrate speedup during training time using popular deep learning libraries run on modern GPUs. PruneTrain (Lym et al., 2019 ) also proposes pruning filters during training to achieve speedup while minimizing degradation to performance with periodic pruning iterations. In contrast to our approach, PruneTrain does not allow specification of the desired network size after training. A specified network size may be useful if training for resource constrained devices such as mobile phones or edge devices. We compare with PruneTrain under the early pruning problem definition in Appendix E.

6. CONCLUSION

This paper presents a novel efficient algorithm to perform pruning of DNN elements such as neurons, or convolutional layers during the training process. To achieve early pruning before the training converges while preserving the performance of the DNN upon convergence, a Bayesian model (i.e., MOGP) is used to predict the saliency of DNN elements in the future (unseen) training iterations by exploiting the exponentially decaying behavior of the saliency and the correlations between saliency of different network elements. Then, we exploit a property (Lemma 1) of the objective function and propose an efficient Bayesian early pruning algorithm. Empirical evaluations on benchmark datasets show that our algorithm performs favorably to related works for pruning convolutional filters and neurons. Our approach remains flexible to changes in saliency function, and appropriately balances the training time vs. performance tradeoff in training DNNs. We are able to train an early pruned ResNet-50 model achieving a 48.6% speedup (37h vs. 55h) while maintaining a validation accuracy of 72.5%. In this work, we use a first order Taylor-series saliency function proposed by Molchanov et al. (2017) . Our design (Section 3) remains flexible to allow usage of arbitrary saliency functions in a plug-n-play basis. We partition a DNN of L layers, where each layer contains C convolutional filters, into a sequence of convolutional filters [z ,c ] c=1,...,C =1,...,L . Each filter z ,c : R C -1 ×W -1 ×H -1 → R W ×H can be considered as one network element in v T and z ,c (P -1 ) R(W ,c * P -1 + b ,c ) where W ,c ∈ R C ×O ×O , b ,c are kernel weights and bias.with receptive field O × O , ' * ' represents the convolution operation, R is the activation function, P -1 represents the output of z -1 [z -1,c ] c =1,...,C -1 with P 0 corresponding to an input x d ∈ X , and W , H are width and height dimensions of layer for = 1, . . . , L. Let N z :z z •, . . . , •z denote a partial neural network of layers [ , . . . , ] 1≤ ≤ ≤L . The Taylor-series saliency function on the convolutional filter z ,c denoted as s([ , c]) is definedfoot_15 : s([ , c]) 1 D D d=1 1 W × H W ×H j=1 ∂L(P (x d ) , y d ; N z +1 :z L ) ∂P (x d ) ,c,j P (x d ) ,c,j . where P (x d ) is the output of the partial neural network N z1:z with x d as the input and [P x d ,c,j ] j=1,...,W ×H interprets the output of the c-th filter in vectorized form. This function uses the first-order Taylor-series approximation of L to approximate the change in loss if z ,c was changed to a constant 0 function. Using the above saliency definition, pruning filter z ,c corresponds to collectively zeroing W ,c , b ,c as well as weight parametersfoot_16 [W +1,c ,{:,:,c} ] c =1,...,C +1 of z +1 which utilize the output of z l,c . This definition can be extended to elements (e.g. neurons) which output scalars by setting W = H = 1.

A.2 ON THE CHOICE OF THE "EXPONENTIAL KERNEL"

We justify our choice of the exponential kernel as a modeling mechanism by presenting visualizations of saliency measurements collected during training, and comparing these to samples drawn from the exponential kernel k q (t, t ) β α (t+t +β) α , as shown in Figs. 3 4 . Both the saliency and the function samples exhibit exponentially decaying behavior, which makes the exponential kernel a strong fit for modeling saliency evolution over time. Furthermore we note that the exponential kernel was used to great effect in Swersky et al. (2014) with respect to modeling loss curves as a function of epochs. Loss curves also exhibit asymptotic behavior, similar to saliency measurement curves, thus providing evidence for the exponential kernel being an apt fit for our task.

A.3 PREDICTIVE DISTRIBUTION OF THE SALIENCY

Let the prior covariance matrix be K τ1:τ2 [cov[s a t , s a t ]] a,a =1,...,M t,t =τ1,...,τ2 for any 1 ≤ τ 1 ≤ τ 2 ≤ T . Given a vector of observed saliency s1:t , the MOGP regression model can provide a Gaus- sian predictive distribution p(s t |s 1:t ) = N (µ t |1:t , K t |1:t ) for any future saliency s t with the following posterior mean vector and covariance matrix:  µ t |1:t K [t t] K -1 1:t s1:t , K t |1:t K t :t -K [t t] K -1 1:t K [t t] where K [t t] [cov[s a t , s a τ ]]

B SUBMODULARITY OF E[ρ T ]

In ( 6), the problem of choosing m from {0, 1} M can be considered as selecting a subset A of indexes from {1, . . . , M } such that m a t = 1 for a ∈ A, and m a t = 0 otherwise. Therefore, P (m) E p(s T |s1:t) [ρ T (m, B s )] can be considered as a set function which we will show to be submodular. To keep notation consistency, we will remain using P (m) instead of representing it as a function of the index subset A. Lemma 2 (Submodularity). Let m , m ∈ {0, 1} M , and e (a) be arbitrary M-dimensional one hot vector with 1 ≤ a ≤ M . We have P (m ∨ e (a) ) -P (m ) ≥ P (m ∨ e (a) ) -P (m ) for any m ≤ m , m ∧ e (a) = 0, and m ∧ e (a) = 0. Proof. According to (3), E p(s T |s1:t) [ρ T (m, B s )] = E p(s T |s1:t) max m T m T • sT , s.t. ||m T || 0 ≤ B s , m T ≤ m Let α(m) arg max m T m T • sT , s.t. ||m T || 0 ≤ B s , m T ≤m return the optimized mask m T given any m, Λ m min(α(m) s T ) be the minimal saliency of the network elements selected at iteration T for P (m). Then, we have P (m ∨ e (a) ) = E p(s T |s1:t) ρT (m ∨ e (a) , B s ) = E p(s T |s1:t) [ρ T (m, B s ) -Λ m + max(s a T , Λ m )] The second equality is due to the fact that the network element v a T would only replace the lowest included element in m T in order to maximize the objective. Then, P (m ∨ e (a) ) -P (m) = E p(s T |s1:t) [ρ T (m, B s ) -Λ m + max(s a T , Λ m )] -E p(s T |s1:t) [ρ T (m, B s )] = E p(s T |s1:t) [-Λ m + max(s a T , Λ m )] = E p(s T |s1:t) [max(s a T -Λ m , 0)] Given m ≤ m , we have Λ m ≤ Λ m since m T ≤ m in α(m ) is a tighter constraint than that in α(m ). Consequently, we can get s a t -Λ m ≥ s a t -Λ m , and thus 

C PROOF OF LEMMA 1

We restate Lemma 1 for clarity. Lemma 1. Let e (i) be an M -dimensional one-hot vectors with the i-th element be 1. To prove this Lemma, we prove the following first: Lemma 3. E p(s T |s1:t) ρ T (m ∨ e (b) ) -E p(s T |s1:t) ρ T (m ∨ e (a) ) ≤ E[max(s b T -s a T , 0)]. Proof. Due to (9), we have E p(s T |s1:t) ρ T (m ∨ e (b) ) -E p(s T |s1:t) ρ T (m ∨ e (a) ) = P (m ∨ e (b) ) -P (m) -(P (m ∨ e (a) ) -P (m)) = E p(s T |s1:t) max(s b T -Λ m , 0) -E p(s T |s1:t) [max(s a T -Λ m , 0)] = E p(s T |s1:t) max(s b T -Λ m , 0) -max(s a T -Λ m , 0) (10) = E p(s T |s1:t) max(s b T -s a T , Λ m -s a T ) -max(0, Λ m -s a T ) ≤ E p(s T |s1:t) max(s b T -s a T , 0) The equality ( 11) is achieved by adding Λ m -s a T in each term of the two max functions in (10). The inequality (12) can be proved by considering the following two cases: If Λ m -s a T ≥ 0, then max(s b T -s a T , Λ m -s a T ) -max(0, Λ m -s a T ) = max(s b T -s a T , Λ m -s a T ) -(Λ m -s a T ) = max(s b T -s a T -(Λ m -s a T ), 0) ≤ max(s b T -s a T , 0) . If Λ m -s a T < 0, then max(s b T -s a T , Λ m -s a T ) -max(0, Λ m -s a T ) = max(s b T -s a T , Λ m -s a T ) ≤ max(s b T -s a T , 0) . Next we utilize a well known bound regarding the maximum of two Gaussian random variables (Nadarajah & Kotz, 2008) , which we restate: Lemma 4. Let s a , s b be Gaussian random variables with means µ a , µ b and standard deviations σ a , σ b , then E[max(s a , s b )] ≤ µ a Φ µ b -µ a θ + µ b Φ µ b -µ a θ + θφ µ b -µ a θ where θ [σ b ] 2 + [σ a ] 2 -2cov(s b , s a ) and Φ, φ are standard normal CDF and PDF respectively.

D DYNAMIC PENALTY SCALING AS A FEEDBACK LOOP

We designed a feedback loop to automatically determine λ t during early pruning. A proportional feedback loop can be defined as follows 19 : λ t λ + K p × e(t) where K p ≥ 0 is a proportional constant which modulates λ t according to a signed measure of error e(•) at time t. Note that λ t ≥ λ as e(t) ≥ 0, and the opposite occurs if e(t) ≤ 0, which allows the error to serve as feedback to determine λ t . Implicitly, λ t asserts some control over e(t + 1), and thus closing the feedback loop. Traditional PID approaches to determine K p do not work in our case as λ may vary over several orders of magnitude. Consequently, a natural choice for K p is λ itself which preserves the same order of magnitude between K p and λ: λ t = λ + λ × e(t) = λ(1 + e(t)). Here we make two decisions to adapt the above to our task. First, as λ is likely to be extremely small, we use exponentiation, as opposed to multiplication. Secondly as λ ≤ 1 in practice, we use 1 -e(t) as an exponent: λ t = λ ∧ [1 -e(t)] = λ [(1/λ) ∧ e(t)] . The above derivation is complete with our definition of e(t): e(t) (T -t)||m t || 0 /B t,c -1. The above determines error by the discrepancy between the anticipated compute required to complete training (T -t)||m t || 0 , vs. the remaining budget B t,c with e(t) = 0 if the two are equal. This is a natural measure of feedback for λ as we expect the two to be equal if λ is serving well to early prune the network.

E COMPARISON WITH PRUNETRAIN

We compare with PruneTrain (Lym et al., 2019) in Table 5 . PruneTrain uses an orthogonal technique of dynamically increasing the minibatch size to achieve further wall time improvements. This prevents accurate wall time comparisons between BEP and PruneTrain. To compare with PruneTrain which  FLOPs 2HW (C in K 2 + 1)C out where H, W , C in are input height, width, and channels respectively, K is the convolutional kernel size, and C out is the number of output channels of the layer. Under equivalent inference cost, BEP 1e-1 outperforms PruneTrain in Top-1 performance. We also find that BEP 1e-1 and BEP 1e-4 consumes fewer training FLOPs when compared to baseline. It should be noted that PruneTrain does not provide a mechanism to constrain the trained network size, thus it is unclear how to utilize it in order to solve the early pruning problem (3), (4).

F TABLE OF NOTATIONS

We list a table of notations used elsewhere in the paper in Table 6 . Lyr 3 GP 1.19(0.5) 1.08(0.06) 1.07(1.07)e5 0.96(0.04) 0.93(0.03) 2.47(0.04) 0.49(0.01) 0.48(0.01) 1.33(0.02) 4-MOGP 1.15(0.05) 0.89(0.06) 2.44(0.05) 0.91(0.02) 0.80(0.03) 2.20(0.03) 0.38(0.02) 0.39(0.02) 1.25(0.02) 8-MOGP 1.09(0.04) 0.86(0.05) 2.38(0.04) 0.84(0.03) 0.78(0.03) 2.16(0.03) 0.32(0.01) 0.35(0.02) 1.20(0.02) 18-MOGP 0.97(0.04) 0.80(0.05) 2.33(0.04) 0.89(0.03) 0.76(0.03) 2.13(0.03) 0.31(0.01) 0.35(0.02) 1.20(0.02) 32-MOGP 0.96(0.06) 0.81(0.06) 2.32(0.04) 0.79(0.03) 0.74(0.03) 2.13(0.03) 0.31(0.01) 0.34(0.02) 1.20(0.02)

G MORE EXPERIMENTAL RESULTS AND EXPERIMENTAL DETAILS

G.1 GP VS. MOGP LOG-LIKELIHOOD ON CIFAR-10 DATASET Table 7 presents the results of the experiment in Section 4.1 for the CIFAR-10 dataset.

G.2 EXPERIMENTAL DETAILS

To train our CIFAR-10 and CIFAR-100 models we used an Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of 0.001. The learning rate used an exponential decay of k = 0.985, and a batch size of 32 was used. Training was paused three times evenly spaced per epoch. During this pause, we collected saliency measurements using 40% of the training dataset. This instrumentation subset was randomly select from the training dataset at initialization, and remained constant throughout the training procedure. We performed data preprocessing of saliency evaluations into a standardized [0, 10] range. 20 We used (8) to measure saliency of neurons/convolutional filters. For the convolutional layers we used 12 latent MOGP functions. For the dense layer we used 4 latent MOGP functions. For our ResNet-50 model we used an SGD with Momentum optimizer with an initial learning rate of 0.1. The learning rate was divided by ten at t = [30, 60, 80] epochs. We collected saliency data every 5 iterations of SGD, and averaged them into buckets corresponding to 625 iterations of SGD to form our dataset. We used a minimum of 4 latent functions per MOGP, however this was dynamically increased if the model couldn't fit the data up to a maximum of 15. We sampled 10K points from our MOGP model to estimate ∆(•) for CIFAR-10/CIFAR-100. For ResNet we sampled 15K points. We repeated experiments 5 times for reporting accuracy on CIFAR-10/CIFAR-100.

G.3 PRUNING ON RESNET

ResNet architecture is composed of a sequence of residual units: Z F(P -1 ) + P -1 , where P -1 is the output of the previous residual unit Z -1 and '+' denotes elementwise addition. Internally, F is typically implemented as three stacked convolutional layers: F(P -1 ) [z 3 • z 2 • z 1 ] (P -1 ) where z 1 , z 2 , z 3 are convolutional layers. Within this setting we consider convolutional filter pruning. Although z 1 , z 2 may be pruned using the procedure described earlier. Pruning z 3 requires a different procedure. Due to the direct addition of P -1 to F(P -1 ), the output dimensions of Z -1 and z 3 must match exactly. Thus a ResNet architecture consists of sequences of residual units of length B with matching input/output dimensions: ζ [Z ] =1,...,B , s.t. dim(P 1 ) = dim(P 2 ) = . . . = dim(P B ). We propose group pruning of layers [z 3 ] =1,...,B where filters are removed from all z 3 in a residual unit sequence in tandem. We define s([ζ, c]) B =1 s([ 3 , c]), where s(•) is defined for convolutional layers as in (8). To prune the channel c from ζ, we prune it from each layer in [z 3 ] =1,...,B . Typically we pruned sequence channels less aggressively than convolutional filters as these channels feed into several convolutional layers.



In contrast, foresight pruning(Wang et al., 2020) removes DNN elements prior to the training process. Popular deep learning libaries do not accelerate sparse matrix operations over dense matrix operations. Thus, pruning network connections cannot be easily capitalized upon with performance improvements. It is also unclear whether moderately sparse matrix operations (i.e., operations on matrices generated by connection pruning) can be significantly accelerated on massively parallel architectures such as GPUs (seeYang et al. (2018) Fig.7). See Section 5 inBuluc ¸& Gilbert (2008) for challenges in parallel sparse matrix multiplication. Implementation details of this saliency function can be found in Appendix A.1. In contrast to PruneTrain(Lym et al., 2019), our problem definition balances training time vs. performance under an additional constraint on the trained network size (3b). We discuss this further in Section 5 Among the various types of MOGPs (seeÁlvarez & Lawrence (2011) for a detailed review.), we choose this linear model such that the correlations between s a t and s a t can be computed analytically. We omit (4b) as it is automatically satisfied due to our simplification. Note as µ a T |1:t ≥ µ b T |1:t , Φ(•) ≤ 0.5 and experiences tail decay proportional to µ a T |1:t -µ b T |1:t . Code available at https://github.com/keras-team/keras/blob/master/examples/ cifar10_cnn.py Complete experimental setup details found in Appendix G.2. In our observations, jointly modeling the belief of multiple layers' saliency measurements using MOGP yielded no measurable improvement in log-likelihood. For CIFAR-10 see Appendix G.1. Further details can be found in Appendix D. We formally define saliency on residual unit sequences in Appendix G.3 We omit comparison to DSR due to differing underlying deep learning library which makes walltime comparisons inaccurate. For brevity, we omit parameters X , Y, Nz 1 :z L , L. Here we use {} to distinguish indexing into a tensor from indexing into the sequence of tensors [W +1,c ]. This approach is inspired from Proportional-Integral-Derivative (PID) controllers (Bellman, 2015), seeÅström et al. (1993) for an introductory survey. Generally, saliency evaluations are relatively small (≤ 0.01), which leads to poor fitting models or positive log-likelihood. Precise details of our data preprocessing is in Appendix G.4.



Figure 1: Visualization of qualitative differences between GP and MOGP prediction. Top: GP, Bottom: 18-MOGP. Dataset is separated into training (green) and validation (blue). Posterior belief of the saliency is visualized as predictive mean (red line), and 95% confidence interval (error bar).

Figure 2: Comparing dynamic penalty scaling vs. static on pruning a 32-convolutional filter layer in a CNN. Dynamic penalty scaling encourages gradual pruning across a wide variety of settings of λ.

AppendixA.1 and A.2;Wang et al. (2020) Section 3 last paragraph). Development of saliency functions which overcome this shortcoming while remaining performant is a difficult open problem outside the scope of this work.14 Precise details of the ResNet architecture may be found inHe et al. (2016a)  Section 3.

Figure 3: Convolutional filter saliency over 150 epochs of SGD on CIFAR-10. A MODELING DETAILS A.1 SALIENCY FUNCTION

Figure 4: Function samples drawn from the exponential kernel.

P (m ∨ e (a) ) -P (m )] ≥ [P (m ∨ e (a) ) -P (m )] .

∀ 1 ≤ a, b ≤ M ; m ∈ {0, 1} M s.t. m ∧ (e (a) ∨ e (b) ) = 0. Given a vector of observed saliency s1:t , if µ a T |1:t ≥ µ b T |1:t and µ a T |1:t ≥ 0, then E p(s T |s1:t) [ρ T (m ∨ e (b) )] -E p(s T |s1:t) [ρ T (m ∨ e (a) )] ≤ µ b T |1:t Φ(ν/θ) + θ φ(ν/θ)where θ σ aa T |1:t + σ bb T |1:t -2σ ab T |1:t , ν µ b T |1:t -µ a T |1:t , and Φ and φ are standard normal CDF and PDF, respectively.

] t =t,...,T , [B t ,c ] t =t,...,T , and m T • s T . Instead, we consider a simplification of the above problem by only considering solutions of the form m T -1 = m T -2 = . . . = m t which yields 6 :

Comparing log likelihood (standard error) of test data for independent GPs (GP) vs. MOGP with n latent functions (n-MOGP) on collected saliency measurements from CIFAR-100 training. Measurements are given as a multiple of -10 4 (lower is better). MOGP outperforms GP, particularly on the small dataset. Results are averaged over 20 runs. Extremely large values are due to the GP model being unable to fit the data.

Performance (standard error) against percentage of neurons/filters pruned per layer with varying λ for tested algorithms.

Ablation study showing performance (standard error) vs. varying early pruning hyperparameters: MOGP variational inducing points (Ind. pnts.), MOGP latent functions (Lat. func.), T step . Default setting for hyperparameters are 60, 1.0×, and 10 respectively. Outside of the higest sparsity setting, 95%, all hyperparameters are robust to changes, with mild degradation observed in the extremal settings.

Li et al. (2017) define a saliency function on convolutional filters by using the L 1 norm.Molchanov et al. (2017) propose using a first-order Taylor-series approximation on the objective function as a saliency measure.Dong et al. (2017) propose layer wise pruning of weight parameters using a Hessian based saliency measure. Several variants of pruning after training exist.Han et al. (2015) propose iterative pruning where pruning is performed in stages alternating with fine tune training.Guo et al. (2016) suggest dynamic network surgery, where pruning is performed on-the-fly during evaluation time.Li et al. (2017) andHe et al.  propose reinforcement learning for pruning decisions. A comprehensive overview may be found inGale et al. (2019).

BEP vs. SNIP and Grasp on ResNet-50 on ImageNet dataset. We vary percentage of residual unit sequence channels (Seq) and filters pruned (Lyr). 'Train' refers to wall time during network training, 'Prune' refers to pruning modeling/inference overhead. Benchmark wall time for ResNet-50 is 55h on our hardware. Benchmark performance is 75.7% for unpruned ResNet-50.

|1:t is the predictive mean of the saliency s a t . And the [a, a ]-th element of K [t t] denoted as σ aa t |1:t is the predictive (co)variance between the saliency s a t and s a t .

Comparing BEP with PruneTrain on ResNet-50 on ImageNet dataset. PruneTrain uses a stronger ResNet-50 baseline with Top-1 76.2% performance. BEP uses a 75.7% baseline. 'Train' refers to wall time during network training, 'Prune' refers to pruning modeling/inference overhead. Benchmark wall time for ResNet-50 is 55h on our hardware. Comparison performed using 47% of ResNet-50 baseline inference FLOPs. Train FLOPs refers to proportion of ResNet-50 baseline training FLOPs used to train the network.

Notations used elsewhere in the paper. Random variable representing the saliency measurement of network element a at time t. s t Sequence of random variables [s a t ] a=1,...,M s τ1:τ2 Sequence of random variables [s t ] t=1 ...,T .

Comparing log likelihood (standard error) of test data for Independent GPs (GP) vs. MOGP with n latent functions (n-MOGP) on collected saliency measurements from CIFAR-10 training.

G.4 DATA PREPROCESSING

We followed the same data preprocessing procedure for both our small scale and ImageNet experiments. To standardize the saliency measurements for a training dataset s1:t in our modeling experiments we clip them between 0 and an upper bound computed as follows: ub percentile(s 1:t , 95) × 1.3. This procedure removes outliers. We used 1.3 as a multiplier, as this upper bound is used to transform test dataset as well, which may have higher saliency evaluations.After clipping the training data, we perform a trend check for each element v a by fitting a Linear Regression model to the data sa 1:t . For sa 1:t with an increasing trend (i.e., the linear regression model has positive slope) we perform the transformation sa 1:t = ub -sa 1:t . The reasoning behind this is that the exponential kernel strongly prefers decaying curves. After this preprocessing, we scale up the saliency measurements to a [0, 10] range: s1:t = s1:t × 10. We found that without scaling to larger values, log-likelihood of our models demonstrated extremely high positive values due to small values of unscaled saliency measurements.We transform the test data in our modeling experiments st+1:T with the same procedure using the same ub and per-element v a regression models as computed by the training data. We measure log-likelihood after this transformation for both the test dataset in our small scale experiments.During the BEP Algorithm, the same steps are followed, however we inverse the trend check transformation (s a 1:t = ub -sa 1:t ) on the predicted MOGP distribution of s T prior to sampling for estimation of ∆(•).

