GRADIENT FLOW IN SPARSE NEURAL NETWORKS AND HOW LOTTERY TICKETS WIN

Abstract

Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; and (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from -however, this comes at the cost of learning novel solutions.

1. Introduction

Deep Neural Networks (DNNs) are the state-of-the-art method for solving problems in computer vision, speech recognition, and many other fields. While early research in deep learning focused on application to new problems, or pushing state-of-the-art performance with ever larger/more computationally expensive models, a broader focus has emerged towards their efficient real-world application. One such focus is on the observation that only a sparse subset of this dense connectivity is required for inference, as apparent in the success of pruning (Han et al., 2015; Mozer et al., 1989b) . Pruning has a long history in Neural Network (NN) literature, and remains the most popular approach for finding sparse NNs. Sparse NNs found by pruning algorithms (Han et al., 2015; Louizos et al., 2017; Molchanov et al., 2017; Zhu et al., 2018) (i.e. pruning solutions) can match dense NN generalization with much better efficiency at inference time. However, naively training an (unstructured) sparse NN from a random initialization (i.e. from scratch), typically leads to significantly worse generalization. Two methods in particular have shown some success at addressing this problem -Lottery Tickets (LTs) and Dynamic Sparse Training (DST). The mechanism behind the success of both of these methods is not well understood however, e.g. we don't know how to find Lottery Tickets (LTs) efficiently; while RigL (Evci et al., 2020) , a recent DST method, requires 5× the training steps to match dense NN generalization. Only in understanding how these methods overcome the difficulty of sparse training can we improve upon them. A significant breakthrough in training DNNs -addressing vanishing and exploding gradients -arose from understanding gradient flow both at initialization, and during training. In this work we investigate the role of gradient flow in the difficulty of training unstructured sparse NNs from random initializations and from LT initializations. Our experimental investigation results in the following insights: 1. Sparse NNs have poor gradient flow at initialization. In §3.1, §4.1 we show that existing methods for initializing sparse NNs are incorrect in not considering heterogeneous connectivity. We believe we are the first to show that sparsity-aware initialization methods improve gradient flow and training. 2. Sparse NNs have poor gradient flow during training. In §3.2, §4.2, we observe that even in sparse NN architectures less sensitive to incorrect initialization, the gradient flow during training is poor. We show that DST methods achieving the best generalization have improved gradient flow. 3. Lottery Tickets don't improve upon (1) or (2), instead they re-learn the pruning solution. In §3.3, §4.3 we show that a LT initialization resides within the same basin of attraction as the original pruning solution it is derived of, and a LT solution is highly similar to the pruning solution in function space.

2. Related Work

Pruning Pruning is used commonly in Neural Network (NN) literature to obtain sparse networks (Castellano et al., 1997; Hanson et al., 1988; Kusupati et al., 2020; Mozer et al., 1989a,b; Setiono, 1997; Sietsma et al., 1988; Wortsman et al., 2019) . Pruning algorithms remove connections of a trained dense network using various criteria including weight magnitude (Han et al., 2016 (Han et al., , 2015;; Zhu et al., 2018) , gradient-based measures (Molchanov et al., 2016) , and 2 nd -order terms based on the Hessian (Hassibi et al., 1993; LeCun et al., 1990) . While the majority of pruning algorithms focus on pruning after training, a subset focuses on pruning NNs before training (Lee et al., 2019; Tanaka et al., 2020; Wang et al., 2020) . Gradient Signal Preservation (GRaSP) (Wang et al., 2020) is particularly relevant to our study, since their pruning criteria aims to preserve gradient flow, and they observe a positive correlation between initial gradient flow and final generalization. However, recent work of Frankle et al., 2020b suggests that the reported gains are due to sparsity distributions discovered rather than the particular sub-network. Another limitation of these algorithms is that they don't scale to large scale tasks like Resnet-50 training on ImageNet-2012. Lottery Tickets Frankle et al. (2019a) showed the existence of sparse sub-networks at initializationknown as Lottery Tickets -which can be trained to match the generalization of the corresponding dense Deep Neural Network (DNN). The initial work of Frankle et al. (2019a) inspired much follow-up work. Gale et al. (2019) and Liu et al. (2019) observed that the initial formulation was not applicable to larger networks with higher learning rates. Frankle et al. (2019b Frankle et al. ( , 2020a) ) proposed late rewinding as a solution. Morcos et al. (2019) and Sabatelli et al. (2020) showed that Lottery Tickets (LTs) trained on large datasets transfer to smaller ones, but not vice versa. Frankle et al. (2020c) , Ramanujan et al. (2019), and Zhou et al. (2019) focused on further understanding LTs, and finding sparse sub-networks at initialization. As one might expect, sufficiently large networks would have smaller solutions hidden in them. Malach et al. (2020) studied this and proved the existence of solutions in sufficiently large networks. However, it is an open question whether finding such networks at initialization could be done more efficiently than with existing pruning algorithms.

Dynamic Sparse Training

Most training algorithms work on pre-determined architectures and optimize parameters using fixed learning schedules. Dynamic Sparse Training (DST), on the other hand, aims to optimize the sparse NN connectivity jointly with model parameters. Mocanu et al. (2018) and Mostafa et al. (2019) propose replacing low magnitude parameters with random connections and report improved generalization. Dettmers et al. (2019) proposed using momentum values, whereas Evci et al. (2020) used gradient estimates directly to guide the selection of new connections, reporting results that are on par with pruning algorithms. In §4.2 we study these algorithms and try to understand the role of gradient flow in their success.

Random Initialization of Sparse NN

In training sparse NN from scratch, the vast majority of pre-exisiting work on training sparse NN has used the common initialization methods (Glorot et al., 2010; He et al., 2015) derived for dense NNs, with only a few notable exceptions. Gale et al. (2019) , Liu et al. (2019), and Ramanujan et al. (2019) scaled the variance (fan-in/fan-out) of a sparse NN layer according to the layer's sparsity, effectively using the standard initialization for a small dense layer of equivalent number of weights as in the sparse model.

3. Analyzing Gradient Flow in Sparse Neural Networks

A significant breakthrough in training very deep NNs arose in addressing the vanishing and exploding gradient problem, both at initialization, and during training. This problem was understood by analyzing the signal propagation within a DNN, and addressed in improved initialization methods (Glorot et al., 2010; He et al., 2015; Xiao et al., 2018) alongside normalization methods, such as Batch Normalization (BatchNorm) (Ioffe et al., 2015) . In our work, following Wang et al. (2020) , we study these problems using the gradient flow, ∇L(θ) T ∇L(θ) which is the first order approximation * of the decrease in the loss expected after a gradient step. We observe poor gradient flow for the predominant sparse NN initialization strategy and propose a solution in §3.1. Then in §3.2 and §3.3 we summarize Dynamic Sparse Training (DST) methods and LT hypothesis respectively.  w 2 2,2 w 2 2,1 w 2 1,2 w 2 1,1 w 2 3,2 w 2 3,1 weight matrix fan in = 2 fan out = 3 w 2 1,1 w 2 1,2 w 2 2,1 w 2 2,2 w 2 3,1 w 2 3,2

3.1. The Initialization Problem in Sparse Networks

Here we analyze the gradient flow at initialization for random sparse NNs, motivating the derivation of a more general initialization for NN with heterogeneous connectivity, such as in sparse NNs. In practice, without a method such as BatchNorm (Ioffe et al., 2015) , using the correct initialization can be the difference between being able to train a DNN, or not -as observed for VGG16 in our results ( §4.1, Table 1 ). The initializations proposed by Glorot et al. (2010) and He et al. (2015) ensure that the output distribution of every neuron in a layer is of zero-mean and unit variance, and do this by sampling a Gaussian distribution with a variance based on the number of incoming/outgoing connections for all the neurons in a dense layer, as illustrated in Fig. 1a , which is assumed to be identical for all neurons in the layer. In an unstructured sparse NN however, the number of incoming/outgoing connections is not identical for all neurons in a layer, as illustrated in Fig. 1b . In Appendix A.2 we derive the initialization for this more general case. In Appendix A.1 we explain in full the generalized Glorot et al. (2010) and He et al. (2015) initialization, in the forward, backward and average use cases. Here we will focus only on explaining the generalized He et al. (2015) initialization for forward propagation, which we used in our experiments. For every weight w [ ] ij ∈W n [ ] ×n [ -1] in a layer with n [ ] neurons, and mask [m [ ] ij ]=M ∈[0,1] n [ ] ×n [ -1] , w [ ] ij ∼N 0, 2 fan-in [ ] i , where fan-in [ ] i = n [ -1] j=1 m [ ] ij , is the number of incoming connections for neuron i in layer . In the special case of a dense layer where m [ ] ij = 1,∀i,j, Eq. ( 1) reduces to the initialization proposed by (He et al., 2015) since fan-in [ ] i =n [ -1] ,∀i. Using the dense initialization in a sparse DNN causes signal to vanish, as empirically observed in Fig. 1c ), whereas our initialization keeps the variance of the signal constant. The initialization proposed by Liu et al. (2019) is a special case of ours where it is assumed fan-in [ ] i ≡fan-in [ ] ,∀i, i.e. all neurons have the same number of unmasked incoming connections in a layer. Surprisingly the initialization of Liu et al. (2019) also preserves the signal in Fig. 1c (discussed in §4.1).

3.2. Gradient Flow during Training and Dynamic Sparse Training

While initialization is important for the first training step, the gradient flow during the early stages of training is not well addressed by initialization alone, as shown by normalization methods (Ioffe et al., 2015) . Our findings show that even with BatchNorm, the gradient flow during training in unstructured sparse NNs is poor. Recently, a promising new approach to training sparse NNs has emerged -Dynamic Sparse Training (DST) -that learns connectivity adaptively during training, showing significant improvements over baseline methods that use a fixed mask. These methods perform periodic updates on the sparse connectivity of each layer: commonly replacing least magnitude connections with new connections selected using various criteria. We consider two of these methods: Sparse Evolutionary Training (SET) (Mocanu et al., 2018) , which chooses new connections randomly and Rigged Lottery (RigL) (Evci et al., 2019) , which chooses connections with high gradient magnitude. RigL improves over SET and matches pruning performance with sufficient training time. Since these methods have only recently been proposed, there is a lack of understanding of why and how these methods achieve better results.

3.3. Lottery Ticket Hypothesis

A recent approach for training unstructured sparse NNs while achieving similar generalization to the original dense solution is the Lottery Ticket Hypothesis (LTH) (Frankle et al., 2019a) . Notably, rather than training a pruned NN structure from random initialization, the LTH uses the dense initialization from which the pruning solution was trained/derived from. Definition [Lottery Ticket Hypothesis]: Given a NN f with a parameter vector θ and an optimization function O N (f,θ) = θ N , which gives the optimized parameters of f after N training steps, there exists a sparse sub-network characterized by the binary mask M such that for some iteration K, O N (f,θ K * M) performs as well as O N (f,θ) * M, whereas the model trained from another random initialization θ S , using the same mask O N (f,θ S * M), typically does not * . Frankle et al. (2019a) initially claimed the LTH held for K =0, but later revised this to N K ≥0 (Frankle et al., 2019b; Liu et al., 2019) . LTs enjoy significantly faster convergence compared to regular NN training but require the connectivity mask as found by the pruning solution (Frankle et al., 2019a) along with values from early training (Frankle et al., 2019b) . Given the importance of the early phase of training (Frankle et al., 2020c; Lewkowycz et al., n.d.) , it is natural to ask about the difference between lottery tickets and the solution they are derived from (i.e. pruning solutions). Answering this question can help us understand if the success of LTs is primarily due to its relation to the solution, or if we can identify generalizable characteristics that help with sparse NNs training.

4. Experiments

Here we show empirically that (1) sparsity-aware initialization improves gradient flow at initialization for all methods, and achieves higher generalization for networks without BatchNorm, (2) the mask updates of DST methods increase gradient flow and create new negative eigenvalues in the Hessian; which we believe to be the main factor for improved generalization, (3) lottery tickets have poor gradient flow, however they achieve good performance by effectively re-learning the pruning solution, meaning they do not address the problem of training sparse NNs in general. Our experiments include the following settings: LeNet5 on MNIST, VGG16 on ImageNet-2012 and ResNet-50 on ImageNet-2012. Experimental details can be found in Appendix B † .

4.1. Gradient Flow at Initialization

In this section, we measure the gradient flow over the course of the training (Fig. 2 ) and evaluate the performance of our generalized He initialization method (Table 1 ), and that proposed by Liu et al. (2019) , over the commonly used masked dense initialization. Additional gradient flow plots for the remaining methods are shared in Appendix C. Sparse NN initialized using the initialization distribution of a dense model (Scratch in Fig. 2 ) start in a flat region where gradient flow is very small and don't make any early progress. Learning starts after 1000 iterations for LeNet5 and 5000 for VGG-16, however, their generalization is sub-optimal. Liu et al. (2019) claim their proposed initialization has no empirical effect as compared to the masked dense initialization ‡ . Although technically incorrect (see §3.1), our results show their method to be largely as effective as our proposed initialization. This indicates that the assumption of a mask having roughly uniform mask sparsity is sufficient for the masks we considered. Both of these initializations remedy the vanishing gradient problem at initialization (Scratch+ in Fig. 2 ) and result in better generalization for all methods. For instance, improved initialization results in an 11% improvement in Top-1 accuracy for VGG16 (62.52 vs 51.81). While initialization is extremely important for NNs without BatchNorm and skip connections, its effect on modern architectures, such as Resnet-50, is limited (Evci et al., 2019; Frankle et al., 2020b; Zhang et al., 2019) . We confirm these observations in our ResNet-50 experiments in which, despite some initial improvement in gradient flow, our initialization seems to have no effect on final generalization. We observe significant increases in gradient norm after each learning rate drop (due to increased variance in gradients), which suggests studying gradient norm in the later part of the training might not be helpful. On the other hand, we observe a significant difference in gradient flow during training between sparse networks and small dense models of a similar parameter count. Can the performance gap between static-sparse and dense models be explained by this difference? A cartoon illustration of the loss landscape of a sparse model, after it is pruned from a dense solution to create a LT sub-network. A lottery ticket initialization is within the basin of attraction of the pruned model's solution. In contrast a random initialization is unlikely to be close to the dense solution's basin.

4.2. Gradient Flow during Training and Dynamic Sparse Training

In Fig. 2 we observed improved gradient flow for RigL. In this section we focus on those iterations in which the sparse connectivity is updated, and measure the change in gradient flow along with the Hessian spectrum. We also run the inverted baseline for RigL (RigL Inverted), in which the growing criteria is reversed and connections with least gradient magnitudes are activated. DST methods such as RigL replace low saliency connections during training. Assuming the pruned connections indeed have a low impact on the loss, we might expect to see increased gradient norm after new connections are activated, especially in the case of RigL, which picks new connections with high magnitude gradients. In Fig. 3 we confirm that RigL updates increase the norm of the gradient significantly, especially in the first half of training, whereas SET, which picks new connections randomly, seems to be less effective at this. Using the inverted RigL criteria doesn't improve the gradient flow, as expected, and without this RigL's performance degrades (73.83±0.12 for ResNet-50 and 92.71±7.67 for LeNet5). These results suggest that improving gradient flow early in training might be the key for training sparse networks and that is what RigL appears to be doing. Additional plots for different initialization methods are shared in Appendix C. RigL falls short of matching Small-Dense performance while constantly having higher gradient flow during the training, which highlights a limitation of looking solely at gradient flow. When the gradient is zero, or uninformative due to the error term of the approximation, analyzing the Hessian could provide additional insights (Ghorbani et al., 2019; Papyan, 2019; Sagun et al., 2017) . In Appendix E, we show the Hessian spectrum before and after sparse connectivity updates. After RigL updates we observe more negative eigenvalues with significantly larger magnitudes as compared to SET. On the other hand, small dense models have smaller positive outlier eigenvalues while having significantly larger negative ones; which is again a sign of better conditioned optimization. We leave investigating the relationship between gradient flow and the Hessian further as a future work.

4.3. Why Lottery Tickets are Successful

We found that LTs do not improve gradient flow, either at initialization, or early in training, as shown in Fig. 2 . This may be surprising given the apparent success of LTs, however the questions posed in §3.3 present an alternative hypothesis for the ease of training from a LT initialization. Here we present results showing that indeed (1) LTs initializations are consistently closer to the pruning solution than a random initialization, (2) trained LTs (i.e. LT solutions) consistently end up in the same basin as the pruning solution and (3), LT solutions are highly similar to pruning solutions under various function similarity measures. Our resulting understanding of LTs in the context of the pruning solution and the loss landscape is illustrated in Fig. 4 .

Experimental Setup

To investigate the relationship between the pruned and LT solutions we perform experiments on two models/datasets: a 95% sparse LeNet5 § architecture (LeCun et al., 1989) trained on MNIST (where the original LT formulation works, i.e. K =0), and an 80% sparse ResNet-50 (Wu et al., 2018) on ImageNet-2012 (Russakovsky et al., 2015) (where K =0 doesn't work (Frankle et al., 2019b) ), for which we use values from K =2000 (≈6 th epoch). In both cases, we find a LT initialization by pruning each layer of a dense NN separately using magnitude-based iterative pruning (Zhu et al., 2018) . Further details about our experiments can be found in Appendix B. Lottery Tickets Are Close to the Pruning Solution We train 5 different models using different seeds from both scratch (random) and LT initializations, the results of which are in Figs. 5b and 5e . These networks share the same pruning mask and therefore lie in the same solution space. We visualize distances between initial and final points of these experiments in Figs. 5a and 5d using 2D Multi-dimensional Scaling (MDS) (Kruskal, 1964) Lottery Tickets are in the Pruning Solution Basin Investigating paths between different solutions is a popular tool for understanding how various points in parameter space relate to each other in the loss landscape (Draxler et al., 2018; Evci et al., 2019; Fort et al., 2020; Frankle et al., 2020a; Garipov et al., 2018; Goodfellow et al., 2015) . For example, Frankle et al. (2019b) use linear interpolations to show that LTs always go to the same basin ¶ when trained in different data orders. In Figs. 5c and 5f we look at the linear paths between pruning solution and 4 other points: LT initialization/solution and random (scratch) initialization/solution. Each experiment is repeated 5 times with different random seeds, and mean values are provided with 80% confidence intervals. In both experiments we observe that the linear path between LT initialization and the pruning solution decreases faster compared to the path that originates from scratch initialization. After training, the linear paths towards the pruning solution change drastically. The path from the scratch solution depicts a loss barrier; the scratch solution seems to be in a different basin than the pruning solution || . In contrast, LTs are linearly connected to the pruning solution in both small and large-scale experiments indicating that LTs have the same basin of attraction as the pruning solutions ¶ We define a basin as a set of points, each of which is linearly connected to at least one other point in the set. || This is not always true, it is possible that non-linear low energy paths exist between two solutions (Draxler et al., 2018; Garipov et al., 2018) , but searching for such paths is outside the scope of this work. Table 2 : Ensemble & Prediction Disagreement. We compare the function similarity (Fort et al., 2020) with the original pruning solution and ensemble generalization over 5 sparse models, trained from random initializations and LTs. As a baseline, we also show results for 5 pruned models trained from different random initializations. See Appendix F for the complete results. they are derived from. While it seems likely, these results do not however explicitly show that the LT and pruning solutions have learned similar functions. Lottery Tickets Learn Similar Functions to the Pruning Solution Fort et al. ( 2020) motivate deep ensembles by empirically showing that models starting from different random initializations typically learn different solutions, as compared to models trained from similar initializations. Here we adopt the analysis of (Fort et al., 2020) , but in comparing LT initializations and random initializations using fractional disagreement. The fractional disagreement with the pruning solution is the fraction of class predictions over which the LT and scratch models disagree with the pruning solution they were derived from. In Table 2 we show the mean fractional disagreement over all pairs of models. We run two versions of scratch training: (1) Scratch (Diff. Init. different weight initialization and different data order (2) Scratch same weight initialization and different data order for 5 different seeds the experiments are ran. Finally, we restart training starting from the pruning solution (Prune Restart) using, again, 5 different data orders. The results presented in Table 2 suggest that all 5 LTs models converge on a solution almost identical to the pruning solution. Interestingly, the 5 LT models are even more similar to each other (Disagree. column) than the pruning solution, possibly because they share an initialization and training is stable (Frankle et al., 2019b) . The disagreement of Prune Restart solutions with the original pruning solution matches the disagreement of lottery solutions; showing the extent of similarity between LT and pruning solutions. Our results show that having a fixed initialization alone can not explain the low disagreement observed for LT experiments as Scratch solutions obtain an average disagreement of 0.0316 despite using the same initialization, which is almost 10 times more than the LT solutions (0.0043). Finding different LT initialization is costly, however using a different initialization in Scratch (Diff. Init.) training is free as the initializations are random. Using different initializations we can obtain more diverse solutions and thus achieve higher ensemble accuracy. As suggested by the analysis of Fort et al. (2020) , ensembles of different solutions are more robust, and generalize better, than ensembles of similar solutions. An ensemble of 5 LT models with low disagreement doesn't significantly improve generalization as compared to an ensemble of 5 different pruning solutions with similar individual test accuracy. We further demonstrate these results by comparing the output probability distributions using the Kullback-Leibler Divergence (KL), and Jensen-Shannon Divergence (JSD) in Appendix F. Implications: (a) Rewinding of LTs. Frankle et al. (2019b Frankle et al. ( , 2020a) ) argued that LTs work when the training is stable, and thus converges to the same basin when trained with different data sampling orders. In §4.3, we show that this basin is the same one found by pruning, and since the training converges to the same basin as before, we expect to see limited gains from rewinding if any. This is partially confirmed by Renda et al. (2020) which shows that restarting the learning rate schedule from the pruning solution performs better than rewinding the weights. (b) Transfer of LTs. Given the close relationship between LTs and pruning solutions, the observation that LTs trained on large datasets transfer to smaller ones, but not vice versa (Morcos et al., 2019; Sabatelli et al., 2020) can be explained by a common observation in transfer learning: networks trained in large datasets transfer to smaller ones. (c) LT's Robustness to Perturbations. Frankle et al. (2020c) and Zhou et al. (2019) found that certain perturbations, like only using the signs of weights at initialization, do not impact LT generalization, while others, like shuffling the weights, do. Our results bring further insights to these observations: As long as the perturbation is small enough such that a LT stays in the same basin of attraction, results will be as good as the pruning solution. (d) Success of LTs. While it is exciting to see widespread applicability of LTs in different domains (Brix et al., 2020; Li et al., 2020; Venkatesh et al., 2020) , the results presented in this paper suggest this success may be due to the underlying pruning algorithm (and transfer learning) rather than LT initializations themselves.

5. Conclusion

We attempted to answer the questions of (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes Lottery Tickets (LTs) and Dynamic Sparse Training (DST) the exceptions? We identified that randomly initialized unstructured sparse Neural Networks (NNs) exhibit poor gradient flow when initialized naively and proposed an alternative initialization that scales the initial variance for each neuron separately. Furthermore we showed that modern sparse NN architectures are more sensitive to poor gradient flow during early training rather than initialization alone. We observed that this is somewhat addressed by state-of-the-art DST methods, such as Rigged Lottery (RigL), which significantly improves gradient flow during early training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow at either initialization or during training, but rather their success lies in effectively re-learning the original pruning solution they are derived from. We showed that a LTs initialization resides within the same basin of attraction as the pruning solution and, furthermore, when trained the LT solution learns a highly similar solution to the pruning solution. These findings suggest that LTs are fundamentally limited in their potential for improving the training of sparse NNs more generally. w 2 2,2 w 2 2,1 w 2 1,2 w 2 1,1 w 2 3,2 w 2 3,1 weight matrix fan in = 2 fan out = 3 w 2 1,1 w 2 1,2 w 2 2,1 w 2 2,2 w 2 3,1 w 2 3,2 (a) Dense Layer weight mask w 2 2,2 w 2 2,1 w 2 1,2 w 2 1,1 w 2 3,2 w 2 3,1 _ 1 _ _ 2 2 fan in fan out w 2 2,2 w 2 2,1 w 2 1,2 w 2 1,1 w 2 3,2 w 2 3,1 0 1 0 0 1 1 _ 2 _ _ 1 2 (b) Sparse Layer Figure 6 : Glorot/He Initialization for a Sparse NN. (Glorot et al., 2010; He et al., 2015) restrict the outputs of all neurons to be zero-mean and of unit variance. All neurons in a dense NN layer (a) have the same fan-in/fan-out, whereas in a sparse NN (b) the fan-in/fan-out can differ for every neuron, potentially requiring sampling from a different distribution for every neuron. The fan-in matrix contains the values used in Eq. ( 1) for each neuron. A Glorot/He Initialization Generalized to Neural Networks with Heterogeneous Connectivity: Full Explanation/Derivation Here we derive the full generalized initialization for both the forwards/backwards cases (i.e. fan-in/fan-out), refer to Fig. 6 for an illustration of how the connectivity for the fan-in/fan-out cases are determined for each neuron. A.1 Generalized Glorot/He Initialization: Backwards, Forwards and Average Cases For every weight w [ ] ij ∈ W n [ ] ×n [ -1] in a layer with n [ ] neurons, connecting neuron i in layer to neuron j in layer ( -1) with n [ -1] neurons, and weight mask [m [ ] ij ]=M ∈[0,1] n [ ] ×n [ -1] , Glorot et al. ( ): w [ ] ij ∼N 0, 1 u He et al. (2015): w [ ] ij ∼N 0, 2 u where u=        fan-in [ ] i (forward) fan-out [ ] j (backward) fan-in [ ] i +fan-out [ ] j /2 (average) (2) where,

fan-in

[ ] i = n [ -1] j=1 m [ ] ij , fan-out [ ] j = n [ ] i=1 m [ ] ij , are the number of incoming and outgoing connections respectively. In the special case of a dense layer where m [ ] ij =1,∀i,j, Eq. ( 1) reduces to the initializations proposed by (Glorot et al., 2010; He et al., 2015 ) since fan-in [ ] i =n [ -1] ,∀i, and fan-out [ ] j =n [ ] ,∀j. A.2 Derivation: Fixed Mask, Forward Propagation Given a sparse NN, where the output of a neuron a i is given by, a [ ] i = f z [ ] i , where z [ ] i = n [ -1] j m [ ] ij w [ ] ij a [ -1] j , where m [ ] ij ∈ M [ ] and w [ ] ij ∈ W [ ] are the mask and weights respectively for layer , and a [ -1] j the output of the previous layer. Assume the mask [ -1] is an indicator matrix. M [ ] ∈ 1 n [ ] ×n [ -1] is constant, where 1 n [ ] ×n As in Glorot et al. (2010) we want to ensure Var(a [ ] i ) = Var(a [ -1] i ), and mean(a [ ] i ) = 0. Assume that f(x)≈x for x close to 0, e.g. in the case of f(x)=tanh(x), and that w [ ] ij and a [ -1] j are independent, Var(a [ -1] i )≈Var(z [ ] i ) (3) =Var   n [l-1] j=1 m [ ] ij w [ ] ij a [ -1] j   (4) = n [ -1] j=1 Var m [ ] ij w [ ] ij a [ -1] j (independent sum) (5) = n [ -1] j=1 m [ ] ij 2 Var w [ ] ij a [ -1] j ∵m [ ] ij is constant,Var(cX)=c 2 Var(X) (6) = n [ -1] j=1 m [ ] ij Var w [ ] ij a [ -1] j ∵m [ ] ij ∈[0,1], m [ ] ij 2 =m [ ] ij . (7) = n [ -1] j=1 m [ ] ij Var(w [ ] ij )Var(a [ -1] j ). (independent product) (8) Assume Var(w [ ] im ) = Var(w [ ] in ),∀n,m, i.e. the variance of all weights for a given neuron are the same, and Var(a [ -1] n ) = Var(a [ -1] m ), i.e. the variance of any of the outputs of the previous layer are the same. Therefore we can simplify Eq. ( 8), Var(a [ -1] i )= n [ -1] j=1 m [ ] ij Var(w [ ] ij )Var(a [ -1] j ) ] ij )Var(a [ -1] j ) n [ -1] j=1 m [ ] ij . Let neuron i's number of non-masked weights be denoted fan-in [ ] i , where fan-in [ ] i = n [ -1] j=1 m [ ] ij , then Var(a [ -1] i )=fan-in [ ] i Var(w [ ] ij )Var(a [ -1] j ) Recall, Var(a [ -1] i )=Var(a [ -1] j ) ⇒Var(w [ ] ij )= 1 fan-in [ ] i . Therefore, in order to have the output of each neuron a [ ] i in layer to have unit variance, and mean 0, we need to sample the weights for each neuron from the normal distribution, [w [ ] ij ]∼N 0, 1 fan-in [ ] i , where s i is the sparsity of weights of the neuron with output a i . For the ReLU activation function, following the derivation in He et al. (2015) , [w [ ] ij ]∼N 0, 2 fan-in [ ] i . A.3 Fixed Mask: Backward Pass Given a sparse NN, where the output of a neuron a i is given by, a [ ] i = f z [ ] i , where z [ ] i = n [ -1] j m [ ] ij w [ ] ij a [ -1] j , where m [ ] ij ∈ M [ ] and w [ ] ij ∈ W [ ] are the mask and weights respectively for layer , and a [ -1] j the output of the previous layer. Assume the mask M [ ] ∈1 n [ ] ×n [ -1] is constant, where 1 n [ ] ×n [ -1] is an indicator matrix, and let L θ ={W [ ] , =0...N} be the loss we are optimizing. As in Glorot et al. (2010) , from the backward-propagation standpoint, we want to ensure Var( ∂L ∂z [ ] i )=Var( ∂L ∂z [ -1] i )), and mean( ∂L ∂z [ ] i )=0. Assume that f (0)=1, Var( ∂L ∂z [ ] j )≈Var( ∂L ∂a [ -1] j ) (15) =Var   n [ ] i=1 m [ ] ij w [ ] ij ∂L ∂z [ ] i   (16) = n [ ] i=1 Var m [ ] ij w [ ] ij ∂L ∂z [ ] i (independent sum) (17) = n [ ] i=1 m [ ] ij 2 Var w [ ] ij ∂L ∂z [ ] i ∵m [ ] ij is constant,Var(cX)=c 2 Var(X) (18) = n [ ] i=1 m [ ] ij Var w [ ] ij ∂L ∂z [ ] i ∵m [ ] ij ∈[0,1], m [ ] ij 2 =m [ ] ij . ( ) = n [ ] i=1 m [ ] ij Var(w [ ] ij )Var( ∂L ∂z [ ] i ). (independent product) ( ), i.e. the variance of the output gradients of each neuron at layer l are the same. Then we can simplify Eq. ( 20), Var( ∂L ∂z [ ] j )= n [ ] i=1 m [ ] ij Var(w [ ] ij )Var( ∂L ∂z [ ] i ) (21) =Var(w [ ] ij )Var( ∂L ∂z [ ] i ) n [ ] i=1 m [ ] ij . Let neuron i's number of non-masked weights be denoted fan-out [ ] j , where fan-out [ ] j = n [ ] i=1 m [ ] ij , then Var( ∂L ∂z [ ] j )=fan-out [ ] j Var(w [ ] ij )Var( ∂L ∂z [ ] i ) (23) Recall, Var( ∂L ∂z [ ] j )=Var( ∂L ∂z [ ] i ) ⇒Var(w [ ] ij )= 1 fan-out [ ] j . ( ) Therefore, in order to have the output of each neuron a [ ] i in layer to have unit variance, and mean 0, we need to sample the weights for each neuron from the normal distribution, [w [ ] ij ]∼N 0, 1 fan-in [ ] i , ( ) where s i is the sparsity of weights of the neuron with output a i . For the ReLU activation function, following the derivation in He et al. (2015) , The training hyper-parameters used in §4.3 are shared in Table 3 . All experiments in this section start with a pruning experiment, after which the sparsity masks found by pruning are used to perform LT experiments. We use iterative magnitude pruning (Zhu et al., 2018) in our experiments, which is a well studied and more efficient pruning method as compared to the one used by Frankle et al. (2019a) . Our pruning algorithm performs iterative pruning without rewinding the weights between intermediate steps and requires significantly less iterations. We expect our results would be even more pronounced with additional rewinding steps. [w [ ] ij ]∼N 0, 2 fan-in [ ] i . ( ) We use SGD with momentum in all of our experiments. Scratch and Lottery experiments use the same hyper-parameters. Additional specific details of our experiments are shared below.

LeNet5

We prune all layers of LeNet5, so that they reach 95% final sparsity (i.e. 95% of the parameters are zeros). We choose this sparsity, since at this sparsity, we start observing stark differences between Lottery and Scratch in terms of performance. We set the weight decay to zero, similar to the MNIST experiments done in the original LT paper (Frankle et al., 2019a) Training hyper-parameters used for these experiments are shared in Table 4 . MNIST In this setting, the hyper-parameters are almost same as in §4.3, except we enable weight decay as it brings better generalization. We do a grid search over weight-decays={0.001,0.0001,0.00005,0.00001,0.0005} and learning-rates={0.1,0.2,0.05,0.02,0.01} and 

D Fully Connected Neural Network Experiments on MNIST

In this section we repeat our experiments from §4.1 using different sparse initialization methods, and analyzing gradient flow, for a standard 2-layer fully-connected NN with 2 hidden layers of size 300 and 100 units. We use the same grid used in LeNet5 experiments for hyper-parameter selection. Best results were obtained with a learning rate of 0.2, a weight decay coefficient of 0.0001 and an mask update frequency of 500 (used in DST methods). The rest of the hyperparameters remained unchanged from the LeNet5 experiments. The results of training with various initialization methods is shown in Table 5 . Although the results are not as drastic as with LeNet5, we see that here too sparsity aware initialization (the proposed initialization, and that of Liu (Liu et al., 2019 )) shows a significant improvement in the test accuracy of Scratch, and RigL or our proposed initialization, although not quite reaching lottery or RigL accuracy. Finally, we see no significant effect on SET training, with none of the initialization variants having a significant increase over any of the others, although the Liu (Liu et al., 2019) initialization does marginally better. The gradient flow of this model is shown in Fig. 9 . While we see moderate improvements to Scratch gradient flow early on in training with our proposed initialization (a), RigL shows significantly higher gradient flow throughout training, in particular after mask updates (b), mirroring the results of LeNet5. The interpolation graphs in (c) only differ slightly from that of LeNet5, again showing that our results for LeNet5 broadly hold for the fully-connected model.

E Hessian Spectrum of LeNet5

Given a loss function L and parameters θ, we can write the first order Taylor approximation of the change in loss ∆L=L(θ t+1 )-L(θ t ) after a single training step with the learning rate >0 as : ∆L≈-∇L(θ) T ∇L(θ). Note that as long as the error is small, gradient descent is guaranteed to decrease the loss by an amount proportional to ∇L(θ) T ∇L(θ), which we refer as the gradient flow. In practice large learning rates are used, and the first order approximation might not be accurate. Instead we can look at the second order approximation of ∆L: ∆L≈-α∇L(θ) T ∇L(θ)+ α 2 2 ∇L(θ) T H(θ)∇L(θ), where H(θ) is the Hessian of the loss function. The eigenvalue spectrum of Hessian can help us understand the local landscape (Sagun et al., 2017) , and help us identify optimization difficulties (Ghorbani et al., 2019) . For example, if and when the gradient is aligned with large magnitude eigenvalues, the second term of Eq. ( 28) can have a significant effect on the optimization of L. If the gradient is aligned with large positive eigenvalues, it can prevent gradient descent from decreasing the loss and harm the optimization. Similarly, if it is aligned with negative eigenvalues it can help to accelerate optimization. We show the Hessian spectrum before and after the topology updates in Fig. 10 . After RigL updates we observe new negative eigenvalues with significantly larger magnitudes. We also see larger positive eigenvalues, which disappear after few iterations ** . In comparison, the effect of SET updates on the Hessian spectrum seems limited. We also evaluate the Hessian spectrum of LeNet5 during the training. In Fig. 11b , we observe similar shapes for each method on the positive side of the spectrum, however, on the negative side dense models seem to have more mass. We plot the magnitude of the largest negative eigenvalue to characterize this behaviour in Fig. 11a . We observe a significant difference between sparse and dense models and observe that sparse networks trained with RigL have larger negative eigenvalues.

F Comparing Function Similarity

Table 6 gives a full list of comparison metrics of the predictions on the test set for LeNet5 on MNIST and ResNet50 on ImageNet-2012, in particular here we also compare the output probability distributions using relevant metrics. ** We share videos of these transitions in supplementary material. Table 6 : Ensemble/Prediction Disagreement. In order to show the function similarity of LTs to the pruning solution, we follow the analysis of (Fort et al., 2020) , and compare the function similarity and ensemble generalization over 5 sparse models trained using random initializations and LTs with the original pruning solution they are derived from. The fractional disagreement is the pairwise disagreement of class predictions over the test set, as compared within the group of sparse models, and as compared to the pruned model whose mask they were derived from. Kullback-Leibler Divergence (KL) and Jensen-Shannon Divergence (JSD) compare the prediction distributions over all the test samples. 



* We omit learning rate for simplicity. * SeeFrankle et al. (2019b) for details. * indicates element-wise multiplication, respecting the mask. † Implementation of our sparse initialization, Hessian calculation and code for reproducing our experiments will be open sourced with the final version. Additionally we provide videos that shows the evolution of Hessian during training under different algorithms in the supplementary material.‡ Models with BatchNorm and skip connections are less affected by initialization, and this is likely why the authors did not observe this effect. § Note: We use ReLU activation functions, unlike the original architecture(LeCun et al., 1989).



Figure 1: Glorot/He Initialization for a Sparse NN. All neurons in a dense NN layer (a) have the same fan-in, whereas in a sparse NN (b) the fan-in can differ for every neuron, potentially requiring sampling from a different distribution for every neuron. The initialization derivation/fan-out variant are explained further in Appendix A.1. (c) Std. dev. of the pre-softmax output of LeNet5 with input sampled from a normal distribution, over 5 different randomly-initialized sparse NN for a range of sparsities.

Figure 2: Gradient Flow of Sparse Models during Training. Gradient flow during training averaged over multiple runs, '+' indicates training runs with our proposed sparse initialization and Small Dense corresponds to training of a dense network with same number of parameters as the sparse networks. Lottery ticket runs for ResNet-50 include late-rewinding.

Figure 4: Lottery Tickets Are Biased Towards the Pruning Solution, Unlike Random Initialization.A cartoon illustration of the loss landscape of a sparse model, after it is pruned from a dense solution to create a LT sub-network. A lottery ticket initialization is within the basin of attraction of the pruned model's solution. In contrast a random initialization is unlikely to be close to the dense solution's basin.

Figure 5: MDS Embeddings/L2 Distances: (a, d): 2D Multi-dimensional Scaling (MDS) embedding of sparse NNs with the same connectivity/mask; (b, e): the average L2-distance between a pruning solution and other derived sparse networks; (c, f): linear path between the pruning solution (α=1.0) and LT/scratch at both initialization, and solution (end of training). Top and bottom rows are for MNIST/LeNet5 and ImageNet-2012/ResNet-50 respectively.

∀n,m, i.e. the variance of all weights for a given neuron are the same, and Var(

Figure 8: Effect of Mask Updates in Dynamic Sparse Training. Effect of mask updates on the gradient norm. We measure the gradient norm before and after the mask updates and plot the ∆. '+' indicates proposed initialization and used in MNIST experiments.

Figure 9: Sparse 300-100 MLP experiments. Gradient flow during training averaged over multiple runs, '+' indicates training runs with our proposed sparse initialization.

This is the pruning solution that the LT and scratch models are derived from. ** 5 pruning solutions found with different random initialization, one of which is the pruning solution above.*** Here we compare 4 different pruned models with the pruning solution the LT/Scratch are derived from.

Figure 10: Hessian spectrum before and after mask updates: (left) SET (right) RigL. Similar to Ghorbani et al., 2019, we estimate the spectral density of Hessian using Gaussian kernels.

Results of Trained Sparse/Dense Models from Different Initializations. The initializations proposed in Eq. (1) (Ours) and Liu et al. (2019) improve generalization consistently over masked dense (Original) except for in ResNet50. Note that VGG16 trained without a sparsity-aware initialization fails to converge in some instances. Baseline corresponds to the original dense architecture, whereas Small Dense corresponds to a smaller dense model with approximately the same parameter count as the sparse models. 55±1.03 63.19±0.26 63.13±0.15 72.93±0.27 72.77±0.27 72.56±0.14 RigL 80.82±34.74 98.14±0.17 98.13±0.09 37.15±26.20 63.69±0.02 63.56±0.06 74.41±0.05 74.38±0.10 74.38±0.01

embeddings. LeNet5/MNIST: In Fig.5b, we provide the average L2 distance to the pruning solution at initialization (d init ), and after training (d final ). We observe that LT initializations start significantly closer to the pruning solution on average (d init =13.61 v.s. 17.46). After training, LTs end up more than 3× closer to the pruning solution compared to scratch. Resnet-50/ImageNet-2012: We observe similar results for Resnet-50/ImageNet-2012. LTs, again, start closer to the pruning solution, and solutions are 5× closer (d final=39.35 v.s. 215.98). With these observations, non-random initial loss values for LT initialization reported first by(Zhou et al., 2019) seem reasonable. LTs are biased towards the pruning solution they are derived from, but are they in the same basin?

Here we compare 4 different pruned models with the pruning solution LT/Scratch are derived from.

§4.3: Experiment Details/Hyperparameters. Initial Learning Rate (LR), LR Schedule (Sched.), Batchsize (Batch.), Momentum (m), Weight Decay (WD), t start , t end and f are the pruning starting iteration, end iteration, and mask update frequency respectively. Step schedule has a linear warm-up in first 5 epochs and decreases the learning rate by a factor of 10 at epochs 30,70 and 90.

§4.1: Experiment Details/Hyperparameters. Initial Learning Rate (LR), LR Schedule (Sched.), Batchsize (Batch.), Momentum (m), Weight Decay (WD), Initial Drop Fraction (Drop.), t end and f are the pruning mask update frequency and end iteration respectively. LeNet5+ row corresponds the LeNet5 experiments with our sparse initialization, whereas LeNet5 is the regular masked initialization.Step schedule has a linear warm-up in first 5 epochs and decreases the learning rate by a factor of 10 at epochs 30,70 and 90.

Results of Trained Fully-Connected MNIST Model from Different Initializations.

annex

pick the values with top test accuracy. We use the masks found by pruning experiments in all of our MNIST experiments in this section to isolate the effect of the initialization. We simplify the update schedule of Dynamic Sparse Training (DST) methods such that they decay with learning rate. This approach fits well, since the original decay function used in these experiments is the cosine decay which is the same as our learning rate schedule. We scale learning rate such that it matches the initial drop fraction provided. Mask update frequency and initial drop fraction are chosen from a grid search of{50, 100, 500} and{0.01,0.1,0.3} respectively. To allow fair comparison, we use Glorot scaling in all of our initializations (i.e. scale=1 and we average fan-in fan-out values) as it is the default initialization for Tensorflow layers and our results shows that it out-performs He initialization by a small margin with the hyper-parameters used. Using He initialization brings similar results.

Hessian calculation

The Hessian is calculated on full training set using Hessian-vector products. We mask our network after each gradient call and calculate only non-zero rows. After calculating the full Hessian, we use numpy.eigh (van der Walt et al., 2011) to calculate eigenvalues of the Hessian.

ImageNet-2012

In this setting, hyper-parameters are almost the same as in §4.3 except for VGG16 architecture, where we use a smaller batch size and learning rate. For all DST methods, we use a cosine drop schedule Dettmers et al., 2019 and hyper-parameters proposed by Evci et al. (2019) . For VGG, we reduce the mask update frequency and the initial drop fraction, as we observe better performance after doing a grid search over{50, 100, 500} and{0.1,0.3,0.5} respectively. We also use a non-uniform (ERK) sparsity distribution among layers as described in Evci et al. (2020) , since we observed that it brings better performance.

C Additional Gradient Flow Plots

Here we share additional gradient flow figures for method/initialization combinations presented in 3 and 2.In Fig. 7b , we show the gradient flow for DST methods and scratch training. Using RigL helps improves gradient flow with both initialization; helping learning to start earlier than regular Scratch training. Sparse Evolutionary Training (SET) seem to have limited effect on the gradient flow.In Fig. 7b , we share gradient flow when the scaled initialization of Liu et al., 2019 is used. Similar to the proposed initialization, we observe improved gradient flow for all cases. Different than our initialization however, RigL doesn't improve gradient flow in this setting; highlighting an interesting future research direction on the relationship between initialization and the DST methods.In Fig. 7c , we share gradient flow when He initialization is used instead of Glorot initialization. We observe that Scratch training starts learning faster in this case. Gradient flow seems to be similar for other sparse initialization methods.In Fig. 8 , we share gradient flow improvements after DST updates on connectivity for different initialization methods. ResNet-50 curves match the results in Fig. 3 . LeNet5 curves however seem to be adversely affected by poor initialization at the beginning of the training. We start observing improvements with RigL when the learning starts (around the 400 th iteration). 

