KEEP THE GRADIENTS FLOWING: USING GRADIENT FLOW TO STUDY SPARSE NETWORK OPTIMIZATION Anonymous

Abstract

Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider various choices made during training that might disadvantage sparse networks. We measure the gradient flow across different networks and datasets, and show that the default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and a wider view of tailoring optimization to sparse networks yields promising results.

1. INTRODUCTION

Over the last decade, a "bigger is better" race in the number of model parameters has gripped the field of machine learning (Amodei et al., 2018; Thompson et al., 2020) , primarily driven by overparameterized deep neural networks (DNNs). Additional parameters improve top-line metrics, but drive up the cost of training (Horowitz, 2014; Strubell et al., 2019; Hooker, 2020) and increase the latency and memory footprint at inference time (Warden & Situnayake, 2019; Samala et al., 2018; Lane & Warden, 2018) . Moreover, overparameterized networks have been shown to be more prone to memorization (Zhang et al., 2016) . To address some of these limitations, there has been a renewed focus on compression techniques that preserve top-line performance while improving efficiency. A large amount of research focus has centered on pruning, where weights estimated to be unnecessary are removed from the network at the end of training (Louizos et al., 2017; Wen et al., 2016; Cun et al., 1990; Hassibi et al., 1993a; Ström, 1997; Hassibi et al., 1993b; Zhu & Gupta, 2017; See et al., 2016; Narang et al., 2017) . Pruning has shown a remarkable ability to preserve top-line metrics of performance, even when removing the majority of weights (Hooker et al., 2019; Gale et al., 2019) . However, most pruning techniques still require training a large, overparameterized model before pruning a subset of weights. Due to the drawbacks of starting dense prior to introducing sparsity, there has been a recent focus on methods that allow networks which start sparse at initialization, to converge to similar performance as dense networks (Frankle & Carbin, 2018; Frankle et al., 2019b; Liu et al., 2018a) . These efforts have focused disproportionately on trying to understand the properties of initial sparse weight distributions that allow for convergence. However, while this work has had some success, focusing on initialization alone has proven to be inadequate (Frankle et al., 2020; Evci et al., 2019) . In this work, we take a broader view of why training sparse networks to converge to the same performance as dense networks has proven to be elusive. We reconsider many of the basic building blocks of the training process and ask whether they disadvantage sparse networks or not. Our work focuses on the behaviour of networks with random, fixed sparsity at initialization and we aim to gain further intuition into how these networks learn. Furthermore, we provide tooling tailored to the analysis of these networks. In order to effectively study sparse network optimization in a controlled environment, we propose an experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC). Contrary to most prior work comparing sparse to dense networks, where overparameterized dense networks are compared to smaller sparse networks, SC-SDC compares sparse networks to their equivalent capacity dense networks (same number of active connections and depth). This ensures that the results are a direct result of sparse connections themselves and not due to having more or fewer weights (as is the case when comparing large, dense networks to smaller, sparse networks). We go beyond simply comparing top-line metrics by also measuring the impact on gradient flow of each intervention. Historically, exploding and vanishing gradients were a common problem in neural networks (Hochreiter et al., 2001; Hochreiter, 1991; Bengio et al., 1994; Glorot & Bengio, 2010; Goodfellow et al., 2016) . Recent work has suggested that poor gradient flow is an exacerbated issue in sparse networks (Wang et al., 2020; Evci et al., 2020) . To accurately measure gradient flow in sparse networks, we propose a normalized measure of gradient flow, which we term Effective Gradient Flow (EGF) -this measure normalizes by the number of active weights and thus is better suited to studying the training dynamics of sparse networks. We use this measure in conjunction with SC-SDC, to see where sparse optimization fails and to consider where this failure could be a result of poor gradient flow. Contributions Our contributions can be enumerated as follows: 1. Measuring effective gradient flow We conduct large scale experiments to evaluate the role of regularization, optimization and architecture choices on sparse models. We evaluate multiple datasets and architectures and propose a new measure of gradient flow, Effective Gradient Flow (EGF), that we show to be a stronger predictor of top-line metrics such as accuracy and loss than current gradient flow formulations. 2. Batch normalization plays a disproportionate role in stabilizing sparse networks We show that batch normalization is more important for sparse networks than it is for dense networks, which suggests that gradient instability is a key obstacle to starting sparse. 3. Not all optimizers and regulizers are created equal Weight decay and data augmentation can hurt sparse network optimization, particularly when used in conjunction with accelerating, adaptive optimization methods that use an exponentially decaying average of past squared gradients, such as Adam (Kingma & Ba, 2014) and RMSProp (Hinton et al., 2012) . We show this is highly correlated to a high EGF (gradient flow) and how batch normalization helps stabilize EGF. 4. Changing activation functions can benefit sparse networks We benchmark a wide set of activation functions, specifically ReLU (Nair & Hinton, 2010) and non-saturating activation functions such as PReLU (He et al., 2015) , ELU (Clevert et al., 2015) , SReLU (Jin et al., 2015) ,Swish (Ramachandran et al., 2017) and Sigmoid (Neal, 1992) . Our results show that when using adaptive optimization methods, Swish is a promising activation function, while when using stochastic gradient descent, PReLU preforms better than the other activation functions. SC-SDC can be summarized as follows (See Figure 1 for an overview): 1. Initialize For a chosen network depth (number of layers) L and a maximum network width N M axW , we compare sparse and dense networks at various widths, while ensuring they have the same paramater count. Initially, we mask the weights θ S of sparse network S: a l S = θ l S m l , a l D = θ l D , for l = 1, . . . , L , where θ l S m l denotes an element-wise product of the weights θ S of layer l and the random binary matrix (mask) for layer l, m l , a l S is the nonzero weights in layer l of sparse network S and a l D is the nonzero weights in layer l of dense network D (all the weights since no masking occurs). For a fair comparison, we need to ensure the same number of nonzero weights for sparse network S and dense network D , across each layer L. ||a l S || 0 = ||a l D || 0 , for l = 1, . . . , L We provide more implementation details of how we achieve this in Appendix A.1.

2.. Match active weight distributions

Following prior work (Liu et al., 2018b; Gale et al., 2019) , we ensure the nonzero weights at initialization of the sparse and dense networks are sampled from the same distribution at each layer as follows: a l S ∼ P l , a l D ∼ P l , for l = 1, . . . , L , where P l refers to the initial weight distribution at layer l, for example Kaiming initialization (He et al., 2015) . This ensures that both sets of active weights (sparse and dense) are initially sampled from the same distribution.

3.. Train

We then train the sparse and dense networks for 1000 epochs (allowing for convergence).

4.. Evaluate the better architecture

We gather the results across the widths/capacity levels and conduct a paired, one-tail Wilcoxon signed-rank test (Wilcoxon, 1945) to evaluate the better architecture. Our null hypothesis (H 0 ) is that sparse networks have similar or worse test accuracy than dense networks (lower or the same median), while our alternative hypothesis (H 1 ) is that sparse networks have better test accuracy performance than dense networks of the same capacity (higher median). This can be formulated as: H 0 : Sparse <= Dense , H 1 : Sparse > Dense (4) Our goal of SC-SDC is to compare sparse and dense networks at the same capacity level. By same capacity, we are referring to the same number of active weights, but other notions of same capacity can also be used. We briefly discuss this in Appendix A.1. We compare the average absolute Kendall Rank correlation between different formulations of gradient flow and generalization. The subscript denotes the p-norm (l1 or l2 norm). We see that EGF has higher absolute correlation when compared to standard gradient flow measures. We also see that is consistent across Fashion MNIST (see Appendix B.1).

2.2. MEASURING GRADIENT FLOW

Gradient flow (GF) is used to study optimization dynamics and typically approximated by taking the norm of the gradients of the network (Pascanu et al., 2013; Nocedal et al., 2002; Chen et al., 2018; Wang et al., 2020; Evci et al., 2020) . We consider a feedforward neural network f : R D → R, with function inputs x ∈ R D and network weights θ. The gradient norm is usually computed by concatenating all the gradients of a network into a single vector, g = ∂C ∂θ , where C is our cost function. Then the vector norm is taken as follows: gf p = ||g|| p , where p denotes the pth-norm. Effective Gradient Flow Traditional measures of gradient flow take the l1 or l2 norm of all the gradients (Chen et al., 2018; Pascanu et al., 2013; Evci et al., 2020) . This is not appropriate for sparse networks, as this would include gradients of masked weights which have no influence on the forward pass. Furthermore, computing l1 or l2 across all weights in the networks gives disproportionate influence to layers with more weights. We instead propose a simple modification of Equation 5, which we term Effective Gradient Flow (EGF), that computes the average, masked gradient (only gradients of active weights) norm across all layers. We calculate EGF as follows: g = ( ∂C ∂θ 1 m 1 , ∂C ∂θ 2 m 2 , . . . , ∂C ∂θ C m L ) for l = 1, . . . , L egf p = L n=1 ||g i || p L , ( ) where L is number of layers and ∂C ∂θ l m l denotes an element-wise product of the gradients of layer l, ∂C ∂θ l , and the mask m l applied to the weights of layer l. For a fully dense network, m l is a matrix of all ones, since no gradients are masked.

EGF has the following favourable properties:

• Gradient flow is evenly distributed across layers EGF distributes the gradient norm across the layers equally, preventing layers with a lot of weights from dominating the measure and also preventing layers with vanishing gradients from being hidden in the formulation, as is the case with equation 5 (when all gradients are appended together). • Only gradients of active weights are used EGF ensures that for sparse networks, only gradients of active weights are used. Even though weights are masked, their gradients are not necessarily zero since the partial derivative of the weight wrt. the loss, is influenced by other weights and activations. Thereby a weight can be zero, but its gradient can be nonzero. • Possibility for application in gradient-based pruning methods Tanaka et al. (2020) showed that gradient-based pruning methods like GRASP (Wang et al., 2020) and SNIP (Lee et al., 2018a) , disproportionately prune large layers and are susceptible to layer-collapse, which is when an algorithm prunes all the weights in a specific layer. Due to the fact that EGF is evenly distributed across layers, maintaining EGF (as opposed to standard gradient norm) could possibly be used as a pruning criteria. Furthermore, current approaches measuring or approximating the change in gradient flow during pruning in sparse networks (Wang et al., 2020; Evci et al., 2020; Singh Lubana & Dick, 2020) , could benefit from this new formulation. To evaluate EGF against other standard gradient norm measures, such as the l1 and l2 norm, we empirically compare these measures and their correlation to test loss and accuracy. We take the absolute average of the Kendall Rank correlation (Kendall, 1938) , across the different experiment configurations. We follow a similiar approach to Jiang et al. ( 2019), but unlike their work which has focused on correlating network complexity measures to the generalization gap, we measure the correlation of gradient flow to performance (accuracy and loss). We measure gradient flow at 10 points evenly spaced throughout training, specifically at the end of epoch 0, 99. 199, 299, 399, 499, 599, 699, 799 , 899 and 999. Our results from Table 1 shows that EGF has a higher average absolute correlation to both test loss and accuracy. This is also true of Fashion MNIST (see Appendix B.1). Due to the comparative benefits of EGF, we use it for the remainder of the paper to measure the impact of interventions. We include all measures of gradient flow in Appendix B for completeness.

2.3. ARCHITECTURE, NORMALIZATION, REGULARIZATION AND OPTIMIZER VARIANTS

We briefly describe our key experiment variants below, and also include for completeness all unique variants in Table 2 . Activation functions ReLU networks (Nair & Hinton, 2010 ) are known to be more resilient to vanishing gradients than networks that use Sigmoid or Tanh activations, since they only result in vanishing gradients when the input is less than zero, while on active paths, due to ReLU's linearity, the gradients flow uninhibited (Glorot et al., 2011) . Although most experiments are run on ReLU networks, we also explore different activation functions, namely PReLU (He et al., 2015) , ELU (Clevert et al., 2015) , Swish (Ramachandran et al., 2017) , SReLU (Jin et al., 2015) and Sigmoid (Neal, 1992) .

Batch normalization and Skip Connections

Other methods to help alleviate the vanishing gradient problem include the addition of skip connections (every two layers) (Srivastava et al., 2015; He et al., 2016) and batch normalization (Ioffe & Szegedy, 2015) . We empirically explore these methods.

Optimization and Regularization techniques

We explore the impact of popular regularization methods: weight decay/l2 regularization (0.0001) (Krogh & Hertz, 1992; Hanson & Pratt, 1989 ) and data augmentation (random crops and random horizontal flipping (Krizhevsky et al., 2012) ). Furthermore, we benchmark the impact of the most widely used optimizers such as minibatch stochastic gradient descent (with momentum (0.9) (Sutskever et al., 2013; Polyak, 1964) and without momen-tum (Robbins & Monro, 1951) ) , Adam (Kingma & Ba, 2014), Adagrad (Duchi et al., 2011) and RMSProp (Hinton et al., 2012) .

SC-SDC MLP Setting

We use the SC-SDC empirical setting (section 2.1) for all experiment variants. We train over 6000 MLPs for 1000 epochs and evaluate performance on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . We compare sparse and dense networks across various widths, depths, learning rates, regularization and optimization methods as shown in Table 2 . We choose a max network width N M axW of n + 4, where n is the input dimension of the network. In the case of CIFAR, n = 3072 and so our maximum width N M axW = 3076. We repeat these experiments with one, two and four hidden layers, with the number of active weights in these networks ranging from 949, 256 to 31, 765, 568 weights. In section 4, we discuss results achieved using four hidden layers on CIFAR-100 and we provide the one and two hidden layer results in Appendix C. Dense Width Following from SC-SDC, these networks are compared at various network widths, specifically a width of 308, 923, 1538, 2153, 2768 (10%, 30%, 50%, 70% and 90% of our maximum width N M axW (3076)) as shown in Table 2 . We use the term dense width to refer to the width of a network if that network was dense. For example, when comparing sparse and dense networks at a dense width of 308, this means the dense network has a width of 308, while the sparse network has a width of N M axW (3076), but has the same number of active connections as the dense counterpart. We provide more detailed discussion of the choices made in the SC-SDC implementation in Appendix A.1. Extended CNN Setting We also extend our experiments to Wide Resnet-50 (the WRN-28-10 variant) (Zagoruyko & Komodakis, 2016) and use the optimization and regularization configurations from the paper.

4.1. COMPARISON OF DENSE AND SPARSE INTERVENTIONS USING SC-SDC

In this section, we use the results of the Wilcoxon signed rank test from SC-SDC to identify where optimization choices are currently well suited for sparse networks and which are not. Furthermore, for each variant, we also measure the gradient flow using EGF(egf 2 (6))foot_0 as described in the previous section. Our main findings show that: 1. Batch normalization is critical to training sparse networks, more so than it is for dense networks. This suggests that gradient instability is a key obstacle for sparse optimization. 2. Weight decay (with and without batch normalization) and data augmentation ( without batch normalization) can hurt both sparse and dense network optimization. This particularly occurs when using accelerated, adaptive optimization methods that use an exponentially decaying average of past squared gradients, such as Adam and RMSProp (Ruder, 2016) . In these methods, large EGF (gradient flow) strongly correlates to poor test accuracy. 3. Non-saturating activation functions, such as Swish (Ramachandran et al., 2017) and PReLU (He et al., 2015) , achieve promising results in both the sparse and dense regime. However, these results are more statistically significant when sparse networks are used and so this could motivate for the use of similar activation functions for training sparse networks. Batch normalization plays a disproportionate role in stabilizing sparse networks Batch normalization ensures that the distribution of the nonlinearity inputs remains stable as the network trains, which was hypothesized to help stabilize gradient propagation (gradients do not explode or vanish) (Ioffe & Szegedy, 2015) . Following from 2a , 12 and 10a . From Table 3a and Figure 10a , we see methods such as L2 and data augmentation usually favour dense networks (apart from Adam and RMSProp, when using data augmentation). However, with the addition of batch normalization (Table 3b , L2 to L2 BN and DA to DA BN ), these methods favour sparse variants. This is especially apparent in Figure 10a , where batch normalization improves performance across all sparse optimizers, while resulting in a lower, more stable EGF. This further emphasizes the importance of stabilizing gradient flow, particularly in sparse networks. Weight Decay and Data Augmentation can hurt sparse network optimization When we take a closer look at the effects of weight decay and data augmentation on sparse network accuracy (Figure 2 ,Figure 13 ), we see that weight decay (even with batch normalization) drastically decreases accuracy when used with adaptive optimization methods that use an exponentially decaying average of past squared gradients (Adam and RMSProp). Furthermore, it results in distinctively larger EGF values, which hints at Adam and RMSProp being more sensitive to larger gradient norms than other optimizers. This agrees with Loshchilov & Hutter (2017) , who proposed a different formulation of weight decay for adaptive methods, since the current L2 regularization formulation for adaptive methods could lead to weights with large gradients being regularized less, although this was not experimentally verified. In the context of data augmentation, we see poor test accuracy when it is used without batch normalization (Figure 10a ). If used with batch normalization (Figure 2 ), it results in a lower EGF and best test accuracy. This further emphasized the need to stabilize gradient flow and how EGF can be used to this end. The potential of non-saturating activation functions -Swish and PReLU We also explore the effect of different activation functions on sparse network optimization. For the activation function The results for all optimizers can be found in Figure 13 . variants, the best configuration for each optimizer was chosen. For Adagrad, Adam and RMSProp we use BN, SC and DA, while for SGD, we use BN, SC, L2 and DA. From Table 3c , we see that Swish, PReLU and Sigmoid favour sparse architectures, but from the performance results from Figure 12 , we see that only Swish and PReLU are viable activation choices. We continue to see a consistent trend for adaptive methods (most notably in Adam and RMSProp), that higher EGF values, for example in SReLU, correspond to poor performance (Figure 11b ), while promising methods result in a lower EGF value (such as Swish). This further emphasizes how EGF can be used to guide advances in network optimization.

4.2. GENERALIZATION OF RESULTS ACROSS ARCHITECTURE TYPES.

In this section, we move on from SC-SDC and extend our result to Wide ResNet-50. We note from Figure 3 , that most of our results from SC-SDC also hold on larger, more complicated models. We see that L2 regularization (even with batch normalization) hurts performance for adaptive methods (Adagrad and Adam) and also results in higher EGF values (Figure 14 ). Furthermore, we also see data augmentation is beneficial when used with batch normalization. Finally, we see that Swish is a promising activation function for adaptive methods and leads to lower EGF (Figure 14 ). This shows that the SC-SDC results are not constrained to small scale experiments and that it can be used to learn about dynamics of larger, more complicated networks. 

5. RELATED WORK

Pruning at Initialization Methods that prune at initialization aim to start sparse, instead of first pre-training an overparameterized network and then pruning. These methods use certain criteria to estimate at initialization, which weights should remain active. This criteria includes using the connection sensitivity (Lee et al., 2018b) , gradient flow (via the Hessian vector product) (Wang et al., 2020) and conversation of synaptic saliency (Tanaka et al., 2020) . Another branch of pruning is Dynamic Sparse Training, which uses information gathered during the training process, to dynamically update the sparsity pattern of these sparse networks (Mostafa & Wang, 2019; Bellec et al., 2017; Mocanu et al., 2018; Dettmers & Zettlemoyer, 2019; Evci et al., 2019) . While our work is motivated by the same goal of allowing networks to start sparse and converge to the same performance as dense networks, we instead focus on the impact of optimization and regularization choices on sparse networks. Sparse Network Optimization as Pruning Criteria Optimization in sparse networks has often been neglected in favour of studying network initialization. However, there has been work that has looked at sparse network optimization from different perspectives, mainly as a guide for pruning criteria. This includes using gradient information (Mozer & Smolensky, 1989; LeCun et al., 1989; Hassibi & Stork, 1992; Karnin, 1990) , approximates of gradient flow (Wang et al., 2020; Dettmers & Zettlemoyer, 2019; Evci et al., 2020) and Neural Tangent Kernel (NTK) (Liu & Zenke, 2020) to guide the introduction of sparsity. Sparse Network Optimization to study Network Dynamics Apart from use as pruning criteria, optimization information has been used to investigate aspects of sparse networks, such as their loss landscape (Evci et al., 2019) , how they are impacted by SGD noise (Frankle et al., 2019a) , the effect of different activation functions (Dubowski, 2020) and their weight initialization (Lee et al., 2019) . Our work differs from these approaches as we consider more aspects of the optimization and regularization process in a controlled experimental setting (SC-SDC), while using EGF to reason about some of the results.

6. CONCLUSION AND FUTURE WORK

In this work, we take a wider view of sparse optimization strategies and introduce appropriate tooling to measure the impact of architecture and optimization choices on sparse networks (EGF , SC-SDC ). Our results show that weight decay and data augmentation can hurt optimization, when adaptive optimization methods are used and this usually corresponds to a much higher EGF.Furthermore, we show how batch normalization is critical to training sparse networks, more so than it is for dense networks as it helps stabilize gradient flow. We also show the potential of non-saturating activation functions for sparse networks such as Swish and PReLU. Finally, we show that our results extend to more complicated models like Wide ResNet-50.

A SC-SDC

In this section, we provide more information about SC-SDC and its benefits.

A.1 SC-SDC IMPLEMENTATION DETAILS

Wilcoxon Signed Rank Test This is a non-parametric test that compares dependent or paired samples, without assuming the differences in between the paired experiments are normally distributed (McDonald, 2009; Demšar, 2006) . Random Sparsity Our work focuses on the training dynamics of random, sparse networks. This ensures that what is learned is not dependent on a specific pruning method, but rather can be used to better understand sparse training in general. Going forward, it would be interesting to explore these dynamics on pruned networks. We achieve random sparsity, by generating a random mask for each layer and then multiply the weights by this mask during each forward pass. The sparsity is distributed evenly across the network. For example, a 20% sparse MLP has 20% of the weights remaining in each layer. Dense Width A critical component to how we specify our experiments is a term we define as dense width. In order to fairly compare sparse and dense networks, we need them to have the same number of active connections at each depth. In the case of sparse networks, this means ensuring they have the same number of active connections as the dense networks, while remaining sparse. Dense width refers to the width of a network if that network was dense. This process of comparing sparse and dense networks at different dense widths is illustrated in figure 5 .

Fair comparison of Sparse and Dense networks

As can be seen from figure 5 , SC-SDC ensures the exact same active parameter count, but the sparse networks will be connected to more neurons. It is possible that the increased number of activations being used can lead to sparse networks having higher representational power, however most work on expressivity of neural networks looks at this from a depth perspective and proves certain depths of networks are universal approximators (Eldan & Shamir, 2016; Hornik et al., 1989; Funahashi, 1989) . To this end, we ensure these networks have the same depth, but we believe going forward an interesting direction would be ensuring they have a similar amount of active neurons.

SC-SDC comparison details

For completeness, we provide more details of how we ensure sparse and dense networks are of the same capacity. Following from equation 2, to ensure the same number of weights in sparse and dense networks, we can ensure they have the same number of active weights at each layer as follows: ||a l S || 0 = ||a l D || 0 , for l = 1, . . . , L This is achieved by masking each of the weight layers of sparse network S: a l S = θ l S m l for l = 1, . . . , L , where m l is a random binary matrix (mask) for layer l, s.t. ||m l || 0 = a l D , where a l D is determined by the chosen capacity, these networks will be compared at. For SC-SDC, we need a maximum network width N M axW and comparison width N W . We choose a max network width N M axW of n + 4, where n is the input dimension of the network. In the case of CIFAR, n = 3072 and so our maximum width N M axW = 3076. The choice of n + 4 follows from Lu et al. (2017) , where the authors prove a universal approximation theorem for width-bounded ReLU networks, with width bounded to n + 4. Our comparison width, N W , is equivalent to dense widths we vary in our experiments -308, 923, 1538, 2153, 2768. The dimensions of each of layers are as follows: 1. First Layer: θ 1 D ∈ R I×N W , θ 1 S ∈ R I×N M axW , m 1 ∈ {0, 1} I×N M axW 2. Intermediate Layers: θ 1 S ∈ R N W ×N W , θ 1 S ∈ R N M axW ×N M axW , m {2,...,L-1} ∈ {0, 1} N M axW ×N M axW (11) 3. Final Layer: θ L S ∈ R N W ×O , θ L S ∈ R N M axW ×O , m L ∈ {0, 1} N M axW ×O , where N M axW is maximum width of the sparse layer, N W is the comparison width, I is the input dimension, O is output dimension, L is the number of layers in the network, θ l S is the weights in layer l of sparse network S and θ l D is the weights in layer l of dense network D. This process would be the same for convolutional layers, but there would be a third dimension to handle the different channels. In figure 4 , we provide an illustrative example showing how to ensure sparse and dense networks are compared fairly. The benefits of SC-SDC can be summarized as follows: • We can better understand sparse network optimization. SC-SDC allows us to identify which optimization or regularization methods are poorly suited to sparse networks in a controlled setting, ensuring the results are a direct result of the sparse connections themselves. • Learn at what parameter and size budget, sparse networks are better than dense. Comparing sparse and dense networks of the same capacity allows us to see which architecture is better at different configurations. In configurations where sparse architectures perform better, we could exploit advances in sparse matrix computation and storage (Zhao et al., 2018; Merrill & Garland, 2016) to simply default to sparse architectures. We extend our experiments to Fashion MNIST (Xiao et al., 2017) , a dataset that is distinctively different to the CIFAR datasets we used in section 2.2. We ran 450 experiments with networks with four hidden layers, using 0.001 as a learning rate and for 500 epochs. We varied configurations as follows: • Optimizers -Adagrad, Adam and SGD with momentum. • Regularization methods -no regularization, batchnorm, skip connections, l2 (0.0001) and data augmentation. From table 4, we see that out of the gradient flow formulations, EGF still correlates better to generalization performance. For completeness, we present the full set of results using the different formulations of gradient flow on CIFAR-100. Namely, we show ||g|| 1 (5) (Figure 6 ) ,||g|| 2 (5) (Figure 7 ) ,egf 1 (6) (Figure 8 ) and egf 2 (6) (Figure 9 ). In this section, we presented the detailed results for our experiments. 1. Detailed Results with a low learning rate (0.001). 2. Detailed Results with a high learning rate (0.1). 3. Results for different activation functions. 



Note we measure EGF at 10 points throughout training and take the average



Figure 1: Same Capacity Sparse vs Dense Comparison (SC-SDC)

(a) Test Accuracy for Dense and Sparse Networks on CIFAR-100 (b) Gradient Flow for Dense and Sparse Networks on CIFAR-100 NR -No Regularization, BN -Batchnorm, SC -Skip Connections, DA -Data Augmention and L2-weight decay.

Figure 2: We show the test accuracy (upper image) and gradient flow (lower image) results for Sparse MLPs with four hidden layers and a large learning rate (0.1), across different regularization methods and promising activations. The results for all optimizers can be found in Figure 13.

Figure 3: WideResNet50 Test Accuracy on CIFAR-100. The density ranges from 1% to 100%. The gradient flow results can be found in Figure 14.

Figure 4: Fair comparison of sparse and dense neural networks

Figure 5: Comparing sparse and dense neural network fairly at different widths

Figure 6: Gradient Flow in CIFAR-100 using ||g|| 1

Figure 7: Gradient Flow in CIFAR-100 using ||g|| 2

Figure 9: Gradient Flow in CIFAR-100 using egf 2

Figure 13: Effect of Different Interventions on Accuracy and Gradient Flow for Dense and Sparse Networks on CIFAR-100, with large learning rate (0.1) -All Optims (a) Test Accuracy for Dense and Sparse Networks on CIFAR-100

Figure 19: Test Accuracy for CIFAR-100 with 0.1 Learning Rate

Implications Our work is timely as sparse training dynamics are poorly understood. Most training algorithms and methods have been developed to suit training dense networks. Our work provides insight into the nature of sparse optimization and suggests a wider viewpoint beyond initialization is necessary to converge sparse networks to comparable performance as dense. Our proposed approach provides a more accurate measurement of the training dynamics of sparse networks and can be used to inform future work on the design of networks and optimization techniques that are tailored explicitly to sparsity.

The average correlation between gradient flow measures and generalization performance

Different network configurations for sparse and dense comparisons

Table3aand 3b, we see that batch normalization is statistically more important for sparse network performance than it is for dense networks, across most configurations and learning rates.

Wilcoxon Signed Rank Test Results for ReLU networks with four hidden layers, trained on CIFAR-100, using different learning rates. We use a p-value of 0.05, the bold values indicate where sparse networks perform better than dense networks in a statistical significance manner (reject H 0 from 4), while non-bold values indicate that it is possible dense networks have the same or better test accuracy in that configuration. The performance results for these networks are presented in Figure

The average correlation between gradient flow measures and generalization performance for FMISTMeasure Correlation to Test Loss Correlation to Test Accuracy

Test Accuracy summary for CIFAR-10 with low learning rate (0.001)

Test Loss summary for CIFAR-10 with low learning rate (0.001)

Test Accuracy summary for CIFAR-100 with low learning rate (0.001) .633 25.479 +/-1.175 22.624 +/-3.991 24.556 +/-1.343 16.263 +/-8.572 27.008 +/-4.51 29.011 +/-2.027 26.852 +/-1.781 27.279 +/-1.334 Adam 15.583 +/-3.736 22.713 +/-3.209 8.865 +/-3.933 10.306 +/-6.494 19.646 +/-0.889 17.555 +/-8.751 29.652 +/-1.801 30.013 +/-1.688 23.748 +/-1.504 25.151 +/-0.657 RMSProp 4.97 +/-8.083 17.983 +/-9.803 11.597 +/-7.201 20.396 +/-3.643 19.835 +/-0.637 16.678 +/-9.06 29.501 +/-1.853 29.667 +/-1.86 21.43 +/-7.297 21.363 +/-9.074 SGD 20.809 +/-0.911 11.776 +/-8.53 26.802 +/-1.492 14.004 +/-10.133 20.695 +/-0.743 11.375 +/-8.282 24.511 +/-3.541 27.515 +/-1.428 23.805 +/-1.551 22.871 +/-7.053 SGD with mom(0.9) 25.269 +/-1.456 17.466 +/-8.93 29.958 +/-3.146 24.143 +/-12.147 24.741 +/-1.38 16.867 +/-8.621 27.301 +/-3.007 28.893 +/-1.526 26.817 +/-1.83 25.595 +/-3.516

Test Loss summary for CIFAR-100 with low learning rate (0.001) /-189.721 342.236 +/-207.693 38.232 +/-23.824 84.171 +/-110.671 46.989 +/-22.993 26.927 +/-22.742 15.53 +/-2.784 15.77 +/-2.959 RMSProp 261.914 +/-81.721 358.131 +/-234.544 39.084 +/-24.3 743.694 +/-668.044 37.436 +/-14.38 48.654 +/-47.815 15.824 +/-1.809 16.68 +/-1.

Test Accuracy summary for CIFAR-10 with high learning rate (0.1)

