KEEP THE GRADIENTS FLOWING: USING GRADIENT FLOW TO STUDY SPARSE NETWORK OPTIMIZATION Anonymous

Abstract

Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider various choices made during training that might disadvantage sparse networks. We measure the gradient flow across different networks and datasets, and show that the default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and a wider view of tailoring optimization to sparse networks yields promising results.

1. INTRODUCTION

Over the last decade, a "bigger is better" race in the number of model parameters has gripped the field of machine learning (Amodei et al., 2018; Thompson et al., 2020) , primarily driven by overparameterized deep neural networks (DNNs). Additional parameters improve top-line metrics, but drive up the cost of training (Horowitz, 2014; Strubell et al., 2019; Hooker, 2020) and increase the latency and memory footprint at inference time (Warden & Situnayake, 2019; Samala et al., 2018; Lane & Warden, 2018) . Moreover, overparameterized networks have been shown to be more prone to memorization (Zhang et al., 2016) . To address some of these limitations, there has been a renewed focus on compression techniques that preserve top-line performance while improving efficiency. A large amount of research focus has centered on pruning, where weights estimated to be unnecessary are removed from the network at the end of training (Louizos et al., 2017; Wen et al., 2016; Cun et al., 1990; Hassibi et al., 1993a; Ström, 1997; Hassibi et al., 1993b; Zhu & Gupta, 2017; See et al., 2016; Narang et al., 2017) . Pruning has shown a remarkable ability to preserve top-line metrics of performance, even when removing the majority of weights (Hooker et al., 2019; Gale et al., 2019) . However, most pruning techniques still require training a large, overparameterized model before pruning a subset of weights. Due to the drawbacks of starting dense prior to introducing sparsity, there has been a recent focus on methods that allow networks which start sparse at initialization, to converge to similar performance as dense networks (Frankle & Carbin, 2018; Frankle et al., 2019b; Liu et al., 2018a) . These efforts have focused disproportionately on trying to understand the properties of initial sparse weight distributions that allow for convergence. However, while this work has had some success, focusing on initialization alone has proven to be inadequate (Frankle et al., 2020; Evci et al., 2019) . In this work, we take a broader view of why training sparse networks to converge to the same performance as dense networks has proven to be elusive. We reconsider many of the basic building blocks of the training process and ask whether they disadvantage sparse networks or not. Our work focuses on the behaviour of networks with random, fixed sparsity at initialization and we aim to gain further intuition into how these networks learn. Furthermore, we provide tooling tailored to the analysis of these networks. In order to effectively study sparse network optimization in a controlled environment, we propose an experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC). Contrary to most prior work comparing sparse to dense networks, where overparameterized dense networks are compared to smaller sparse networks, SC-SDC compares sparse networks to their equivalent capacity dense networks (same number of active connections and depth). This ensures that the results are a direct result of sparse connections themselves and not due to having more or fewer weights (as is the case when comparing large, dense networks to smaller, sparse networks). We go beyond simply comparing top-line metrics by also measuring the impact on gradient flow of each intervention. Historically, exploding and vanishing gradients were a common problem in neural networks (Hochreiter et al., 2001; Hochreiter, 1991; Bengio et al., 1994; Glorot & Bengio, 2010; Goodfellow et al., 2016) . Recent work has suggested that poor gradient flow is an exacerbated issue in sparse networks (Wang et al., 2020; Evci et al., 2020) . To accurately measure gradient flow in sparse networks, we propose a normalized measure of gradient flow, which we term Effective Gradient Flow (EGF) -this measure normalizes by the number of active weights and thus is better suited to studying the training dynamics of sparse networks. We use this measure in conjunction with SC-SDC, to see where sparse optimization fails and to consider where this failure could be a result of poor gradient flow. Contributions Our contributions can be enumerated as follows: 1. Measuring effective gradient flow We conduct large scale experiments to evaluate the role of regularization, optimization and architecture choices on sparse models. We evaluate multiple datasets and architectures and propose a new measure of gradient flow, Effective Gradient Flow (EGF), that we show to be a stronger predictor of top-line metrics such as accuracy and loss than current gradient flow formulations. 2. Batch normalization plays a disproportionate role in stabilizing sparse networks We show that batch normalization is more important for sparse networks than it is for dense networks, which suggests that gradient instability is a key obstacle to starting sparse. 3. Not all optimizers and regulizers are created equal Weight decay and data augmentation can hurt sparse network optimization, particularly when used in conjunction with accelerating, adaptive optimization methods that use an exponentially decaying average of past squared gradients, such as Adam (Kingma & Ba, 2014) and RMSProp (Hinton et al., 2012) . We show this is highly correlated to a high EGF (gradient flow) and how batch normalization helps stabilize EGF. 4. Changing activation functions can benefit sparse networks We benchmark a wide set of activation functions, specifically ReLU (Nair & Hinton, 2010) and non-saturating activation functions such as PReLU (He et al., 2015) , ELU (Clevert et al., 2015) , SReLU (Jin et al., 2015) ,Swish (Ramachandran et al., 2017) and Sigmoid (Neal, 1992) . Our results show that when using adaptive optimization methods, Swish is a promising activation function, while when using stochastic gradient descent, PReLU preforms better than the other activation functions. SC-SDC can be summarized as follows (See Figure 1 for an overview):



Implications Our work is timely as sparse training dynamics are poorly understood. Most training algorithms and methods have been developed to suit training dense networks. Our work provides insight into the nature of sparse optimization and suggests a wider viewpoint beyond initialization is necessary to converge sparse networks to comparable performance as dense. Our proposed approach provides a more accurate measurement of the training dynamics of sparse networks and can be used to inform future work on the design of networks and optimization techniques that are tailored explicitly to sparsity. SAME CAPACITY SPARSE VS DENSE COMPARISON Our goal is to measure what architecture and optimization choices favor sparse networks relative to dense networks. To fairly compare sparse and dense networks, we propose Same Capacity Sparse vs Dense Comparison (SC-SDC), a simple framework which allows us to study sparse network optimization and identify what training configurations are not well suited for sparse networks.

