KEEP THE GRADIENTS FLOWING: USING GRADIENT FLOW TO STUDY SPARSE NETWORK OPTIMIZATION Anonymous

Abstract

Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider various choices made during training that might disadvantage sparse networks. We measure the gradient flow across different networks and datasets, and show that the default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and a wider view of tailoring optimization to sparse networks yields promising results.

1. INTRODUCTION

Over the last decade, a "bigger is better" race in the number of model parameters has gripped the field of machine learning (Amodei et al., 2018; Thompson et al., 2020) , primarily driven by overparameterized deep neural networks (DNNs). Additional parameters improve top-line metrics, but drive up the cost of training (Horowitz, 2014; Strubell et al., 2019; Hooker, 2020) and increase the latency and memory footprint at inference time (Warden & Situnayake, 2019; Samala et al., 2018; Lane & Warden, 2018) . Moreover, overparameterized networks have been shown to be more prone to memorization (Zhang et al., 2016) . To address some of these limitations, there has been a renewed focus on compression techniques that preserve top-line performance while improving efficiency. A large amount of research focus has centered on pruning, where weights estimated to be unnecessary are removed from the network at the end of training (Louizos et al., 2017; Wen et al., 2016; Cun et al., 1990; Hassibi et al., 1993a; Ström, 1997; Hassibi et al., 1993b; Zhu & Gupta, 2017; See et al., 2016; Narang et al., 2017) . Pruning has shown a remarkable ability to preserve top-line metrics of performance, even when removing the majority of weights (Hooker et al., 2019; Gale et al., 2019) . However, most pruning techniques still require training a large, overparameterized model before pruning a subset of weights. Due to the drawbacks of starting dense prior to introducing sparsity, there has been a recent focus on methods that allow networks which start sparse at initialization, to converge to similar performance as dense networks (Frankle & Carbin, 2018; Frankle et al., 2019b; Liu et al., 2018a) . These efforts have focused disproportionately on trying to understand the properties of initial sparse weight distributions that allow for convergence. However, while this work has had some success, focusing on initialization alone has proven to be inadequate (Frankle et al., 2020; Evci et al., 2019) . In this work, we take a broader view of why training sparse networks to converge to the same performance as dense networks has proven to be elusive. We reconsider many of the basic building blocks of the training process and ask whether they disadvantage sparse networks or not. Our work focuses on the behaviour of networks with random, fixed sparsity at initialization and we aim to gain further intuition into how these networks learn. Furthermore, we provide tooling tailored to the analysis of these networks.

