GRADIENT FLOW IN SPARSE NEURAL NETWORKS AND HOW LOTTERY TICKETS WIN

Abstract

Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; and (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from -however, this comes at the cost of learning novel solutions.

1. Introduction

Deep Neural Networks (DNNs) are the state-of-the-art method for solving problems in computer vision, speech recognition, and many other fields. While early research in deep learning focused on application to new problems, or pushing state-of-the-art performance with ever larger/more computationally expensive models, a broader focus has emerged towards their efficient real-world application. One such focus is on the observation that only a sparse subset of this dense connectivity is required for inference, as apparent in the success of pruning (Han et al., 2015; Mozer et al., 1989b) . Pruning has a long history in Neural Network (NN) literature, and remains the most popular approach for finding sparse NNs. Sparse NNs found by pruning algorithms (Han et al., 2015; Louizos et al., 2017; Molchanov et al., 2017; Zhu et al., 2018) (i.e. pruning solutions) can match dense NN generalization with much better efficiency at inference time. However, naively training an (unstructured) sparse NN from a random initialization (i.e. from scratch), typically leads to significantly worse generalization. Two methods in particular have shown some success at addressing this problem -Lottery Tickets (LTs) and Dynamic Sparse Training (DST). The mechanism behind the success of both of these methods is not well understood however, e.g. we don't know how to find Lottery Tickets (LTs) efficiently; while RigL (Evci et al., 2020) , a recent DST method, requires 5× the training steps to match dense NN generalization. Only in understanding how these methods overcome the difficulty of sparse training can we improve upon them. A significant breakthrough in training DNNs -addressing vanishing and exploding gradients -arose from understanding gradient flow both at initialization, and during training. In this work we investigate the role of gradient flow in the difficulty of training unstructured sparse NNs from random initializations and from LT initializations. Our experimental investigation results in the following insights: 1. Sparse NNs have poor gradient flow at initialization. In §3.1, §4.1 we show that existing methods for initializing sparse NNs are incorrect in not considering heterogeneous connectivity. We believe we are the first to show that sparsity-aware initialization methods improve gradient flow and training. 2. Sparse NNs have poor gradient flow during training. In §3.2, §4.2, we observe that even in sparse NN architectures less sensitive to incorrect initialization, the gradient flow during training is poor. We show that DST methods achieving the best generalization have improved gradient flow. 3. Lottery Tickets don't improve upon (1) or (2), instead they re-learn the pruning solution. In §3.3, §4.3 we show that a LT initialization resides within the same basin of attraction as the original pruning solution it is derived of, and a LT solution is highly similar to the pruning solution in function space.

2. Related Work

Pruning Pruning is used commonly in Neural Network (NN) literature to obtain sparse networks (Castellano et al., 1997; Hanson et al., 1988; Kusupati et al., 2020; Mozer et al., 1989a,b; Setiono, 1997; Sietsma et al., 1988; Wortsman et al., 2019) . Pruning algorithms remove connections of a trained dense network using various criteria including weight magnitude (Han et al., 2016 (Han et al., , 2015;; Zhu et al., 2018) , gradient-based measures (Molchanov et al., 2016) , and 2 nd -order terms based on the Hessian (Hassibi et al., 1993; LeCun et al., 1990) . While the majority of pruning algorithms focus on pruning after training, a subset focuses on pruning NNs before training (Lee et al., 2019; Tanaka et al., 2020; Wang et al., 2020) . Gradient Signal Preservation (GRaSP) (Wang et al., 2020) is particularly relevant to our study, since their pruning criteria aims to preserve gradient flow, and they observe a positive correlation between initial gradient flow and final generalization. 2020) studied this and proved the existence of solutions in sufficiently large networks. However, it is an open question whether finding such networks at initialization could be done more efficiently than with existing pruning algorithms. 2019) scaled the variance (fan-in/fan-out) of a sparse NN layer according to the layer's sparsity, effectively using the standard initialization for a small dense layer of equivalent number of weights as in the sparse model.

3. Analyzing Gradient Flow in Sparse Neural Networks

A significant breakthrough in training very deep NNs arose in addressing the vanishing and exploding gradient problem, both at initialization, and during training. This problem was understood by analyzing the signal propagation within a DNN, and addressed in improved initialization methods (Glorot et al., 2010; He et al., 2015; Xiao et al., 2018) alongside normalization methods, such as Batch Normalization (BatchNorm) (Ioffe et al., 2015) . In our work, following Wang et al. ( 2020), we study these problems using the gradient flow, ∇L(θ) T ∇L(θ) which is the first order approximation * of the decrease in the loss expected after a gradient step. We observe poor gradient flow for the predominant sparse NN initialization strategy and propose a solution in §3.1. Then in §3.2 and §3.3 we summarize Dynamic Sparse Training (DST) methods and LT hypothesis respectively.



* We omit learning rate for simplicity.



Most training algorithms work on pre-determined architectures and optimize parameters using fixed learning schedules. Dynamic Sparse Training (DST), on the other hand, aims to optimize the sparse NN connectivity jointly with model parameters. Mocanu et al. (2018) and Mostafa et al. (2019) propose replacing low magnitude parameters with random connections and report improved generalization. Dettmers et al. (2019) proposed using momentum values, whereas Evci et al. (2020) used gradient estimates directly to guide the selection of new connections, reporting results that are on par with pruning algorithms. In §4.2 we study these algorithms and try to understand the role of gradient flow in their success. Random Initialization of Sparse NN In training sparse NN from scratch, the vast majority of pre-exisiting work on training sparse NN has used the common initialization methods (Glorot et al., 2010; He et al., 2015) derived for dense NNs, with only a few notable exceptions. Gale et al. (2019), Liu et al. (2019), and Ramanujan et al. (

However, recent work of Frankle et al., 2020b suggests that the reported gains are due to sparsity distributions discovered rather than the particular sub-network. Another limitation of these algorithms is that they don't scale to large scale tasks like Resnet-50 training on ImageNet-2012. Lottery Tickets Frankle et al. (2019a) showed the existence of sparse sub-networks at initializationknown as Lottery Tickets -which can be trained to match the generalization of the corresponding dense Deep Neural Network (DNN). The initial work of Frankle et al. (2019a) inspired much follow-up work. Gale et al. (2019) and Liu et al. (2019) observed that the initial formulation was not applicable to larger networks with higher learning rates. Frankle et al. (2019b, 2020a) proposed late rewinding as a solution. Morcos et al. (2019) and Sabatelli et al. (2020) showed that Lottery Tickets (LTs) trained on large datasets transfer to smaller ones, but not vice versa. Frankle et al. (2020c), Ramanujan et al. (2019), and Zhou et al. (2019) focused on further understanding LTs, and finding sparse sub-networks at initialization. As one might expect, sufficiently large networks would have smaller solutions hidden in them. Malach et al. (

