GRADIENT FLOW IN SPARSE NEURAL NETWORKS AND HOW LOTTERY TICKETS WIN

Abstract

Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; and (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from -however, this comes at the cost of learning novel solutions.

1. Introduction

Deep Neural Networks (DNNs) are the state-of-the-art method for solving problems in computer vision, speech recognition, and many other fields. While early research in deep learning focused on application to new problems, or pushing state-of-the-art performance with ever larger/more computationally expensive models, a broader focus has emerged towards their efficient real-world application. One such focus is on the observation that only a sparse subset of this dense connectivity is required for inference, as apparent in the success of pruning (Han et al., 2015; Mozer et al., 1989b) . Pruning has a long history in Neural Network (NN) literature, and remains the most popular approach for finding sparse NNs. Sparse NNs found by pruning algorithms (Han et al., 2015; Louizos et al., 2017; Molchanov et al., 2017; Zhu et al., 2018) (i.e. pruning solutions) can match dense NN generalization with much better efficiency at inference time. However, naively training an (unstructured) sparse NN from a random initialization (i.e. from scratch), typically leads to significantly worse generalization. Two methods in particular have shown some success at addressing this problem -Lottery Tickets (LTs) and Dynamic Sparse Training (DST). The mechanism behind the success of both of these methods is not well understood however, e.g. we don't know how to find Lottery Tickets (LTs) efficiently; while RigL (Evci et al., 2020) , a recent DST method, requires 5× the training steps to match dense NN generalization. Only in understanding how these methods overcome the difficulty of sparse training can we improve upon them. A significant breakthrough in training DNNs -addressing vanishing and exploding gradients -arose from understanding gradient flow both at initialization, and during training. In this work we investigate the role of gradient flow in the difficulty of training unstructured sparse NNs from random initializations and from LT initializations. Our experimental investigation results in the following insights: 1. Sparse NNs have poor gradient flow at initialization. In §3.1, §4.1 we show that existing methods for initializing sparse NNs are incorrect in not considering heterogeneous connectivity. We believe we are the first to show that sparsity-aware initialization methods improve gradient flow and training. 2. Sparse NNs have poor gradient flow during training. In §3.2, §4.2, we observe that even in sparse NN architectures less sensitive to incorrect initialization, the gradient flow during training is poor. We show that DST methods achieving the best generalization have improved gradient flow. 3. Lottery Tickets don't improve upon (1) or (2), instead they re-learn the pruning solution. In §3.3, §4.3 we show that a LT initialization resides within the same basin of attraction as the original pruning solution it is derived of, and a LT solution is highly similar to the pruning solution in function space.

