MEMORY EFFICIENT DYNAMIC SPARSE TRAINING Anonymous

Abstract

The excessive memory and energy consumption of modern Artificial Neural Networks (ANNs) is posing limitations on the machines that can run these models. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods improve the energy efficiency during training, the algorithms yielding the most accurate models still have a peak memory usage on the same order as the dense model. We propose a Dynamic Sparse Training (DST) algorithm that reduces the peak memory usage during training while preserving the energy advantages of sparsely trained models. We evaluate our algorithm on CIFAR-10/100 using ResNet-56 and VGG-16 and compare it against a range of sparsification methods. The benefits of our method are twofold: first, it allows for a given model to be trained to an accuracy on par with the dense model while requiring significantly less memory and energy; second, the savings in memory and energy can be allocated towards training an even larger sparse model on the same machine, generally improving the accuracy of the model.

1. INTRODUCTION

Artificial Neural Networks (ANN) are currently the most prominent machine learning method because of their superiority in a broad range of applications, including computer vision (O'Mahony et al., 2019; Voulodimos et al., 2018; Guo et al., 2016a) , natural language processing (Otter et al., 2020; Young et al., 2018) , and reinforcement learning (Schrittwieser et al., 2020; Arulkumaran et al., 2017) , among many others (Wang et al., 2020; Zhou et al., 2020; Liu et al., 2017) . However, as ANNs keep increasing in size to further improve their representational power (Du et al., 2019; Novak et al., 2018) , the memory and energy requirements to train and make inferences with these models becomes a limiting factor (Hwang, 2018; Ahmed & Wahed, 2020) . The scaling problem of ANNs is most prominent in the fully-connected layers, or the dense part of an ANN that includes multiplication with a weight matrix that scales quadratically with the number of units, making very wide ANNs infeasible. This problem is exacerbated as ANNs learn from highdimensional inputs such as video and spatial data (Garcia-Garcia et al., 2018; Ma et al., 2019) and produce high-dimensional representations for many-classes classification or generative models for images and video (Mildenhall et al., 2021; Ramesh et al., 2021) , all of which are gaining importance. A large body of work has addressed the scaling problem (Reed, 1993; Gale et al., 2019; Blalock et al., 2020; Hoefler et al., 2021) , many studies look into sparsity of the weight matrix as a solution based on the observation that the weight distribution of a dense model at the end of training often has a peak around zero, indicating that the majority of weights contribute little to the function being computed (Han et al., 2015) . By utilizing sparse matrix representations and operations, the floating-point operations (FLOPS), and thus the energy usage, of a model can be reduced dramatically. Biological neural networks have also evolved to utilize sparsity, which is seen as an important property for learning efficiency (Pessoa, 2014; Bullmore & Sporns, 2009; Watts & Strogatz, 1998) . Early works in ANN sparsity removed connections, a process called pruning, of a trained dense model based on the magnitude of the weights (Janowsky, 1989; Ström, 1997) , resulting in a more efficient model for inference. While later works improved upon this technique (Guo et al., 2016b; Dong et al., 2017; Yu et al., 2018) , they all require at least the cost of training a dense model, yielding no efficiency benefit during training. This limits the size of sparse models that can be trained on a given machine by the largest dense model it can train. In light of this limitation, the Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2018) , surprisingly, hypothesized that there exists a subnetwork within a dense over-parameterized model that when trained with the same initial weights will result in a sparse model with comparable accuracy to that of the dense model. However, the proposed method for finding a Winning Ticket within a dense model is very compute intensive, as it requires training the dense model (typically multiple times) to obtain the subnetwork. Morover, later work weakened the hypothesis for larger ANNs (Frankle et al., 2019) . Despite this, it was still an important catalyst for new methods that aim to find the Winning Ticket more efficiently. Efficient methods for finding a Winning Ticket can be categorized as: pruning before training and Dynamic Sparse Training (DST). The before-training methods prune connections from a randomly initialized dense model (Lee et al., 2018; Wang et al., 2019; Tanaka et al., 2020) . In contrast, the DST methods start with a randomly initialized sparse model and change the connections dynamically during training, maintaining the overall sparsity (Mocanu et al., 2018; Mostafa & Wang, 2019; Evci et al., 2020) . In practice, DST methods generally achieve better accuracy than the pruning before training methods (Wang et al., 2019) . In addition, the DST methods do not need to represent the dense model at any point, giving them a clear memory efficiency advantage, an important property for our motivation. The first DST method was Sparse Evolutionary Training (SET) (Mocanu et al., 2018) . SET removes the connections with the lowest weight magnitude, a common pruning strategy (Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2020) , and grows new connections uniformly at random. RigL (Evci et al., 2020) improved upon SET by growing the connections with the largest gradient magnitude instead. These connections are expected to get large weight magnitudes as a result of gradient descent optimization. While these methods drastically reduce the FLOPS required to train sparse models, the pruning before training methods and RigL share an important limitation: they have a peak memory usage during training on the same order as the dense model. This is because the pruning before training methods use the dense randomly initialized model, and RigL requires the periodic computation of the gradient with respect to the loss for all possible connections. We present a DST algorithm that reduces the peak-memory usage to the order of a sparse model while maintaining the improvements in compute and achieving the same accuracy as RigL. We achieve this by efficiently sampling a subset of the inactive connections using a heuristic of the gradient magnitude. We then only evaluate the gradient on this subset of connections. Interestingly, the size of the subset can be on the same order as the number of active connections, therefore reducing the peak memory to the order of the sparse model. We evaluate the accuracy of our method on CIFAR-10/100 using ResNet-56 and VGG-16 models and compare it against a range of sparsification methods at sparsity levels of 90%, 95%, and 98%. The benefits of our method can be utilized in the following two ways: 1. It allows for a given model to be trained to an accuracy on par with the dense model while requiring significantly less memory and energy. 2. The savings in memory and energy can be allocated towards training an even larger sparse model on the same machine, generally improving the accuracy of the model.

2. RELATED WORK

A variety of methods have been proposed that aim to reduce the size of ANNs, such as dimensionality reduction of the model parameters (Jaderberg et al., 2014; Novikov et al., 2015) , and weight quantization (Gupta et al., 2015; Mishra et al., 2018) . However, we are interested in model sparsification methods because they reduce both the size and the FLOPS of a model. Following Wang et al. After training The first pruning algorithms operated on dense trained models, pruning the connections with the smallest weight magnitude (Janowsky, 1989; Thimm & Fiesler, 1995; Ström, 1997; Han et al., 2015) . This method was later generalized to first-order (Mozer & Smolensky, 1988; Karnin, 1990; Molchanov et al., 2019a; b) and second-order (LeCun et al., 1989; Hassibi & Stork, 



(2019), we categorize the sparsification methods as: pruning after training, pruning during training, pruning before training, and Dynamic Sparse Training.

