MEMORY EFFICIENT DYNAMIC SPARSE TRAINING Anonymous

Abstract

The excessive memory and energy consumption of modern Artificial Neural Networks (ANNs) is posing limitations on the machines that can run these models. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods improve the energy efficiency during training, the algorithms yielding the most accurate models still have a peak memory usage on the same order as the dense model. We propose a Dynamic Sparse Training (DST) algorithm that reduces the peak memory usage during training while preserving the energy advantages of sparsely trained models. We evaluate our algorithm on CIFAR-10/100 using ResNet-56 and VGG-16 and compare it against a range of sparsification methods. The benefits of our method are twofold: first, it allows for a given model to be trained to an accuracy on par with the dense model while requiring significantly less memory and energy; second, the savings in memory and energy can be allocated towards training an even larger sparse model on the same machine, generally improving the accuracy of the model.

1. INTRODUCTION

Artificial Neural Networks (ANN) are currently the most prominent machine learning method because of their superiority in a broad range of applications, including computer vision (O'Mahony et al., 2019; Voulodimos et al., 2018; Guo et al., 2016a) , natural language processing (Otter et al., 2020; Young et al., 2018) , and reinforcement learning (Schrittwieser et al., 2020; Arulkumaran et al., 2017) , among many others (Wang et al., 2020; Zhou et al., 2020; Liu et al., 2017) . However, as ANNs keep increasing in size to further improve their representational power (Du et al., 2019; Novak et al., 2018) , the memory and energy requirements to train and make inferences with these models becomes a limiting factor (Hwang, 2018; Ahmed & Wahed, 2020) . The scaling problem of ANNs is most prominent in the fully-connected layers, or the dense part of an ANN that includes multiplication with a weight matrix that scales quadratically with the number of units, making very wide ANNs infeasible. This problem is exacerbated as ANNs learn from highdimensional inputs such as video and spatial data (Garcia-Garcia et al., 2018; Ma et al., 2019) and produce high-dimensional representations for many-classes classification or generative models for images and video (Mildenhall et al., 2021; Ramesh et al., 2021) , all of which are gaining importance. A large body of work has addressed the scaling problem (Reed, 1993; Gale et al., 2019; Blalock et al., 2020; Hoefler et al., 2021) , many studies look into sparsity of the weight matrix as a solution based on the observation that the weight distribution of a dense model at the end of training often has a peak around zero, indicating that the majority of weights contribute little to the function being computed (Han et al., 2015) . By utilizing sparse matrix representations and operations, the floating-point operations (FLOPS), and thus the energy usage, of a model can be reduced dramatically. Biological neural networks have also evolved to utilize sparsity, which is seen as an important property for learning efficiency (Pessoa, 2014; Bullmore & Sporns, 2009; Watts & Strogatz, 1998) . Early works in ANN sparsity removed connections, a process called pruning, of a trained dense model based on the magnitude of the weights (Janowsky, 1989; Ström, 1997) , resulting in a more efficient model for inference. While later works improved upon this technique (Guo et al., 2016b; Dong et al., 2017; Yu et al., 2018) , they all require at least the cost of training a dense model, yielding 1

