AUTOSPARSE: TOWARDS AUTOMATED SPARSE TRAIN-ING

Abstract

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), a gradient driven approach where gradients of pruned out weights are scaled down in a non-linear manner. GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization. We integrated GA with the latest learnable threshold based pruning methods to create an automated sparse training algorithm called AutoSparse. Our algorithm achieves state-of-the-art accuracy with 80% sparsity for ResNet50 and 75% sparsity for MobileNetV1 on Imagenet-1K. AutoSparse also results in 7× reduction in inference FLOPS and > 2× reduction in training FLOPS for ResNet50 on Ima-geNet at 80% sparsity. Finally, GA generalizes well to fixed-budget (Top-K, 80%) sparse training methods, improving the accuracy of ResNet50 on Imagenet-1K, to outperform TopKAST+PP by 0.3%. MEST (SotA method for fixed-budget sparsity) achieves comparable accuracy as AutoSparse at 80% sparsity, however, using 20% more training FLOPS and 45% more inference FLOPS.

1. INTRODUCTION

Deep learning models have emerged as the preferred solution for many important problems in the domains of computer visionHe et al. (2016) ; Dosovitskiy et al. (2021) , language modeling Brown et al. (2020 ), recommender systemsNaumov et al. (2019 ) and reinforcement learningSilver et al. (2017) . Models have grown larger and more complex over the years, as they are applied to increasingly difficult problems on ever-growing datasets. In addition, DNN models are designed to operate in overparameterized regime Arora et al. ( 2018 2019) use non-uniform distribution across layers using dynamic weight reallocation heuristics. Non-uniform distribution allows more degrees of freedom to explore ways to improve accuracy at any given sparsity budget. The best performing methods in this category are the ones that skip pruning the first and last layers.

1.1. LEARNABLE THRESHOLD PRUNING

Learnable threshold pruning methods offer a two-fold advantage over deterministic pruning methods, 1) computationally efficient, as the overhead of computation (e.g. choosing Top-K largest) used for deriving threshold values at each layer is eliminated, 2) learns the non-uniform sparsity distribution inherent within the model automatically, producing a more FLOPs-efficient sparse model for inference. For example, 80% sparse ResNet50 produced via fixed-budget method MEST requires 50% more FLOPS than learned sparsity method AutoSparse (discussed later) as compute profile for various layers is non-uniform (Figure 1a ). 2020) is sparse training method that starts with threshold values initialized to 'zero' and pushes the small thresholds up using an exponential loss function added to L2 regularization term. Soft Threshold Reparameterization Kusupati et al. (2020) initializes threshold parameters with large negative values and controls the induction of sparsity using different weight decay (λ) values for achieving different sparsity levels using L2 regularization. Sparsity vs accuracy trade-off is a challenge and is discussed below.

Exploring Accuracy Vs. Sparsity Trade-Off:

Deterministic pruning methods, not withstanding their limitations, offer a consistent level of sparsity throughout the training which in turn can lead to predictable performance improvements. For learnable threshold methods, the performance can be measured in the form of reduction in training FLOPs measured across the entire duration of the training. In order to meet this goal, learnable threshold methods must have the ability to induce sparsity early in the training and provide algorithmic means to explore the trade-off to increase model sparsity while also reducing accuracy loss. Empirical studies on aforementioned methodsLiu et al. (2020); Kusupati et al. (2020) have indicated that L2 regularization based approach offers at best, a weak trade-off of better accuracy at the expense of lower average sparsity. We also found these methods to be susceptible to runaway sparsity (e.g. hit 100% model sparsity), if higher levels of sparsity were induced right from the start of the training. To mitigate this problem, DST Liu et al. (2020) implements hard upper limit checks (e.g. 99%) on sparsity to trigger a reset of the offending threshold and all the associated weights to prevent loss of accuracy. Similarly, STR Kusupati et al. (2020) methods uses a combination of small initial threshold values with an appropriate λ to delay the induction of sparsity until later in the training cycle (e.g. 30 epochs) to control unfettered growth of sparsity.



); Belkin et al. (2019); Ardalani et al. (2019) to facilitate easier optimization using gradient descent based methods. As a consequence, computational costs of performing training and inference tasks on state-of-the-art models has been growing at an exponential rateAmodei & Hernandez. The excess model capacity also makes DNNs more resilient to noise during training -reduced precision training methodsMicikevicius et al. (2017); Wang et al. (2018); Sun et al. (2019) have successfully exploited this aspect of DNNs to speed up training and inference tasks significantly. Today, state-of-the-art training hardware consists of significantly more reduced-precision FLOPs compared to traditional FP32 computations. Sparsity is another avenue for improving the compute efficiency by exploiting excess model capacity, thereby reducing the number of FLOPs needed for each iteration. Several studies have shown that while overparameterized model helps to ease the training, only a fraction of that model capacity is needed to perform inference Li et al. (2020); Han et al. (2015); Li et al. (2020); Narang et al. (2017); Ström (1997); Gale et al. (2019). A wide array of studies have also proposed methods to prune dense networks to produce sparse models for inference (dense-to-sparse) Molchanov et al. (2017); Zhu & Gupta (2017); Frankle & Carbin (2019); Renda et al. (2020). More recently, there is a growing interest in sparse-to-sparse Frankle & Carbin (2019); Mostafa & Wang (2019); Bellec et al. (2018); Evci et al. (2021); Lee (2021); Dettmers & Zettlemoyer (2019); Jayakumar et al. (2021); Zhang et al. (2022); Schwarz et al. (2021); Yuan et al. (2021) methods training models with end-to-end sparsity to reduce the computational costs of training. This paper presents techniques to improve and generalize sparse training methods for easy integration into different training workflows.Sparse training methods can be broadly divided into two categories -a) deterministic pruning methods, that initialize the model with a desired fixed sparsity budget at each layer and enforce it throughout the training cycle, b) learnable threshold pruning methods attempt to discover the sparsity distribution within the model by learning layer-wise threshold parameters. While these latter methods can aim for a desired sparsity level by selecting appropriate combination of initialization and hyper-parameters, the final model sparsity may not be exactly what was desired -hence they are non-deterministic. Please refer toHoefler et al. (2021)  for detailed categorization and discussion on various sparsification methods.Deterministic Pruning: Deterministic pruning methods expect a prior knowledge of how much sparsity can be extracted out of any given model. This sparsity budget is often determined by trialand-error or extrapolated from previously published studies. Once the sparsity budget is determined, a choice must be made between a uniform or a non-uniform distribution of sparsity across the layers. Majority of the methods in this category Frankle & Carbin (2019); Bellec et al. (2018); Evci et al. (2021); Lee (2021); Jayakumar et al. (2021); Zhang et al. (2022); Schwarz et al. (2021); Zhou et al. (2021); Yuan et al. (2021) opt for uniform sparse distribution because it requires fewer hyperparameters -a subset of these methods Jayakumar et al. (2021); Zhang et al. (2022); Schwarz et al. (2021) maintain first and last layers in dense while the sparsity budget is uniformly distributed across rest of the layers. Fewer methods in this category Mostafa & Wang (2019); Dettmers & Zettlemoyer (

Current state-of-the-art methods in this space Liu et al. (2020); Kusupati et al. (2020) rely on L2 regularization based approaches to guide threshold updates and penalize pruned weights. Dynamic Sparse Training Liu et al. (

