AUTOSPARSE: TOWARDS AUTOMATED SPARSE TRAIN-ING

Abstract

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), a gradient driven approach where gradients of pruned out weights are scaled down in a non-linear manner. GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization. We integrated GA with the latest learnable threshold based pruning methods to create an automated sparse training algorithm called AutoSparse. Our algorithm achieves state-of-the-art accuracy with 80% sparsity for ResNet50 and 75% sparsity for MobileNetV1 on Imagenet-1K. AutoSparse also results in 7× reduction in inference FLOPS and > 2× reduction in training FLOPS for ResNet50 on Ima-geNet at 80% sparsity. Finally, GA generalizes well to fixed-budget (Top-K, 80%) sparse training methods, improving the accuracy of ResNet50 on Imagenet-1K, to outperform TopKAST+PP by 0.3%. MEST (SotA method for fixed-budget sparsity) achieves comparable accuracy as AutoSparse at 80% sparsity, however, using 20% more training FLOPS and 45% more inference FLOPS.

1. INTRODUCTION

Deep learning models have emerged as the preferred solution for many important problems in the domains of computer visionHe et al. ( 2016 



); Dosovitskiy et al. (2021), language modeling Brown et al. (2020), recommender systemsNaumov et al. (2019) and reinforcement learningSilver et al. (2017). Models have grown larger and more complex over the years, as they are applied to increasingly difficult problems on ever-growing datasets. In addition, DNN models are designed to operate in overparameterized regime Arora et al. (2018); Belkin et al. (2019); Ardalani et al. (2019) to facilitate easier optimization using gradient descent based methods. As a consequence, computational costs of performing training and inference tasks on state-of-the-art models has been growing at an exponential rateAmodei & Hernandez. The excess model capacity also makes DNNs more resilient to noise during training -reduced precision training methodsMicikevicius et al. (2017); Wang et al. (2018); Sun et al. (2019) have successfully exploited this aspect of DNNs to speed up training and inference tasks significantly. Today, state-of-the-art training hardware consists of significantly more reduced-precision FLOPs compared to traditional FP32 computations. Sparsity is another avenue for improving the compute efficiency by exploiting excess model capacity, thereby reducing the number of FLOPs needed for each iteration. Several studies have shown that while overparameterized model helps to ease the training, only a fraction of that model capacity is needed to perform inference Li et al. (2020); Han et al. (2015); Li et al. (2020); Narang et al. (2017); Ström (1997); Gale et al. (2019). A wide array of studies have also proposed methods to prune dense networks to produce sparse models for inference (dense-to-sparse) Molchanov et al. (2017); Zhu & Gupta (2017); Frankle & Carbin (2019); Renda et al. (2020). More recently, there is a growing interest in sparse-to-sparse Frankle & Carbin (2019); Mostafa & Wang (2019); Bellec et al. (2018); Evci et al. (2021); Lee (2021); Dettmers & Zettlemoyer (2019); Jayakumar et al. (2021); Zhang et al. (2022); Schwarz et al. (2021); Yuan et al. (2021) methods training models with end-to-end sparsity to reduce the computational costs of training. This paper presents techniques to improve and generalize sparse training methods for easy integration into different training workflows. 1

