LEARNING N:M FINE-GRAINED STRUCTURED SPARSE NEURAL NETWORKS FROM SCRATCH

Abstract

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarsegrained sparsity which prunes blocks of sub-networks of a neural network. Finegrained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2 : 4 sparse network could achieve 2× speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straightthrough estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

1. INTRODUCTION

Deep neural networks (DNNs) have shown promising performances on various tasks including computer vision, natural language processing, speech recognition, etc. However, a DNN usually comes with a large number of learnable parameters, ranging from millions of to even billions of (e.g., GPT-3 (Brown et al., 2020) ), making the DNN model burdensome and difficult to be applied to real-world deployments. Therefore, researchers began to investigate how to speed up and compress DNNs via various methods such as knowledge distillation (Hinton et al., 2015) , quantization (Jacob et al., 2018; Zhou et al., 2017) , designing efficient model architectures (Howard et al., 2017) , and structured sparsity (Wen et al., 2016; Li et al., 2016) . In this paper, we focus on the problem of sparsifying DNNs. Sparsity in DNNs can be categorized into unstructured sparsity and structured sparsity. Unstructured sparsity prunes individual weights at any location, which is fine-grained and can achieve extremely high compression ratio (Han et al., 2015; Guo et al., 2016) . However, unstructured sparsity struggles to take advantage of vectorprocessing architectures, which increases latency due to dependent sequences of reads (Nvidia, 2020) . Compared with unstructured sparsity, structured sparsity is more friendly to hardware, especially for block pruning (Wang et al., 2019) , kernel shape sparsity (Tan et al., 2020) or channel and filter pruning (Li et al., 2016; Wen et al., 2016) . Although structured sparsity can speed up DNNs on commodity hardware, it hurts model performance more significantly than unstructured fine-grained sparsity. For example, ResNet-50 network generated by unstructured pruning can achieve a 5.96× compression ratio, with the same accuracy as the original network, but it can only achieve 1× com-pression in the case of structured sparsity (Renda et al., 2020) . Therefore, how to combine the unstructured sparsity and structured sparsity to accelerate DNNs on modern hardware (e.g., GPU) becomes a challenging yet valuable problem. Recently, Nvidia Ampere A100 is equipped with the Sparse Tensor Cores to accelerate 2:4 structured fine-grained sparsity. Here, N:M sparsity indicates the sparsity of DNNs in which only N weights are non-zero for every continuous M weights. To the best of our knowledge, A100 is the first commodity sparse hardware, where the sparse tensor core can support several common operations including linear, convolutional, recurrent cells, transformer blocks, etc. Specifically, suppose a typical matrix multiplication X × W in DNNs, X and W denote input tensor and parameter tensor respectively. The Dense Tensor Cores implement X 16×32 ×W 32×8 matrix multiplication by 2 cycles while the Sparse Tensor Cores only need 1 cycle if the parameter tensor W satisfies the 2:4 structured sparse pattern. Nvidia has proposed an ASPfoot_0 (APEX's Automatic Sparsity) solution (Nvidia, 2020) to sparsify a dense neural network to satisfy the 2:4 fine-grained structured sparsity requirement. The recipe contains three steps: (1) training a dense network until converge; (2) pruning for 2:4 sparsity with magnitude-based single-shot pruning; (3) repeating the original training procedure. However, ASP is computationally expensive since it requires training the full dense models from scratch and finetuning again. Therefore, we still lack a simple recipe to obtain a structured sparse DNN model consistent with the dense network without extra fine-tuning. This paper addresses this question: Can we design a simple yet universal recipe to learn N :M sparse neural networks from scratch in an efficient way? It is difficult to find the optimal sparse architecture (connections) and optimal parameters (Evci et al., 2019b) simultaneously during training sparse CNNs and Transformers although SET-MLP could easily outperform dense MLP (Bourgin et al., 2019) . There are two schemes to obtain such sparse models. One is a two-stage scheme, which discovers a sparse neural architecture by pruning a well-trained dense network and then uses the same or even greater computational resources to retrain the sparse models (Nvidia, 2020; Evci et al., 2019b; Han et al., 2015; Frankle & Carbin, 2018) . The other is a one-stage scheme, which adopts the dynamic method to alternatively optimize parameters and prunes network architectures based on different criteria (Bellec et al., 2017; Mocanu et al., 2018; Mostafa & Wang, 2019; Evci et al., 2019b; Kusupati et al., 2020; Dettmers & Zettlemoyer, 2019) . Compared with the two-stage scheme, the one-stage scheme can save training time and cost however usually obtains lower performance. To overcome the aforementioned trade-off between training cost and performance, we present a simple yet effective framework to train sparse neural networks from scratch. Specifically, we employ the magnitude-based pruning method (Renda et al., 2020; Gale et al., 2019) during the forward process. Considering that the pruning operation is a non-differentiable operator (a similar dilemma in model quantization (Courbariaux et al., 2016 )), we extend the widely used Straight-through Estimator (STE) (Bengio et al., 2013) in model quantization to aid sparse neural network's backpropagation. However, perturbations are introduced during the back-propagation (Yin et al., 2019; Bengio et al., 2013) . Hence we define Sparse Architecture Divergence (SAD) to further analyze N :M sparse networks trained by STE methods so that we can identify the impact of perturbations on sparse neural networks training. Based on SAD analysis, to alleviate the negative impact, we propose a sparse-refined term mitigating the approximated gradients' influence. We also compare the performance of neural networks with different granularities of fine-grained structured sparsity (i.e., 1:4, 2:4, 2:8, 4:8) and conduct thorough experiments on several typical deep neural networks with different N :M sparsity levels, covering image classification, detection, segmentation, optical flow estimation, and machine translation. Experimental results have shown that the models with our proposed structured sparsity can achieve neglectful performance drop and can even sometimes outperform the dense model. The main contributions of this paper are summarized as three-fold. (1) To the best of our knowledge, this is the first systematic study into training N :M structured sparse neural networks from scratch without performance drop. The N :M structured sparsity is a missing yet promising ingredient in model acceleration, which can be a valuable supplement with various compression methods. (2) We extend STE to tackle the problem of training N :M sparse neural networks. To alleviate the limitations of STE on sparsifying networks, we propose a sparse refined term to enhance the effectiveness



https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity

