LEARNING N:M FINE-GRAINED STRUCTURED SPARSE NEURAL NETWORKS FROM SCRATCH

Abstract

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarsegrained sparsity which prunes blocks of sub-networks of a neural network. Finegrained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2 : 4 sparse network could achieve 2× speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straightthrough estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

1. INTRODUCTION

Deep neural networks (DNNs) have shown promising performances on various tasks including computer vision, natural language processing, speech recognition, etc. However, a DNN usually comes with a large number of learnable parameters, ranging from millions of to even billions of (e.g., GPT-3 (Brown et al., 2020) ), making the DNN model burdensome and difficult to be applied to real-world deployments. Therefore, researchers began to investigate how to speed up and compress DNNs via various methods such as knowledge distillation (Hinton et al., 2015) , quantization (Jacob et al., 2018; Zhou et al., 2017) , designing efficient model architectures (Howard et al., 2017) , and structured sparsity (Wen et al., 2016; Li et al., 2016) . In this paper, we focus on the problem of sparsifying DNNs. Sparsity in DNNs can be categorized into unstructured sparsity and structured sparsity. Unstructured sparsity prunes individual weights at any location, which is fine-grained and can achieve extremely high compression ratio (Han et al., 2015; Guo et al., 2016) . However, unstructured sparsity struggles to take advantage of vectorprocessing architectures, which increases latency due to dependent sequences of reads (Nvidia, 2020) . Compared with unstructured sparsity, structured sparsity is more friendly to hardware, especially for block pruning (Wang et al., 2019) , kernel shape sparsity (Tan et al., 2020) or channel and filter pruning (Li et al., 2016; Wen et al., 2016) . Although structured sparsity can speed up DNNs on commodity hardware, it hurts model performance more significantly than unstructured fine-grained sparsity. For example, ResNet-50 network generated by unstructured pruning can achieve a 5.96× compression ratio, with the same accuracy as the original network, but it can only achieve 1× com-

