MINIMUM VARIANCE UNBIASED N:M SPARSITY FOR THE NEURAL GRADIENTS

Abstract

In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e., loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensorlevel optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more. A reference implementation is supplied in the supplementary material.

1. INTRODUCTION

Pruning Deep Neural Networks (DNNs) is one of the most effective and widely studied methods to improve DNN resource efficiency. Since DNNs are over-parametrized, most researchers focused on weights pruning. Yet, recently researchers suggested that sparsity of activations (Jaszczur et al., 2021; Kurtz et al., 2020) and gradients (Chmiel et al., 2021b ) could be exploited as well. However, all these types of unstructured pruning only reduce the memory footprint (Frankle & Carbin, 2018; Evci et al., 2020) . It is possible to also reduce the compute footprint by enforcing some structure on the pruning mask, such as block sparsity (Wen et al., 2016) , filter sparsity (Li et al., 2017) , or N:M fine-grained sparsity (Nvidia, 2020; Hubara et al., 2021; Mishra et al., 2021) . We focus on N:M fine-grained sparsity, in which, N out of every M contiguous elements would be pruned, for at least one of the two matrices involved in the matrix multiplication. Nvidia's sparse tensor cores (Nvidia, 2020; Mishra et al., 2021) can use N:M fine-grained sparsity to accelerate matrix multiplication. Specifically, Nvidia (2020) used a 2:4 format to accelerate inference up to x2. They suggested using a three-step scheme: (a) train a dense model, (b) prune weights to obtain a 2:4 fixed mask, and (c) use the original training regime to retrain with the masked weights. Following works suggested methods to accelerate different parts of this scheme. First, Zhou et al. (2021) was able to omit steps (a) and (b) by training with an N:M mask from scratch using a straightthrough estimator (STE) and additional regularization. Specifically, they keep a dense copy of the weights and set different weight decays rates to the masked and unmasked weights. Next, Hubara et al. (2021) focused on accelerating the remaining step (c), i.e., sparse training, as we do here.

