MINIMUM VARIANCE UNBIASED N:M SPARSITY FOR THE NEURAL GRADIENTS

Abstract

In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e., loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensorlevel optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more. A reference implementation is supplied in the supplementary material.

1. INTRODUCTION

Pruning Deep Neural Networks (DNNs) is one of the most effective and widely studied methods to improve DNN resource efficiency. Since DNNs are over-parametrized, most researchers focused on weights pruning. Yet, recently researchers suggested that sparsity of activations (Jaszczur et al., 2021; Kurtz et al., 2020) and gradients (Chmiel et al., 2021b ) could be exploited as well. However, all these types of unstructured pruning only reduce the memory footprint (Frankle & Carbin, 2018; Evci et al., 2020) . It is possible to also reduce the compute footprint by enforcing some structure on the pruning mask, such as block sparsity (Wen et al., 2016) , filter sparsity (Li et al., 2017) , or N:M fine-grained sparsity (Nvidia, 2020; Hubara et al., 2021; Mishra et al., 2021) . We focus on N:M fine-grained sparsity, in which, N out of every M contiguous elements would be pruned, for at least one of the two matrices involved in the matrix multiplication. Nvidia's sparse tensor cores (Nvidia, 2020; Mishra et al., 2021) can use N:M fine-grained sparsity to accelerate matrix multiplication. Specifically, Nvidia (2020) used a 2:4 format to accelerate inference up to x2. They suggested using a three-step scheme: (a) train a dense model, (b) prune weights to obtain a 2:4 fixed mask, and (c) use the original training regime to retrain with the masked weights.  ✓(MSE) ✓(MSE) ✗ ✓(MSE) Backward ✗ ✓(MSE) ✗ ✓(MSE) Update ✗ ✗ ✓(MVUE) ✓(MVUE) Recall that in each training step we use Backpropagation, which has three phases. Generally, each phase requires a General Matrix Multiplication (GEMM) for each DNN layer l: [Forward] z l = W l h l-1 ; h l = f l (z l ) [Backward] g l = Diag(f ′ l (z l ))W T l+1 g l+1 (2) [Update] ∂C ∂W l = g l h T l-1 , ( ) where C is the loss function, and in each layer l, f l is a non-linear activation function, W l represents the weights, z l the pre-activations, h l the post-activations and g l = ∂C ∂z l is the neural gradient. Nvidia suggested accelerating only the inference phase (i.e., the forward pass in eq. Equation ( 1)), while the backward and update passes were kept dense. Noting that the backward phase uses the transposed (sparse) weight matrix, Hubara et al. ( 2021) used a transposable mask, i.e., a mask that can be transposed and still match the N:M fine-grained structure. This enabled the acceleration of the backward phase. Although Hubara et al. (2021) suggested different methods to find the optimal transposable mask efficiently, they did not suggest how to accelerate the update phase. In this work we explore different methods to accelerate the update phase as well using N:M sparsity. We need to decide in Equation (3) if we want to prune the activations (h l ) or the neural gradients (g l ). In order to avoid a mismatch with the forward phase in Equation ( 1), where the activations are not pruned, we decided in this work to focus on the neural gradient for the update phase. To that end, we examine gradients with fine-grained pruning and establish a tensor-level optimality criteria. So far, N:M sparsity in the weights was obtained by minimizing the Mean Square Error (MSE). We explain (Section 3) that, while this MSE criterion can also be used for the N:M sparsity in activations (which can be useful for inference, as we discuss in Section 6), for N:M sparsity in the neural gradients it is better to use a Minimum Variance Unbiased Estimate (MVUE). We develop (in Section 4) such MVUE pruning methods for 1:2 and 2:4 sparsity in the neural gradients. Our experiments (in Section 5) suggest that while the traditional minimum MSE method crashed, our MVUE method with 1:2 sparsity is usually sufficient for training, and 2:4 sparsity is enough when this is not the case. Moreover, we suggest to combine several such methods together (fine-grained sparse neural gradients and sparse transposable fine-grained weights) in order to potentially speed up training even more and be able to accelerate all training phases with N:M fine-grained sparsity. In Table 1 we present all the N:M fine-grained structured sparsity methods, which part of the network they accelerate, the relevant optimality criteria we use, and the configurations we use to fully accelerate training. In summary, this paper makes the following contributions: • We developed an unbiased minimum variance optimality criteria for pruning neural gradients with N:M structured sparsity. • We propose 1:2 and 2:4 unbiased minimum variance methods to prune the neural gradients and demonstrate that they achieve small or no degradation, where previous methods failed. • We combine these methods with previous methods for N:M structured sparsity in the weights, and observe small or no degradation. Thus, the GEMMs in all training phases can potentially be accelerated by x2.



works suggested methods to accelerate different parts of this scheme.First, Zhou et al.  (2021)  was able to omit steps (a) and (b) by training with an N:M mask from scratch using a straightthrough estimator (STE) and additional regularization. Specifically, they keep a dense copy of the weights and set different weight decays rates to the masked and unmasked weights.Next, Hubara  et al. (2021)  focused on accelerating the remaining step (c), i.e., sparse training, as we do here.

Exploring fine-grained sparsity on different training phases with different sampling methods (MVUE, MSE). While previous methods aim to accelerate the forward and backward phases, we focus on accelerating the update phase. The combination of all methods allows us to accelerate all training phases.

