NEURAL GRADIENTS ARE NEAR-LOGNORMAL: IMPROVED QUANTIZED AND SPARSE TRAINING

Abstract

While training can mostly be accelerated by reducing the time needed to propagate neural gradients (loss gradients with respect to the intermediate neural layer outputs) back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -in each case without accuracy degradation. Reference implementation accompanies the paper in the supplementary material.

1. INTRODUCTION

Neural gradients are used in the training process of deep networks to backpropagate the error-gradient throughout the model, thus allowing to compute the required weight updates. As these neural gradients are needed for a substantial ratio of the underlying computations (about 2 3 ), compressing them can alleviate data-throughput requirements and accelerate the training process. Compression of neural gradients reduce the memory footprint for the intermediate calculation and the bandwidth of data transfer inside the HW accelerator. Moreover, in term of distributed training in model parallelism the neural gradients are one of the main bottlenecks that need to be transferred between devices (Rong et al., 2020; Gupta et al., 2020) . Many previous works (Banner et al., 2019; Fang et al., 2020) compress tensors such as weights and activations by approximating their distributions using an analytically tractable density. These works often assume a bell-shaped distribution such as Gaussian or Laplace distributions, which have been reported to fail for neural gradients (Ye et al., 2019) . One key observation in this paper is that neural gradient distributions are heavy-tailed, fundamentally different from the light-tailed distributions of weights and activations. Further statistical and distributional tests reveal gradient magnitudes follow a lognormal distribution. Adopting this lognormal observation, our paper suggests two main applications -quantization and pruning, used to reduce the computational and memory burden of neural gradients. To tackle these challenges, we first formalize the problems and find closed-form expressions that enable us to predict the optimal quantization and pruning policies. These measures are easy to use and depend only on the estimated lognormal parameters. In Figure 1 we summarize these applications and their derivation. The first application uses the lognormal prior to enabling low-precision floating-point (FP) quantization of the gradients. Here we optimize two tasks. The first task is to find a partition between mantissa and exponent bit-widths that minimizes quantization noise for a given n bit FP gradient representation. The second task is to scale these gradients so that they would be properly represented within a limited dynamic range (distance between the maximum and minimum that FP format can represent). We provide useful insights that make empirically-based heuristics such as loss scaling (Micikevicius et al., 2018) a more grounded approach with a theoretical basis. Optimizing both tasks we obtain state-of-the-art results for FP quantization of the neural gradients. The second application performs accurate and predictable stochastic pruning of gradients on the fly, which results in two state-of-the-art pruning schemes. The first translates the desired sparsity level into an accurate threshold, and the other enables combined use of different sparsity levels at different layers (heterogeneous sparsity). 2020)) aiming to reduce both bandwidth and memory footprint, as well as computation time. Most of these methods focus on the quantization or pruning of the weights / activations in the forward path (Banner et al., 2019; Nahshan et al., 2019) or the weight gradients (Bernstein et al., 2018; Alistarh et al., 2016) in the backward path. So far, neural gradients got less attention. Some of these methods (Banner et al., 2019; Ye et al., 2019; Fang et al., 2020) use a systemic and rigorous statistical approach to optimize various distortion measures. For example, Banner et al. ( 2019) used the normal distributional assumption (of weights and activations) to analytically minimize the mean-squared quantization error. Our work follows a similar line to rigorously optimize similar performance measures for quantization and pruning of gradient distributions, which are different from that of the weights and activations.

Neural gradients distribution (lognormal)

Gradient Quantization. While a lot of research focused on the quantization of weights and activations for inference (Krishnamoorthi, 2018; Choi et al., 2018; Jain et al., 2020) , there were also major advances in quantization during training, many of them faced difficulty trying to represent the high dynamic range of the gradients (Banner et al., 2018; Wu et al., 2018 ). Cambier et al. (2020) suggests keeping for each tensor shift and scale in full precision numbers to make them fit in FP8 format dynamic range. Mantissa vs exponent allocation of the available bits has proven crucial in deep learning workloads, where for example, BF16 (1-8-7: sign-exponent-mantissa) has shown greater success compared to traditional FP16 (1-5-10) format due to wider dynamic range (Henry et al., 2019; Kalamkar et al., 2019) . Research over required format and trade-offs in exponent versus mantissa is on-going with growing interest over lower precision representations such as FP8. Some works have

