ACCURATE NEURAL TRAINING WITH 4-BIT MATRIX MULTIPLICATIONS AT STANDARD FORMATS

Abstract

Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. Previous works separately showed that accurate 4-bit quantization of the neural gradients needs to (1) be unbiased and (2) have a log scale. However, no previous work aimed to combine both ideas, as we do in this work. Specifically, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how to combine it with logarithmic quantization. Based on this, we suggest a logarithmic unbiased quantization (LUQ) method to quantize both the forward and backward phases to 4-bit, achieving state-of-the-art results in 4-bit training without the overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1%. We further improve this to a degradation of only 0.32% after three epochs of high precision fine-tuning, combined with a variance reduction method-where both these methods add overhead comparable to previously suggested methods. A reference implementation is supplied in the supplementary material.

1. INTRODUCTION

Deep neural networks (DNNs) training consists of three main general-matrix-multiply (GEMM) phases: the forward phase, backward phase, and update phase. Quantization has become one of the main methods to compress DNNs and reduce the GEMM computational resources. Previous works showed the weights and activations in the forward pass to 4 bits while preserving model accuracy (Banner et al., 2019; Nahshan et al., 2019; Bhalgat et al., 2020; Choi et al., 2018b) . Despite these advances, they only apply to a third of the training process, while the backward phase and update phase are still computed with higher precision. Recently, Sun et al. (2020) was able, for the first time, to train a DNN while reducing the numerical precision of most of its parts to 4 bits with some degradation (e.g., 2.49% error in ResNet50). To do so, Sun et al. ( 2020) suggested a non-standard radix-4 floating-point format, combined with double quantization of the neural gradients (called two-phase rounding). This was an impressive step forward in the ability to quantize all GEMMs in training. However, since a radix-4 format is not aligned with conventional radix-2, any numerical conversion between the two requires an explicit multiplication to modify both the exponent and mantissa. Thus, their non-standard quantization requires specific hardware support (Kupriianova et al., 2013) that can significantly reduce the benefit of quantization to low bits (Appendix A.6), and make it less practical. The main challenge in reducing the numerical precision of the entire training process is quantizing the neural gradients, i.e. the backpropagated error. Previous works showed separately that, to achieve accurate low precision representation of the neural gradients, it is important to use: (1) Logarithmic quantization and (2) Unbiased quantization. Specifically, Chmiel et al. (2021) showed the neural gradients have a heavy tailed near-lognormal distribution and found an analytical expression for the optimal floating point format. At low precision levels, the optimal format is logarithmically quantized. For example, for FP4 the optimal format is [sign,exponent,mantissa] = [1,3,0], i.e. without mantissa bits. In contrast, weights and activations are well approximated with Normal or Laplacian distributions (Banner et al., 2019; Choi et al., 2018a) , and therefore are better approximated using uniform quantization (e.g., INT4). However, Chmiel et al. ( 2021) did not use unbiased quantization (nor did any of the previous works that use logarithmic quantization of the neural gradients (Li et al., 2020; Miyashita et al., 2016; Ortiz et al., 2018) ). Therefore, they were unable to successfully quantize in this FP4 format (their narrowest format was FP5). Chen et al. (2020a) showed that unbiased quantization of the neural gradients is essential to get unbiasedness in the weight gradients, which is required in SGD analysis convergence (Bottou et al., 2018) . However, they focused on quantization using integer formats, as did other works that pointed out on the importance of being unbiased (Banner et al., 2018; Zhong et al., 2022) . Naive quantization of the neural gradients using the optimal FP4 format (logarithmic) results in biased estimates of the FP32 weight gradients-and this leads to severe degradation in the test accuracy. For example, a major issue is that under aggressive (naive) quantization many neural gradients with magnitudes below the representable range are zeroed, resulting in biased estimates of the FP32 gradients and reduced model accuracy. Using either a logarithmic scale or unbiased rounding alone catastrophically fails at 4bit quantization of the neural gradients (e.g., see Fig. 2 below ). Therefore, it is critical to combine them, as we do in this paper in Section 4. To do this, we stochastically quantize gradients below the representable range to either zero or the smallest representable magnitude α to provide unbiased estimates within that "underflow" range. Additionally, in order to represent the maximum magnitude without bias, we dynamically adjust α so that the maximum can always be represented with an exponentiated scaling starting at α. Finally, to completely eliminate bias, we devise an efficient way to use stochastic rounding on a logarithmic scale, on the values between α and the maximum. Together, this gradient quantization method is called Logarithmic Unbiased Quantization (LUQ), and for 4-bit quantization it uses a numerical format with one sign bit, three exponent bits, and zero mantissa bits, along with stochastic mapping (to zero or α) of gradients whose values are below α and stochastic rounding within the representable range. Main contribution LUQ, for the first time, combines logarithmic quantization with unbiased quantization for the neural gradients and does this efficiently using a standard format. By additionally quantizing the forward phase to INT4, we enable, for the first time, an efficient scheme for "full 4-bit training", i.e. the weights, activations and neural gradients are quantized to 4-bit in standard formats (see Appendix A.1) so all GEMMs can be done in 4-bit, and also bandwidth can be reduced. As we show, this method requires little to no overhead while achieving state-of-the-art accuracy results: for example, in ResNet50 we get 1.1% error degradation with standard formats; in comparison, the previous method (Sun et al., 2020) had 2.49% error degradation but required non-standard formats, as well as additional modifications which have additional overhead. Moreover, in Section 5 we suggest two optional simple methods to further reduce the degradation, with some overhead: the first method reduces the quantization variance of the neural gradients using re-sampling, while the second is fine-tuning in high precision. Combining LUQ with these two proposed methods we achieve, for the first time, only 0.32% error in the 4-bit training of ResNet50. The overhead of our additional methods is no more than similar modifications previously suggested in Sun et al. (2020) . Lastly, in Section 7 we discuss how to reduce remaining overheads such as data movement, scaling operations, and GEMM-related operations. 2021)). In this case, for standard ImageNet models, the best performing methods can achieve quantization to 4 bits with small or no degradation Choi et al. (2018a); Sakr et al. (2022) . These methods can



networks Quantization has been extensively investigated in the last few years. Most of the quantization research has focused on reducing the numerical precision of the weights and activations for inference (e.g., Courbariaux et al. (2016); Rastegari et al. (2016); Banner et al. (2019); Nahshan et al. (2019); Choi et al. (2018b); Bhalgat et al. (2020); Choi et al. (2018a); Liang et al. (

