IMPROVED FULLY QUANTIZED TRAINING VIA RECTI-FYING BATCH NORMALIZATION

Abstract

Quantization-aware Training (QAT) is able to reduce the training cost by quantizing neural network weights and activations in the forward pass and improve the speed at the inference stage. QAT can be extended to Fully-Quantized Training (FQT), which further accelerates the training by quantizing gradients in the backward pass as backpropagation typically occupies half of the training time. Unfortunately, gradient quantization is challenging as Stochastic Gradient Descent (SGD) based training is sensitive to the precision of the gradient signal. Particularly, the noise introduced by gradient quantization accumulates during backward pass, which causes the exploding gradient problem and results in unstable training and significant accuracy drop. Though Batch Normalization (BatchNorm) is a de-facto resort to stabilize training in regular full-precision scenario, we observe that it fails to prevent the gradient explosion when gradient quantizers are injected in the backward pass. Surprisingly, our theory shows that BatchNorm could amplify the noise accumulation, which in turn hastens the explosion of gradients. A BatchNorm rectification method is derived from our theory to suppress the amplification effect and bridge the performance gap between full-precision training and FQT. Adding this simple rectification loss to baselines generates better results than most prior FQT algorithms on various neural network architectures and datasets, regardless of the gradient bit-widths used (8,4, and 2 bits).



Yet gradient quantization under the FQT scheme is vastly underexplored, as it is notoriously more challenging than forward quantization in QAT. It is observed that network training is sensitive to the precision of gradients, and low-bit gradient quantization leads to unstable training and significant accuracy drop (see Fig. 1 ). More importantly, the accumulation of gradient quantization noise in backward pass (see Fig. 



Training (QAT) is a popular track of research that simulates the neural network quantization (weights and activations) during the course of training to curb the inference-time accuracy drop of low-bit models (e.g. INT8 quantization). On the other hand, theoretical calculations on the BitOps (Yang & Jin (2021); Guo et al. (2020)) computation costs can easily conclude that backpropagation accounts for half of the computations during training. Empirical data 1 shows backward pass sometimes even costs more in practice. Decreasing the gradient bit-widths will apparently reduce computation overheads of backpropagation Horowitz (2014). If variables in backward pass are also quantized, adding up the forward quantization in QAT, all the network variables required in training would be fully quantized and the whole training process could be accelerated on dedicated hardware, i.e., Fully-Quantized Training (FQT), providing huge accessibility of large model training to users with limited computation capability. Recent work Zhu et al. (2020) has shown that INT8 FQT speeds up the forward pass and the backward pass by 1.63× and 1.94× respectively when training ResNet-50 on ImageNet with NVIDIA Pascal GPU.

2) causes the exploding gradient problem during backpropagation, even resulting in training failure. In contrast to weight/activation quantization, gradient quantization noise produced during backpropagation cannot be automatically corrected by optimizing objective loss. Unlike prior works on optimizing gradient quantizers for quantization noise reduction Zhou et al. (2016); Choi et al. (2018); Zhu et al. (2020), this paper reveals the negative effect of Batch Nor-1 https://github.com/jcjohnson/cnn-benchmarks 1

