END-TO-END QUANTIZED TRAINING VIA LOG-BARRIER EXTENSIONS

Abstract

Quantization of neural network parameters and activations has emerged as a successful approach to reducing model size and inference time on hardware that supports native low-precision arithmetic. Fully quantized training would facilitate further computational speed-ups as well as enable model training on embedded devices, a feature that would alleviate privacy concerns resulting from the transfer of sensitive data and models that is necessitated by off-device training. Existing approaches to quantization-aware training (QAT) perform "fake" quantization in the forward pass in order to learn model parameters that will perform well when quantized, but rely on higher precision variables to avoid overflow in large matrix multiplications, which is unsuitable for training on fully low-precision (e.g. 8-bit) hardware. To enable fully end-to-end quantized training, we propose Log Barrier Tail-bounded Quantization (LogBTQ). LogBTQ introduces a loss term, inspired by the log-barrier for constrained optimization, that enforces soft constraints on the range of values that model parameters can take on. By constraining and sparsifying model parameters, activations and inputs, our approach eliminates overflow in practice, allowing for fully quantized 8-bit training of deep neural network models. We show that models trained using our approach achieve results competitive with state-of-the-art full-precision networks on the MNIST, CIFAR-10 and ImageNet classification benchmarks.

1. INTRODUCTION

As state-of-the-art deep learning models for vision, language understanding and speech grow increasingly large and computationally burdensome (He et al., 2017; Devlin et al., 2018; Karita et al., 2019) , there is increasing antithetical demand, motivated by latency, security and privacy concerns, to perform training and inference in these models on smaller devices at the edge rather than in server farms in the cloud. Model quantization has emerged as a promising approach to enable deployment of deep learning models on edge devices that reduce energy, latency and storage requirements by performing floating-point computation in low precision (less than 32 bits). There are two primary strategies for quantization: Post-training approaches quantize the parameters of a model trained in full precision post-hoc, and tend to suffer a heavy penalty on accuracy since their inference graph differs substantially from training (Jacob et al., 2018) . Quantization-aware training (QAT) (Bhuwalka et al., 2020) combats this discrepancy by simulating quantization during training, so that model parameters are learned that will work well when inference is performed in low precision. In this work, we focus on the latter setting, suitable for fully quantized training on low-precision (e.g. 8-bit) devices. Though QAT results in quantized models that perform largely on par with their non-quantized counterparts, current state-of-the-art QAT methods (Wu et al., 2018; Wang et al., 2018; Bhuwalka et al., 2020) are not suitable for training on fully low-precision hardware because they employ fake quantization, meaning each operation is executed using 32-or 16-bit floating point arithmetic, and its output is quantized to lower precision, e.g. int8. This results in two key incompatibilities with fully low-precision training, and consequently deployment on real low-precision hardware. First, existing QAT approaches assume perfect sums in inner product operations, which means that the accumulators used to compute matrix multiplies (the acc row in Table 1 ) must be higher precision than the values being multiplied (other bit-precision rows in Table 1 ). This is to avoid losing res- olution in low-precision additions, also known as swamping (Wang et al., 2018)foot_0 . Second, QAT commonly leverages dynamic quantization ranges per-layer, meaning the mapping between highand low-precision values varies by layer, carefully tuned as a function of the network architecture, optimization dynamics and data during training. While this practice results in higher quantized inference accuracy, it is also a challenge to low-precision training, since it is unclear how to tune those ranges when training on new data in the absence of high-precision arithmetic. These incompatibilities present a substantial hurdle to quantized training in practice. For example, an automotive electronics manufacturer may want to deploy a machine learning model on its 8-bit door lock or power window controller to adaptively fit the users' habits. In this scenario, existing approaches for quantized training would fail (Sakr et al., 2019) . In response, we propose a new approach for fully quantized training of neural network models, inspired by the barrier method from convex optimization (Boyd & Vandenberghe, 2004) . Log Barrier Tail-bounded Quantization (LogBTQ) utilizes a log barrier extension loss (Kervadec et al., 2019) to constrain the output of the network, encouraging all model parameters and activations to stay within the same predefined range. The log barrier function itself is a smooth approximation of the indicator function, which is ideal for selecting the weights that are within the range of quantization (see Figure 1 , left). By fixing a single quantization range throughout the network at the beginning of training, our approach both obviates the need for dynamic ranges, and the limits of the range are set so as to alleviate overflowfoot_1 in matrix multiply accumulations. We combine the log barrier extension loss with an L 1 regularization term (Hoffer et al., 2018) to further reduce the total magnitude of parameters and activations in the model. To allow for gradients, which tend form a peaky distribution near extremely small values (Zhou et al., 2016; Jain et al., 2020) , to be quantized using the same range as the rest of the network, we also adopt the nonlinear µ-law algorithm from audio applications (Deng & Doroslovacki, 2006) to construct a new MU8 codebook that better deals with "swamping" issues compared to the standard IEEE Float Standard. Experiments show that our approach achieves competitive results compared to state-of-art full-precision models on the MNIST, CIFAR-10 and ImageNet classification benchmarks, despite our models being trained end-to-end using only 8 bits of precision.



Swamping: Accumulation of floating-point numbers, where the small magnitude value is ignored (or truncated) when it is added to the large magnitude sum. Overflowing: for the fixed-point accumulation where the accumulated value wraps around to the small value when it exceeds the largest value representable by the given accumulation precision.



Figure 1: Left: Visualization of the log barrier constraint applied to network parameters quantized in range [-2, 2]. See §3.3 for an approximated tail bound on possible overflow. Right: µ-law encoding vs. FP8(1-5-2) and FP8(1-4-3) for all possible values on interval [-2, 2]. µ-law maintains higher precision at concentrated small values.

