POST-TRAINING WEIGHTED QUANTIZATION OF NEURAL NETWORKS FOR LANGUAGE MODELS

Abstract

As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint. Neural network quantization is usually performed to reduce quantization loss assuming that quantization error of each parameter equally contributes to the overall training loss. The importance of each parameter, however, may highly differ such that for the same number of quantization bits, certain parameters lead to higher training loss than the others after quantization. In this paper, we consider a non-uniform quantization scheme, specifically binary-coding-based quantization, for high compression ratio and efficient computations while avoiding large accuracy degradation by uniform quantization (e.g., INT8 ). Then, we derive quantization optimization methods to take into account the importance of each parameter. We demonstrate that for post-training quantization, weight magnitude can represent importance and improve model accuracy significantly compared to the previous schemes lacking importance considerations. For various language models including BERT, Dis-tilBERT, AWD-LSTM, and Transformer, we achieve 2-4 bits per weight by our proposed post-training quantization with reasonable accuracy degradation.

1. INTRODUCTION

Training techniques for deep neural networks (DNNs) have been developed in ways to incur a lot of parameter redundancy to expedite seeking local minima (Denil et al., 2013; Jonathan Frankle, 2019) . As a result, various model compression techniques including parameter pruning (Han et al., 2015; He et al., 2017) , quantization (Courbariaux et al., 2015; Rastegari et al., 2016) , low-rank approximation (N. Sainath et al., 2013; Prabhavalkar et al., 2016) , and knowledge distillation (Hinton et al., 2015; Polino et al., 2018) are proposed to lower storage requirements and improve inference performance. Several compression techniques can be combined in a synergistic way to enhance compression ratio (Han et al., 2016; Zhu et al., 2017) . In this work, we consider parameter quantization that maintains structured model formats and presents high compression ratio. Note that due to limited hardware resources, quantization is an essential method for any inference systems. In general, quantization is classified into uniform quantization based on fixed-point parameter representations (Jacob et al., 2018; Han et al., 2016) and non-uniform quantization associated with the binary codes (Zhou et al., 2017; Rastegari et al., 2016) or codebooks (Choi et al., 2017; Stock et al., 2020) . Most DNN quantization methods are performed based on the principle of minimizing the mean squared error (MSE) of quantized parameters (Rastegari et al., 2016; Xu et al., 2018; Zhou et al., 2017) . Optimizing the MSE is also an underlying principle of low-rank approximation techniques such as the singular value decomposition (SVD) (Prabhavalkar et al., 2016; N. Sainath et al., 2013) . Note that, however, minimizing the MSE implies that each parameter is equally important (i.e., squared errors from parameters are accumulated without considering importance of each weight). In practice, the impact of each parameter perturbation from quantization on training loss can be vastly different and such impact needs to be analyzed through a sensitivity study of each parameter toward a change in training loss value. In other words, minimizing the MSE (or the Euclidean distance between original parameters and quantized parameters) may not correspond to minimizing training loss function after quantization. Robustness to quantization error of each parameter can be expressed as sensitivity. Sensitivity of i-th parameter w i is the amount of change in the loss function when w i is perturbed. A parameter associated with high sensitivity would require relatively smaller quantization error if quantization is performed in a group manner. Several previous works acknowledge distinct sensitivity of each parameter to improve quantization quality. Note that because exact sensitivity estimation of each parameter toward loss function is highly complicated, various heuristic techniques have been introduced. For example, Hessian-weighted k-means clustering is used for codebook-based implementations (Choi et al., 2017) or Taylor series expansion to bound loss function difference is conducted to decide the optimal quantization bits of each weight (Khoram & Li, 2018) . The Hessian matrix can be used to assign different numbers of quantization bits for each layer (Dong et al., 2019; Shen et al., 2019) . Minimizing the reconstruction error on the output activations after each layer quantization is performed in (Stock et al., 2020) . In this paper, we propose a weighted quantization framework where quantized parameters follow the structure of the binary codes so as to achieve high compression ratio and high computational efficiency (Rastegari et al., 2016; Jeon et al., 2020) . Specifically, given that an importance of each parameter is represented as a real number between 0 and 1, we extract an optimal quantization solution modified from the previous binary-coding-based quantization methods that employ equal parameter importance. Similar to previous attempts, we also find that calculating accurate importance of each parameter is challenging. As a successful approximation of importance, we suggest that magnitude-based importance estimation is especially effective for post-training non-uniform quantization.

2. POST-TRAINING PARAMETER QUANTIZATION FOR LANGUAGE MODELS

The number of parameters for language models is dramatically increasing (e.g., GPT-3 (Brown et al., 2020) requires 175 billion parameters). Correspondingly, model compression for language models is becoming a mandatory process to reduce response time and inference energy. We devise a compression method considering the followings: • Recent language models are usually memory-bound because of small batch size and lacking layers of high reuse (e.g., conv layers). Thus, reducing memory footprint is critical. • Compression algorithms should be supported by dedicated kernels, designed specifically for language models if possible. • Compression-aware training is challenging and expensive if hyper-parameters are added to already huge language models (hence, we choose a post-training method.) Fixed-point inference using uniform quantization is not desirable for language models because of noticeable accuracy degradation (Shen et al., 2019; Jeon et al., 2020) while the advantage of small computational units (e.g., INT8 MAC) is insignificant for memory-bound applications. Thus, we adopt float-based parameter quantization (i.e., expected values of quantized parameters remains to be of full precision) that induce a lot smaller number of quantization bits compared to fixed-point quantization (Xu et al., 2018; Stock et al., 2020) . Recently, a kernel library, called BiQGEMM (Jeon et al., 2020) , was introduced to support binarycoding-based quantization techniques to accelerate quantized neural networks. Using lookup tables, BiQGEMM enables byte-level memory accesses and achieves 8.3× run-time memory footprints and 3.5× speed up with a mobile CPU for Transformer (Chung et al., 2020) . As a result, binary-codingbased quantization has become a practical approach to quantizing language models. As such, we restrict our interests to binary-coding-based quantization technique in this paper. Quantization-aware training is an active research area to improve model accuracy (Courbariaux et al., 2015; Lee et al., 2018) . We note that in the case of language models, however, there are numerous occasions when retraining for quantization is not available. For example, quantizationaware training requires in-depth knowledge on model compression while model designers may not have such expertise. On the other hand, the original training code or the entire training data may not be shared with model compression engineers. Also, modifying the original DNN models to be aware of quantization would increase model design efforts and training time significantly. Since language models already demand significant training time and cost, adding additional training complexity by quantization-aware training would not be a practical option. As such, post-training quantization without retraining is gaining increasing attention (Zhao et al., 2019; Nagel et al., 2019) .

