POST-TRAINING WEIGHTED QUANTIZATION OF NEURAL NETWORKS FOR LANGUAGE MODELS

Abstract

As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint. Neural network quantization is usually performed to reduce quantization loss assuming that quantization error of each parameter equally contributes to the overall training loss. The importance of each parameter, however, may highly differ such that for the same number of quantization bits, certain parameters lead to higher training loss than the others after quantization. In this paper, we consider a non-uniform quantization scheme, specifically binary-coding-based quantization, for high compression ratio and efficient computations while avoiding large accuracy degradation by uniform quantization (e.g., INT8 ). Then, we derive quantization optimization methods to take into account the importance of each parameter. We demonstrate that for post-training quantization, weight magnitude can represent importance and improve model accuracy significantly compared to the previous schemes lacking importance considerations. For various language models including BERT, Dis-tilBERT, AWD-LSTM, and Transformer, we achieve 2-4 bits per weight by our proposed post-training quantization with reasonable accuracy degradation.

1. INTRODUCTION

Training techniques for deep neural networks (DNNs) have been developed in ways to incur a lot of parameter redundancy to expedite seeking local minima (Denil et al., 2013; Jonathan Frankle, 2019) . As a result, various model compression techniques including parameter pruning (Han et al., 2015; He et al., 2017) , quantization (Courbariaux et al., 2015; Rastegari et al., 2016) , low-rank approximation (N. Sainath et al., 2013; Prabhavalkar et al., 2016) , and knowledge distillation (Hinton et al., 2015; Polino et al., 2018) are proposed to lower storage requirements and improve inference performance. Several compression techniques can be combined in a synergistic way to enhance compression ratio (Han et al., 2016; Zhu et al., 2017) . In this work, we consider parameter quantization that maintains structured model formats and presents high compression ratio. Note that due to limited hardware resources, quantization is an essential method for any inference systems. In general, quantization is classified into uniform quantization based on fixed-point parameter representations (Jacob et al., 2018; Han et al., 2016) and non-uniform quantization associated with the binary codes (Zhou et al., 2017; Rastegari et al., 2016) or codebooks (Choi et al., 2017; Stock et al., 2020) . Most DNN quantization methods are performed based on the principle of minimizing the mean squared error (MSE) of quantized parameters (Rastegari et al., 2016; Xu et al., 2018; Zhou et al., 2017) . Optimizing the MSE is also an underlying principle of low-rank approximation techniques such as the singular value decomposition (SVD) (Prabhavalkar et al., 2016; N. Sainath et al., 2013) . Note that, however, minimizing the MSE implies that each parameter is equally important (i.e., squared errors from parameters are accumulated without considering importance of each weight). In practice, the impact of each parameter perturbation from quantization on training loss can be vastly different and such impact needs to be analyzed through a sensitivity study of each parameter toward a change in training loss value. In other words, minimizing the MSE (or the Euclidean distance between original parameters and quantized parameters) may not correspond to minimizing training loss function after quantization. Robustness to quantization error of each parameter can be expressed as sensitivity. Sensitivity of i-th parameter w i is the amount of change in the loss function when w i is perturbed. A parameter associated with high sensitivity would require relatively smaller quantization error if quantization is performed in a group manner. Several previous works acknowledge distinct sensitivity of each parameter to improve quantization quality. Note that because exact sensitivity estimation of each parameter toward loss function is highly complicated, various heuristic techniques have been introduced. For example, Hessian-weighted k-means clustering is used for codebook-based implementations (Choi et al., 2017) or Taylor series expansion to bound loss function difference is conducted to decide the optimal quantization bits of each weight (Khoram & Li, 2018) . The Hessian matrix can be used to assign different numbers of quantization bits for each layer (Dong et al., 2019; Shen et al., 2019) . Minimizing the reconstruction error on the output activations after each layer quantization is performed in (Stock et al., 2020) . In this paper, we propose a weighted quantization framework where quantized parameters follow the structure of the binary codes so as to achieve high compression ratio and high computational efficiency (Rastegari et al., 2016; Jeon et al., 2020) . Specifically, given that an importance of each parameter is represented as a real number between 0 and 1, we extract an optimal quantization solution modified from the previous binary-coding-based quantization methods that employ equal parameter importance. Similar to previous attempts, we also find that calculating accurate importance of each parameter is challenging. As a successful approximation of importance, we suggest that magnitude-based importance estimation is especially effective for post-training non-uniform quantization.

2. POST-TRAINING PARAMETER QUANTIZATION FOR LANGUAGE MODELS

The number of parameters for language models is dramatically increasing (e.g., GPT-3 (Brown et al., 2020) requires 175 billion parameters). Correspondingly, model compression for language models is becoming a mandatory process to reduce response time and inference energy. We devise a compression method considering the followings: • Recent language models are usually memory-bound because of small batch size and lacking layers of high reuse (e.g., conv layers). Thus, reducing memory footprint is critical. • Compression algorithms should be supported by dedicated kernels, designed specifically for language models if possible. • Compression-aware training is challenging and expensive if hyper-parameters are added to already huge language models (hence, we choose a post-training method.) Fixed-point inference using uniform quantization is not desirable for language models because of noticeable accuracy degradation (Shen et al., 2019; Jeon et al., 2020) while the advantage of small computational units (e.g., INT8 MAC) is insignificant for memory-bound applications. Thus, we adopt float-based parameter quantization (i.e., expected values of quantized parameters remains to be of full precision) that induce a lot smaller number of quantization bits compared to fixed-point quantization (Xu et al., 2018; Stock et al., 2020) . Recently, a kernel library, called BiQGEMM (Jeon et al., 2020) , was introduced to support binarycoding-based quantization techniques to accelerate quantized neural networks. Using lookup tables, BiQGEMM enables byte-level memory accesses and achieves 8.3× run-time memory footprints and 3.5× speed up with a mobile CPU for Transformer (Chung et al., 2020) . As a result, binary-codingbased quantization has become a practical approach to quantizing language models. As such, we restrict our interests to binary-coding-based quantization technique in this paper. Quantization-aware training is an active research area to improve model accuracy (Courbariaux et al., 2015; Lee et al., 2018) . We note that in the case of language models, however, there are numerous occasions when retraining for quantization is not available. For example, quantizationaware training requires in-depth knowledge on model compression while model designers may not have such expertise. On the other hand, the original training code or the entire training data may not be shared with model compression engineers. Also, modifying the original DNN models to be aware of quantization would increase model design efforts and training time significantly. Since language models already demand significant training time and cost, adding additional training complexity by quantization-aware training would not be a practical option. As such, post-training quantization without retraining is gaining increasing attention (Zhao et al., 2019; Nagel et al., 2019) .

3. WEIGHTED QUANTIZATION BASED ON THE BINARY CODES

As discussed, we choose post-training binary-coding-based quantization as our strategy to compress language DNN models efficiently. Following the Binary-Weight-Networks (Rastegari et al., 2016) introducing the binary codes as a quantization format, a weight vector w is approximated to be αb by using a scaling factor α ∈ R and a binary vector b (∈ {-1, +1} n ), where n is the vector size. A real number scaling factor is shared by multiple weights such that binary vector b occupies most of the weight storage requirements. Binary codes eliminate the need for dequantization for inference, leading to reduced on-chip memory size for weights. In this section, we study general weighted quantization methodologies when quantization follows the format of the binary codes and and quantization error recognizes sensitivity information.

3.1. GREEDY METHOD AND ALTERNATING METHOD WITHOUT IMPORTANCE CONSIDERATIONS

In general, non-uniform weight quantization methods (in the form of the binary codes) strive to minimize w -αb 2 . In the case of 1-bit quantization, we obtain the following analytical solution: b * = sign(w), α * = w b * n . On the other hand, in the case of multi-bit quantization, there is no analytical solution (Rastegari et al., 2016; Xu et al., 2018) . As a result, various approximated methods exist for multi-bit quantization. Greedy Method As a computationally simple method, 1-bit quantization shown in Eq. ( 1) can be extended to multi-bit (q-bit) quantization (Guo et al., 2017) . Specifically, i th -bit (i > 1) quantization is performed by minimizing the residue of (i -1) th -bit quantization as following: min αi,bi r i-1 -α i b i 2 , where r i-1 = w - i-1 j=1 α j b j , 1 < i ≤ q. (2) The optimal solution of Eq. ( 2) is then given as b * i = sign(r i-1 ), α * i = r i-1 b * i n . Alternating Method Greedy method described above is non-iterative. In order to reduce wq i=1 α i b i 2 further than Greedy method, iterative methods would be necessary while increasing the number of iterations tends to lower quantization error. Once initial α and b values are calculated by Greedy method, one can notice that {α i } q i=1 can be refined (Guo et al., 2017) as [α 1 , ..., α q ] = B q B q -1 B q w , when B q = [b 1 , ..., b q ] ∈ {-1, +1} n×q . Then, B q can be refined as well by binary search given a new refined {α i } q i=1 . As a result, {α i } q i=1 and B q are refined alternatively. Alternating refinements of {α i } q i=1 and B k are repeated until there is no noticeable improvement in quantization error. Such iterative quantization procedure is introduced as Alternating multi-bit method (Xu et al., 2018) .

3.2. IMPORTANCE-AWARE WEIGHTED QUANTIZATION

Let us assume that the importance of the i-th parameter is normalized and given as m i (0 ≤ m i ≤ 1). Then, we minimize the weighted quantization loss n i=1 (m i (w i -ŵi ) 2 ) where w i is quantized to be ŵi = q j=1 (α j b j ). Before studying how to estimate importance values, we are interested in finding modified versions of Greedy method and Alternating method when importance values are given. For 1-bit quantization, weighted quantization also has the following analytical solution: b * = sign(w), α * = n i=1 (m i |w i |) n i=1 m i . (5) Note that if all importance values are equal (e.g., m i = 1 for all i), then Eq. ( 5) becomes the same as Eq. (1). Correspondingly, Eq. ( 1) can be regarded as a special case of Eq. ( 5). Compared to the conventional Greedy method, our proposed importance-aware Greedy method demands modifications to α i calculations as α * i = n i=1 (m i |r i-1 |)/ n i=1 m i . For the importance-aware Alternating method, we first conduct the importance-aware Greedy method. Then, Eq. ( 4) is to be transformed to employ importance. Let us define an n-by-n diagonal matrix M = diag(m 1 , ..., m n ), where each diagonal element is an importance value m i . By solving linear least squares, α values are refined as [α 1 , ..., α q ] = B q M B q -1 B q M w , when B q = [b 1 , ..., b q ] ∈ {-1, +1} n×q , (6) while refining B q is still performed by binary search using refined scaling factors. Accordingly, Eq. ( 4) is a particular case of Eq. ( 6) when M is an identity matrix. Overall, our proposed importance-aware quantization scheme is comprehensive to include previous methods as a subset. In the rest of this paper, we investigate simple and efficient schemes to estimate importance metrics applicable to post-training non-uniform binary-coding-based quantization.

4. IMPORTANCE ESTIMATION USING WEIGHT MAGNITUDE

Sensitivity or importance of each parameter can be estimated by evaluating the loss function change induced by parameter perturbation. Such estimation, however, is computationally demanding since each parameter's perturbation requires computations of entire feedforward paths while the number of parameters is ever increasing in recent DNN designs. Moreover, it is difficult to decide an appropriate amount of perturbation of parameters. In this section, we propose an efficient importance estimation method based on weight magnitude for fast and high-accurate post-training weight quantization.

4.1. WEIGHT MAGNITUDE AS IMPORTANCE

As an effort to study a factor affecting importance, we introduce Optimal Brain Damage (OBD) (LeCun et al., 1990) . OBD approximates the loss function by the Taylor series and a perturbation of the loss function δL by weight perturbations is presented as δL = v i=1 ∂L ∂w i δw i + 1 2 v i=1 h i,i δw 2 i + 1 2 v i=1 v j=1 j =i h i,j δw i δw j + O( δw 3 ), ( ) where v is the number of weights and h i,j is an element of the Hessian matrix. Note that at a local minimum, the first term is eliminated while all the h i,i are non-negative. Using "diagonal" approximation and "quadratic" approximation (LeCun et al., 1990) , Eq. ( 7) is simplified as δL 1 2 v i=1 h i,i δw 2 i . As described in Eq. 8, diagonal elements of the Hessian matrix can be used as effective importance metrics for loss-aware training algorithms for quantization (Hou et al., 2017; Hou & Kwok, 2018) . Note that for post-training quantization, unfortunately, the first or the second partial derivatives (i.e., the gradient or the Hessian) are not available. Thus, for post-training quantization, importance is required to be given as a function of weight magnitude. In other words, we are interested whether Hessian-based importance can be replaced with magnitude-based importance. The goal of this work is then to empirically show that magnitude-based importance is indeed practical for post-training binary-coding-based quantization. To verify our basic assumption that large weights would present higher importance for binary-codingbased quantization, we controlled scaling factors (i.e., α values of the minimum MSE are multiplied by 'Scaling Factor Multiplier') of a layer in an LSTM model using PTB dataset (see details of models and dataset in Appendix) as shown in Figure 1 . Indeed, when scaling factors become larger than the ones obtained by minimizing MSE present, test perplexity is enhanced for both layers. 

4.2. HYPER-PARAMETERS FOR IMPORTANCE METRICS

The underlying principle of our importance estimation methodology is that relatively smaller quantization errors are supposed to be applied to weights of larger magnitude. To fine-tune our proposed importance estimation scheme, we propose the following three hyper-parameters: 1) an exponent to control a correlation between importance and magnitude of a weight, 2) a parameter to clip importance in order to handle outliers in magnitude distributions, and 3) a pruning parameter to exclude weights of low magnitude during quantization optimization.

E (Exponent)

A basic form to calculate normalized importance of each weight is presented as m i = w i w max E , where w max is the maximum weight magnitude in a given layer (i.e., m i is computed in a layer-wise manner) and the weight of the largest magnitude is considered to be the most important. A value between 0.0 and 1.0 for E is primarily adopted in our experiments to yield sub-linear importance increase as weight magnitude increases. E = 0 results in the conventional quantization method with m i = 1 for all w i . C (Clipping Importance) For a given distribution of weight magnitude, a few large outliers may distort the entire distribution of m i obtained by Eq. ( 9). In other words, because of a few exceptionally large weights, most weights may exhibit small importance values. A conventional quantization technique to prevent outliers in a distribution is to clip weights/activations and/or gradients (Choi et al., 2018; Zhao et al., 2019; Goodfellow et al., 2016) . For Eq. ( 9), w max is decided to be the weight magnitude at the (C × 100)-th percentile when 0 < C ≤ 1. If m i exceeds 1.0, then m i is forced to be 1.0 (note that w i is not clipped). P (Pruning for Quantization) Due to regularization effects, a lot of weights employ small magnitudes and a weight distribution usually follows a Gaussian distribution (Goodfellow et al., 2016) . As a result, a large number of small weights (with less importance as in Eq. ( 8)) may take a large portion of total quantization error unless m i values of those weights are extremely small. Note that even though pruning prior to quantization is an effective method to improve quantization quality (Li & Liu, 2016; Zhu et al., 2017) , pruning would yield one additional bit per weight for masking information or sparse matrix formats with low parallelism for DNN inference. In our work, 1) we exclude weights of magnitude smaller than the (P × 100)-th percentile from the quantization optimization, 2) find the scaling factors and the binary codes using weights larger than the (P × 100)-th percentile, and 3) all of the excluded small weights are assigned to a binary code with the smallest magnitude available from combining scaling factors (while each sign information is maintained). Accordingly, while we adopt a parameter pruning idea for quantization, additional pruning mask data is not necessary. In short, small weights are not considered while obtaining the binary codes, and then, replaced with the smallest weight in the binary codes. 

4.3. EMPIRICAL OBSERVATIONS 1

To verify basic operations of our proposed method, we perform post-training weighted quantization using fine-tuned models of BERT-base (Devlin et al., 2018) on MNLI and MRPC dataset within a GLUE benchmark (Wang et al., 2018) . In the case of fine-tuned BERT models, we quantize all weights except those of a segment embedding layer and a classification layer which show a tiny storage footprint. For conventional or weighted Alternating quantization methods, we conduct iterative refinements of α and B values 20 times over which no further noticeable quantization error improvement is recognized. Given a weight matrix or tensor, α and B are computed for each row, independently (hence, we study row-wise quantization in this work). Due to the space limit, see Appendix for additional experimental results with various models not included in this section. We first analyze how a simple weighted quantization scheme (with E=1.0, C=1.0, and P =0.0) adds distinguished features to the conventional quantization methods.  n i=1 (m i (w i -ŵi ) 2 ) is preferred to minimizing MSE for post-training quantization. It is interesting to see that for the Greedy method, quantization MSE is reduced by weighted quantization. We conjecture that for weight distributions in DNNs, Eq. ( 5) is probably a better approximation compared to Eq. (1) even to minimize quantization MSE. For both Greedy and Alternating methods, scaling factors increase by weighted quantization due to the context of magnitude-based importance design. Let us study the impact of our proposed weighted quantization on model accuracy when E, C, and P can vary for fine-tuning. Figure 2 describes test accuracy and quantization error of a finetuned BERT-base model on MRPC when we sweep only one of E, C, or P across all layers. It is clear that all hyper-parameters for importance metrics enable new search space for model accuracy that is somewhat uncorrelated to quantization error. Using various combinations of E, C, and P , Table 2 describes test model accuracy of fine-tuned BERT and DistilBERT models using Greedy and Alternating quantization methods. Even though numerous hyper-parameter combinations outperform Greedy and Alternating methods without importance, the best set of hyper-parameters varies for each model, and hence, an automated hyper-parameter search process is desirable. Note that in order to enable such an automated process, we need to investigate whether a set of hyper-parameters for importance searched by using training dataset is also valid for test dataset. We extensively explored hyper-parameter combinations (using E, C, P ) with different number of refinement iterations using BERT on MRPC and MNLI and confirmed that training model accuracy achieved by our weighted quantization is highly correlated with test model accuracy such that our proposed hyper-parameters maintain a generalization capability (see Fig. 4 in Appendix). Increasing the number of scaling factors (by decreasing the number of parameters sharing a scaling factor) enhances model accuracy despite increased memory footprint and computation overhead during inference. Our proposed weighted quantization obtains larger accuracy improvements when less number of scaling factors is utilized (see Table 7 in Appendix that compares model accuracy with different number of scaling factors when Alternating method is applied to BERT on MRPC.)

5. EXPERIMENTAL RESULTS

We observed that an optimal set of hyper-parameters needs to be achieved empirically. Unfortunately, optimizing E, C, and P for post-training quantization to obtain the best model accuracy is challenging because 1) trained models present a variety of weight distributions and 2) hyper-parameters are correlated. As an effort to automate the hyper-parameter search process, we adopt Bayesian optimization (BO) that is implemented by a publicly available code introduced in (Nogueira, 2014-) . Once we perform a rough and fast grid search for hyper-parameters (as shown in Table 2 ), then BO conducts fine-tuning of hyper-parameters. As a result, we achieve quick post-training quantization even when optimal hyper-parameters vary for each layer. For BO experiments, training dataset D t is used during hyper-parameter search, and then test dataset D v validates the optimization procedure (refer to Table 4 ). In other words, given a hyper-parameter vector denoted by x = {E, C, P } ∈ R 3 , BO tries to find the optimal x * to be arg max x f (x; D t ), where f measures accuracy of the model. Then, test model accuracy is measured as f (x * ; D v ). To the best of our knowledge, our work is the first post-training binary-coding-based quantization considering weight importance. As a result, we compare our results with the conventional Greedy algorithm and Alternating algorithm. We consider three different search methods for our proposed scheme: 1) (manual search) we investigate prearranged 16 sets of hyper-parameters for importance metric (described in Table 5 ), 2) (model-wise BO) for all layers, the same values of E, C, and P are explored and applied for quantization, and 3) (layer-wise BO) hyper-parameters are locally searched for each layer, and the fixed before proceeding to the next layer (hence, BO for quantization is performed in layer-by-layer manner). For all model-wise or layer-wise BO for various models, the same 16 sets of hyper-parameters (given in Table 5 ) are explored first as initial samples. BO outperforms our manual search while layer-wise BO improves test score further as presented in Table 8 (BERT-base), 9 (DistilBERT-base), 11 (Longformer), 13 (AWD-LSTM), and 14 (Transformer NMT). Among three methods considered, layer-wise BO is the best because the optimal set of hyper-parameters turns out to be vastly different for each layer as shown in Figure 3 , 5, 6, 8, and 9. Table 3 presents the overall comparison on test scores of various language models that are quantized by conventional Alternating quantization and our proposed weighted quantization. Compared to conventional Alternating quantization (that is our baseline for post-training binary-coding-based quantization) of equal importance for each parameter, ours improves test scores for all language models in Table 3 . We note that weighted quantization yields relatively different amounts of improvements on test scores depending on a given model. Even though thorough analysis of such different accuracy enhancement would entail in-depth sensitivity analysis of parameters toward test scores, pruning weights in the context of magnitude (without retraining) provides approximated correlation between importance and magnitude for a given weight distribution (see Figure 7 in Appendix). Indeed, the pruning rate of the Longformer is a lot lower compared to the other models (as shown in Figure 7 ) that can partly explain challenges to enhance Longformer quantization. We also note that our proposed method highly depends on the target objective function. As such, Transformers show different quantization results depending on the target selection of perplexity (PPL) or BLEU score due to somewhat low correlation between PPL and BLEU score (Appendix C.3). We applied our weighted quantization scheme to ResNet models on CIFAR-10 and ImageNet (refer to Appendix D) for which model accuracy is also significantly enhanced similar to language models.

6. CONCLUSIONS AND FUTURE WORK

In this paper, we propose a weighted quantization framework employing importance metrics that are useful when each parameter shows a different sensitivity toward a loss function change. For the binary-coding-based quantization that is our choice for language models because of existing efficient kernel designs (e.g., BiQGEMM) and high compression ratio, we extract modified Greedy method and Alternating method assuming that each importance value is represented as a real number between 0 and 1. Using various DNN models, we demonstrate that a magnitude-based importance metric is effective for post-training quantization in the form of the binary codes. To fine-tune model accuracy, we also propose three hyper-parameters that need to be empirically investigated since an optimal set of hyper-parameters varies depending on each model design. As such, we suggest Bayesian optimization as an effective technique to automate hyper-parameter search process. Our proposed hyper-parameters can be independently optimized to each layer to further improve compression ratio and/or model accuracy. It would be interesting to study additional hyper-parameters effective for post-training quantization. Since our weighted quantization framework is general (rather than depending on a particular approximation such as the Hessian), if proper importance metrics are found, our proposed quantization techniques can be extended to a quantization-aware training method.

A MODELS AND DATASETS

A.1 LSTM MODELS 1-Layer LSTM model (in Fig. 1 ) (Zaremba et al., 2014) : One layer LSTM model with 300 hidden statesfoot_1 is used to predict PTB dataset. We compress the a LSTM layer and an embedding layer of the pre-trained language model to draw the Figure 1 . A scaling factor is extracted for each raw, e.g., for the (10000, 300) embedding layer, there exist 10,000 scaling factors as a result of quantization. AWD-LSTM (Merity et al., 2017) : We use a 3-layer AWD-LSTM model. The embedding size is 400 and the hidden vector size within LSTM layers is 1550. We train the original model during 500 epochs and then fine-tune during additional 300 epochsfoot_2 . We compress both models including embedding and softmax layers by our post-training quantization method. Dataset: To evaluate our quantization method, we use the test dataset of Penn Treebank (PTB) dataset (Marcus et al., 1993) for the LSTM language models. To find the quantization parameters with BO, we use valid dataset.

A.2 HUGGINGFACE LANGUAGE MODELS

To evaluate recently developed language models, we utilize the transformers library (PyTorch version) developed by huggingface (Wolf et al., 2019) . For all bert-based models, the last classification layer and the sequence embedding layer are not quantized because the sizes of weights are relatively smaller than other weights. BERT (Devlin et al., 2018) / DistilBERT (Sanh et al., 2019) : We fine-tune the pre-trained BERTbase and DistilBERT-base models to evaluate our method. BERT-base model consists of 12 encoder blocks with 768 hidden size and DistilBERT-base model consists of 6 encoder blocks with 768 hidden size. We follow fine-tuning recipes found in the transformers repositoryfoot_3 . For the MRPC and MNLI tasks, the initial learning rate is 2e-5 and the training epoch is 3. For the SQUAD task, the initial learning rate is 3e-5 and the training epoch is 2. Longformer (Beltagy et al., 2020) : To evaluate the Longformer model, we choose the pre-trained longformer-base model that has 12 blocks of Longformer encoder with 768 hidden size. We fine-tune the pre-trained modelfoot_4 for SQUAD v1.1 dataset with the same recipe. Dataset: We use three language tasks: MRPC, MNLI and SQUAD(v1.1). To search quantization parameters (E, C and P ), we use a randomly sampled fraction of train dataset when the evaluation time is too time-consuming. For the MRPC task, we use the whole train dataset because the train dataset is small enough to use. For the MNLI task, we use only 10% of train dataset. For the SQUAD task, we use only 6.7% of train dataset for bayesian optimization while dev dataset is used for testing because there is no published test dataset.

A.3 OPENNMT TRANSFORMER

We use the pre-trained transformer model for Neural Machine Translation task (Klein et al., 2018) foot_5 . The transformer model consists of 6 encoder blocks, 6 decoder blocks and embedding layers. Note that the embedding layers are not shared, e.g. embedding weights are not tied. The vocabulary size is 50k and the size of the hidden vector is 512. Dataset: We evaluate the pre-trained model in a translation direction: English to German (en2de). We use valid dataset in newstest2017 for BO processes and test dataset for the test evaluation. All datasets are pre-processed by SentencePiece (Kudo & Richardson, 2018) . All the translation scores are BLEU scores by sacrebleu script (Post, 2018) as the beam size is 1. Under review as a conference paper at ICLR 2021

A.4 RESNET FOR IMAGE CLASSIFICATION

We conduct experiments using ResNet32 on CIFAR10 (Krizhevsky, 2009) and ResNet18 (He et al., 2016) on ImageNet (Russakovsky et al., 2015) . For convolution tensors, α and B are computed for each channel. We maintain full-precision on the first and last layers of ResNet models because those layers are very small while a lot of quantization bits are required (McDonnell, 2018) . For ResNet-18, we use ImageNet1Kfoot_6 training dataset, which is a small subset of ImageNet dataset, for fast hyper-parameter search, while test accuracy is still measured by the entire ImageNet test dataset. To obtain the same accuracy for the same set of hyper-parameters, the training dataset is not randomly manipulated (e.g., by cropping and flipping).

B BAYESIAN OPTIMIZATION FOR WEIGHTED QUANTIZATION

BO is one of automated machine learning (AutoML) techniques to search optimal hyper-parameters for networks. Given a black box function f , BO aims to find the optimal x * to be arg max x f (x). Suppose observations are described as y=[f (x 1 ), f (x 2 ), . . . , f (x n )] T and y * is an output of any unobserved x * , then under the assumption that f (x) is drawn from Gaussian Process, the distribution of y * |y follows N (K * K -1 y, K * * -K * K -1 K T * ), where K=    k(x 1 , x 1 ) • • • k(x 1 , x n ) . . . . . . . . . k(x 1 , x 1 ) • • • k(x 1 , x n )    , K * =[k(x * , x 1 ) • • • k(x * , x n )], and K * * =k(x * , x * ). The kernel function k(x, x ) is one of the hyper-parameter for BO and measures a similarity between x and x (i.e., the output is high when they are close). There are various kernel functions (Rasmussen & Williams, 2006) , and we use squared-exponential kernel that is one of the popular choices for regression (Ebden, 2015) . To identify which of the unobserved data to be taken as x n+1 , the acquisition function needs to be specified. In general, the expected improvement acquisition function a EI (Lizotte, 2008) (see Eq. ( 11)) is most commonly used and is selected for our experiments. a EI (x * |y) = (ZΦ(Z) + φ(Z))σ(x * ), where Z = µ(x * )-f (x + )-ξ σ(x * ) if σ(x * ) > 0 0 if σ(x * ) = 0 and f (x + ) = max 1≤i≤n f (x). The parameter ξ is the trade-off factor between exploitation and exploration. In our experiments, we set ξ to 0.2, which implies that exploitation has more influence on determining x n+1 . After computing a EI for unobserved random sampled data x * s, x n+1 is computed as arg max x * a EI (x * |y). Further details of BO can be found in (Brochu et al., 2010; Nogueira, 2014-; Snoek et al., 2012) . For the same pruning rate, the Longformer presents sharper score degradation that partly explains the difficulty of improving test scores by our proposed weighted quantization method compared to the conventional Alternating quantization. Table 13 : Hyper-parameter search results of fine-tuned AWD-LSTM (quantized into 2/3/4 bits per weight) using manual search, model-wise BO, and layer-wise BO. Hyper-parameter search method val ACC. test ACC. Parameters(E,C,P )

2-bit

Manual (by 



See Appendix for detailed descriptions of models and dataset selected for our experiments. Available at https://github.com/pytorch/examples/tree/master/word language model The detailed parameters are described in https://github.com/salesforce/awd-lstm-lm https://github.com/huggingface/transformers Avaliable at https://huggingface.co/allenai/longformer-base-4096 Available at https://opennmt.net/Models-py/ Available at: https://s3.amazonaws.com/pytorch-tutorial-assets/imagenet 1k.zip



Figure 1: Quantization error (MSE) and test perplexity when one selected layer of an LSTM model using PTB dataset is quantized to be 1-bit. (Left): Embedding layer. (Right): LSTM layer.

Figure 2: Test accuracy and quantization error (MSE) of fine-tuned BERT-base model on MRPC when weights are quantized into 3 bits by our proposed method while one of E, C, or P varies.

Figure 3: E, C, and P values searched by layer-wise BO for BERT-base on MRPC, MNLI, and SQUAD. X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer. BO is necessary to efficiently and quickly find such diversified E, C, and P values.

Figure 4: Relationship of training accuracy achieved by weighted quantization and test accuracy using BERT on MRPC (LEFT) and BERT on MNLI (RIGHT).

Figure 5: E, C, and P values searched by layer-wise BO for DistilBERT-base on MRPC, MNLI, and SQUAD. X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer.

Figure 6: E, C, and P values searched by layer-wise BO for Longformer on SQUAD v1.1. X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer.

Figure 7: Test score degradation by(post-training)  pruning weights (based on the magnitude) using various pre-trained language models. Weights of a layer are pruned by the same target pruning rate. For the same pruning rate, the Longformer presents sharper score degradation that partly explains the difficulty of improving test scores by our proposed weighted quantization method compared to the conventional Alternating quantization.

Figure 8: E, C, and P values searched by layer-wise BO for fine-tuned AWD-LSTM model. X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer.

Figure 9: E, C, and P values searched by layer-wise BO for Transformer. BO is performed to optimize PPL (Left) or BLEU (Right). X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer.

Figure 10: E, C, and P values searched by layer-wise BO for ResNet18 on ImageNet. X-axis shows layer index and y-axis shows hyper-parameters optimized differently for each layer.

Post-training (3 bits per weight) quantization comparison on MSE (quantization error), average scaling factor values, training loss, and training model accuracy. For importance metrics, E = 1.0 is used while P and C are not considered.

Table 1 presents comparisons on quantization error (MSE), average scaling factor values, training loss, and training accuracy. Note that for Alternating weighted quantization, despite larger quantization MSE (i.e., n i=1 (w i -ŵi ) 2 ), training loss and training model accuracy are improved such that it is confirmed that minimizing

Test score after post-training quantization with various E, C and P choices when the quantization bit is 3.

Quantization results on various language models (see Appendix for details on model descriptions). Alternating quantization scheme significantly improves test scores when combined with our proposed importance metrics (described as 'Ours') that are searched by layer-wise BO.

Dataset usages and maximum iterations for Bayesian optimization. In the case of large training dataset (such as MNLI and SQUAD v1.1), we use samples.

16 sets of hyper-parameters selected for our manual search of importance metric.

Test score after post-training quantization with various E, C and P choices when the quantization bit is 4.

Hyper-parameter search results of BERT-base (quantized into 3 bits per weight) using manual search, model-wise BO, and layer-wise BO.

Hyper-parameter search results of DistilBERT-base (quantized into 3 bits per weight) using manual search, model-wise BO, and layer-wise BO.

F1 scores of Longformer on SQUAD v1.1 after post-training quantization (4 bits per weight) with various E, C and P choices.



Quantizing Transformer by using 3 bits per weight with different quantization schemes and the metric to be optimized by BO.

Post-training quantization comparison on quantization MSE, average scaling factor values, training loss, and training model accuracy. For importance metrics, E=1.0 is used while P and C are not considered.

Model accuracy(%) on test dataset after post-training quantization with various E and C choices. q is the number of quantization bits.

The optimal hyper-parameters searched by Bayesian optimization when Alternating quantization method is utilized and q is the number of quantization bits.

