FLEXROUND: LEARNABLE ROUNDING BY ELEMENT-WISE DIVISION FOR POST-TRAINING QUANTIZATION

Abstract

Post-training Quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantizationaware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. We notice that, however, such new rounding schemes are established on element-wise addition. In this work, we propose a simple yet effective new rounding mechanism for post-training weight quantization, coined FlexRound, via element-wise division to learn not only a common quantization grid size but also a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit the importance of a pre-trained weight when updating its corresponding scale, and thus, flexibly quantize a pre-trained weight depending on its own importance. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but natural language generation in the per-tensor uniform PTQ setting. Our code will be open-sourced soon.

1. INTRODUCTION

Recent years have witnessed the unprecedented success of deep neural networks in a wide variety of domains including computer vision, natural language processing, automatic speech recognition, and so on. Although state-of-the-art deep neural networks surpass human-level performance, these neural networks cannot help requiring more and more computation cost and memory usage as networks become deeper and wider. In order to reduce the model size and accelerate inference operations, many researchers have attempted diverse compression techniques such as network quantization (Courbariaux et al., 2016) and network pruning (Han et al., 2016) . In this paper, we concentrate on network quantization due to the advantage that INT4 or INT8 quantization allows us to accelerate quantized neural networks using off-the-shelf accelerators such as the NVIDIA A100 Tensor Core GPU (Wu et al., 2020) or ARM Cortex MCUs (Kim et al., 2021) . Network quantization techniques can be generally divided into two categories: quantization-aware training (QAT) and post-training quantization (PTQ). When quantizing neural networks via QAT (Jung et al., 2019; Jain et al., 2019; Zhao et al., 2020; Esser et al., 2020; Lee et al., 2021) , the performance gap between a full-precision neural network and its quantized counterpart can be marginal. Yet, QAT requires end-to-end retraining or fine-tuning on a full training dataset, which often causes an enormous amount of time and resources to obtain a quantized neural network with competitive performance. Furthermore, a whole training dataset may not be available due to data privacy issues or demands to utilize legacy models. Such drawbacks of QAT are the reasons why researchers recently pay more attention to PTQ (Zhao et al., 2019; Wang et al., 2020; Nahshan et al., 2021) that needs neither a full training dataset nor end-to-end learning at all. PTQ had been initially performed via rounding-to-nearest scheme by minimizing the quantization error in the parameter space. Unfortunately, this approach suffers from severe performance degradation. Since it is reported that the loss degradation resulting from quantization can be approximated as the second-order error in Taylor Expansion by viewing quantized weights as perturbed weights, Nagel et al. (2020) and Li et al. (2021) substantiate that reconstructing each output of layer or block is equivalent to minimizing the approximation of loss degradation resulting from quantization under some assumptions. Accordingly, recent works (Nagel et al., 2020; Li et al., 2021; Hubara et al., 2021; Wei et al., 2022) have suggested to reconstruct each output of layer or block by devising and learning a new weight-rounding scheme, deviating from rounding-to-nearest, as an effort to preserve the performance of a full-precision model. However, all those new rounding schemes designed in existing studies either round or quantize pre-trained weights adaptively via element-wise addition. Changing the perspective of a new rounding policy from element-wise addition to element-wise division, we propose a simple yet effective post-training weight quantization method called FlexRound, which flexibly quantizes pre-trained weights by learning how much each pre-trained weight should be divided by. Interestingly, thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound can inherently leverage pre-trained weights when updating an individual scale for every pre-trained weight. Specifically, we corroborate that a relatively wider range of discrete values needs to be explored when quantizing pre-trained weights of large magnitude. The rationale behind such an approach is that the magnitude of weight can be considered as its importance. Given that it is crucial to retain the knowledge of important weights even after quantization so as to maintain the performance of a pre-trained model, the constraints associated with quantizing weights of large absolute value should be relaxed compared to those of small absolute value (i.e., those important weights can be quantized to one of not only its two nearest discrete values but also discrete ones far from it). Accordingly, FlexRound quantizes pre-trained weights flexibly depending on each their own importance, thereby leading to better performance. Our contributions are threefold: • We propose FlexRound as a new rounding scheme for post-training weight quantization based on the principle of element-wise division to enable learning separate scales for all pre-trained weights as well as a common quantization grid size across a group (e.g., a channel or a layer). • We demonstrate that such a new rounding scheme via element-wise division takes into consideration the importance of pre-trained weights when updating their corresponding scales so that FlexRound can quantize pre-trained weights of large magnitude (i.e., important pre-trained weights) more flexibly. • To the best of our knowledge, we are the first to conduct extensive experiments in the form of per-tensor uniform PTQ reconstruction on natural language generation as well as image classification and natural language understanding. We verify the effectiveness of FlexRound using numerous models such as ResNet, MobileNetV2, BERT, GPT-Neo, and OPT.

2. RELATED WORK

Recently, many researchers have attempted to quantize a wide range of models for various tasks such as vision and language understanding/generation without any (re)training. OCS (Zhao et al., 2019) replicates channels entailing outliers, and then, halves outliers of those channels. Unfortunately, even though OCS explicitly addresses outliers, it still suffers from severe accuracy degradation when both weights and activations are quantized into low-bit. As an alternative solution, Wang et al. (2020) proposed Bit-Split that splits an integer into several bits and optimizes them separately. Although Wang et al. (2020) showed that the performance of Bit-Split is close to that of a full-precision model in the low-bit setting, Bit-Split may not be effective for certain architectures including MobileNetV2. To overcome the limitations discussed above, Nagel et al. (2020) and Hubara et al. (2021) minimize the mean squared error (in a layer-by-layer fashion) between the full-precision layer's output and its quantized layer's output by inventing and learning a new weight-rounding mechanism dubbed as AdaRound and AdaQuant, respectively. As such a layer-wise reconstruction error minimization opens the door to 4-bit PTQ regime, Li et al. (2021) proposed block-wise reconstruction, titled BRECQ, to consider cross-layer dependency along with the possibility of fully quantizing MobileNetV2 into 4-bit. In addition to block-wise reconstruction, Wei et al. (2022) proposed QDrop that drops the quantization of activations at random during reconstruction to induce activation quantization to be synchronized with weight quantization. Both BRECQ and QDrop, however, are based on AdaRound, which cannot learn a quantization grid size while quantizing weights allows for rounding either up or down only at most. AdaQuant quantizes weights adaptively. AdaQuant, however, does not consider the magnitude of weights for quantization that turns out to be important as we discuss later. As another line of post-training quantization (PTQ) research, some PTQ techniques are specialized in quantizing language models such as BERT and GPT-like models. Bondarenko et al. (2021) first applied PTQ to BERT by introducing per-embedding-group activation quantization scheme to deal with highly dynamic activation ranges. Bai et al. (2021) studied the PTQ reconstruction in parallel for BERT. Yao et al. (2022) proposed ZeroQuant that quantizes BERT and GPT-3 in group-wise weight quantization manner driven by token-wise activation quantization via layer-by-layer knowledge distillation. Dettmers et al. (2022) quantizes large language models like OPT with vector-wise weight quantization and mixed-precision decomposition with FP16 activation. All those methods do not consider per-tensor weight quantization which can enable integer matrix-to-matrix multiplication API/function calls (Migacz, 2017) . Most of the aforementioned PTQ studies are targeted to either vision models or language models only, but not to both. Most experimental results in the above PTQ works are conducted via channelwise/group-wise/vector-wise weight quantization at the expense of reduced parallelism. To the best of our knowledge, our work is the first to carry out extensive experiments on diverse tasks ranging from image classification to natural language generation assuming a per-tensor uniform PTQ setting.

3. METHODOLOGY

In this section, we first present the notations used in the paper, describe the concept and design of FlexRound for per-tensor uniform post-training quantization (PTQ) reconstruction, and then, scrutinize how FlexRound can leverage the importance of a pre-trained weight.

3.1. PRELIMINARIES

Notations. A scalar, a vector, and a matrix (or a tensor) are expressed as a non-bold letter, a small bold letter and a capital bold letter (e.g. s, s and S) respectively. W indicates the quantized counterpart of W . The input to a convolutional or fully-connected layer is denoted as X if all previous layers are intact or as X if all previous layers are quantized. The (i, j) element of a matrix W is represented as W (i,j) . We let ⊙ and / indicate element-wise product and element-wise division, respectively, similar to the broadcasting process in Python Numpy. ⌊ • ⌉ and ⌊•⌋ express the rounding function and the floor function. || • || F represents the Frobenius norm. PTQ Background. The conventional uniform PTQ approach is to quantize pre-trained weights W to be W = s 1 W s1 via rounding-to-nearest and to minimize ∥W -W ∥ 2 F with respect to the quantization grid size s 1 , but the minimization of quantization error in the parameter space is not equivalent to that of the final task loss. On the grounds that Li et al. (2021) proves that the loss degradation resulting from quantization can be approximated as the quadratic form of the network output and its Hessian matrix, several existing studies have strove to minimize ∥W X -W X∥ 2 F layer-by-layer or block-by-block with respect to continuous variables V with only a small amount  W (2,4) < W (3,2) but W (2,4) > W (3,2) . of data, where W is either s 1 (⌊ W s1 ⌋ + h(V )) with a certain function h(•) (Nagel et al., 2020) or s 1 W +V s1 ] (Hubara et al., 2021) . However, all these aforementioned rounding mechanisms are founded on element-wise addition.

3.2. FLEXROUND

Unlike prior works based on element-wise addition, we exploit element-wise division for quantizing pre-trained weights. We can formulate our proposed weight-rounding scheme via element-wise division as follows: W = s 1 W S , where the shape of S is equal to that of W while all entries of S as well as the quantization grid size s 1 are positive and learnable. Similarly to preceding studies, both s 1 and S are updated as an attempt to minimize ∥W X -W X∥ 2 F . Eq. 1 implies that the basic formula of FlexRound supports per-tensor uniform PTQ. Notice that although FlexRound can adopt a per-channel weight quantization scheme simply by replacing a scalar s 1 with a vector s 1 , since we show later that per-tensor uniform PTQ (using FlexRound) is enough to provide the accuracy of a full-precision model, we set a single quantization grid size s 1 for each layer (Per-tensor quantization schemes might enable integer matrix-to-matrix multiplication API/function calls that can facilitate efficient inference of quantized models. (Migacz, 2017) ). From now on, thus, we study only the per-tensor uniform PTQ reconstruction. The overall procedure of FlexRound is described in Figure 1 . in the case of a convolutional layer while all elements of S 2 are learnable. Then, motivated by a wide acknowledgement that the statistics of output channels can vary greatly (Nagel et al., 2019; Lou et al., 2020) , we account for the variation of output channel's statistics by complementing S with an additional learnable tensor s 3 , where s 3 ∈ R Cout×1 >0 in the case of a fully-connected layer and s 3 ∈ R Cout×1×1×1 >0 in the case of a convolutional layer. For a convolutional layer, S is additionally complemented by another learnable tensor s 4 , where = 𝑠 ! ⊙ ⊙ 𝑺 𝟐 𝑺 𝑠 # s 4 ∈ R 1×Cin×1×1 >0 . Consequently, S is formulated as s 1 ⊙ S 2 ⊙ s 3 for a fully-connected layer as displayed in Figure 2 and s 1 ⊙ S 2 ⊙ s 3 ⊙ s 4 for a convolutional layer. Accordingly, quantization process for FlexRound can be expressed as W =    s 1 W s1⊙S2⊙s3 if W is a fully-connected layer s 1 W s1⊙S2⊙s3⊙s4 if W is a convolutional layer (2) where all entries of S 2 , s 3 , and s 4 are initialized to be ones in order to enable learning S 2 , s 3 , and s 4 from rounding-to-nearest, s 1 W s1 . s 1 , S 2 , s 3 , and s 4 are updated to minimize ∥W X -W X∥ 2 F subject to the constraint that all elements of s 1 , S 2 , s 3 , and s 4 are positive. Since s 1 , S 2 , s 3 , and s 4 are all learnable and FlexRound does not need any explicit regularization terms, no additional hyper-parameter is necessary, and thus, FlexRound would be convenient for practitioners. Moreover, as all entries of s 1 , S 2 , s 3 , and s 4 are positive and FlexRound is based on element-wise division, FlexRound encourages W to employ the same sign as W . Hence, FlexRound prevents extreme changes of weights through quantization process unlike some element-wise addition rounding scheme such as AdaQuant (Hubara et al., 2021) .

4. EXPERIMENTS

In this section, we present experimental results for benchmark datasets and network models in computer vision and natural language processing tasks. We first empirically confirm that additional tensors s 3 and s 4 introduced in Section 3.2 implement distinct contributions in the per-tensor uniform post-training quantization (PTQ) setting. Then, we compare the performance of FlexRound with that of some state-of-the-art PTQ approaches in the following cases: image classification on the ImageNet (Russakovsky et al., 2015) dataset with the ResNet (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) architectures (Section 4.3), natural language understanding (NLU) on the GLUE (Wang et al., 2018) benchmark with the BERT (Devlin et al., 2018) and GPT-Neo (Black et al., 2021) architectures (Section 4.4), and natural language generation (NLG) on WikiText2 (Merity et al., 2016) and Penn Treebank (PTB) (Marcus et al., 1993) with the GPT-Neo and OPT (Zhang et al., 2022) architectures (Section 4.4). For brevity, we let "B + X" and "Q + X" indicate that a certain rounding scheme 'X' is performed in the experimental setup described in BRECQ (Li et al., 2021) or QDrop (Wei et al., 2022) , respectively (an experimental setup includes the definition of a block unit for reconstruction error minimization or how much the probability of dropping the quantization of activations is). As introduced in BRECQ and QDrop, we also utilize the LSQ technique (Esser et al., 2020) when updating an activation step size for activation quantization. Throughout our comprehensive experiments, we verify that FlexRound can achieve competitive performance with a full-precision model for the above tasks even in the per-tensor uniform PTQ reconstruction, which has not been introduced previously. All experimental results in this section are conducted by our own implementation based on open-source codes.

4.1. LEVERAGING THE IMPORTANCE OF A PRE-TRAINED WEIGHT

As we discussed previously, either element-wise addition or element-wise division is effective to produce a better rounding scheme than a rounding to the nearest scheme. In order to investigate the difference between element-wise addition and element-wise division, it would be instructive to analyze the gradient of the reconstruction error L = ∥W X -W X∥ 2 F with respect to S ′ (where S ′ is S 2 ⊙ s 3 for a fully-connected layer and S 2 ⊙ s 3 ⊙ s 4 for a convolutional layer). Through analysis, unlike element-wise addition, we show that element-wise division enables ∂L ∂S ′ to leverage the importance of pre-trained weights W , as followsfoot_0 : Using the straight-through estimator (Bengio et al., 2013) , for every i and j, ∂L ∂S ′ (i,j) is directly proportional to W (i,j) ∂L ∂ W (i,j) , which implies that S ′ (i,j) is (partially) affected by W (i,j) . As a result,  W (i,j) = W (i,j) s1⊙S ′ (i,j) can also be updated and influenced by W (i,j) as well. In other words, as the magnitude of a pre-trained weight W (i,j) is larger, the chance of W (i,j) receiving a larger update becomes higher during the PTQ reconstruction. In light of the fact that the magnitude of a weight can be regarded as a metric to measure importance during compressing a neural network (Han et al., 2015; Zhu & Gupta, 2017) , if the goal is to enhance model accuracy after quantization, it would be reasonable to have less important (that is, smaller magnitude) weights rounded either up or down only while allowing more important (i.e., exhibiting larger magnitude) weights to be quantized to one of the two closest quantization grids or more. Figure 3 presents the amount of weight updates through FlexRound for MobileNetV2 and ResNet-18. On the left side and the center side of Figure 3 , histograms describe the change of W (i,j) grouped for small pre-trained weights (|W | < 1, left) and large pre-trained weights (|W | > 1, center). On the right side, scatter plots show the amount of grid shifts from the grids obtainable by the rounding-tonearest (RTN) scheme. We note that MobileNetV2 and ResNet-18 are quantized distinctively due to FlexRound. For example, in the case of MobileNetV2 as illustrated in Figure 3 (a), the change of W (i,j) attained by minimizing L is more aggressive (i.e., rounding can be deviated by more than one-step up or one-step down) when the absolute value of W (i,j) is larger than one, which means that FlexRound more flexibly quantizes pre-trained weights of large magnitude as illustrated in red dotted squares in Figure 3(a) . The amount of aggressively rounded weights in the first convolutional layer of the first block of MobileNetV2 is around 12.8% of the total. For ResNet-18, however, there are no pre-trained weights whose magnitudes are larger than one. Thus, most pre-trained weights are rounded either up or down as shown in Figure 3 (b) (e.g., only about 1.5% weights are rounded aggressively in the first convolutional layer of the first block of ResNet-18). Different rounding results by FlexRound, AdaRound, and AdaQuant are visually compared in Appendix A.

4.2. ABLATION STUDY

To justify the introduction of s 3 and s 4 on FlexRound in the per-tensor uniform PTQ setting, we investigate the impact of s 3 and s 4 on the performance of FlexRound using the ImageNet dataset with pre-trained weights quantized into 2-bit (activations are not quantized). As shown in the last two rows in Table 1 , the presence of s 3 and s 4 enhances the accuracy for all models. Interestingly, FlexRound outperforms both AdaQuant and AdaRound even without s 3 and s 4 , which would support In this subsection, we quantize ResNet-18, ResNet-50, and MobileNetV2 in the low-bit PTQ reconstruction with 1024 randomly sampled images. Linear symmetric per-tensor quantization format is assumed to quantize weights and/or activations. For FlexRound, the output of each layer or block is reconstructed during 5k iterations while all learnable parameters (i.e., s 1 , S 2 , s 3 , and s 4 ) are updated by using one learning rate (e.g., 4e-4 for the ResNet models quantized by 3-bit or 4-bit, or 1e-3 for the ResNet models quantized by 2-bit and MobileNetv2). The first and last layers are quantized into 8-bit and the batch normalization layer is folded into convolution, as done in Li et al. (2021) . Our experiments are performed based on full-precision pre-trained models available from the BRECQ (Li et al., 2021) github repositoryfoot_1 , and we report the median over five random trials. Assuming the quantization of weights only, we compare FlexRound with AdaRound and AdaQuant that utilize the principle of element-wise addition to decide rounding operations. Table 2 shows that FlexRound consistently outperforms those two addition-based rounding policies. Note that the performance of AdaQuant is inferior to that of AdaRound in Table 2 . Correspondingly, FlexRound would be compared to AdaRound only to save space hereafter. Table 3 provides model accuacy when AdaRound and FlexRound (to quantize both weights and activations) are associated with the settings of BRECQ or QDrop. In Table 3 , it should be noted that FlexRound is particularly successful for MobileNetV2 incorporating weights of large magnitude, for the reason that we explained in Section 4.1. It is also interesting to see that even when both weights and activations of the ResNet models are quantized into 4-bit under the per-tensor uniform PTQ setting, the performance degradation (compared to a full-precision pre-trained model) is negligible (less than 1.5%) in Table 3 .

4.4. LANGUAGE MODELS

All language models we consider in this paper are based on the structure of Transformers (Vaswani et al., 2017) . To quantize Transformers into 8-bit, we apply linear asymmetric per-tensor quantization scheme for both weights and activations, while reconstruction (for PTQ) is considered for each Transformer layer that includes attention sublayers and feedforward sublayers. All weights are quantized into 8-bit except the last randomly initialized layer. As for activation quantization, onthe-fly (static) quantization is conducted before every fully-connected layer except the inputs of the softmax layer and the normalization layer that remain to be of full-precision as in Zafrir et al. (2019) and Zhang et al. (2020) . BERT and GPT-Neo on GLUE We evaluate the natural language understanding (NLU) performance of FlexRound using various models including BERT Base , BERT Large , GPT-Neo 125M , GPT-Neo 1.3B and GPT-Neo 2.7B on the GLUE benchmark. The learning rate applied to all learnable parameters (s 1 , S 2 , and s 3 ) is selected to be 2e-4 for BERT and to be 3e-4 for GPT-Neo. Reconstruction process is performed by using 1024 random samples for 20K iterations. For all experiments, the batch size is 64 and maximum sequence length of all experiments is 128. We utilize pre-trained language models (PLMs) and datasets available from the HuggingFace (Wolf et al., 2020 ) repositoryfoot_2 . Further experimental details are referred to Appendix G. In Table 4 , we report the performance of 'Q + AdaRound' and 'Q + FlexRound' that are potentially promising as shown in Table 3 . We can notice that 'Q + FlexRound' yields better NLU scores than 'Q + AdaRound' for most NLU tasks. In particular, for the MNLI and QQP datasets, 'Q + FlexRound' can achieve comparable or even superior performance to a full-precision model in the per-tensor uniform PTQ setting except GPT-Neo 125M . GPT-Neo and OPT on WikiText2 and PTB We test the natural language generation (NLG) performance of FlexRound on the WikiText2 and PTB datasets. PLMs (for NLG) are quantized by FlexRound (in a per-tensor quantization manner) while a small amount of data of downstream tasks are used for reconstruction and evaluation. Specifically, PLMs include GPT-Neo 125M , GPT-Neo 1.3B , GPT-Neo 2.7B , OPT 125M , OPT 1.3B and OPT 2.7B , while 256 downstream task data samples are chosen at random for reconstruction. More details on the experimental setup are provided in Appendix I. Table 5 presents the results of GPT-Neo and OPT on NLG tasks and it is clear that 'Q + FlexRound' is superior to 'Q + AdaRound' for all models and NLG tasks. Note that for GPT-Neo, 'Q + FlexRound' can achieve the similar performance of a full-precision PLM even in the per-tensor uniform PTQ setting, while some previous attempts rely on group-wise or vector-wise quantization (Yao et al., 2022; Dettmers et al., 2022) .

5. CONCLUSION

We propose a new rounding scheme, named FlexRound, for post-training quantization under the the principle of element-wise division, to enable learning both a common quantization grid size and an individual scale for each pre-trained weight. We validate that FlexRound can flexibly quantizes pre-trained weights by exploiting their magnitude as a metric to measure importance. Consequently, FlexRound can achieve comparable performance to a full-precision model even in the per-tensor uniform PTQ setting. As a future work, we plan to quantize large language models beyond 6.7B parameters in the per-tensor uniform PTQ setting. A COMPARISON OF FLEXROUND TO ADAROUND AND ADAQUANT B DERIVATION OF SECTION 4.1 Let L = ∥W X -W X∥ 2 F and S ′ be S 2 ⊙ s 3 for a fully-connected layer and S 2 ⊙ s 3 ⊙ s 4 for a convolutional layer. In the case of a fully-connected layer, ∂L ∂S ′ (i,j) = ∂ W (i,j) ∂S ′ (i,j) ∂L ∂ W (i,j) = ∂ ∂S ′ (i,j) s 1 W (i,j) s 1 S ′ (i,j) ∂L ∂ W (i,j) = s 1 ∂ ∂S ′ (i,j) W (i,j) s 1 S ′ (i,j) ∂L ∂ W (i,j) = s 1 ∂ ∂S ′ (i,j) W (i,j) s 1 S ′ (i,j) ∂L ∂ W (i,j) (∵ Straight-Through Estimator) = s 1 W (i,j) s 1 ∂ ∂S ′ (i,j) 1 S ′ (i,j) ∂L ∂ W (i,j) = W (i,j) - 1 S ′2 (i,j) ∂L ∂ W (i,j) = - W (i,j) S ′2 (i,j) ∂L ∂ W (i,j) The derivation in the case of a convolutional layer can be done by just replacing W (i,j) with W (i,j,k,l) and S ′ (i,j) with S ′ (i,j,k,l) . D IMPORTANCE OF JOINTLY LEARNING THE QUANTIZATION GRID SIZE s 1 WITH ROUNDING 

G BERT AND GPT-NEO ON GLUE

The experimental setting of 'Q + AdaRound' follows Wei et al. (2022) . To investigate the natural language understanding performance of FlexRound from BERTfoot_3 to GPT-Neofoot_4 , we directly fine-tune pre-trained models on the GLUEfoot_5 dataset. For BERT, we use uncased models. Hyper-parameter selection for fine-tuning a pre-trained model is given in Table 10 . We use ADAM optimizer as default for all methods and models. In the QDrop's setting, the probability of dropping activation quantization is set to 0.5. We utilize the Huggingface repositoryfoot_6 for the evaluation method without any modification. dataset for the BERT models. For experimental details, Both BERT Base and BERT Large are uncased models. For 'Q + FlexRound', the learning rate is set to 1e-4 for both models. For both 'Q + AdaRound' and 'Q + FlexRound', the batch size and the number of iterations for reconstruction are 64 and 20k respectively. We use ADAM optimizer as default for all methods and models. The other experimental setting of 'Q + AdaRound' follows Wei et al. (2022) . Table 12 shows the hyperparameter selection for fine-tuning. Both BERT Base and BERT Large are using the same configuration. The other setting for fine-tuning and the evaluation method are the same as HuggingFace repositoryfoot_8 . 

I GPT-NEO AND OPT ON WIKITEXT2 AND PTB

To evaluate FlexRound for natural language generation tasks, we utilize the WikiText2 10 and PTB 11 datasets. Table 13 reports the learning rate, the batch size, and the number of iterations for 'Q + FlexRound'. The experimental setting of 'Q + AdaRound' follows Wei et al. (2022) except the number of iterations; we employ 15k iterations for GPT-Neo and 20k iterations for OPT 12 . The batch size for 'Q + AdaRound' is same as that for 'Q + FlexRound'. We use ADAM optimizer as default for all methods and models. The probability of dropping activation quantization is set to 0.5 in the QDrop's setting. We use the Huggingface repository 13 for the evaluation method without any modification. Table 13 : Hyper-parameter selection for 'Q + FlexRound' in Table 5 . J FINETUNED GPT-NEO AND OPT ON WIKITEXT2 AND PTB As for the evaluation of quantized pre-trained language models, the performance (i.e., accuracy) of quantized OPT (by Q+AdaRound or Q+FlexRound) is not close to that of full-precision OPT, while GPT-Neo can be quantized without noticeable accuracy degradation. To investigate whether such an observation is also valid for finetuned OPT or not, we conduct additional experiments on finetuned OPT and GPT-Neo with Wikitext2 and PTB dataset. As shown in the table 14, quantized model's performance of finetuned OPT turns out to be close to full-precision performance. Considering that the model was finetuned with each downstream dataset, We utilize smaller dataset and lesser iteration for reconstruction. We use 128 samples for calibrations set and the iteration is fixed to 500 for all experiments. Learning rate and batch size for the experiments are shown in Table 15 . Other settings are the same as Appendix I. 



For simplicity, we take into account the case of a fully-connected layer. https://github.com/yhhhli/BRECQ https://github.com/huggingface/transformers https://huggingface.co/bert-base-uncased https://huggingface.co/EleutherAI/gpt-neo-1.3B https://huggingface.co/datasets/glue https://github.com/huggingface/transformers/tree/main/examples/ pytorch/text-classification https://huggingface.co/datasets/squad https://github.com/huggingface/transformers/tree/main/examples/ pytorch/question-answering



(a) A new rounding scheme via element-wise division. Both s1 and S are updated toward minimizing the reconstruction error, L. (b) Rounding functions with learned parameters s1 and S as shown in (a).

Figure 1: Illustration of FlexRound in the per-tensor uniform PTQ reconstruction. As seen in (b), FlexRound flexibly quantizes pre-trained weights by observing W (2,4) < W (3,2) but W (2,4) > W (3,2) .

Figure 2: Formation of S for a linear layer.

Figure 3: Weight updates through FlexRound of the first convolutional layer in the first block of (a) MobileNetV2 and (b) ResNet-18, after quantizing pre-trained weights into 4-bit (by FlexRound) while activations are kept in full-precision.

Figure4shows that the comparison of FlexRound to AdaRound and AdaQaunt. As seen in Figure4(a), FlexRound can quantize pre-trained weights more flexibly than AdaRound and AdaQuant. As weights of large magnitude are not quantized aggressively in the middle of Figure4(a) compared to the right of Figure4(a), AdaQuant quantizes weights of large importance marginally, which seems to make it difficult for AdaQuant to quantize MobileNetV2 into 4-bit.

Figure 4: Scatter plot of the amount of grid shifts from rounding-to-nearest gird in the first layer of the first block in MobileNetV2 and ResNet-18 when only weights are quantized into 4-bit.

Figure 5: Ablation study on sample size when quantizing MobileNetV2 into 4-bit. Only weights are quantized, with activations kept in full-precision. We employ pre-trained models available from the official PyTorch repository.

Top-1/Top-5 accuracy (%) on ImageNet by ResNet-18, ResNet-50, and MobileNetV2 with only weights quantized into 2-bit. "B + X" denotes the implementation of X in the setting of BRECQ. We employ pre-trained models available from the official PyTorch repository.

Top-1/Top-5 accuracy (%) for ResNet-18, ResNet-50, and MobileNetV2 on ImageNet when both weights and activations are quantized. "B + X" and "Q + Y" represent the implementation of X in the BRECQ's setting and that of Y in the QDrop's setting, respectively. We employ pre-trained models available from the BRECQ github repository.

Performance of BERT Base , BERT Large , on the GLUE benchmark. For evaluation metrics, matched and mismatched accuracies are reported for MNLI, F1 score and accuracy are reported for QQP, Mathews correlation is reported for CoLA, Pearson and Spearman correlations are reported for STS-B, and accuracy is reported for the others. "Q + X" indicates the implementation of X in the QDrop's setting.

Performance of GPT-Neo 125M , GPT-Neo 1.3B , GPT-Neo 2.7B , OPT 125M , OPT 1.3B and OPT 2.7B on the WikiText2 and PTB datasets. The perplexity (PPL) is employed as a performance metric. The lower PPL, the better. "Q + X" means the implementation of X in the QDrop's setting. Neo 125M GPT-Neo 1.3B GPT-Neo 2.7B OPT 125M OPT 1.3B OPT 2.7B

Top-1/Top-5 accuracy (%) on ImageNet by ResNet-18, ResNet-50, and MobileNetV2 with only weights quantized into 4-bit. "B + X" denotes the implementation of X in the setting of BRECQ. We employ pre-trained models available from the official PyTorch repository.To demonstrate the importance of jointly learning s 1 with the rounding, we did an additional study with s 1 fixed. When fixing s 1 , for ResNet models the performance of FlexRound is almost comparable to that of AdaRound, while for MobileNetV2 FlexRound is somewhat superior to AdaRound. When jointly learning s 1 with the rounding, however, FlexRound outperforms AdaRound for all models. It is therefore critical to learn s 1 jointly with the rounding.

Hyper-parameter selection for fine-tuning BERT Base , BERT Large , GPT-Neo 125M , GPT-Neo 1.3B , and GPT-Neo 2.7B on the GLUE benchmark.ConfigurationBERT Base BERT Large GPT-Neo 125M GPT-Neo 1.3B GPT-Neo 2.7B

additionally shows the performace of FlexRound on the SQuADv1(Rajpurkar et al., 2016) 8

F1 score for BERT Base and BERT Large on SQuADv1 dataset when both weights and activations are quantized into 8-bit. "Q + X" represent the implementation of X in the QDrop's setting.

Hyper-parameter selection for fine-tuning BERT Base and BERT Large on SQuADv1 dataset.

DatasetConfiguration GPT-Neo 125M GPT-Neo 1.3B GPT-Neo 2.7B OPT 125M OPT 1.3B OPT 2.7B

Performance of GPT-Neo 125M , GPT-Neo 1.3B , GPT-Neo 2.7B , OPT 125M , OPT 1.3B and OPT 2.7B Finetuned on the WikiText2 and PTB datasets. The perplexity (PPL) is employed as a performance metric. The lower PPL, the better. "Q + X" means the implementation of X in the QDrop's setting. Neo 125M GPT-Neo 1.3B GPT-Neo 2.7B OPT 125M OPT 1.3B OPT 2.7B

Hyper-parameter selection for 'Q + FlexRound' in Table14. Sample size is 128 and iteration is 500.DatasetConfiguration GPT-Neo 125M GPT-Neo 1.3B GPT-Neo 2.7B OPT 125M OPT 1.3B OPT 2.7B

annex

C RESNET-18, RESNET-50, AND MOBILENETV2 ON IMAGENET WITH PRE-TRAINED MODELS FROM THE OFFICIAL PYTORCH REPOSITORY To identify whether there comes any benefit from both addition and division, we combine AdaQuant with FlexRound. AdaQuant + FlexRound is superior to AdaQuant but inferior to FlexRound. This might be due to the naive combination of AdaQuant with FlexRound. Considering both addition and division would be an interesting future work.

