FLEXROUND: LEARNABLE ROUNDING BY ELEMENT-WISE DIVISION FOR POST-TRAINING QUANTIZATION

Abstract

Post-training Quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantizationaware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. We notice that, however, such new rounding schemes are established on element-wise addition. In this work, we propose a simple yet effective new rounding mechanism for post-training weight quantization, coined FlexRound, via element-wise division to learn not only a common quantization grid size but also a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit the importance of a pre-trained weight when updating its corresponding scale, and thus, flexibly quantize a pre-trained weight depending on its own importance. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but natural language generation in the per-tensor uniform PTQ setting. Our code will be open-sourced soon.

1. INTRODUCTION

Recent years have witnessed the unprecedented success of deep neural networks in a wide variety of domains including computer vision, natural language processing, automatic speech recognition, and so on. Although state-of-the-art deep neural networks surpass human-level performance, these neural networks cannot help requiring more and more computation cost and memory usage as networks become deeper and wider. In order to reduce the model size and accelerate inference operations, many researchers have attempted diverse compression techniques such as network quantization (Courbariaux et al., 2016) and network pruning (Han et al., 2016) . In this paper, we concentrate on network quantization due to the advantage that INT4 or INT8 quantization allows us to accelerate quantized neural networks using off-the-shelf accelerators such as the NVIDIA A100 Tensor Core GPU (Wu et al., 2020) or ARM Cortex MCUs (Kim et al., 2021) . Network quantization techniques can be generally divided into two categories: quantization-aware training (QAT) and post-training quantization (PTQ). When quantizing neural networks via QAT (Jung et al., 2019; Jain et al., 2019; Zhao et al., 2020; Esser et al., 2020; Lee et al., 2021) , the performance gap between a full-precision neural network and its quantized counterpart can be marginal. Yet, QAT requires end-to-end retraining or fine-tuning on a full training dataset, which often causes an enormous amount of time and resources to obtain a quantized neural network with competitive performance. Furthermore, a whole training dataset may not be available due to data privacy issues or demands to utilize legacy models. Such drawbacks of QAT are the reasons why researchers recently pay more attention to PTQ (Zhao et al., 2019; Wang et al., 2020; Nahshan et al., 2021) that needs neither a full training dataset nor end-to-end learning at all. PTQ had been initially performed via rounding-to-nearest scheme by minimizing the quantization error in the parameter space. Unfortunately, this approach suffers from severe performance degradation. Since it is reported that the loss degradation resulting from quantization can be approximated as the second-order error in Taylor Expansion by viewing quantized weights as perturbed weights, Nagel et al. ( 2020) and Li et al. (2021) substantiate that reconstructing each output of layer or block is equivalent to minimizing the approximation of loss degradation resulting from quantization under some assumptions. Accordingly, recent works (Nagel et al., 2020; Li et al., 2021; Hubara et al., 2021; Wei et al., 2022) have suggested to reconstruct each output of layer or block by devising and learning a new weight-rounding scheme, deviating from rounding-to-nearest, as an effort to preserve the performance of a full-precision model. However, all those new rounding schemes designed in existing studies either round or quantize pre-trained weights adaptively via element-wise addition. Changing the perspective of a new rounding policy from element-wise addition to element-wise division, we propose a simple yet effective post-training weight quantization method called FlexRound, which flexibly quantizes pre-trained weights by learning how much each pre-trained weight should be divided by. Interestingly, thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound can inherently leverage pre-trained weights when updating an individual scale for every pre-trained weight. Specifically, we corroborate that a relatively wider range of discrete values needs to be explored when quantizing pre-trained weights of large magnitude. The rationale behind such an approach is that the magnitude of weight can be considered as its importance. Given that it is crucial to retain the knowledge of important weights even after quantization so as to maintain the performance of a pre-trained model, the constraints associated with quantizing weights of large absolute value should be relaxed compared to those of small absolute value (i.e., those important weights can be quantized to one of not only its two nearest discrete values but also discrete ones far from it). Accordingly, FlexRound quantizes pre-trained weights flexibly depending on each their own importance, thereby leading to better performance. Our contributions are threefold: • We propose FlexRound as a new rounding scheme for post-training weight quantization based on the principle of element-wise division to enable learning separate scales for all pre-trained weights as well as a common quantization grid size across a group (e.g., a channel or a layer). • We demonstrate that such a new rounding scheme via element-wise division takes into consideration the importance of pre-trained weights when updating their corresponding scales so that FlexRound can quantize pre-trained weights of large magnitude (i.e., important pre-trained weights) more flexibly. • To the best of our knowledge, we are the first to conduct extensive experiments in the form of per-tensor uniform PTQ reconstruction on natural language generation as well as image classification and natural language understanding. We verify the effectiveness of FlexRound using numerous models such as ResNet, MobileNetV2, BERT, GPT-Neo, and OPT.

2. RELATED WORK

Recently, many researchers have attempted to quantize a wide range of models for various tasks such as vision and language understanding/generation without any (re)training. OCS (Zhao et al., 2019) replicates channels entailing outliers, and then, halves outliers of those channels. Unfortunately, even though OCS explicitly addresses outliers, it still suffers from severe accuracy degradation when both weights and activations are quantized into low-bit. As an alternative solution, Wang et al. (2020) proposed Bit-Split that splits an integer into several bits and optimizes them separately. Although Wang et al. (2020) showed that the performance of Bit-Split is close to that of a full-precision model in the low-bit setting, Bit-Split may not be effective for certain architectures including MobileNetV2.

