HYBRID AND NON-UNIFORM QUANTIZATION METH-ODS USING RETRO SYNTHESIS DATA FOR EFFICIENT IN-FERENCE Anonymous

Abstract

Existing quantization aware training methods attempt to compensate for the quantization loss by leveraging on training data, like most of the post-training quantization methods, and are also time consuming. Both these methods are not effective for privacy constraint applications as they are tightly coupled with training data. In contrast, this paper proposes a data-independent post-training quantization scheme that eliminates the need for training data. This is achieved by generating a faux dataset, hereafter referred to as 'Retro-Synthesis Data', from the FP32 model layer statistics and further using it for quantization. This approach outperformed state-of-the-art methods including, but not limited to, ZeroQ and DFQ on models with and without Batch-Normalization layers for 8, 6, and 4 bit precisions on ImageNet and CIFAR-10 datasets. We also introduced two futuristic variants of post-training quantization methods namely 'Hybrid Quantization' and 'Non-Uniform Quantization'. The Hybrid Quantization scheme determines the sensitivity of each layer for per-tensor & per-channel quantization, and thereby generates hybrid quantized models that are '10 to 20%' efficient in inference time while achieving the same or better accuracy compared to per-channel quantization. Also, this method outperformed FP32 accuracy when applied for ResNet-18, and ResNet-50 models on the ImageNet dataset. In the proposed Non-Uniform Quantization scheme, the weights are grouped into different clusters and these clusters are assigned with a varied number of quantization steps depending on the number of weights and their ranges in the respective cluster. This method resulted in '1%' accuracy improvement against state-of-the-art methods on the ImageNet dataset.

1. INTRODUCTION

Quantization is a widely used and necessary approach to convert heavy Deep Neural Network (DNN) models in Floating Point (FP32) format to a light-weight lower precision format, compatible with edge device inference. The introduction of lower precision computing hardware like Qualcomm Hexagon DSP (Codrescu, 2015) resulted in various quantization methods (Morgan et al., 1991; Rastegari et al., 2016; Wu et al., 2016; Zhou et al., 2017; Li et al., 2019; Dong et al., 2019; Krishnamoorthi, 2018) compatible for edge devices. Quantizing a FP32 DNN to INT8 or lower precision results in model size reduction by at least 4X based on the precision opted for. Also, since the computations happen in lower precision, it implicitly results in faster inference time and lesser power consumption. The above benefits with quantization come with a caveat of accuracy loss, due to noise introduced in the model's weights and activations. In order to reduce this accuracy loss, quantization aware fine-tuning methods are introduced (Zhu et al., 2016; Zhang et al., 2018; Choukroun et al., 2019; Jacob et al., 2018; Baskin et al., 2019; Courbariaux et al., 2015) , wherein the FP32 model is trained along with quantizers and quantized weights. The major disadvantages of these methods are, they are computationally intensive and time-consuming since they involve the whole training process. To address this, various post-training quantization methods (Morgan et al., 1991; Wu et al., 2016; Li et al., 2019; Banner et al., 2019) are developed that resulted in trivial to heavy accuracy loss when evaluated on different DNNs. Also, to determine the quantized model's weight and activation ranges most of these methods require access to training data, which may not be always available in case of applications with security and privacy constraints which involve card details, health records, and personal images. Contemporary research in post-training quantization (Nagel et al., 2019; Cai et al., 2020) eliminated the need for training data for quantization by estimating the quantization parameters from the Batch-Normalization (BN) layer statistics of the FP32 model but fail to produce better accuracy when BN layers are not present in the model. To address the above mentioned shortcomings, this paper proposes a data-independent post-training quantization method that estimates the quantization ranges by leveraging on 'retro-synthesis' data generated from the original FP32 model. This method resulted in better accuracy as compared to both data-independent and data-dependent state-of-the-art quantization methods on models ResNet18, ResNet50 (He et al., 2016 ), MobileNetV2 (Sandler et al., 2018) , AlexNet (Krizhevsky et al., 2012) and ISONet (Qi et al., 2020) on ImageNet dataset (Deng et al., 2009) . It also outperformed state-of-the-art methods even for lower precision such as 6 and 4 bit on ImageNet and CIFAR-10 datasets. The 'retro-synthesis' data generation takes only 10 to 12 sec of time to generate the entire dataset which is a minimal overhead as compared to the benefit of data independence it provides. Additionally, this paper introduces two variants of post-training quantization methods namely 'Hybrid Quantization' and 'Non-Uniform Quantization'. 2019) proposes a 'Compression Aware Training' scheme that trains a model to learn compression of feature maps in a better possible way during inference. Similarly, in binary connect method (Courbariaux et al., 2015) the network is trained with binary weights during forward and backward passes that act as a regularizer. Since these methods majorly adopt training the networks with quantized weights and quantizers, the downside of these methods is not only that they are time-consuming but also they demand training data which is not always accessible.

2.2. POST TRAINING QUANTIZATION BASED METHODS

Several post-training quantization methods are proposed to replace time-consuming quantization aware training based methods. The method in Choukroun et al. ( 2019) avoids full network training, by formalizing the linear quantization as 'Minimum Mean Squared Error' and achieves better accuracy without retraining the model. 'ACIQ' method (Banner et al., 2019) achieved accuracy close to FP32 models by estimating an analytical clipping range of activations in the DNN. However, to compensate for the accuracy loss, this method relies on a run-time per-channel quantization scheme for activations which is inefficient and not hardware friendly. In similar lines, the OCS method (Zhao et al., 2019) proposes to eliminate the outliers for better accuracy with minimal overhead. Though these methods considerably reduce the time taken for quantization, they are unfortunately tightly coupled with training data for quantization. Hence they are not suitable for applications wherein access to training data is restricted. The contemporary research on data free post-training quantization methods was successful in eliminating the need for accessing training data. By adopting a per-tensor quantization approach, the DFQ method (Nagel et al., 2019) achieved accuracy similar to the perchannel quantization approach through cross layer equalization and bias correction. It successfully eliminated the huge weight range variations across the channels in a layer by scaling the weights for cross channels. In contrast ZeroQ (Cai et al., 2020) proposed a quantization method that eliminated the need for training data, by generating distilled data with the help of the Batch-Normalization layer statistics of the FP32 model and using the same for determining the activation ranges for quantization and achieved state-of-the-art accuracy. However, these methods tend to observe accuracy degradation when there are no Batch-Normalization layers present in the FP32 model.



2.1 QUANTIZATION AWARE TRAINING BASED METHODSAn efficient integer only arithmetic inference method for commonly available integer only hardware is proposed in Jacob et al. (2018), wherein a training procedure is employed which preserves the accuracy of the model even after quantization. The work in Zhang et al. (2018) trained a quantized bit compatible DNN and associated quantizers for both weights and activations instead of relying on handcrafted quantization schemes for better accuracy. A 'Trained Ternary Quantization' approach is proposed in Zhu et al. (2016) wherein the model is trained to be capable of reducing the weights to 2-bit precision which achieved model size reduction by 16x without much accuracy loss. Inspired by other methods Baskin et al. (

