HYBRID AND NON-UNIFORM QUANTIZATION METH-ODS USING RETRO SYNTHESIS DATA FOR EFFICIENT IN-FERENCE Anonymous

Abstract

Existing quantization aware training methods attempt to compensate for the quantization loss by leveraging on training data, like most of the post-training quantization methods, and are also time consuming. Both these methods are not effective for privacy constraint applications as they are tightly coupled with training data. In contrast, this paper proposes a data-independent post-training quantization scheme that eliminates the need for training data. This is achieved by generating a faux dataset, hereafter referred to as 'Retro-Synthesis Data', from the FP32 model layer statistics and further using it for quantization. This approach outperformed state-of-the-art methods including, but not limited to, ZeroQ and DFQ on models with and without Batch-Normalization layers for 8, 6, and 4 bit precisions on ImageNet and CIFAR-10 datasets. We also introduced two futuristic variants of post-training quantization methods namely 'Hybrid Quantization' and 'Non-Uniform Quantization'. The Hybrid Quantization scheme determines the sensitivity of each layer for per-tensor & per-channel quantization, and thereby generates hybrid quantized models that are '10 to 20%' efficient in inference time while achieving the same or better accuracy compared to per-channel quantization. Also, this method outperformed FP32 accuracy when applied for ResNet-18, and ResNet-50 models on the ImageNet dataset. In the proposed Non-Uniform Quantization scheme, the weights are grouped into different clusters and these clusters are assigned with a varied number of quantization steps depending on the number of weights and their ranges in the respective cluster. This method resulted in '1%' accuracy improvement against state-of-the-art methods on the ImageNet dataset.

1. INTRODUCTION

Quantization is a widely used and necessary approach to convert heavy Deep Neural Network (DNN) models in Floating Point (FP32) format to a light-weight lower precision format, compatible with edge device inference. The introduction of lower precision computing hardware like Qualcomm Hexagon DSP (Codrescu, 2015) resulted in various quantization methods (Morgan et al., 1991; Rastegari et al., 2016; Wu et al., 2016; Zhou et al., 2017; Li et al., 2019; Dong et al., 2019; Krishnamoorthi, 2018) compatible for edge devices. Quantizing a FP32 DNN to INT8 or lower precision results in model size reduction by at least 4X based on the precision opted for. Also, since the computations happen in lower precision, it implicitly results in faster inference time and lesser power consumption. The above benefits with quantization come with a caveat of accuracy loss, due to noise introduced in the model's weights and activations. In order to reduce this accuracy loss, quantization aware fine-tuning methods are introduced (Zhu et al., 2016; Zhang et al., 2018; Choukroun et al., 2019; Jacob et al., 2018; Baskin et al., 2019; Courbariaux et al., 2015) , wherein the FP32 model is trained along with quantizers and quantized weights. The major disadvantages of these methods are, they are computationally intensive and time-consuming since they involve the whole training process. To address this, various post-training quantization methods (Morgan et al., 1991; Wu et al., 2016; Li et al., 2019; Banner et al., 2019) are developed that resulted in trivial to heavy accuracy loss when evaluated on different DNNs. Also, to determine the quantized model's weight and activation ranges most of these methods require access to training data, which may not be always available in case of applications with security and privacy 1

