IMPROVING POST TRAINING NEURAL QUANTIZATION: LAYER-WISE CALIBRATION AND INTEGER PROGRAM-MING

Abstract

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1% accuracy degradation -with 4-bit weights and activations in all layers, but the smallest two. Our code is publicly available at

1. INTRODUCTION

The pursuit of advanced Deep Neural Networks (DNNs) causes researchers to construct deeper and wider networks, making them expensive to use in terms of power and time. This increases the need for efficient implementations of these networks. Efficient networks reduce cloud-vendor costs and make it possible to run them on low-power devices such as smartphones and wearable devices. The most common off-the-shelf approach to improving network efficiency is quantization, which reduces the numerical precision of the network and its complexity and memory footprint. DNN quantization techniques can be classified as either post-training or quantization-aware training (QAT) techniques (Han et al., 2015; Courbariaux et al., 2015; Hubara et al., 2017; Zhou et al., 2016) . Although QAT techniques, in general, achieve better results, there are important real-world scenarios in which they are not applicable. These are the cases where the training data is sensitive or simply unavailable at the time of deployment. For instance, when off-the-shelf or legacy models are being used, or when medical records are involved. Therefore, much attention has recently been dedicated to post-training quantization methods (Nagel et al., 2019; Banner et al., 2018; Zhao et al., 2019) , which can be more easily applied in practice. These methods allow for network quantization to happen seamlessly when deployed, without requiring additional information from the user except a small unlabeled calibration set. Unfortunately, post-training quantization below 8-bit always incurs significant accuracy degradation and in some cases even higher numerical precision is required. In this paper, our goal is to break this barrier by distilling all the information the pre-trained model and calibration set encode. Our goal is to find an optimal scheme for current state of the art hardware which usually support 16,8,4 bits data types with per-channel quantization of the weights. To that end, we suggest a three-stage

availability

https://github.com/

