IMPROVING POST TRAINING NEURAL QUANTIZATION: LAYER-WISE CALIBRATION AND INTEGER PROGRAM-MING

Abstract

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1% accuracy degradation -with 4-bit weights and activations in all layers, but the smallest two. Our code is publicly available at

1. INTRODUCTION

The pursuit of advanced Deep Neural Networks (DNNs) causes researchers to construct deeper and wider networks, making them expensive to use in terms of power and time. This increases the need for efficient implementations of these networks. Efficient networks reduce cloud-vendor costs and make it possible to run them on low-power devices such as smartphones and wearable devices. The most common off-the-shelf approach to improving network efficiency is quantization, which reduces the numerical precision of the network and its complexity and memory footprint. DNN quantization techniques can be classified as either post-training or quantization-aware training (QAT) techniques (Han et al., 2015; Courbariaux et al., 2015; Hubara et al., 2017; Zhou et al., 2016) . Although QAT techniques, in general, achieve better results, there are important real-world scenarios in which they are not applicable. These are the cases where the training data is sensitive or simply unavailable at the time of deployment. For instance, when off-the-shelf or legacy models are being used, or when medical records are involved. Therefore, much attention has recently been dedicated to post-training quantization methods (Nagel et al., 2019; Banner et al., 2018; Zhao et al., 2019) , which can be more easily applied in practice. These methods allow for network quantization to happen seamlessly when deployed, without requiring additional information from the user except a small unlabeled calibration set. Unfortunately, post-training quantization below 8-bit always incurs significant accuracy degradation and in some cases even higher numerical precision is required. In this paper, our goal is to break this barrier by distilling all the information the pre-trained model and calibration set encode. Our goal is to find an optimal scheme for current state of the art hardware which usually support 16,8,4 bits data types with per-channel quantization of the weights. To that end, we suggest a three-stage pipeline that consists of methods applied solely on a small calibration set to reduce the local error introduced during the quantization process (e.g., round-off errors) followed by integer programming to determine the bit-width of different layers so that the overall accuracy degradation is minimized. Even without using mixed-precision, the suggested method is much less prone to over-fitting than current methods and yields best in class results for 8-bits Mobilenet-V2 and BERT-base trained on ImageNet and SQuAD1.1 datasets, respectively. Our paper suggests several contributions for mixed-precision post-training quantization: 1. AdaQuant: A layer-by-layer optimization method that minimizes the error between the quantized layer output and the full-precision layer output. This method can consume only a small calibration dataset from training data without overfitting. In a comprehensive study, we show that AdaQuant defines a new state-of-the-art for post-training quantization on several networks and tasks, including vision models (Resnet18, Resnet50, MobilenetV2) and language (BERT).

2.. Integer programming:

As some parts of the network may allow lower precision compared to other layers, we suggest an integer-linear programming based approach for determining the precision level of different layers. This method aims at maximizing either the expected speedup or savings in power consumption without violating a predefined constraint on network accuracy degradation or compression. 3. Batch-norm tuning: Following quantization we observe an inherent bias in the mean and the variance of batch norm statistics. We show that by employing the re-estimated statistics in batch normalization, much of the quantized network degradation can be recovered.

4.. Light and Advanced pipelines:

We analyze the advantages and disadvantages of each of the given methods and suggest two pipelines: (1) light pipeline that does not require a backward pass, thus can be invoked even on inference-only hardware; and (2) Advanced pipeline that includes also AdaQuant and bias tuning.

2. RELATED WORK

There has been a significant effort to accelerate inference via quantization (Courbariaux et al., 2015; Han et al., 2015; Rastegari et al., 2016; Zhou et al., 2017) . These works involve re-training in order to compensate for the degradation due to the quantization process. Post-training quantization, on the other hand is applied to a model after it was trained. Thus, it avoids re-training and as such it is much simpler to use. However, naively quantizing a full-precision model to INT4 or lower to accelerate computation usually incurs significant accuracy degradation (Krishnamoorthi, 2018; Jacob et al., 2018) . AdaQuant: A recent post-training quantization method (Nagel et al., 2020) , termed AdaRound, suggested optimizing the rounding policy. Instead of using the predominant rounding-to-nearest approach, they suggest formulating a per-layer quadratic optimization problem to optimize the roundoff error. Our proposed method, AdaQuant, takes another step and relaxes AdaRound's implicit constraint which forces the quantized weights to be within ±1 of their round-to-nearest value. This is done by optimizing the weights and quantization parameters of each layer separately, over the calibration set, to minimize the MSE between the layer's original and quantized outputs. As oppose to AdaRound we apply AdaQuant to find optimal quantization not only to weights but also to activations. In addtion we suggest two flavors for AdaQuant: (1) parallel-AdaQuant suited for mixed precision setting; (b) sequential-adaquant which suited for fixed configuration. 2020) used a combinatorial optimization approach for network pruning. Their problem was formulated as a Knapsack problem that optimizes the trade-off between the channels importance and their associated computational cost. Cai et al. ( 2020) finds a mixed-precision configuration with a guaranteed Pareto efficient allocation with respect to model size and accuracy degradation. While this provides a "best-effort" standard (e.g., the configuration cannot be further compressed without hurting accuracy), it does not suggest which of all possible outcomes is best. To the best of our knowledge, this work is the first to formalize a generic integer program, which can easily be adapted to various types of models and requirements with a clear objective and constraints.



Early work by Lin et al. (2016) used a convex optimization formulation which results in a simple greedy compression scheme. Aflalo et al. (

availability

https://github.com/

