MIXQUANT: A QUANTIZATION BIT-WIDTH SEARCH THAT CAN OPTIMIZE THE PERFORMANCE OF YOUR QUANTIZATION METHOD

Abstract

Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference latency, and therefore allows for DNNs to be deployed on platforms with constrained computational resources and real-time systems. However, quantization can lead to numerical instability caused by roundoff error which leads to inaccurate computations and therefore, a decrease in quantized model accuracy. In this paper we focus on simulated quantized inference, where the quantized model parameters are stored in low-precision, but the mathematical operations on them (e.g. matrix multiplications and additions) are performed with floating point arithmetic. This means that the DNN parameters are first quantized from f32 to, for example, int4, and then dequantized back to f32 to perform computations. We show that the roundtrip process of quantizing and dequantizing the model parameters leads to roundoff error, which may lead to numerical instability. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths, we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error and can be combined with any quantization method as a form of pre-processing optimization. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone. Additionally, we combine MixQuant with vanilla asymmetric quantization to show that MixQuant has the potential to optimize the performance of any quantization technique.

1. INTRODUCTION

Quantization is a method for mapping continuous values to a set of discrete values. The goal of neural network quantization is to perform computations and store tensors at lower bit-widths than floating point precision to reduce model size and inference latency while maintaining model accuracy, which allows for deploying DNNs on platforms with constrained computational resources, e.g.: real time inference on mobile devices. Quantization can be performed during training or inference. In this paper we focus on quantized inference, specifically post-training quantization, which quantizes a full precision trained model without the need for re-training or fine-tuning. Quantized inference can be either simulated or integer-only, and in this paper we focus on simulated quantization, where the quantized model parameters are stored in low-precision, but the mathematical operations on them (e.g. matrix multiplications and additions) are performed with floating point arithmetic (Gholami et al., 2022) . In Tensorflow, PyTorch, and HuggingFace (QDQBERT model), simulated quantization is referred to as fake quantization. This means that the DNN parameters are first quantized from f32 to, for example, int4, and then dequantized back to f32 to perform the forward pass executed during inference. We show that the roundtrip process of quantizing and dequantizing the model parameters leads to roundoff error, which may lead to numerical instability. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths (Zhou et al., 2016) , we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm that finds the optimal quantization bit-width from int2, int3, int4, int5, int6, int7, and int8 for each layer weight based on roundoff error and can be combined with any quantization method as a form of pre-processing optimization. We show that combining MixQuant with BRECQ (Li et al., 2021) , a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone. Additionally, we combine MixQuant with vanilla asymmetric quantization to show that MixQuant has the potential to optimize the performance of any quantization technique. MixQuant has three main benefits. First, MixQuant is a component of the quantization process, which can be leveraged to find optimal quantization mixed precision bit-widths that can be plugged into any quantization method to optimize its performance. Second, MixQuant is linear and runs in a matter of seconds, which makes it practical. Third, combining MixQuant with BRECQ, a stateof-the-art quantization method yields better quantized model accuracy than BRECQ alone, OMSE (Choukroun et al., 2019) , AdaRound (Nagel et al., 2020) , AdaQuant (Hubara et al., 2020), and Bit-Split (Wang et al., 2020) .

2. RELATED WORK

Neural Network Quantization Neural network quantization can be applied to training (Gupta et al., 2015; Zhou et al., 2016; Hubara et al., 2017; Bartan & Pilanci, 2021; Elthakeb et al., 2020) or inference. There are two paradigms in quantized DNN inference: post-training quantization (PTQ) and quantization-aware training (QAT) (Jacob et al., 2018; Tailor et al., 2021) . In contrast to PTQ, QAT requires that the f32 model is retrained while simulating quantized inference in the forward pass. While MixQuant can be integrated with either, we focus on PTQ which does not require any re-training. 2021) introduce AdaQuant, which finds optimal quantization for both weights and activations and is based on minimizing the error between quantized layer outputs and f32 layer outputs. This approach is similar to MixQuant; however, MixQuant finds the optimal quantization bit-widths based on quantization error (QE) minimization, while AdaQuant treats the bit-width as a constant and quantizes all weights and activations using the same bit-width (either int8 or int4). Li et al. (2021) propose BRECQ, a quantization method based on DNN block reconstruction. Nagel et al. (2020) propose AdaRound, adaptive rounding for weights, which achieves better accuracy than rounding to the nearest. They formulate the rounding procedure as an optimization problem that minimizes the expected difference between model loss with and without weights quantization perturbation. Quantization originated with convolutional neural networks, but it has been extended to natural language processing neural networks as well. Chen & Sun (2020) propose differentiable product quantization, a learnable compression for embedding layers in DNNs. Kim et al. (2021) study an integer-only quantization scheme for transformers, where the entire inference is performed with pure integer arithmetic. Other works studied hardware optimization for quantization or the relationship between quantization and adversarial robustness. Han et al. (2020) focus on performance optimization for Low-bit Convolution on ARM CPU and NVIDIA GPU. Fu et al. (2021) investigate quantized models' adversarial robustness. They find that when an adversarially trained model is quantized to different precisions in a post-training manner, the associated adversarial attacks transfer poorly between different precisions.



Hubara et al. (2021) and Li et al. (2021) are amongst the current state-of-the-art post training quantization works. Hubara et al. (

Li et al. (2020) develop a method based on constraining all quantization levels as the sum of Powers-of-Two terms, Wang et al. (2020) propose a Bit-Split and Stitching framework (Bit-split), Nahshan et al. (2021) study the effect of quantization on the structure of the loss landscape, Banner et al. (2019) develop ACIQ-Mix, a 4 bit convolutional neural network quantization, and Cai et al. (2020) perform zero-shot quantization ZeroQ based on distilling a dataset that matches the input data distribution.

