MIXED-PRECISION INFERENCE QUANTIZATION: RAD-ICALLY TOWARDS FASTER INFERENCE SPEED, LOWER STORAGE REQUIREMENT, AND LOWER LOSS

Abstract

Model quantization is important for compressing models and improving computing speed. However, current researchers think that the loss function value of the quantized model is usually higher than the full-precision model. This study provides a methodology for acquiring a mixed-precision quantization model with a lower loss without "fine-tuning" than the full-precision model. Using our algorithm in different models on different datasets, we gain lower loss quantized models than full-precision models.

1. INTRODUCTION

Neural network storage, inference, and training are computationally intensive due to the massive parameter sizes of neural networks. Therefore, developing a compression algorithm for machine learning models is necessary. Model quantization, based on the robustness of computational noise, is one of the most important compression techniques. The primary sources of noise are truncation and data type conversion errors. In the quantization process, the initial high-precision data type used for a model's parameters is replaced with a lower-precision data type. Both PyTorch and TensorFlow have quantization techniques that translate floats to integers. Various quantization techniques share the same theoretical foundation, which is the substitution of approximation data for the original data in the storage and inference processes. A lower-precision data format requires less memory, and using lower-precision data requires fewer computer resources and less time. In quantization, the precision loss in different quantization level conversions and data type conversions is the source of the noise. Current works, on the other hand, raise the following issues:1. No study examines how to reduce the loss function value of a model using quantization technology. There is the myth that the quantized model's loss is higher than the full-precision model. 2. The background of some work is against current computation device requirements: current computation devices have to use the same two data types in one computation process, which means the layer's weight and the input of the layer have to be the same quantization level. 3. No one has examined which types of models are stable in the quantization process and why. The purpose of this paper is mainly to discuss the question of whether quantization technology always leads to the model's loss function increasing and how to gain a better performance quantized model by quantization method. In current papers, the main target of the current algorithm is to gain a quantized model whose loss function value is not much higher than a full-precision model. However, we want to give an algorithm that can find the quantized model that is better than the full precision model, i.e., the quantized model's loss function value is lower than the full precision model, based on the current computation device's requirements. This research provides a basic analysis of the computational noise robustness of neural networks. Furthermore, we present a method for acquiring a quantized model with a lower loss than the model with full precision by using the f loor and ceiling functions in different layers, with a focus on layerwise post-training static model quantization. As an added benefit in algorithm analysis, we give the theoretical result to answer the question that which types of models are stable in the quantization process and why when the noise introduced by quantization process can be covered by the neighborhood concept. Problem 1 The objective of quantization is to solve the following optimization problem:

2. RELATED WORK

min q∈Q q(w) -w 2 where q is the quantization scheme, q(w) is the quantized model with quantization q, and w represents the weights, i.e., parameters, in the neural network. Although problem 1 gives researchers a target to aim for when performing quantization, the current problem definition has two shortcomings: 1. The search space of all possible mixed-precision layout schemes is a discrete space that is exponentially large in the number of layers. There is no effective method to solve the corresponding search problem. 2. There is a gap between the problem target and the final task target. As we can see, no terms related to the final task target, such as the loss function or accuracy, appear in the current problem definition.

3.1. MODEL COMPUTATION, NOISE GENERATION AND QUANTIZATION

Compressed models for the inference process are computed using different methods depending on the hardware, programming methods and deep learning framework. All of these methods introduce noise into the computing process. One reason for this noise problem is that although it is common practice to store and compute model parameters directly using different data types, only data of the same precision can support precision computations in a computer framework. Therefore, before performing computations on nonuniform data, a computer will convert them into the same data type. Usually, a lower-precision data type in a standard computing environment will be converted into a higher-precision data type; this ensures that the results are correct but require more computational resources and time. However, to accelerate the computing speed, some works on artificial intelligence (AI) computations propose converting higher-precision data types into lowerprecision data types based on the premise that AI models are not sensitive to compression noise. The commonly used quantization technology is converting data directly and using a lower-precision data type to map to a higher-precision data type linearly. We use the following example to illustrate quantization method, which is presented in Yao et al. (2020) . Suppose that there are two data objects input 1 and input 2 are to be subjected to a computing operation, such as multiplication. After the quantization process, we have Q 1 = int( input1 scale1 ) and Q 2 = int( input2 scale2 ), and we can write Q output = int( input 1 * input 2 scale output ) ≈ int(Q 1 Q 2 scale 1 * scale 2 scale output ) scale output , scale 1 and scale 2 are precalculated scale factors that depend on the distributions of input 1 , input 2 and the output; Q i is stored as a lower-precision data type, such as an integer. All



Model compression methods include pruning methodsHan et al. (2015); Li et al. (2016); Mao et al. (2017) , knowledge distillationHinton et al. (2015), weight sharingUllrich et al. (2017) and quantization methods. From the perspective of the precision layout, post-training quantization methods can be mainly divided into channelwise Li et al. (2019); Qian et al. (2020), groupwise Dong et al. (2019b) and layerwise Dong et al. (2019a) methods. Layerwise mixed-precision layout schemes are more friendly to hardware. Parameters of the same precision are organized together, making full of a program's temporal and spatial locality. Some works give the relationship between the weight and input of layer's best quantization analysisSakr et al. (2017); Sakr & Shanbhag (2018). But in current computation architectures, the quantization level for weight and input should be the same. A common problem definition for quantizationDong et al. (2019a); Morgan et al. (1991); Courbariaux et al. (2015); Yao et al. (2020) is as follows Gholami et al. (2021).

