MIXED-PRECISION INFERENCE QUANTIZATION: RAD-ICALLY TOWARDS FASTER INFERENCE SPEED, LOWER STORAGE REQUIREMENT, AND LOWER LOSS

Abstract

Model quantization is important for compressing models and improving computing speed. However, current researchers think that the loss function value of the quantized model is usually higher than the full-precision model. This study provides a methodology for acquiring a mixed-precision quantization model with a lower loss without "fine-tuning" than the full-precision model. Using our algorithm in different models on different datasets, we gain lower loss quantized models than full-precision models.

1. INTRODUCTION

Neural network storage, inference, and training are computationally intensive due to the massive parameter sizes of neural networks. Therefore, developing a compression algorithm for machine learning models is necessary. Model quantization, based on the robustness of computational noise, is one of the most important compression techniques. The primary sources of noise are truncation and data type conversion errors. In the quantization process, the initial high-precision data type used for a model's parameters is replaced with a lower-precision data type. Both PyTorch and TensorFlow have quantization techniques that translate floats to integers. Various quantization techniques share the same theoretical foundation, which is the substitution of approximation data for the original data in the storage and inference processes. A lower-precision data format requires less memory, and using lower-precision data requires fewer computer resources and less time. In quantization, the precision loss in different quantization level conversions and data type conversions is the source of the noise. Current works, on the other hand, raise the following issues:1. No study examines how to reduce the loss function value of a model using quantization technology. There is the myth that the quantized model's loss is higher than the full-precision model. 2. The background of some work is against current computation device requirements: current computation devices have to use the same two data types in one computation process, which means the layer's weight and the input of the layer have to be the same quantization level. 3. No one has examined which types of models are stable in the quantization process and why. The purpose of this paper is mainly to discuss the question of whether quantization technology always leads to the model's loss function increasing and how to gain a better performance quantized model by quantization method. In current papers, the main target of the current algorithm is to gain a quantized model whose loss function value is not much higher than a full-precision model. However, we want to give an algorithm that can find the quantized model that is better than the full precision model, i.e., the quantized model's loss function value is lower than the full precision model, based on the current computation device's requirements. This research provides a basic analysis of the computational noise robustness of neural networks. Furthermore, we present a method for acquiring a quantized model with a lower loss than the model with full precision by using the f loor and ceiling functions in different layers, with a focus on layerwise post-training static model quantization. As an added benefit in algorithm analysis, we give the theoretical result to answer the question that which types of models are stable in the quantization process and why when the noise introduced by quantization process can be covered by the neighborhood concept. 2020) is as follows Gholami et al. (2021) .

2. RELATED WORK

Problem 1 The objective of quantization is to solve the following optimization problem: min q∈Q q(w) -w 2 where q is the quantization scheme, q(w) is the quantized model with quantization q, and w represents the weights, i.e., parameters, in the neural network. Although problem 1 gives researchers a target to aim for when performing quantization, the current problem definition has two shortcomings: 1. The search space of all possible mixed-precision layout schemes is a discrete space that is exponentially large in the number of layers. There is no effective method to solve the corresponding search problem. 2. There is a gap between the problem target and the final task target. As we can see, no terms related to the final task target, such as the loss function or accuracy, appear in the current problem definition.

3.1. MODEL COMPUTATION, NOISE GENERATION AND QUANTIZATION

Compressed models for the inference process are computed using different methods depending on the hardware, programming methods and deep learning framework. All of these methods introduce noise into the computing process. One reason for this noise problem is that although it is common practice to store and compute model parameters directly using different data types, only data of the same precision can support precision computations in a computer framework. Therefore, before performing computations on nonuniform data, a computer will convert them into the same data type. Usually, a lower-precision data type in a standard computing environment will be converted into a higher-precision data type; this ensures that the results are correct but require more computational resources and time. However, to accelerate the computing speed, some works on artificial intelligence (AI) computations propose converting higher-precision data types into lowerprecision data types based on the premise that AI models are not sensitive to compression noise. The commonly used quantization technology is converting data directly and using a lower-precision data type to map to a higher-precision data type linearly. We use the following example to illustrate quantization method, which is presented in Yao et al. (2020) . Suppose that there are two data objects input 1 and input 2 are to be subjected to a computing operation, such as multiplication. After the quantization process, we have Q 1 = int( input1 scale1 ) and Q 2 = int( input2 scale2 ), and we can write Q output = int( input 1 * input 2 scale output ) ≈ int(Q 1 Q 2 scale 1 * scale 2 scale output ) scale output , scale 1 and scale 2 are precalculated scale factors that depend on the distributions of input 1 , input 2 and the output; Q i is stored as a lower-precision data type, such as an integer. All scale terms can be precalculated and established ahead of time. Then, throughout the whole inference process, only computations on the Q i values are needed, which are fast. In this method, the noise is introduced in the int(•) process. This basic idea gives rise to several variants, such as (non)uniform quantization and (non)symmetrical quantization. When we focus on quantization strategy, i.e. round function in quantization framework like Micronet, we can have at least three strategy: round up, i.e., ceil function in python, round down, i.e., f loor function in python and rounding, i.e., round function in python. usually, rounding is the most common method to deal with quantization. But, in this paper, we will show that how to mixed use round up/round down to gain a mixed precision quantized model which is better than full precision model.

3.2. NEURAL NETWORKS

In this paper, we mainly use the mathematical properties of extreme points to analyze quantization methods. This approach is universal to all cases, not only neural networks. However, there is a myth in the community that it is the neural network properties that guarantee the success of quantization methodsWang et al. ( 2019 A neural network can be described in the following Eq. 1 form. model(x) = h 1 (h 2,1 (h 3,1 (...), ..., h 3,k , w 2,1 ), h 2,2 (h 3,k+1 (...), ..., w 3 ), ..., w 2,2 ), ..., w 1 ) (1) where h i,j , i ∈ [2, ..., n], are the (n -i + 1)th layers in the neural network; w i,j is the parameter in h i,j (•). Definition 1 means that a neural network, without training, can be any function. With definition 1, a neural network is no longer a mathematical concept, but this idea is widely used in practice Roesch et al. (2019) . We can see from definition 1 that the requirement that a neural network is in composite function form is the only mathematical property of a neural network that can be used for analysis. In practice, the loss function is one method to evaluate a neural network. A lower loss on a dataset means a better performance neural network. For example, the training process optimises the model's loss, i.e., following Eq. 2. min w f (w) = E sample (w, sample) = 1 m (xi,yi)∈D (w, x i , y i ) where f (•) is the loss for model on a dataset, w represents the model parameters, D is the dataset, m is the size of the dataset, (•) is the loss function for a sample and (x i , y i ) represents a sample in the dataset and its label. In this paper, we mainly use the sequential neural network to describe the conclusion for the sequential neural network is easily described, and the whole conclusion is non-related to the structure of the neural network. For a sequential n-layer neural network, (•) can be described in the following Eq.3 form. (w, xi, yi) = L(modeln(xi, w), yi) modeln = h1(h2(h3(h4(• • • hn(hn+1, wn) • • • , w4), w3), w2), w1) where L(•) is the loss function, such as the cross-entropy function; h i , i ∈ [1, ..., n], is the (n-i+1)th layer in the neural network; w = (w T n , w T n-1 , • • • , w T 1 ) T , w i is the parameter in h i (•); and for a unified format, h n+1 stands for the sample x. 2020). However, in addition to the noise added to the parameters directly, noise is also introduced between different layers in the inference process because different quantization levels or data types of different precisions are used in different layers.

4. ALGORITHM AND BASIC ANALYSIS

After quantization, the quantized loss for a sample, i.e. ¯ (•), in the inference process is as follows. ¯ (w, x i , y i ) = L(h 1 (h 2 (• • • h n (h n+1 + n , w n + δ n ) + n-1 • • • , w 2 + δ 2 ) + 1 , w 1 + δ 1 ), y i ) where δ i , i ∈ 1, • • • , n, and i , i ∈ [1, ..., n], are the minor errors that are introduced in model parameter quantization and in data type conversion in the mixed-precision layout scheme, respectively. Thus, we obtain the following expression based on the basic total differential calculation. ¯ (w, xi, yi) -(w, xi, yi) = n i=1 ∂ ∂hi+1 • i + ∂ ∂wi • δi (4) where • is inner product and * is the scalar product in following parts. For the loss on whole dataset, we can gain min ∈E f (w) -f (w) = 1 m (x j ,y j )∈D n i=1 ∂ ∂hi+1 • i + ∂ ∂wi • δi = 1 m n i=1 (x j ,y j )∈D ∂ ∂hi+1 • i (5) where f (w) = 1 m ¯ (•). The reason for second equation in Eq. 5 is for a well-trained model, the expectation of (•)'s gradient for parameters is zero, i.e., for the (xj ,yj )∈D ∂ ∂w components, ∂ ∂wi = 0.

4.1.2. TARGET AND ALGORITHM GUARANTEE

The key is to choose the appropriate vector to gain a lower loss model. When the loss of the inner product, i.e., (xi,yi)∈D ∂ ∂hi+1 • , is negative, the loss for the quantized model, i.e., f , is lower than for the full precision model. An appropriate to produce a negative 2020) because these methods do not take the error in the layer's input into consideration, which prevents their work and analysis in the mixed-precision computing area. As a result, these works can only be used to store a compressed neural network on a disk. When the compressed model is stored in memory for inference, these compressed models have to be recovered into the full precision model. Figure 1 : When the predicted point(x p ) is out of the neighborhood range but not pretty far from x 0 , secant lines between the x 0 and x 1 perform significantly better than tangent lines. The choice of x 1 is the maximum quantization noise in practice. 

4.2. THE MAP FROM MATHEMATICAL ANALYSIS TO REAL ENGINEERING

In the above analysis, the whole process is under the condition that vector is small enough, which can be used in the total differential method. However, in practice, the scope of may be within [-0.1,0.1], which would escape the concept of neighborhood. What is more, mapping vector into round operation should be fully discussed. This part will show how to deal with the above gap between analysis and engineering.

4.2.1. ROUND FUNCTION CHOICE

We use the convenient language of probability theory to describe ∂ ∂hi+1 • for is a stochastic vector naturally. We set = [e 1 , e 2 , .., e k ] and e i is i.i.d. random variable. We also set that ∂ ∂hi+1 = [p 1 , p 2 , ..., p k ] and p i is i.i.d. random variablefoot_0 . e and p are independence to each other. Then, we have ∂ ∂hi+1 • = k i=1 e i * p i = kep and following Eq. 6. E ∂ ∂hi+1 • = E k i=1 ei * pi = Ekep = kEeEp For a trained model, the Ep can be computed as Ep = 1 k * ∂ ∂hi+1 • 1. Then to gain a negative E ∂ ∂hi+1 • , the Ee should be different signs with Ep. To gain the suitable vector, we use the different round functions to ensure the sign of Ee. The roundup function, i.e., the ceil function in python, will produce an error vector whose all elements are positive. The round down function, i.e., the f loor function, will produce an error vector whose all elements are negative. Thus, we are sure that the Ee is positive and negative by round methods. Although the parameters in layers have strong noise robustness, we still try to add less noise to them. Thus, in the parameters quantization process, we use the rounding method, i.e., the round function in python, to quantize parameters for the rounding method exerts less noise on original data.

4.2.2. REPLACE GRADIENT WITH SECANT LINE SLOPE

Although the elements in the vector are not small enough to use the total differential directly, the elements in the vector are still small. For example, when using INT8 to quantize the res14 model without identity mapping, the element in the vector is less than 0.01. The above fact shows that i * i is small, which has a tiny influence on the final loss function. Thus, we can use the slope of the secant line to replace the gradient in the total differential, which is shown in figure 1 . and layer's input with this quantization level to reduce computation resources. In algorithm design, we can use error min to control this case.

4.2.3. THE PROBABILITY OF GETTING A BETTER MODEL

To show the probability of getting positive ∂ ∂hi+1 • , we use chebyshev's theorem, we have following Eq. 7. P ( ∂ ∂hi+1 ≥ 0) < P ( ∂ ∂hi+1 -Ekep ≥ Ekep ) ≤ V ar(kep) Ekep 2 = V ar(e)V ar(p) EeEp 2 + V ar(e) Ee 2 + V ar(p) Ep 2 Based on Eq.7, we know that to gain a better model performance, for the layer whose ∂ ∂hi+1 • 1 is large and V ar(p) is small, we can use high quantization level to gain a model which is better than full precision model with high probability. To guarantee the success probability is high, we can set a algorithm parameter µ. Algorithm quantize Layer i only when Ep > µ.

4.3. ALGORITHM DESCRIPTION

Based on the above map between analysis and engineering, we can get algorithm 1. Algorithm 1 is a radical probability algorithm. In algorithm 1, we use a high quantization level as a priority to gain a small quantized model. Under the appropriate µ setting, algorithm 1 would give a better model with a high probability.

5. THE LIMITATION OF ALGORITHM 1 AND MODEL ROBUSTNESS

Although algorithm 1 provides a better model, yet in experiments, we find that ResNet50 / ResNet101 are hard to gain a significant improvements, which makes us think our algorithm have the limitations. To show this limitations is rooted in the model properties, we will prove a stronger conclusion in this section. The neural network is under the description of the probably approximately correct (PAC) learning frameworkDenilson & Barbosa (2016) . A neural network hypothesis class H consists of the neural networks which share the same structure. The learning algorithms, A , are SGD and SGD's variants for the neural network hypothesis class. Identity mapping is when the input to some layer is passed directly or as a shortcut to some other layer. The neural networks, which mainly consist of identity mappings, like ResNet or DenseNet, succeed in the CV area. Then, we can gain the following propositions. Proposition 1 There is a set of function G . For any random variable vector x and any random variable vector y, ∃g ∈ G which satisfies Eg(x) • Ey ≤ 0 and g(x) belongs to 0's neighborhood. Brief proof: From the analysis in algorithm 1, we can find an appropriate E that E ∂ ∂hi+1 • ≤ 0. We can use g ∈ G to replace . Then, proposition 1 is proved, which is also shown in figure 3 . The set, which consists of Relu(Conv(•)), satisfies the requirements of G . Proposition 1 tells us how to structure a deep residual network. Repeatedly using proposition 1 and retraining the new model would show that for the neural networks consisting of residual blocks like ResNet, the deeper, the better. It is shown in figure 4  • = E sample L(model * n , sample) -E sample L(model n+1 , sample) < E sample L(model * n , sample) -E sample L(model * n+1 , sample) = 0 Because i and E can be chosen at random, we can tell that E ∂ ∂hi+1 is zero or very close to zero. Proposition 2 shows one of the residual network's SOTA criterion. Then, we can prove the following theorem 1. Theorem 1 When quantization noise is under the concept of neighborhood, SOTA or near to SOTA residual networks in a dataset exhibit high noise robustness. Brief proof: Based on Eq.5 and proposition 2, we know f (w) -f (w) = n i=1 E ∂ ∂h i+1 • i ≤ n i=1 E ∂ ∂h i+1 E i = 0 (9) which means the noise would have higher-order infinitesimal, i.e., o(noise), influence. Thus, SOTA or near to SOTA residual networks in a dataset exhibit high noise robustness Theorem 1 shows that the robustness is stronger with increase of number of layers and identity mapping when quantization noise is under the concept of neighborhood. Based on Theorem 1, we can gain followint corollary. Corollary 1 When the quantization noise is small, algorithm 1 cannot improve SOTA or near SOTA model's performance too much.

6. EXPERIMENT

In this section, we evaluate the performance of algorithm 1. Our objective is to show that the quantized model gained by algorithm 1 is better than the full precision model without "fine-tuning" technology.

6.1.1. DATASET AND MODEL

We make use of datasets from MNIST, CIFAR 10, CIFAR 100, and ImageNet-100. The calibration and training datasets are separated from the training dataset. The calibration dataset's size is also the same as the test dataset's. We employ a DNN as a benchmark in the MNIST dataset that is in accordance with the workSakr et al. (2017) . ReLU layer is between each layer in the model's 784-512-256-128-64-10 design. We apply ResNet8/14 and VGG11/13 to the CIFAR 10 dataset. We employ VGG13, ResNet34, and mobilnet in the CIFAR 100 dataset. The mobilenet dataset for ImageNet-100 is used. We remove the identity mapping structure from the Resnet model in our experimental models to magnify the outcomes of the tests.



We also can treat pi as the random variable with different distributions or directly use E ∂ ∂h i+1 vector in following analyses. The conclusions are the same or close with current analysis.



Model compression methods include pruning methodsHan et al. (2015); Li et al. (2016); Mao et al. (2017) , knowledge distillationHinton et al. (2015), weight sharingUllrich et al. (2017) and quantization methods. From the perspective of the precision layout, post-training quantization methods can be mainly divided into channelwise Li et al. (2019); Qian et al. (2020), groupwise Dong et al. (2019b) and layerwise Dong et al. (2019a) methods. Layerwise mixed-precision layout schemes are more friendly to hardware. Parameters of the same precision are organized together, making full of a program's temporal and spatial locality. Some works give the relationship between the weight and input of layer's best quantization analysisSakr et al. (2017); Sakr & Shanbhag (2018). But in current computation architectures, the quantization level for weight and input should be the same. A common problem definition for quantizationDong et al. (2019a); Morgan et al. (1991); Courbariaux et al. (2015); Yao et al. (

inference are complex. Different algorithms use different assumption to solve the problem. Most of them pay much attention to the noise on parameters in NNDong et al. (2019a); Yao et al. (2020); Gholami et al. (2021); Nagel et al. (

question is why (xj ,yj )∈D ∂ ∂w is zero but (xi,yi)∈D ∂ ∂hi+1 is non-zero. The optimization algorithm is to optimize w in the training process. Thus, (xi,yi)∈D ∂ ∂hi+1 is random in the final model except for the layers with bias terms like the batch norm layer. The bias term will absorb the gradient and train them in the optimization process. What is more, in the model, which mainly consists of identity mapping, (xi,yi)∈D ∂ ∂hi+1 is close to zero vector, and we will show this in the next chapter. Our problem setting for quantization is different from previous works like HAWQDong et al. (2019a); Yao et al. (2020); Dong et al. (2019b); Nagel et al. (

Figure 2: Different directions have different secant line in prediction process.

Figure 3: Proof of proposition 1

For a well-trained neural network model * n ∈ H n by learning algorithm A , there exists a model n+1 ∈ H n+1 which is slightly better than model * n . The difference between H n and H n+1 is the model n+1 ∈ H n+1 have one more residual block than model n+1 ∈ H n and the function in residual block is in G .

. Using proposition 1 in a different place, we can get different networks, like Resnet or DenseNet. Based on the proposition 1's structure process, we can prove the following proposition 2. For a dataset's SOTA or close to SOTA residual network, all E ∂ ∂hi+1 are close to zero. Brief proof: The SOTA model implies that adding new layers will not improve model performance, i.e., for well-trained model * n ∈ H n and well-trained model * n+1 ∈ H n+1 , E sample L(model * n , sample)-E sample L(model * n+1 , sample) = 0. So for any i and any appropriate E , we have the following Eq 8.

annex

Algorithm 1: Radical Mixed-Precision Inference Layout Scheme Input: Neural network M , quantization levels [q 1 , q 2 , ..., q n ], error min , error max , µ, calibration dataset D Output: Quantized neural network M Arrange Q = [q 1 , q 2 , ..., q n ] in ascending order Q = [q i1 , q i2 , ..., q in ] based on the the size of parameters under q i ;//For example Q=[INT8, INT4, INT16] into Q=[INT4,INT8,INT16] for q i in Q do for Layer i in M do if Layer i is quantized then continue end Compute the error ∆ of h i+1 under q i quantization level on D if ∆ < error min then quantize Layer i 's parameters and input by rounding method on q i quantization level.quantize Layer i 's input by round down method on q i quantization level.quantize Layer i 's input by round up method on q i quantization level. end if Choice != 0 then quantize Layer i 's parameters by rounding method on q i quantization level. end end end return MBecause in different direction, the secant line is different, which is shown in figure 2 , so we have to define the following secant + (h i , ∆), ∆ ∈ R 1 + and secant -(h i , ∆), ∆ ∈ R 1 +. ∆ is the maximum error which is introduced by quantization. For example, the scale parameter in Section 3.1's example is the max error introduced by quantization.In the algorithm, we will use scant ± (•) to replace ∂ ∂hi+1 ± • 1. We use this definition because 1.Compared to computing by the definition of secant, the secant function is easy to be computed.Although we know the element in vector is less than 0.01 empirically, we still have to set a mechanism in real algorithm design to keep the analysis map into algorithm practice. Thus, we have to set a value error max , which is small enough for the final loss function. When ∆ > error max , we can choose more bits quantization level or full precision in this layer.When is close to zero, i.e., we use more bits quantization level. For 

6.1.2. ALGORITHM SETTING

Before quantization process, we will whole calibration dataset in full precision and find the min and max value in dataset for a layer's input and compute ∆. We use this setting because we want to enlarge the noise and get a obvious experimental results. In CIFAR 10 VGG experiments, under this setting, we cannot find an appropriate layer to quantize because all ∆s are large or Choices variable are zero. Thus, we use the min/max on current quantization vector like HAWQ'sDong et al.(2019a) experiments to compute ∆. To gain a high performance model as radically as possible, we set error min = 0, error max = 0.1. For CIFAR100 and ImageNet-100 experiments, we uses µ = 0.6 in experiments, in MNIST and CIFAR10 dataset, the µ is 0.4. And the value of µ is adjusted during our experiments.

6.2. EXPERIMENTAL RESULTS

In this part, we also show the range of noise which is introduced by the different quantization level in the input of the layer. In our experiments, we find that INT8 quantization level brings less than 1e-2.Only few layers would be larger than this level (usually less than 5e-1), For INT4, the quantization noise is less than 5e-2 and few layers would larger than 1, and these layers whose noise larger than 1 should omit in quantization level.In our algorithm, we substitute secant line for gradient in Eq. 5, which expands the applicability of Fig . 5 beyond the mathematical neighborhood concept into the 1e-2 ball. Actually we find that the gradient can used in Eq.5 only when the noise is less than 3e-4. However, secant line would fail when the noise is larger than 5e-2, which is smaller than the noise by INT4.In INT8 level quantization, our algorithm can find a model with a lower loss function value than full-precision models. When the quantization noise is larger than 5e-2, secant line would fail in some cases.

7. CONCLUSION

This paper shows that quantization technology can improve the model's performance, i.e., gain a lower loss. Moreover, based on our analysis, we propose a Radical Mixed-Precision Inference Layout Scheme, which could produce a quantized model which is better than the full-precision model. We also show that residual networks are very resistant to noise. This means that the performance of a SOTA residual network is stable for any quantization algorithms.

