MIXED-PRECISION INFERENCE QUANTIZATION: PROB-LEM RESETTING AND TRADITIONAL NP HARD PROB-LEM

Abstract

Based on the model's resilience to computational noise, model quantization is important for compressing models and improving computing speed. Existing quantization techniques rely heavily on experience and "fine-tuning" skills. In this paper, we map the mixed-precision layout problem into a traditional NP hard problem and the problem can be solved by low cost methods like branch and bound method without "fine-tuning". In experiments, experimental results show that our method is better than HAWQ-v2, which is one of the current SOTA methods to solve mixed-precisions layout problem.

1. INTRODUCTION

Neural network storage, inference, and training are computationally intensive due to the massive parameter size of neural networks. Therefore, developing a compression algorithm for machine learning models is necessary. Model quantization, based on the robustness of computational noise, is one of the most important compression techniques. The computational noise robustness measures the algorithm's performance when noise is added during the computation process. The primary sources of noise are truncation and data type conversion mistakes. The initial high-precision data type used for a model's parameters is replaced with a lower-precision data type during model quantization. It is typical to replace FP32 with FP16, and both PyTorch and TensorFlow have quantization techniques that map floats to integers. Various quantization methods share the same theoretical foundation, which is the substitution of approximation data for the original data in the storage and inference processes. A lower-precision data format requires less memory, and using lower-precision data requires fewer computer resources and less time. In quantization, the precision loss in different quantization level conversions and data type conversions is the source of the noise. The primary issue with the model quantization methods is that a naive quantization scheme is likely to raise the loss function. It is not easy to substitute massive-scale model parameters with extremely low-data precision without sacrificing significant precision. It is also not possible to utilize the same quantization level, i.e., to introduce the same level of noise to all parameters for all model parameters and get good performance. Utilizing mixed-precision quantization is one way to solve this issue. For more "sensitive" model parameters, higher-precision data is used, whereas lower-precision data is used for "nonsensitive" model parameters. Higher-precision data indicates that the original data adds small noise, while lower-precision data indicates that the original data adds large noise. But mixed-precision quantization also has its restrictions: In computing processes, for example, GPU and CPU must use the same data type in a computing process. Moreover, the following facts challenge current algorithms for mixed-precision algorithms: 1. These algorithms are built on empirical experience and "tuning" skills. 2.Some algorithms forego neural network and dataset analysis. Some algorithms base model quantization on hardware features. It is impossible to show the bounds of algorithms' performance. 3. Some algorithms utilize Hessian data. The majority of them are analyzable. However, obtaining Hessian information necessitates a considerable computing resources and time. Some of these methods are only useful for storing purposes. The consensus among researchers is that quantization technology without "fine-tuning" is harmful to the performance of the model. Current quantization technology widely uses the "fine-tuning" method, but no one can explain why their algorithm cannot work without "fine-tuning", even though the basis of their algorithm is well-defined in math. This paper establishes the model quantization problem setting for inference processes. Based on our analysis, we map layerwise PTQ problem into one of the traditional NP-hard problems, i.e.,extended 0-1 knapsack problems, which can be solved by the branch and bound method without too much computing resources or inference/training model too many times. Thus, our work gets rid of "finetuning" skills and has clear interpretability. Compared with the SOTA mixed-precision algorithm HAWQ-v2 without "fine-tuning" the quantization model based on our work is better than the model from HAWQ-v2 under the same computation resource limitation. Problem 1 The objective of quantization is to solve the following optimization problem:

2. RELATED WORK

min q∈Q q(w) -w 2 where q is the quantization scheme, q(w) is the quantized model with quantization q, and w represents the weights, i.e., parameters, in the neural network. Although problem 1 gives researchers a target to aim for when performing quantization, the current problem definition has two shortcomings: 1. The search space of all possible mixed-precision layout schemes is a discrete space that is exponentially large in the number of layers. There is no effective method to solve the corresponding search problem. 2. There is a gap between the problem target and the final task target. As we can see, no terms related to the final task target, such as the loss function or accuracy, appear in the current problem definition.

3.1. MODEL COMPUTATION, NOISE GENERATION AND QUANTIZATION

Compressed models for the inference process are computed using different methods depending on the hardware, programming methods and deep learning framework. All of these methods introduce noise into the computing process. One reason for this noise problem is that although it is common practice to store and compute model parameters directly using different data types, only data of the same precision can be support precise computations. Therefore, before performing computations on nonuniform data, a computer will convert them into the same data type. Usually, a lower-precision data type in a standard computing environment will be converted into a higher-precision data type; this ensures that the results are correct but require more computational resources and time. However, to accelerate the computing speed, some works on artificial intelligence (AI) computations propose converting higher-precision data types into lower-precision data types based on the premise that AI models are not sensitive to compression noise. The commonly used quantization technology is converting data directly and using a lower-precision data type to map to a higher-precision data type linearly.



Model compression methods include pruning methodsHan et al. (2015); Li et al. (2016); Mao et al. (2017) , knowledge distillationHinton et al. (2015), weight sharingUllrich et al. (2017) and quantization methods. From the perspective of the precision layout, post-training quantization methods can be mainly divided into channelwise Li et al. (2019); Qian et al. (2020), groupwise Dong et al. (2019b) and layerwise Dong et al. (2019a) methods. Layerwise mixed-precision layout schemes are more friendly to hardware. Parameters of the same precision are organized together, making full of a program's temporal and spatial locality. A common problem definition for quantizationDong et al. (2019a); Morgan et al. (1991); Courbariaux et al. (2015); Yao et al. (2020) is as follows Gholami et al. (2021).

