MIXED-PRECISION INFERENCE QUANTIZATION: PROB-LEM RESETTING AND TRADITIONAL NP HARD PROB-LEM

Abstract

Based on the model's resilience to computational noise, model quantization is important for compressing models and improving computing speed. Existing quantization techniques rely heavily on experience and "fine-tuning" skills. In this paper, we map the mixed-precision layout problem into a traditional NP hard problem and the problem can be solved by low cost methods like branch and bound method without "fine-tuning". In experiments, experimental results show that our method is better than HAWQ-v2, which is one of the current SOTA methods to solve mixed-precisions layout problem.

1. INTRODUCTION

Neural network storage, inference, and training are computationally intensive due to the massive parameter size of neural networks. Therefore, developing a compression algorithm for machine learning models is necessary. Model quantization, based on the robustness of computational noise, is one of the most important compression techniques. The computational noise robustness measures the algorithm's performance when noise is added during the computation process. The primary sources of noise are truncation and data type conversion mistakes. The initial high-precision data type used for a model's parameters is replaced with a lower-precision data type during model quantization. It is typical to replace FP32 with FP16, and both PyTorch and TensorFlow have quantization techniques that map floats to integers. Various quantization methods share the same theoretical foundation, which is the substitution of approximation data for the original data in the storage and inference processes. A lower-precision data format requires less memory, and using lower-precision data requires fewer computer resources and less time. In quantization, the precision loss in different quantization level conversions and data type conversions is the source of the noise. The primary issue with the model quantization methods is that a naive quantization scheme is likely to raise the loss function. It is not easy to substitute massive-scale model parameters with extremely low-data precision without sacrificing significant precision. It is also not possible to utilize the same quantization level, i.e., to introduce the same level of noise to all parameters for all model parameters and get good performance. Utilizing mixed-precision quantization is one way to solve this issue. For more "sensitive" model parameters, higher-precision data is used, whereas lower-precision data is used for "nonsensitive" model parameters. Higher-precision data indicates that the original data adds small noise, while lower-precision data indicates that the original data adds large noise. But mixed-precision quantization also has its restrictions: In computing processes, for example, GPU and CPU must use the same data type in a computing process. Moreover, the following facts challenge current algorithms for mixed-precision algorithms: 1. These algorithms are built on empirical experience and "tuning" skills. 2.Some algorithms forego neural network and dataset analysis. Some algorithms base model quantization on hardware features. It is impossible to show the bounds of algorithms' performance. 3. Some algorithms utilize Hessian data. The majority of them are analyzable. However, obtaining Hessian information necessitates

