FIT: A METRIC FOR MODEL SENSITIVITY

Abstract

Model compression is vital to the deployment of deep learning on edge devices. Low precision representations, achieved via quantization of weights and activations, can reduce inference time and memory requirements. However, quantifying and predicting the response of a model to the changes associated with this procedure remains challenging. This response is non-linear and heterogeneous throughout the network. Understanding which groups of parameters and activations are more sensitive to quantization than others is a critical stage in maximizing efficiency. For this purpose, we propose FIT. Motivated by an information geometric perspective, FIT combines the Fisher information with a model of quantization. We find that FIT can estimate the final performance of a network without retraining. FIT effectively fuses contributions from both parameter and activation quantization into a single metric. Additionally, FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties. These properties are validated experimentally across hundreds of quantization configurations, with a focus on layer-wise mixed-precision quantization.

1. INTRODUCTION

The computational costs and memory footprints associated with deep neural networks (DNN) hamper their deployment to resource-constrained environments like mobile devices (Ignatov et al., 2018) , self-driving cars (Liu et al., 2019) , or high-energy physics experiments (Coelho et al., 2021) . Latency, storage and even environmental limitations directly conflict with the current machine learning regime of performance improvement through scale. For deep learning practitioners, adhering to these strict requirements, whilst implementing state-of-the-art solutions, is a constant challenge. As a result, compression methods, such as quantization (Gray & Neuhoff, 1998) and pruning (Janowsky, 1989) , have become essential stages in deployment on the edge. In this paper, we focus on quantization. Quantization refers to the use of lower-precision representations for values within the network, like weights, activations and even gradients. This could, for example, involve reducing the values stored in 32-bit floating point (FP) single precision, to INT-8/4/2 integer precision (IP). This reduces the memory requirements whilst allowing models to meet strict latency and energy consumption criteria on high-performance hardware such as FPGAs. Despite these benefits, there is a trade-off associated with quantization. As full precision computation is approximated with less precise representations, the model often incurs a drop in performance. In practice, this trade-off is worthwhile for resource-constrained applications. However, the DNN performance degradation associated with quantization can become unacceptable under aggressive schemes, where post-training quantization (PTQ) to 8 bits and below is applied to the whole network (Jacob et al., 2018) . Quantization Aware Training (QAT) (Jacob et al., 2018) is often used to recover lost performance. However, even after QAT, aggressive quantization may still result in a large performance drop. The model performance is limited by sub-optimal quantization schemes. It is known that different layers, within different architectures, respond differently to quantization (Wu et al., 2018) . Akin to how more detailed regions of images are more challenging to compress, as

