FIT: A METRIC FOR MODEL SENSITIVITY

Abstract

Model compression is vital to the deployment of deep learning on edge devices. Low precision representations, achieved via quantization of weights and activations, can reduce inference time and memory requirements. However, quantifying and predicting the response of a model to the changes associated with this procedure remains challenging. This response is non-linear and heterogeneous throughout the network. Understanding which groups of parameters and activations are more sensitive to quantization than others is a critical stage in maximizing efficiency. For this purpose, we propose FIT. Motivated by an information geometric perspective, FIT combines the Fisher information with a model of quantization. We find that FIT can estimate the final performance of a network without retraining. FIT effectively fuses contributions from both parameter and activation quantization into a single metric. Additionally, FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties. These properties are validated experimentally across hundreds of quantization configurations, with a focus on layer-wise mixed-precision quantization.

1. INTRODUCTION

The computational costs and memory footprints associated with deep neural networks (DNN) hamper their deployment to resource-constrained environments like mobile devices (Ignatov et al., 2018) , self-driving cars (Liu et al., 2019) , or high-energy physics experiments (Coelho et al., 2021) . Latency, storage and even environmental limitations directly conflict with the current machine learning regime of performance improvement through scale. For deep learning practitioners, adhering to these strict requirements, whilst implementing state-of-the-art solutions, is a constant challenge. As a result, compression methods, such as quantization (Gray & Neuhoff, 1998) and pruning (Janowsky, 1989) , have become essential stages in deployment on the edge. In this paper, we focus on quantization. Quantization refers to the use of lower-precision representations for values within the network, like weights, activations and even gradients. This could, for example, involve reducing the values stored in 32-bit floating point (FP) single precision, to INT-8/4/2 integer precision (IP). This reduces the memory requirements whilst allowing models to meet strict latency and energy consumption criteria on high-performance hardware such as FPGAs. Despite these benefits, there is a trade-off associated with quantization. As full precision computation is approximated with less precise representations, the model often incurs a drop in performance. In practice, this trade-off is worthwhile for resource-constrained applications. However, the DNN performance degradation associated with quantization can become unacceptable under aggressive schemes, where post-training quantization (PTQ) to 8 bits and below is applied to the whole network (Jacob et al., 2018) . Quantization Aware Training (QAT) (Jacob et al., 2018) is often used to recover lost performance. However, even after QAT, aggressive quantization may still result in a large performance drop. The model performance is limited by sub-optimal quantization schemes. It is known that different layers, within different architectures, respond differently to quantization (Wu et al., 2018) . Akin to how more detailed regions of images are more challenging to compress, as are certain groups of parameters. As is shown clearly by Wu et al. (2018) , uniform bit-width schemes fail to capture this heterogeneity. Mixed-precision quantization (MPQ), where each layer within the network is assigned a different precision, allows us to push the performance-compression trade-off to the limit. However, determining which bit widths to assign to each layer is non-trivial. Furthermore, the search space of quantization configurations is exponential in the number of layers and activations. Existing methods employ techniques such as neural architecture search and deep reinforcement learning, which are computationally expensive and less general. Methods which aim to explicitly capture the sensitivity (or importance) of layers within the network, present improved performance and reduced complexity. In particular, previous works employ the Hessian, taking the loss landscape curvature as sensitivity and achieving state-of-the-art compression. Even so, many explicit methods are slow to compute, grounded in intuition, and fail to include activation quantization. Furthermore, previous works determine performance based on only a handful of configurations. Further elaboration is presented in Section 2. The Fisher Information and the Hessian are closely related. In particular, many previous works in optimisation present the Fisher Information as an alternative to the Hessian. In this paper, we use the Fisher Information Trace as a means of capturing the network dynamics. We obtain our final FIT metric which includes a quantization model, through a general proof in Section 3 grounded within the field of information geometry. The layer-wise form of FIT closely resembles that of Hessian Aware Quantization (HAWQ), presented by Dong et al. (2020) . Our contributions in this work are as follows: 1. We introduce the Fisher Information Trace (FIT) metric, to determine the effects of quantization. To the best of our knowledge, this is the first application of the Fisher Information to generate MPQ configurations and predict final model performance. We show that FIT demonstrates improved convergence properties, is faster to compute than alternative metrics, and can be used to predict final model performance after quantization. 2. The sensitivity of parameters and activations to quantization is combined within FIT as a single metric. We show that this consistently improves performance. 3. We introduce a rank correlation evaluation procedure for mixed-precision quantization, which yields more significant results with which to inform practitioners.

2. PREVIOUS WORK

In this section, we primarily focus on mixed-precision quantization (MPQ), and also give context to the information geometric perspective and the Hessian. Mixed Precision Quantization As noted in Section 1, the search space of possible quantization configurations, i.e. bit setting for each layer and/or activation, is exponential in the number of layers:  O(|B| 2L ),



where B is the set of bit precisions and L the layers. Tackling this large search space has proved challenging, however recent works have made headway in improving the state-of-the-art. CW-HAWQ(Qian et al., 2020), AutoQ (Lou et al., 2019)  and HAQ(Wang et al., 2019)  deploy Deep Reinforcement Learning (DRL) to automatically determine the required quantization configuration, given a set of constraints (e.g. accuracy, latency or size). AutoQ improves upon HAQ by employing a hierarchical agent with a hardware-aware objective function. CW-HAWQ seeks further improvements by reducing the search space with explicit second-order information, as outlined byDong et al. (2020). The search space is also often explored using Neural Architecture Search (NAS). For instance,Wu  et al. (2018)  obtain 10-20× model compression with little to no accuracy degradation. Unfortunately, both the DRL and NAS approaches suffer from large computational resource requirements. As a result, evaluation is only possible on a small number of configurations. These methods explore the search space of possible model configurations, without explicitly capturing the dynamics of the network. Instead, this is learned implicitly, which restricts generalisation.More recent works have successfully reduced the search space of model configurations through explicit methods, which capture the relative sensitivity of layers to quantization. The bit-width assignment is based on this sensitivity. The eigenvalues of the Hessian matrix yield an analogous heuristic to the local curvature. Higher local curvature indicates higher sensitivities to parameter perturbation, as would result from quantization to a lower bit precision.Choi et al. (2016) exploit

