POST-TRAINING WEIGHTED QUANTIZATION OF NEURAL NETWORKS FOR LANGUAGE MODELS

Abstract

As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint. Neural network quantization is usually performed to reduce quantization loss assuming that quantization error of each parameter equally contributes to the overall training loss. The importance of each parameter, however, may highly differ such that for the same number of quantization bits, certain parameters lead to higher training loss than the others after quantization. In this paper, we consider a non-uniform quantization scheme, specifically binary-coding-based quantization, for high compression ratio and efficient computations while avoiding large accuracy degradation by uniform quantization (e.g., INT8 ). Then, we derive quantization optimization methods to take into account the importance of each parameter. We demonstrate that for post-training quantization, weight magnitude can represent importance and improve model accuracy significantly compared to the previous schemes lacking importance considerations. For various language models including BERT, Dis-tilBERT, AWD-LSTM, and Transformer, we achieve 2-4 bits per weight by our proposed post-training quantization with reasonable accuracy degradation.

1. INTRODUCTION

Training techniques for deep neural networks (DNNs) have been developed in ways to incur a lot of parameter redundancy to expedite seeking local minima (Denil et al., 2013; Jonathan Frankle, 2019) . As a result, various model compression techniques including parameter pruning (Han et al., 2015; He et al., 2017 ), quantization (Courbariaux et al., 2015; Rastegari et al., 2016) , low-rank approximation (N. Sainath et al., 2013; Prabhavalkar et al., 2016) , and knowledge distillation (Hinton et al., 2015; Polino et al., 2018) are proposed to lower storage requirements and improve inference performance. Several compression techniques can be combined in a synergistic way to enhance compression ratio (Han et al., 2016; Zhu et al., 2017) . In this work, we consider parameter quantization that maintains structured model formats and presents high compression ratio. Note that due to limited hardware resources, quantization is an essential method for any inference systems. In general, quantization is classified into uniform quantization based on fixed-point parameter representations (Jacob et al., 2018; Han et al., 2016) and non-uniform quantization associated with the binary codes (Zhou et al., 2017; Rastegari et al., 2016) or codebooks (Choi et al., 2017; Stock et al., 2020) . Most DNN quantization methods are performed based on the principle of minimizing the mean squared error (MSE) of quantized parameters (Rastegari et al., 2016; Xu et al., 2018; Zhou et al., 2017) . Optimizing the MSE is also an underlying principle of low-rank approximation techniques such as the singular value decomposition (SVD) (Prabhavalkar et al., 2016; N. Sainath et al., 2013) . Note that, however, minimizing the MSE implies that each parameter is equally important (i.e., squared errors from parameters are accumulated without considering importance of each weight). In practice, the impact of each parameter perturbation from quantization on training loss can be vastly different and such impact needs to be analyzed through a sensitivity study of each parameter toward a change in training loss value. In other words, minimizing the MSE (or the Euclidean distance between original parameters and quantized parameters) may not correspond to minimizing training loss function after quantization. Robustness to quantization error of each parameter can be expressed as sensitivity. Sensitivity of i-th parameter w i is the amount of change in the loss function when w i is perturbed. A parameter 1

