TRAINING WITH QUANTIZATION NOISE FOR EXTREME MODEL COMPRESSION

Abstract

We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training (Jacob et al., 2018), where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator (Bengio et al., 2013). In this paper, we extend this approach to work beyond int8 fixedpoint quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14 MB and 80.0% top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3 MB.

1. INTRODUCTION

Many of the best performing neural network architectures in real-world applications have a large number of parameters. For example, the current standard machine translation architecture, Transformer (Vaswani et al., 2017) , has layers that contain millions of parameters. Even models that are designed to jointly optimize the performance and the parameter efficiency, such as EfficientNets (Tan & Le, 2019) , still require dozens to hundreds of megabytes, which limits their applications to domains like robotics or virtual assistants. Model compression schemes reduce the memory footprint of overparametrized models. Pruning (LeCun et al., 1990) and distillation (Hinton et al., 2015) remove parameters by reducing the number of network weights. In contrast, quantization focuses on reducing the bits per weight. This makes quantization particularly interesting when compressing models that have already been carefully optimized in terms of network architecture. Whereas deleting weights or whole hidden units will inevitably lead to a drop in performance, we demonstrate that quantizing the weights can be performed with little to no loss in accuracy. Popular postprocessing quantization methods, like scalar quantization, replace the floating-point weights of a trained network by a lower-precision representation, like fixed-width integers (Vanhoucke et al., 2011) . These approaches achieve a good compression rate with the additional benefit of accelerating inference on supporting hardware. However, the errors made by these approximations A solution to address this drifting effect is to directly quantize the network during training. This raises two challenges. First, the discretization operators have a null gradient -the derivative with respect to the input is zero almost everywhere. This requires special workarounds to train a network with these operators. The second challenge that often comes with these workarounds is the discrepancy that appears between the train and test functions implemented by the network. Quantization Aware Training (QAT) (Jacob et al., 2018) resolves these issues by quantizing all the weights during the forward and using a straight through estimator (STE) (Bengio et al., 2013) to compute the gradient. This works when the error introduced by STE is small, like with int8 quantization, but does not suffice in compression regimes where the approximation made by the compression is more severe. In this work, we show that quantizing only a subset of weights instead of the entire network during training is more stable for high compression schemes. Indeed, by quantizing only a random fraction of the network at each forward, most the weights are updated with unbiased gradients. Interestingly, we show that our method can employ a simpler quantization scheme during the training. This is particularly useful for quantizers with trainable parameters, such as Product Quantizer (PQ), for which our quantization proxy is not parametrized. Our approach simply applies a quantization noise, called Quant-Noise, to a random subset of the weights, see Figure 1 . We observe that this makes a network resilient to various types of discretization methods: it significantly improves the accuracy associated with (a) low precision representation of weights like int8; and (b) state-of-the-art PQ. Further, we demonstrate that Quant-Noise can be applied to existing trained networks as a post-processing step, to improve the performance network after quantization. In summary, this paper makes the following contributions: • We introduce the Quant-Noise technique to learn networks that are more resilient to a variety of quantization methods such as int4, int8, and PQ; • Adding Quant-Noise to PQ leads to new state-of-the-art trade-offs between accuracy and model size. For instance, for natural language processing (NLP), we reach 82.5% accuracy on MNLI by compressing RoBERTa to 14 MB. Similarly for computer vision, we report 80.0% top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3 MB; • By combining PQ and int8 to quantize weights and activations for networks trained with Quant-Noise, we obtain extreme compression with fixed-precision computation and achieve 79.8% top-1 accuracy on ImageNet and 21.1 perplexity on WikiText-103.

2. RELATED WORK

Model compression. Many compression methods focus on efficient parameterization, via weight pruning (LeCun et al., 1990; Li et al., 2016; Huang et al., 2018; Mittal et al., 2018 ), weight sharing (Dehghani et al., 2018; Turc et al., 2019; Lan et al., 2019) or with dedicated architectures (Tan & Le, 2019; Zhang et al., 2017; Howard et al., 2019) . Weight pruning is implemented during training (Louizos et al., 2017) or as a fine-tuning post-processing step (Han et al., 2015; 2016) . Many pruning methods are unstructured, i.e., remove individual weights (LeCun et al., 1990; Molchanov et al., 2017) . On the other hand, structured pruning methods follow the structure of the weights to



Figure 1: Quant-Noise trains models to be resilient to inference-time quantization by mimicking the effect of the quantization method during training time. This allows for extreme compression rates without much loss in accuracy on a variety of tasks and benchmarks. accumulate in the computations operated during the forward pass, inducing a significant drop in performance (Stock et al., 2019).

