WAVEQ: GRADIENT-BASED DEEP QUANTIZATION OF NEURAL NETWORKS THROUGH SINUSOIDAL REGU-LARIZATION

Abstract

Deep quantization of neural networks below eight bits can lead to superlinear benefits in storage and compute efficiency. However, homogeneously quantizing all the layers to the same level does not account for the distinction of the layers and their individual properties. Heterogeneous assignment of bitwidths to the layers is attractive but opens an exponentially large non-contiguous hyperparameter space (Available Bitwidths # Layers ). Thus, finding the bitwidth while also quantizing the network to those levels becomes a major challenge. This paper addresses this challenge through a sinusoidal regularization mechanism, dubbed WaveQ. Adding our parametrized sinusoidal regularizer enables WaveQ to not only find the quantized weights, but also learn the bitwidth of the layers by making the period of the sinusoidal regularizer a trainable parameter. In addition, the sinusoidal regularizer itself is designed to align its minima on the quantization levels. With these two innovations, during training, stochastic gradient descent uses the form of the sinusoidal regularizer and its minima to push the weights to the quantization levels while it is also learning the period which will determine the bitwidth of each layer separately. As such WaveQ is a gradient-based mechanism that jointly learns the quantized weights as well as the heterogeneous bitwidths. We show that WaveQ balances compute efficiency and accuracy, and provides a heterogeneous bitwidth assignment for quantization of a large variety of deep networks (AlexNet, MobileNet, SVHN, that virtually preserves the accuracy. WaveQ is versatile and can also be used with predetermined bitwidths by fixing the period of the sinusoidal regularizer. In this case, WaveQ, on average, improves the accuracy of quantized training algorithms (DoReFa and WRPN) by ∼ 4.8%, and outperforms multiple state-of-the-art techniques. Finally, WaveQ is applicable to quantizing transformers and yields significant benefits.

1. INTRODUCTION

Quantization, in general, and deep quantization (below eight bits) (Krishnamoorthi, 2018) , in particular, aims to not only reduce the compute requirements of DNNs but also reduce their memory footprint (Zhou et al., 2016; Judd et al., 2016b; Hubara et al., 2017; Mishra et al., 2018; Sharma et al., 2018) . Nevertheless, without specialized training algorithms, quantization can diminish the accuracy. As such, the practical utility of quantization hinges upon addressing two fundamental challenges: (1) discovering the appropriate bitwidth of quantization for each layer while considering the accuracy; and (2) learning weights in the quantized domain for a given set of bitwidths. This paper formulates both of these challenges as a gradient-based joint optimization problem by introducing an additional novel sinusoidal regularization term in the training loss, called WaveQ. The following two main insights drive this work. (1) Sinusoidal functions (sin 2 ) have inherent periodic minima and by adjusting the period, the minima can be positioned on quantization levels corresponding to a bitwidth at per-layer granularity. (2) As such, sinusoidal period becomes a direct and continuous representation of the bitwidth. Therefore, WaveQ incorporates this continuous variable (i.e., period) as a differentiable part of the training loss in the form of a regularizer. Hence, WaveQ is a differentiable regularization mechanism, it piggy backs on the stochastic gradient descent that trains the neural network to also learn the bitwidth (the period). Simultaneously this parametric sinusoidal regularizer pushes the weights to the quantization levels (sin 2 minima). By adding our parametric sinusoidal regularizer to the original training objective function, our method automatically yields the bitwidths for each layer along with nearly quantized weights for those bitwidths. In fact, the original optimization procedure itself is harnessed for this purpose, which is enabled by the differentiability of the sinusoidal regularization term. As such, quantized training algorithms (Zhou et al., 2016; Mishra et al., 2018) that still use some form of backpropagation (Rumelhart et al., 1986) can effectively utilize the proposed mechanism by modifying their loss. Moreover, the proposed technique is flexible as it enables heterogeneous quantization across the layers. The WaveQ regularization can also be applied for training a model from scratch, or for fine-tuning a pretrained model. In contrast to the prior inspiring works (Uhlich et al., 2019; Esser et al., 2019) , WaveQ is the only technique that casts finding the bitwidths and the corresponding quantized weights as a simultaneous gradient-based optimization through sinusoidal regularization during the training process. We also prove a theoretical result to provide an insight on why the proposed approach leads to solutions preserving the original accuracy during quantization. We evaluate WaveQ using different bitwidth assignments across different DNNs (AlexNet, CIFAR-10, MobileNet, ResNet-18, ResNet-20, SVHN, and VGG-11). To show the versatility of WaveQ, it is used with two different quantized training algorithms, DoReFa (Zhou et al., 2016) and WRPN (Mishra et al., 2018) . Over all the bitwidth assignments, the proposed regularization, on average, improves the top-1 accuracy of DoReFa by 4.8%. The reduction in the bitwidth, on average, leads to 77.5% reduction in the energy consumed during the execution of these networks. Finally, we apply WaveQ to Transformer DNNs citeppby augmenting their loss with WaveQ parametric sinusiodal regularization. In this case, the conventional stochastic gradient descent plus WaveQ regularization is used to quantize the big Transformer model from (Ott et al., 2018) for machine translation on the IWSLT14 German-English dataset (IWS). For 5, 6, and 7-bit quantization, training with WaveQ yields 0.46, 0.14, 0.04 improved BiLingual Evaluation Understudy (BLEU) score, respectively. As a point of reference, the original big Transformer model from (Ott et al., 2018) improves the BLEU by only 0.1 over the state-of-the-art. Code available at https://github.com/waveq-reg/waveq 2 JOINT LEARNING OF LAYER BITWIDTHS AND QUANTIZED PARAMETERS Our proposed method WaveQ exploits weight regularization in order to automatically quantize a neural network while training. To that end, Section 2.1 describes the role of regularization in neural networks and then Section 2.2 explains WaveQ in more details. 2.1 PRELIMINARIES Quantizer. We discuss how quantization of weight works. Consider a floating-point variable w f to be mapped into a quantized domain using (b + 1) bits. Let Q be a set of (2k + 1) quantized values, where k = 2 b -1. Considering linear quantization, Q can be represented as -1, -k-1 k , ..., -1 k , 0, 1 k , ..., k-1 k , 1 , where 1 k is the size of the quantization bin. Now, w f can be mapped to the b-bit quantization Zhou et al. ( 2016) space as follows. w qo = 2 × quantize b tanh(w f ) 2 max(|tanh(W f )|) + 1 2 -1 (2.1) In Equation 2.1, quantize b (x) = 1 2 b -1 round((2 b -1)x), w f is a scalar, W f is a vector, and w qo is a scalar in the range [-1, 1]. Then, a scaling factor c is determined per layer to map the final quantized weight w q into the range [-c, +c] . As such, w q takes the form cw qo , where c > 0, and w qo ∈ Q. Soft constraints through regularization and the loss landscape of neural networks. Neural networks' loss landscapes are known to be highly non-convex and it has been empirically verified that loss surfaces for large networks have many local minima that essentially provide equivalent test errors Choromanska et al. (2015); Li et al. (2018) . This opens up the possibility of adding soft constrains as extra custom objectives during the training process, in addition to the original objective (i.e., to minimize the accuracy loss). The added constraint could be with the purpose of increasing generalization or imposing some preference on the weights values.



Figure 1: Sketch for a hypothetical loss surface (original task loss to be minimized) and an extra regularization term in 2-D weight space: for (a) weight decay, and (b) WaveQ.

