WAVEQ: GRADIENT-BASED DEEP QUANTIZATION OF NEURAL NETWORKS THROUGH SINUSOIDAL REGU-LARIZATION

Abstract

Deep quantization of neural networks below eight bits can lead to superlinear benefits in storage and compute efficiency. However, homogeneously quantizing all the layers to the same level does not account for the distinction of the layers and their individual properties. Heterogeneous assignment of bitwidths to the layers is attractive but opens an exponentially large non-contiguous hyperparameter space (Available Bitwidths # Layers ). Thus, finding the bitwidth while also quantizing the network to those levels becomes a major challenge. This paper addresses this challenge through a sinusoidal regularization mechanism, dubbed WaveQ. Adding our parametrized sinusoidal regularizer enables WaveQ to not only find the quantized weights, but also learn the bitwidth of the layers by making the period of the sinusoidal regularizer a trainable parameter. In addition, the sinusoidal regularizer itself is designed to align its minima on the quantization levels. With these two innovations, during training, stochastic gradient descent uses the form of the sinusoidal regularizer and its minima to push the weights to the quantization levels while it is also learning the period which will determine the bitwidth of each layer separately. As such WaveQ is a gradient-based mechanism that jointly learns the quantized weights as well as the heterogeneous bitwidths. We show that WaveQ balances compute efficiency and accuracy, and provides a heterogeneous bitwidth assignment for quantization of a large variety of deep networks (AlexNet, MobileNet, SVHN, that virtually preserves the accuracy. WaveQ is versatile and can also be used with predetermined bitwidths by fixing the period of the sinusoidal regularizer. In this case, WaveQ, on average, improves the accuracy of quantized training algorithms (DoReFa and WRPN) by ∼ 4.8%, and outperforms multiple state-of-the-art techniques. Finally, WaveQ is applicable to quantizing transformers and yields significant benefits.

1. INTRODUCTION

Quantization, in general, and deep quantization (below eight bits) (Krishnamoorthi, 2018) , in particular, aims to not only reduce the compute requirements of DNNs but also reduce their memory footprint (Zhou et al., 2016; Judd et al., 2016b; Hubara et al., 2017; Mishra et al., 2018; Sharma et al., 2018) . Nevertheless, without specialized training algorithms, quantization can diminish the accuracy. As such, the practical utility of quantization hinges upon addressing two fundamental challenges: (1) discovering the appropriate bitwidth of quantization for each layer while considering the accuracy; and (2) learning weights in the quantized domain for a given set of bitwidths. This paper formulates both of these challenges as a gradient-based joint optimization problem by introducing an additional novel sinusoidal regularization term in the training loss, called WaveQ. The following two main insights drive this work. (1) Sinusoidal functions (sin 2 ) have inherent periodic minima and by adjusting the period, the minima can be positioned on quantization levels corresponding to a bitwidth at per-layer granularity. (2) As such, sinusoidal period becomes a direct and continuous representation of the bitwidth. Therefore, WaveQ incorporates this continuous variable (i.e., period) as a differentiable part of the training loss in the form of a regularizer. Hence, WaveQ is a differentiable regularization mechanism, it piggy backs on the stochastic gradient descent that trains the neural network to also learn the bitwidth (the period). Simultaneously this parametric sinusoidal regularizer pushes the weights to the quantization levels (sin 2 minima). By adding our parametric sinusoidal regularizer to the original training objective function, our method automatically yields the bitwidths for each layer along with nearly quantized weights for those bitwidths. In fact, the original optimization procedure itself is harnessed for this purpose, which is enabled by the differentiability of the sinusoidal regularization term. As such, quantized training algorithms (Zhou et al., 2016; Mishra et al., 2018) that still use some form of backpropagation (Rumelhart et al., 1986) can effectively utilize the proposed mechanism by modifying their loss. Moreover, the proposed technique is flexible as it enables heterogeneous quantization across the layers. The WaveQ regularization can also be applied for training a model from scratch, or for fine-tuning a pretrained model. In contrast to the prior inspiring works (Uhlich et al., 2019; Esser et al., 2019) , WaveQ is the only technique that casts finding the bitwidths and the corresponding quantized weights as a simultaneous gradient-based optimization through sinusoidal regularization during the training process. We also prove a theoretical result to provide an insight on why the proposed approach leads to solutions preserving the original accuracy during quantization. We evaluate WaveQ using different bitwidth assignments across different DNNs (AlexNet, MobileNet, SVHN, . To show the versatility of WaveQ, it is used with two different quantized training algorithms, DoReFa (Zhou et al., 2016) and WRPN (Mishra et al., 2018) . Over all the bitwidth assignments, the proposed regularization, on average, improves the top-1 accuracy of DoReFa by 4.8%. The reduction in the bitwidth, on average, leads to 77.5% reduction in the energy consumed during the execution of these networks. Finally, we apply WaveQ to Transformer DNNs citeppby augmenting their loss with WaveQ parametric sinusiodal regularization. In this case, the conventional stochastic gradient descent plus WaveQ regularization is used to quantize the big Transformer model from (Ott et al., 2018) for machine translation on the IWSLT14 German-English dataset (IWS). For 5, 6, and 7-bit quantization, training with WaveQ yields 0.46, 0.14, 0.04 improved BiLingual Evaluation Understudy (BLEU) score, respectively. As a point of reference, the original big Transformer model from (Ott et al., 2018) improves the BLEU by only 0.1 over the state-of-the-art. Code available at https://github.com/waveq-reg/waveq 2 JOINT LEARNING OF LAYER BITWIDTHS AND QUANTIZED PARAMETERS Our proposed method WaveQ exploits weight regularization in order to automatically quantize a neural network while training. To that end, Section 2.1 describes the role of regularization in neural networks and then Section 2.2 explains WaveQ in more details. 2.1 PRELIMINARIES Quantizer. We discuss how quantization of weight works. Consider a floating-point variable w f to be mapped into a quantized domain using (b + 1) bits. Let Q be a set of (2k + 1) quantized values, where k = 2 b -1. Considering linear quantization, Q can be represented as -1, -k-1 k , ..., -1 k , 0, 1 k , ..., k-1 k , 1 , where 1 k is the size of the quantization bin. Now, w f can be mapped to the b-bit quantization Zhou et al. (2016) space as follows. Soft constraints through regularization and the loss landscape of neural networks. Neural networks' loss landscapes are known to be highly non-convex and it has been empirically verified that loss surfaces for large networks have many local minima that essentially provide equivalent test errors Choromanska et al. (2015) ; Li et al. (2018) . This opens up the possibility of adding soft constrains as extra custom objectives during the training process, in addition to the original objective (i.e., to minimize the accuracy loss). The added constraint could be with the purpose of increasing generalization or imposing some preference on the weights values. (2.2) w qo = 2 × quantize b tanh(w f ) 2 max(|tanh(W f )|) + 1 2 -1 (2.1) In Equation 2.1, quantize b (x) = 1 2 b -1 round((2 b -1)x), w f is a scalar, W f is a vector, In Equation (2.2), λ w is the weights quantization regularization strength which governs how strongly weight quantization errors are penalized, and λ β is the bitwidth regularization strength. The parameter β i is proportional to the quantization bitwidth which is elaborated later in this section. Figure 2 (a) shows a 3-D visualization of our regularizer, R. The periodic regularizer induces a periodic pattern of minima that correspond to the desired quantization levels. Such correspondence is achieved by matching the period to the quantization step (1/(2 βi -1)) based on a particular number of bits (β i ) for a given layer i. Learning the sinusoidal period. The parameter β i in (Equation 2.2) controls the period of the sinusoidal regularizer. Thereby β i is directly proportional to the actual quantization bitwidth (b i ) of layer i as follows: b i = β i , and α i = 2 bi /2 βi (2.3) In Equation (B.2) where α i ∈ R + is a scaling factor. Note that b i ∈ Z is the only discrete parameter, while β i ∈ R + is a continuous real valued variable, and * . is the ceiling operator. While the first term in Equation equation 2.2 is only responsible for promoting quantized weights, the second term (λ β i β i ) aims to reduce the number of bits for each layer i individually while the overall loss is aiming to maximize accuracy. As such, this term is a soft constraint that yields heterogeneous bitwidths for different layers. The main insight here is that β i , which also controls the period of the sinusoidal term, is a continuous valued parameter by definition. As such, β i acts as an ideal optimization objective constraint and a proxy to minimize the actual quantization bitwidth b i . Therefore, WaveQ avoids the issues of gradient-based optimization for discrete valued parameters. Furthermore, the benefit of learning the sinusoidal period is two-fold. First, it provides a smooth differentiable objective for finding minimal bitwidths. Second, simultaneously learning the scaling factor (α i ) associated with the found bitwidth. Leveraging the sinusoidal properties, WaveQ learns the following two quantization parameters simultaneously: (1) a per-layer quantization bitwidth (b i ) along with (2) a scaling factor (α i ) through learning the period of the sinusoidal function. Additionally, by exploiting the periodicity, differentiability, and the local convexity profile in sinusoidal functions, WaveQ automatically propels network weights towards values that are inherently closer to quantization levels according to the jointly learned quantizer's parameters b i , α i as follows.  ( i j sin 2 (πwij(2 β i -1)) 2 β i ) is used to control the range derivatives of the proposed regularization term with respect to β. This denominator is chosen to limit vanishing and exploding gradients during training. To this end, we compared three variants of equation equation 2.2 with different normalization defined, for k = 0, 1, and 2, as: R k (w; β) = λw i j sin 2 πw ij (2 β i -1) 2 kβ i + λ β i β i (2.5) Figure 3 (a), (b), (c) provide a visualization on how each of the proposed scaled variants affect the first and second derivatives. For R k=0 and R k=2 , there are regions of vanishing or exploding gradients. Only the regularization R k=1 (the proposed one) is free of such issues. Setting the regularization strengths. The convergence behavior depends on the setting of the regularization strengths λ w and λ β . Our proposed objective seeks to learn multiple quantization parameterization in conjunction. As such, the learning process can be portrayed as three phases (Figure 2 (e)). In Phase ( 1 ), the learning process optimizes for the original task loss E 0 . Initially, the small λ w and λ β values allow the gradient descent to explore the optimization surface freely. As the training process moves forward, we transition to phase ( 2 ) where the larger λ w and λ β gradually engage both the weights quantization regularization and the bitwidth regularization, respectively. Note that, for this to work, the strength of the weights quantization regularization λ w should be higher than the strength of the bitwidth regularization λ β such that a bitwidth per layer could be properly evaluated and eventually learned during this phase. After the bitwidth regularizer converges to a bitwidth for each layer, we transition to phase ( 3), where we fix the learned bitwidths and gradually decay λ β while we keep λ w high. The criterion for choosing λ w and λ β is to balance the magnitude of regularization loss to be smaller than the magnitude of accuracy loss. The mathematical formula used to generate λ w and λ β profiles can be found in the supplementary material. (Figure 8 ).

3. THEORETICAL ANALYSIS

The results of this section are motivated as follows. Intuitively, we would like to show that the global minima of E = E 0 + R are very close to the minima of E 0 that minimizes R. In other words, we expect to extract among the original solutions, the ones that are most prone to be quantized. To establish such result, we will not consider the minima of E = E 0 + R, but the sequence S n of minima of E n = E 0 + δ n R defined for any sequence δ n of real positive numbers. The next theorem shows that our intuition holds true, at least asymptotically with n provided δ n → 0. Theorem 1. Let E 0 , R : R n → [0, ∞) be continuous and assume that the set S E0 of the global minima of E 0 is non-empty and compact. As S E0 is compact, we can also define S E0,R ⊆ S E0 as the set of minima of E 0 which minimizes R. Let δ n be a sequence of real positive numbers, define Proof. For the first statement, assume that S n → S * . We wish to show that S * ⊆ S E0,R . Assume that x n is a sequence of global minima of E n = E 0 + δ n R F + δ n G converging to x * . It suffices to show that x * ∈ S E0,R . First let us observe that x * ∈ S E0 . Indeed, let λ = inf x∈R n E 0 (x) and assume that x ∈ S E0 . Then, λ ≤ E 0 (x n ) ≤ (E 0 + δ n R)(x n ) ≤ (E 0 + δ n R)(x) = λ + δ n R(x) →λ . Thus, since E 0 is continuous and x n → x * we have that E 0 (x * ) = λ which implies x * ∈ S E0 . Next, define µ = inf x∈S E 0 R(x). Let x ∈ S E0,R so that R(x) = µ. Now observe that, by the minimality of x n we have that λ + δ n µ = (E 0 + δ n R)(x) ≥ (E 0 + δ n R)(x n ) ≥ λ + δ n R(x n ) Thus, R(x n ) ≤ µ for all n. Since R is continuous and x n → x * we have that R(x * ) ≤ µ which implies that R(x * ) = µ since x * ∈ S E0 . Thus, x * ∈ S E0,R . The second statement follows from the standard theory of Hausdorff distance on compact metric spaces and the first statement. Theorem 1 implies that by decreasing the strength of R, one recovers the subset of the original solutions that achives the smallest quantization loss. In practice, we are not interested in global minima, and we should not decrease much the strength of R. In our context, Theorem 1 should then be understood as a proof of concept on why the proposed approach leads the expected result. Experiments carried out in the next section will support this claim. Additionally, note that while the theorem is stated in terms of a limit as the regularization parameter vanishes, the proof in fact gives a corresponding stability result. Namely, if the regularization parameter is sufficiently small relative to the main loss then the minimizers will be "almost" quantized. For the interested reader, we provide a more detailed version of the above analysis in the supplementary material.

4. EXPERIMENTAL RESULTS

To demonstrate the effectiveness of WaveQ, we evaluated it on several deep neural networks with different Image Classification datasets (CIFAR10, SVHN, and ImageNet), and one Transformerbased network, which is the big Transformer model from (Ott et al., 2018) on the IWSLT14 German-English dataset (IWS). We provide results for two different types of quantization. First, we show quantization results for learned heterogeneous bitwidths using WaveQ and we provide different analyses to asses the quality of these learned bitwidth assignments. Second, we further provide results assuming a preset homogeneous bitwidth assignment as a special setting of WaveQ. This is, in some cases, a practical assumption that might stem from particular hardware requirements or constraints. Table 1 provides a summary of the evaluated networks and datasets for both learned heterogeneous bitwidths, and the special case of training preset homogeneous bitwidth assignments. We compare our proposed WaveQ method with PACT (Choi et al., 2018a) , LQ-Nets (Zhang et al., 2018) , Experimental setup. We implemented WaveQ on top of DoReFa inside Distiller (Zmora et al., 2018) , an open source framework for quantization by Intel that implements various quantization techniques. The reported accuracies for DoReFa and WRPN are with the built-in implementations in Distiller, which may not exactly match the accuracies reported in their respective papers. However, an independent implementation from a major company provides an unbiased foundation for comparison. We quantize all convolution and fully connected layers, except for the first and last layers which use 8-bits. This assumption is commensurate with the previous techniques. 4.1 LEARNED HETEROGENEOUS BITWIDTHS As for quantizing both weights and activations, Table 1 shows that incorporating WaveQ into the quantized training process yields best accuracy results outperforming PACT, LQ-Net, DSQ, and DoReFa with significant margins. Furthermore, it can be seen that the learned heterogeneous bitwidths yield better accuracy as compared to the preset 4-bit homogeneous assignments, with lower, on average, bitwidth (3.85, 3.57, and 3.95 bits for AlexNet, ResNet-18, and MobileNet, respectively). Figure 4 (a),(b) (bottom bar graphs) show the learned heterogeneous weight bitwidths over layers for AlexNet and ResNet-18, respectively. As seen, WaveQ parametric regularization yields a spectrum of varying bitwidth assignments to the layers which vary from 2 bits to 8 bits with an irregular pattern. These results demonstrate that the proposed regularization, WaveQ, automatically distinguishes different layers and their varying importance with respect to accuracy while learning their respective bitwidths. To assess the quality of these bitwidths assignments, we conduct a sensitivity analysis on the relatively big networks (see next subsection). Benefits of heterogeneous quantization. Figure 4 (a),(b) (top graphs) show various comparisons and sensitivity results for learned heterogeneous bitwidth assignments for bigger networks (AlexNet and ResNet-18). It is infeasible to enumerate these networks' respective quantization spaces. Compared to 4-bit homogeneous quantization, learned heterogeneous assignments achieve better accuracy with lower, on average, bitwidth 3.85 bits for AlexNet and 3.57 bits for ResNet-18. This demonstrates that a homogeneous (uniform) assignment of the bits is not always the desired choice to minimize accuracy loss. Furthermore, Figure 4 also shows that decrementing the learned bitwidth for any single layer at a time results in 0.44% and 0.24% average reduction in accuracy for AlexNet and ResNet-18, respectively. The dotted blue line with markers shows how differently decrementing the bitwidth of various individual layers affect the accuracy. This trend further demonstrates the effectiveness of learning with WaveQ to find the lowest bitwidth that maximizes the accuracy. Energy savings. To demonstrate the energy savings of the solutions found by WaveQ, we evaluate it on Stripes (Judd et al., 2016a) , a custom accelerator designed for DNNs, which exploits bit-serial computation to support flexible bitwidths for DNN operations. As shown in Table 1 , the reduction in the bitwidth, on average, leads to 77.5% reduction in the energy consumed during the execution of these networks. We also consider a preset homogeneous bitwidth quantization which can also be supported by WaveQ under special settings where we fix β (to a preset bitwidth). Hence, only the first regularization term is engaged for propelling the weights to the quantization levels.

4.2. PRESET HOMOGENOUS BITWIDTH QUANTIZATION

Table 2 shows accuracies of different networks Transformers (encoder-decoder architectures) have been shown to achieve best results for NLP tasks including machine translation (Vaswani et al., 2017) and automatic speech recognition (Mohamed et al., 2019) . A Transformer layer relies on a key-value self-attention mechanism for learning relationships between distant concepts, rather than relying on recurrent connections and memory cells. Herein, we extend the application of WaveQ regularization to improve the accuracy of deeply quantized (below 8 bits) Transformer models. We run our experiments on IWSLT 2014 German-English (DE-EN) dataset. We use the Transformer model implemented in the fairseq-py toolkit (Ott et al., 2019) . All experiments are based on the big transformer model with 6 blocks in the encoder and decoder networks. Table 3 shows the effect of applying WaveQ regularization into the training process for 5, 6, and 7-bit quantization on the final accuracy (BLEU score). WaveQ consistently improves the BLEU score of quantized models at various quantization bitwidths (7-5 bits). Moreover, higher improvements are obtained at lower bitwidths. We conduct an experiment that uses WaveQ for training from scratch. For the sake of clarity, we are considering in this experiment the case of preset bitwidth assignments (i.e., λ β = 0). As the Figure illustrates, using a constant λ w results in the weights being stuck in a region close to their initialization, (i.e., quantization objective dominates the accuracy objective). However, if we dynamically change the λ w following the exponential curve in Figure 6 -Row(III)-Col(I)) during the from-scratch training, the weights no longer get stuck. Instead, the weights traverse the space (i.e., jump from wave to wave) as illustrated in Figure 6 -Cols(II) and (III) for CIFAR and SVHN, respectively. In these two columns, Rows (I), (II), (III), correspond to quantization with 3, 4, 5 bits, respectively. citepInitially, the smaller λ w values allow the gradient descent to explore the optimization surface freely, as the training process moves forward, the larger λ w gradually engages the sinusoidal regularizer, and eventually pushes the weights close to the quantization levels. Further convergence analysis is provided in the supplementary material. In terms of test accuracy, both profiles yield similar results (Profile 1, 74.95%) vs (Profile 2, 74.45%). Note that while the theorem is stated in terms of a limit as the regularization parameter vanishes, the proof in fact gives a corresponding stability result. Namely, if the regularization parameter is sufficiently small relative to the main loss then the minimizers will be "almost" quantized.

6. RELATED WORK

This research lies at the intersection of (1) quantized training algorithms and (2) techniques that discover bitwidth for quantization. The following discusses the most related works in both directions. Most related methods define a new optimization problem and use a special method for solving it. For example, (Bai et al., 2019) uses a proximal gradient method (adds a prox step after each stochastic gradient step), (Yang et al., 2020) uses ADMM, and (Tung & Mori, 2018) takes a Bayesian approach. This only makes the training more difficult, slower and increases the computational com-plexity. In contrast, WaveQ exploits the conventional stochastic gradient descent method while jointly optimizing for the original training loss while softly constraining it to simultaneously learn the quantized parameters, and more importantly bitwidths. The differentiability of the adaptive sinusoidal regularizer enables simultaneously learning both the bitwidths and pushing the weight values to the quantization levels. As such, WaveQ can be used as a complementary method to some of these efforts, which is demonstrated by experiments with both DoReFa-Net (Zhou et al., 2016) and WRPN (Mishra et al., 2018) . Our preliminary efforts (Anonymous) and another concurrent work (Naumov et al., 2018) use a sinusoidal regularization to push the weights closer to the quantization levels. However, neither of these two works make the period a differentiable parameter nor find bitwidths during training. Quantized training algorithms There have been several techniques (Zhou et al., 2016; Zhu et al., 2017; Mishra et al., 2018) that train a neural network in a quantized domain after the bitwidth of the layers is determined manually. DoReFa-Net (Zhou et al., 2016) uses straight through estimator (Bengio et al., 2013) for quantization and extends it for any arbitrary k bit quantization in weights, activations, and gradients. WRPN (Mishra et al., 2018) is training algorithm that compensates for the reduced precision by increasing the number of filter maps in a layer (doubling or tripling). TTQ (Zhu et al., 2017) quantizes the weights to ternary values by using per layer scaling coefficients learnt during training. These scaling coefficients are used to scale the weights during inference. PACT (Choi et al., 2018a) proposes a technique for quantizing activations by introducing an activation clipping parameter α. This parameter (α) is used to represent the clipping level in the activation function and is learned via back-propagation during training. More recently, VNQ (Achterhold et al., 2018) uses a variational Bayesian approach for quantizing neural network weights during training. VNQ requires a careful choice prior distribution of the weights, which is not straightforward, and the model is often intractable. In contrast, WaveQ is directly applicable without introducing extra hyperparameters to optimize. Additionally, VNQ takes on a probabilistic approach, while WaveQ is a deterministic approach towards soft quantization. Loss-aware weight quantization. Recent works pursued loss-aware minimization approaches for quantization. (Hou et al., 2017) and (Hou & Kwok, 2018) developed approximate solutions using proximal Newton algorithm to minimize the loss function directly under the constraints of low bitwidth weights. One effort (Choi et al., 2018b) proposed to learn the quantization of DNNs through a regularization term of the mean-squared-quantization error. LQ-Net (Zhang et al., 2018) proposes to jointly train the network and its quantizers. DSQ (Gong et al., 2019) employs a series of tanh functions to gradually approximate the staircase function for low-bit quantization (e.g., sign for 1bit case), and meanwhile keeps the smoothness for easy gradient calculation. Although some of these techniques use regularization to guide the process of quantized training, none explores the use of adaptive sinusoidal regularizers for quantization. Most recently, (Nguyen et al., 2020) suggests using |cos| function as a regularizer. Moreover, unlike WaveQ, these techniques do not find the bitwidth for quantizing the layers. Techniques for discovering quantization bitwidths. A recent line of research focused on methods which can also find the optimal quantization parameters, e.g., the bitwidth, the stepsize, in parallel to the network weights. Recent work (Ye et al., 2018) based on ADMM (adm) runs a binary search to minimize the total square quantization error in order to decide the quantization levels for the layers. Most recently, (Uhlich et al., 2019) proposed to indirectly learn quantizer's parameters via Straight Through Estimator (STE) (Bengio et al., 2013) based approach. In a similar vein, (Esser et al., 2019) has proposed to learn the quantization mapping for each layer in a deep network by approximating the gradient to the quantizer step size that is sensitive to quantized state transitions. On another side, recent works (Elthakeb et al., 2018; Wang et al., November 21, 2018) proposed a reinforcement learning based approach to find an optimal bitwidth assignment policy. Quantizing Transformers. FullyQT (Prato et al., 2019) uses a bucketing based uniform quantization proposed by QSGD (Alistarh et al., 2016) and extends it to Tranformers. Q8BERT (Zafrir et al., 2019) quantizes all the GEMM (General Matrix Multiply) operations to 8 bit by adding an additional term for quantization loss during training, which is calculated based on the rounding effect of floating point values (Shaw et al., 2018) . WaveQ, however, uses a sinusoidal regularizer to automatically push the weights towards the quantization levels.

7. CONCLUSION

This paper devised WaveQ that casts the two problems of finding layer bitwidth and quantized weights as a gradient-based optimization through parametric sinusoidal regularization. WaveQ provides significant improvements over the state-of-the-art and is even applicable to the Transformers.

A.3 STATEMENT OF THE THEOREM

Theorem 2. Let F, G : R n → [0, ∞) are continuous and assume that F is coercive. Consider the sets S F +δG , the set of points at which F + δG is globally minimum. The following are true: 1. If δ n → 0 and S F +δnG → S * , then S * ⊂ S F,G 2. If δ n → 0 then there is a subsequence δ n k → 0 and a non-empty set S * ⊂ S F,G so that S F +δn k G → S * . Proof. The second statement follows from the standard theory of Hausdorff distance on compact metric spaces and the first statement. For the first statement, assume that S F +δnG → S * . We wish to show that S * ⊂ S F,G . Assume that x n is a sequence of global minima of F + δ n G converging to x * . It suffices to show that x * ∈ S F,G . First let us observe that x * ∈ S F . Indeed, let λ = inf x∈R n F (x) and assume that x ∈ S F . Then, λ ≤ F (x n ) ≤ (F + δ n G)(x n ) ≤ (F + δ n G)(x) = λ + δ n G(x) → λ. Thus, since F is continuous and x n → x * we have that F (x * ) = λ which implies x * ∈ S F . Next, define µ = inf x∈S F G(x). Let x ∈ S F,G so that G(x) = µ. Now observe that, by the minimality of x n we have that λ + δ n µ = (F + δ n G)(x) ≥ (F + δ n G)(x n ) ≥ λ + δ n G(x n ) Thus, G(x n ) ≤ µ for all n. Since G is continuous and x n → x * we have that G(x * ) ≤ µ which implies that G(x * ) = µ since x * ∈ S F . Thus, x * ∈ S F,G . B QUANTIZER Here, we give an overview about the used quantization method. Consider a floating-point variable w f to be mapped into a quantized domain using (b + 1) bits. Let Q be a set of (2k + 1) quantized values, where k = 2 b -1. Considering linear quantization, Q can be represented as -1, -k-1 k , ..., -1 k , 0, 1 k , ..., k-1 k , 1 , where 1 k is the size of the quantization bin. Now, w f can be mapped to the b-bit quantization (Zhou et al., 2016) space as follows: (B.1) w qo = 2 × quantize b tanh(w f ) 2 max(|tanh(W f )|) + 1 2 -1 where quantize b (x) = 1 2 b -1 round((2 b -1)x), w f is a scalar, W f is a vector, and w qo is a scalar and tanh is used to limit its range to [-1, 1]. Then, a scaling factor c is determined per layer to map the final quantized weight w q into the range [-c, +c] . As such, w q takes the form cw qo , where c > 0, and w qo ∈ Q. These learned parameters (b, α), as explained in Section 2.2, can be mapped to the quantizer parameters explained in Equation equation B.1. For (b + 1) bits quantization (the extra bit is the sign bit):  k = 2 b -1, and c = α = 2 b /2 β (B.2)

C CONVERGENCE ANALYSIS

Figure 8 (a), (b) show the convergence behavior of WaveQ by visualizing both accuracy and regularization loss over finetuning epochs for two networks: CIFAR10 and SVHN. As can be seen, the regularization loss (WaveQ Loss) is minimized across the finetuning epochs while the accuracy is maximized. This demonstrates a validity for the proposed regularization being able to optimize the two objectives simultaneously. Figure 8 (c), (d) contrasts the convergence behavior with and without WaveQ for the case of training from scratch for VGG-11. As can be seen, at the onset of training, the accuracy in the presence of WaveQ is behind that without WaveQ. This can be explained as a result of optimizing for an extra objective in case of with WaveQ as compared to without. Shortly thereafter, the regularization effect kicks in and eventually achieves ∼ 6% accuracy improvement. The convergence behavior, however, is primarily controlled by the regularization strengths (λ w , λ β ). As briefly mentioned in Section 2.2, (λ w , λ β ) ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the proposed regularization objective to the standard accuracy objective. We reckon that careful setting of λ w , λ β across the layers and during the training epochs is essential for optimum results (Choi et al., 2018b) .  D WAVEQ PERFORMANCE ON BERT Additionally, Table 5 provides layer-wise quantization with a heterogeneous mix of 4 and 5 bits for the BERT model. In all cases, WaveQ improves UPOS and LAS metrics for two French treebanks (SPOKEN, PARTUT). E TRAINING FROM SCRATCH 

F REGULARIZATION STRENGTHS

Having a regularization strength is a normal setting associated with any regularization method. The criterion for choosing λ w and λ β is to balance the magnitude of regularization loss to be smaller than the magnitude of accuracy loss. We then perform a grid search over a few points and chose the ones with the best convergence. From the theoretical perspective, while the theorem is stated in terms of a limit as the regularization parameter vanishes, the proof in fact gives a corresponding stability result. Namely, if the regularization parameter is sufficiently small relative to the main loss then the minimizers will be "almost" quantized.

G HETEROGENEOUS COMPARISON

For Mobilenet-V2, WaveQ quantizes the network to an average bitwidth of 3.95(20.66 GBOPS) compared to 5.90(29.16 GBOPs) reported by Uhlich et al. (2019) . Similarly, for Resnet-18, WaveQ achieves an average bitwidth of 3.57(62.56 GBOPs) compared to 5.47(65.90 GBOPs) by Uhlich et al. (2019) . Table 7 shows this comparison. However this is not a fair comparison since WaveQ is a regularization method and not a full-blown quantization technique.



Figure 1: Sketch for a hypothetical loss surface (original task loss to be minimized) and an extra regularization term in 2-D weight space: for (a) weight decay, and (b) WaveQ.

Figure 2: (a) 3-D visualization of the proposed generalized objective WaveQ. (b) WaveQ 2-D profile, w.r.t weights, adapting for arbitrary bitwidths, (c) example of adapting to ternary quantization. (d) WaveQ 2-D profile w.r.t bitwidth. (e) Regularization strengths profiles, λ w , and λ β , across training iterations. Regularization in action. Regularization effectively constrains weight parameters by adding a corresponding term (regularizer) to the objective loss function. A classical example is Weight DecayKrogh & Hertz (1991) that aims to reduce the network complexity by limiting the growth of the weights. This soft constraint is realized by adding a term, proportional to the L 2 norm of the weights to the objective function as the regularizer that penalizes large weight values. WaveQ, on the other hand, uses regularization to push the weights to the quantization levels. For the sake of simplicity and clarity, Figure1(a) and (b) illustrate a geometrical sketch for a hypothetical loss surface (original objective function to be minimized) and an extra regularization term in 2-D weight space, respectively. For weight decay regularization (Figure 1 (a)), the faded circle shows that as we get closer to the origin, the regularization loss is minimized. The point w opt is the optimum just for the loss function alone and the overall optimum solution is achieved by striking a balance between the original loss term and the regularization loss term. Similarly, Figure 1(b) shows a representation of the proposed periodic regularization for a fixed bitwidth β. A periodic pattern of minima pockets are seen surrounding the original optimum point. The objective of the optimization problem is to find the best solution that is the closest to one of those minima pockets where weight values are nearly matching the desired quantization levels, hence the name quantization-friendly. 2.2 WAVEQ REGULARIZATION The proposed regularizer is formulated in Equation (2.2) where the first term pushes the weights to the quantization levels and the second correlated term aims to reduce the bitwidth of each individual layer heterogeneously. R(w; β) = λ w i j sin 2 πw ij (2 βi -1) 2 βi Weights quantization regularization

Figure 2 (b), (c) show a 2-D profile w.r.t weights (w), while (d) shows a 2-D profile w.r.t the bitwidth (β). Periodic sinusoidal regularization. As shown in Equation equation 2.2, the first regularization term is based on a periodic function (sinusoidal) that adds a smooth and differentiable term to the original objective, Figure 2(b).

Figure 3: Visualization for three variants of the proposed regularization objective using different normalizations and their respective first and second derivatives with respect to β. (a) R 0 (w; β), (b) R 1 (w; β), and (c) R 2 (w; β).

for one layer at a time

Figure 4: Quantization bitwidth assignments across layers. (a) AlexNet (average bitwidth = 3.85 bits). (b) ResNet-18 (average bitwidth = 3.57 bits) DSQ (Gong et al., 2019), and DoReFa, which are current state-of-the-art methods that show results with homogeneous 3-, and 4-bit weight/activation quantization for various networks (AlexNet, ResNet-18, and MobileNet).

Figure 5: Evolution of weight distributions over training epochs at different layers and bitwidths for different networks. (a) CIFAR10, (b) SVHN, (c) AlexNet, (d) ResNet18.

using plain WRPN, plain DoReFa and DoReFa + WaveQ for 3, 4, and 5 bitwdiths. As depicted, the results concretely show the effect of incorporating WaveQ into existing quantized training techniques and how it outperforms the previously reported accuracies. Weight distributions during training. Figure 5 shows the evolution of weights distributions over fine-tuning epochs for different layers of CIFAR10, SVHN, AlexNet, and ResNet-18 networks. The high-precision weights form clusters and gradually converge around the quantization centroids as regularization loss is minimized along with the main accuracy loss. 4.3 WAVEQ FOR TRANSFORMER QUANTIZATION

Figure 6: Weight trajectories. The 10 colored lines in each plot denote the trajectory of 10 different weights.

Row(I)-Col(I) shows weight trajectories without WaveQ as a point of reference. Row(II)-Col(I) shows the weight trajectories when WaveQ is used with a constant λ w .

Figure 7: Weight trajectories and training losses for different λ w profiles.Next, we provide results comparing two profiles of the regularization strength (λ w ). Profile 1: λ w gradually increases as training proceeds then gradually decays towards the end of training. (Figure7(a)). Profile 2: λ w gradually increases as training proceeds and remains high (Figure7(c)). Figure7(a,c) depicts different loss components and Figure7(b,d) visualizes weights trajectories. Both profiles show that accuracy loss is unimpededly minimized along with WaveQ loss. Our theoretical results align with Profile 1 (Figure7(b))-what reviewer insightfully pointed out. Although λ w decays back towards end of training, the weights mostly remain tied to their quantization levels except for a few deflections that cause slight increase of the regularization loss towards end of training. In terms of test accuracy, both profiles yield similar results (Profile 1, 74.95%) vs (Profile 2, 74.45%). Note that while the theorem is stated in terms of a limit as the regularization parameter vanishes, the proof in fact gives a corresponding stability result. Namely, if the regularization parameter is sufficiently small relative to the main loss then the minimizers will be "almost" quantized.

Figure 8: Convergence behavior: accuracy and WaveQ regularization loss over fine-tuning epochs for (a) CIFAR10, (b) SVHN. Comparing convergence behavior with and without WaveQ during training from scratch (c) accuracy, (d) training loss. Network: VGG-11, 2-bit DoReFa quantization

and the sequence S n = S En of the global minima of E n . Then, the following holds true:1. If δ n → 0 and S n → S * , then S * ⊆ S E0,R , 2. If δ n → 0 then there is a subsequence δ n k → 0 and a non-empty set S * ⊆ S E0,R so that S n k → S * , where the convergence of sets, denoted by S n → S * , is defined as the convergence to 0 of their Haussdorff distance, i.e., lim

Comparison with state-of-the-art quantization methods on ImageNet. The " W/A " values are the bitwidths of weights/activations.



Accuracies

Performance of WaveQ for quantizing Transformers.

Hyperparameters settings.

Performance of WaveQ on BERT.



Validation top-1 accuracy for training from scratch w/ WaveQ vs w/o WaveQ.

shows a comparison between training from scratch with WaveQ vs without. It can be seen that incorporating WaveQ into the training process achieves strictly better accuracy than the baseline training without WaveQ across all cases. Moreover, higher improvements are obtained at lower bitwidths reaching to 35%

annex

Intuitively, we would like to show that if > 0 is very small, then the global minima of the functionare very close to the global minima of F closest to Q. To achieve this we will have to introduce first the concept of convergence of sets and then we will show that our intuition is correct by proving that the set of global minima to the above relaxed function converges to a subset of global minima of F closest to Q. Lemma A.9. Let S δ be a family of compact subsets of R n , then lim δ→0 S δ = S * if and only if the following two conditions hold.1. If x δ ∈ S δ converges to x, then x ∈ S * 2. For every x ∈ S * , there exists a family x δ ∈ S δ with x δ → x.The lemma is just an exercise in the definition. 

