SEMI-RELAXED QUANTIZATION WITH DROPBITS: TRAINING LOW-BIT NEURAL NETWORKS VIA BIT-WISE REGULARIZATION

Abstract

Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from large quantization error due to the high temperature for low variance of gradients, hence showing suboptimal performance. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the quantization error, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to reduce the distribution bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.

1. INTRODUCTION

Deep neural networks have achieved great success in various computer vision applications such as image classification, object detection/segmentation, pose estimation, action recognition, and so on. However, state-of-the-art neural network architectures require too much computation and memory to be deployed to resource-limited devices. Therefore, researchers have been exploring various approaches to compress deep neural networks to reduce their memory usage and computation cost. In this paper, we focus on neural network quantization, which aims to reduce the bit-width of a neural network while maintaining competitive performance with a full-precision network. It is typically divided into two groups, uniform and heterogeneous quantization. In uniform quantization, one of the simplest methods is to round the full-precision weights and activations to the nearest grid points: x = α x α + 1 2 where α controls the grid interval size. However, this naïve approach incurs severe performance degradation on large datasets. Recent network quantization methods tackle this problem from different perspectives. In particular, Relaxed Quantization (RQ) (Louizos et al., 2019) employs Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017) to force weights and activations to be located near quantization grids with high density. Louizos et al. (2019) notice the importance of keeping the gradient variance small, which leads them to use high Gumbel-Softmax temperatures in RQ. However, such high temperatures may cause a large quantization error, thus preventing quantized networks from achieving comparable performance to full-precision networks. To resolve this issue, we first propose Semi-Relaxed Quantization (SRQ) that uses the mode of the original categorical distribution in the forward pass, which induces small quantization error. It is clearly distinguished from Gumbel-Softmax choosing argmax among the samples from the concrete distribution. To cluster weights cohesively around quantization grid points, we devise a multi-class straight-through estimator (STE) that allows for efficient gradient-based optimization as well. As this STE is biased like a traditional one (Bengio et al., 2013) for the binary case, we present a novel technique, DropBits to reduce the distribution bias of the multi-calss STE in SRQ. Motivated from Dropout (Srivastava et al., 2014) , DropBits drops bits rather than neurons/filters to train low-bit neural networks under SRQ framework. In addition to uniform quantization, DropBits allows for heterogeneous quantization, which learns different bit-width per parameter/channel/layer by dropping redundant bits. DropBits with learnable bit-drop rates adaptively finds out the optimal bit-width for each group of parameters, possibly further reducing the overall bits. In contrast to recent studies (Wang et al., 2019; Uhlich et al., 2020) in heterogeneous quantization that exhibit almost all layers possess at least 4 bits, up to 10-bit, our method yields much more resource-efficient low-bit neural networks with at most 4 bits for all layers. With trainable bit-widths, we also articulate a new hypothesis for quantization where one can find the learned bit-width network (termed a 'quantized sub-network') which can perform better than the network with the same but fixed bit-widths from scratch. Our contribution is threefold: • We propose a new quantization method, Semi-Relaxed Quantization (SRQ) that introduces the multi-class straight-through estimator to reduce the quantization error of Relaxed Quantization for transforming continuous activations and weights to discrete ones. We further present a novel technique, DropBits to reduce the distribution bias of the multi-class straight-through estimator in SRQ. • Extending DropBits technique, we propose a more resource-efficient heterogeneous quantization algorithm to curtail redundant bit-widths across groups of weights and/or activations (e.g. across layers) and verify that our method is able to find out 'quantized sub-networks'. • We conduct extensive experiments on several benchmark datasets to demonstrate the effectiveness of our method. We accomplish new state-of-the-art results for ResNet-18 and Mo-bileNetV2 on the ImageNet dataset when all layers are uniformly quantized.

2. RELATED WORK

While our goal in this work is to obtain an extremely low-bit neural network both for weights and activations, here we broadly discuss existing quantization techniques with various goals and settings. BinaryConnect (Courbariaux et al., 2015) first attempted to binarize weights to ±1 by employing deterministic or stochastic operation. To obtain better performance, various studies (Rastegari et al., 2016; Li et al., 2016; Achterhold et al., 2018; Shayer et al., 2018) have been conducted in binarization and ternarization. To reduce hardware cost for inference, Geng et al. (2019) proposed softmax approximation via a look-up table. Although these works effectively decrease the model size and raise the accuracy, they are limited to quantizing weights with activations remained in full-precision. To take full advantage of quantization at run-time, it is necessary to quantize activations as well. Researchers have recently focused more on simultaneously quantizing both weights and activations (Zhou et al., 2016; Yin et al., 2018; Choi et al., 2018; Zhang et al., 2018; Gong et al., 2019; Jung et al., 2019; Esser et al., 2020) . XNOR-Net (Rastegari et al., 2016) , the beginning of this line of work, exploits the efficiency of XNOR and bit-counting operations. QIL (Jung et al., 2019) also quantizes weights and activations by introducing parametrized learnable quantizers that can be trained jointly with weight parameters. Esser et al. (2020) recently presented a simple technique to approximate the gradients with respect to the grid interval size to improve QIL. Nevertheless, these methods do not quantize the first or last layer, which leaves a room to improve power-efficiency. For ease of deployment in practice, it is inevitable to quantize weights and activations of all layers, which is the most challenging. Louizos et al. (2019) proposed to cluster weights by using Gumbel-Softmax, but it shows drawbacks as we will discuss in Section 3.2. Jain et al. ( 2019) presented efficient fixed-point implementations by formulating the grid interval size to the power of two, but they quantized the first and last layer to at least 8-bit. Zhao et al. (2020) proposed to quantize the grid interval size and network parameters in batch normalization for the deployment of quantized models on low-bit integer hardware, but it requires a specific accelerator only for this approach. As another line of work, Fromm et al. (2018) proposed a heterogeneous binarization given pre-defined bit-distribution. HAWQ (Dong et al., 2019) determines the bit-width for each block heuristically based on the top eigenvalue of Hessian. Unfortunately, both of them do not learn optimal bit-widths for heterogeneity. Toward this, Wang et al. (2019) and Uhlich et al. (2020) proposed a layer-wise heterogeneous quantization by exploiting reinforcement learning and learning dynamic range of quantizers, respectively. However, their results exhibit that almost all layers have up to 10-bit (at least 4-bit), which would be suboptimal. Lou et al. (2020) presented a channel-wise heterogeneous quantization by exploiting hierarchical reinforcement learning, but channel-wise precision limits the structure of accelerators, thereby restricting the applicability of the model.

3. METHOD

In this section, we briefly review Relaxed Quantization (RQ) (Louizos et al., 2019) and propose Semi-Relaxed Quantization, which selects the nearest grid point in the forward pass to decrease the quantization error. To make it learnable and to cluster compressed parameters cohesively, SRQ expresses the nearest grid selection of the forward pass as the equivalent form, the combination of logistic distribution and argmax, and performs the backward pass on it. Then, we present DropBits technique to reduce the distribution bias of SRQ and its extension to heterogeneous quantization. 3.1 PRELIMINARY: RELAXED QUANTIZATION Relaxed Quantization (RQ) considers the following quantization grids for weights: G = α[-2 b-1 , . . . , 0, . . . , 2 b-1 -1] =: [g 0 , . . . , g 2 b -1 ] where b is the bit-width and a learnable parameter α > 0 for each layer controls a grid interval. When quantizing activations, the grid points in G start from zero since the output of ReLU is always non-negative. Then, x (a weight or an activation) is perturbed by noise as x = x + , which enables gradient-based optimization for non-differentiable rounding. The noise follows a distribution p( ) = Logistic(0, σ) so that p( x) is governed by Logistic(x, σ) where σ represents the standard deviation. Under p( x), we can easily compute the unnormalized probability of x being quantized to each grid point g i in a closed form as below: π i = p( x = g i |x, α, σ) = Sigmoid (g i + α/2 -x)/σ -Sigmoid (g i -α/2 -x)/σ , where x denotes a quantized realization of x. Note that the cumulative distribution function of the logistic distribution is just a sigmoid function. Finally, given unnormalized categorical probability π = {π i } 2 b -1 i=0 for grid points G = {g i } 2 b -1 i=0 , RQ discretizes x to x = r• G by sampling r = {r i } 2 b -1 i=0 from the concrete distribution (Jang et al., 2017; Maddison et al., 2017) with a temperature τ : u i ∼ Gumbel(0, 1), r i = exp (log π i + u i )/τ 2 b -1 j=0 exp (log π j + u j )/τ , x = 2 b -1 i=0 r i g i . The algorithm of RQ is described in Appendix in detail.

3.2. SEMI-RELAXED QUANTIZATION -FIXING PITFALLS OF RQ

Although RQ achieves competitive performance with both weights and activations of neural networks quantized, the quantization probability modeling of RQ may still incur large quantization error, thereby yielding suboptimal performance. To be specific, Louizos et al. (2019) recommend high temperatures for the concrete distribution (e.g. τ = 1.0 or 2.0) in (2) since exploiting low temperatures hinders networks from converging due to high variance of gradients. However, it turns out that RQ according to different Gumbel samples, (f): Gumbel samples used in (a ∼ e), and (g): SRQ, the same as the original categorical distribution. The x-axis denotes weight value (except (f)), denotes the original value α/2. For RQ (a∼e), represents x for input x = computed by (2). the concrete distribution with such a high temperature is almost similar to the uniform distribution. As a concrete example, we consider 2-bit quantization with G = α[-2, -1, 0, 1] for a fixed scale parameter α > 0, σ = α/3, and we set τ to 1.0 as in Louizos et al. (2019) . Now, suppose that the original weight value is α/2. As in Figure d, e ), can be sporadically quantized to below zero by RQ as the original categorical distribution has support for -α and -2α. It may be okay on average, but RQ computes only one sample in each forward pass due to computational burden, which can accidentally lead to very large quantization error for these particular sample. To avoid the counterintuitive-sample with large quantization error as seen in Figure d, e ), we propose 'Semi-Relaxed Quantization' (SRQ) which rather directly considers the original categorical distribution in Figure 1 -(g). To be concrete, for a weight or an activation x, the probability of x being quantized to each grid is r i = π i / 2 b -1 j=0 π j for i ∈ {0, • • • , 2 b -1} with b-bit precision, where π i is computed as (1). In such a manner, selecting a grid point for x can be thought of as sampling from the categorical distribution with categories G = {g i } 2 b -1 i=0 and the corresponding probabilities r = {r i } 2 b -1 i=0 as illustrated in Figure 1 -(g). Then, the grid point g imax with i max = argmax i r i would be the most reasonable speculation due to the highest probability. SRQ therefore chooses the mode of the original categorical distribution, g imax and assign it to x, entirely discriminated from Gumbel-Softmax which selects the argmax among samples from the concrete distribution. As a result, SRQ does not suffer from counterintutive-sample problem that RQ encounters at all. The last essential part for SRQ is to handle the non-differentiable argmax operator in computing i max . Toward this, we propose a multi-class straight-through estimator (STE) that allows for backpropagating through a non-differentiable categorical sample by approximating ∂L/∂r imax to ∂L/∂y imax and ∂L/∂r i to zero for i = i max , where L is the cross entropy between the true label and the prediction made by a quantized neural network as delineated in the previous paragraph and y imax is the i max -th entry of the one-hot vector y. The forward and backward passes of SRQ are summarized as follows. Forward: y = one hot[argmax i r i ], Backward: ∂L ∂r imax = ∂L ∂y imax and ∂L ∂r i = 0 for i = i max (3) Such a formulation brings two important advantages in network quantization. First of all, (3) makes the variance of gradient estimator become zero. Since SRQ always chooses the mode of the original categorical distribution (i.e., there is no randomness in the forward pass of SRQ), and the gradient of loss function L with respect to the individual categorical probabilities is defined as zero everywhere except for the coordinate corresponding to the mode, the variance of gradients in SRQ is indeed zero. The other advantage is that the backward pass (3) can cluster network weight parameters cohesively. Under the assumption that r i = π i , ∂L ∂x is proportional to  max ∂x = 1 σ Sigmoid gi max +α/2-x σ Sigmoid - gi max +α/2-x σ -Sigmoid gi max -α/2-x σ Sigmoid - gi max -α/2-x σ foot_0 . When x is close to g imax , ∂π imax /∂x is nearly zero, so is ∂L/∂x. With an appropriate learning rate, x converges to g imax , which leads SRQ to cluster weights better than RQ as shown in Figure 2 . Although ∂L ∂x is almost zero, α is still trained. After α is updated, there is a gap between x and α so that x can be trained. Hence, the network will continue to train until it finds the optimal α.

3.3. DROPBITS

Although our multi-class STE enjoys low variance of gradients, it is biased to the mode as the binary one in Bengio et al. (2013) . To reduce the bias of a STE, Chung et al. (2016) propose the slope annealing trick, but this strategy is only applicable to the binary case. To address this limitation, we propose a novel method, DropBits, to decrease the distribution bias of a multi-class STE. Inspired by dropping neurons in Dropout (Srivastava et al., 2014) , we drop an arbitrary number of grid points at random every iteration, where in effect the probability of being quantized to dropped grid points becomes zero. 𝜋 0 𝜋 1 𝜋 7 𝒁𝟐𝜋1 𝒁𝟑𝜋7 𝒁𝟑𝜋0 ⋯ 𝝅 𝟎 𝝅 𝟏 𝝅 𝟕 ⋯ 𝑟 0 𝑟 1 𝑟 7 𝑟 𝑖 = 𝜋 𝑖 σ 𝑗=0 7 𝜋 𝑗 𝒁𝟏 𝒁𝟐 𝒁𝟑 𝑔 0 𝑔 1 𝑔 7 ⋯ 𝑔 2 No Mask (a) Endpoints-sharing mask However, the design policy that each grid point has its own binary mask would make the number of masks increase exponentially with bitwidth. Taking into account appropriate noise levels with a less aggressive design, the following two examples are available: (a) endpoints in the grids share the same binary mask, and (b) the grid points in the same bit-level share the same binary mask (see Figure 3 ). Hereafter, we consider (b) bitwise-sharing masks for groups of grid points, unless otherwise specified. Now, we introduce how to formulate binary masks. Unlike practical Dropout implementation through dividing activations by 1 -p (here, p is a dropout probability), we employ an explicit binary mask Z whose probability Π can be optimized jointly with model parameters. The Bernoulli random variable being non-differentiable, we relax a binary mask via the hard concrete distribution (Louizos et al., 2018) . While the binary concrete distribution (Maddison et al., 2017) has its support (0, 1), the hard concrete distribution stretches it slightly at both ends, thus concentrating more mass on exact 0 and 1. Assuming disjoint masks, we describe the construction of a binary mask Z k for the k-th bit-level using the hard concrete distribution as ⋯ 𝑟 0 𝑟 1 𝑟 7 𝑟 𝑖 = 𝜋 𝑖 σ 𝑗=0 7 𝜋 𝑗 𝜋 0 𝜋 1 𝜋 7 ⋯ 𝝅 𝟎 𝝅 𝟏 𝝅 𝟕 𝒁𝟐𝜋1 𝒁𝟐𝜋7 𝒁𝟐𝜋0 𝑔 0 𝑔 1 𝑔 7 ⋯ 𝑔 2 𝑔 6 𝒁𝟐 𝒁𝟐 𝒁𝟏 No Mask (b) Bitwise-sharing mask U k ∼ Uniform(0, 1), S k = Sigmoid log U k -log (1 -U k ) + log Π k -log(1 -Π k ) /τ (4) Sk = S k (ζ -γ) + γ and Z k = min(max( Sk , 0), 1) where τ is a temperature for the hard concrete distributions with γ < 0 and ζ > 0 reflecting stretching level. For i = 2 b-1 -1, 2 b-1 and 2 b-1 + 1, we do not sample from the above procedure but fix Z = 1 so as to prohibit all the binary masks from becoming zero (see 'No Mask' in Figure 3 ). With the value of each mask generated from the above procedure, the probability of being quantized to any grid point is re-calculated by multiplying π i 's by their corresponding binary masks Z k 's (e.g. π 0 = Z 2 • π 0 ) and then normalizing them (to sum to 1). As seen in Figure 4 , the sampling distribution of SRQ is biased to the mode, -3α. With DropBits adjusting π i 's to π i 's based on Z k 's, the sampling distribution of SRQ + DropBits more resemble the original categorical distribution than that of SRQ, which means that DropBits could effectively reduce the distribution bias of the multi-class straightthrough estimator in SRQ. Not only that, DropBits does not require any hand-crafted scheduling at all due to the learnable characteristic of Π k , whereas such scheduling is vital for Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017) and slope annealing trick (Chung et al., 2016) . Although quantization grids for weights are symmetric with respect to zero, those for activations start from zero, which makes it difficult to exploit symmetrically-designed DropBits for activations. Therefore, DropBits is applied only for weights in our experiments. Assuming that Z k 's are shared across all weights of each layer, the overall procedure is described in Figure 5 . The overall algorithm including the test phase is deferred to Appendix due to space limit.

3.4. EMPIRICAL ANALYSIS OF QUANTIZATION ERROR, DISTRIBUTION BIAS, VARIANCE

In this section, we empirically compare (i) the expectation of quantization error, (ii) the 2 -norm of distribution bias, and (iii) the gradient variance of RQ, SRQ, and SRQ + DropBits. The expectation of quantization error indicates the expected value of difference between an input value and its quantized value, E[|x -x|] where x is a quantized value of x by each algorithm. Here, the expectation is taken over the randomness of x. As described in Appendix B, it is not possible to compute the bias of gradient with respect to π, so we instead compute the bias of gradient with respect to x as a proxy. However, it is not straightforward to say that SRQ + DropBits is better than RQ and vice versa in terms of the bias with respect to x (see Figure 8-(b ) in Appendix B). As an additional indirect metric, we suggest a distribution bias as the difference between the original categorical distribution and the distribution approximated by each algorithm, E[p] -π origin where a vector π origin := (π i / 2 b -1 i=0 π j ) 2 b -1 i=0 is the original categorical distribution and a vector p := (p i ) 2 b -1 i=0 is the approximated one (i.e., (p i ) 2 b -1 i=0 = (r i ) 2 b -1 i=0 in (2) for RQ, and (p i ) 2 b -1 i=0 = (y i ) 2 b -1 i=0 in (3) for SRQ (+ DropBits)). As a distribution bias is also a vector, we compare the 2 -norm of a distribution bias of each algorithm. Note that our new notion of distribution bias is very loosely coupled with the gradient in the sense that both converge to zero if there is no approximation for the non-differentiable parts. However, it can be used as an indirect indicator to measure how much biased distribution is used in computing the gradient estimator. For the variance, we use the closed-form of the gradient estimator ∂r ∂π of each algorithm: for i = j, (a) RQ: ∂ri ∂πj = -rirj τ πj with randomness on u i and u j , and (b) SRQ + DropBits: ∂ri ∂πj = -rirj πj with randomness on binary masks Z i and Z j . The case for i = j can be derived similarly. Armed with these terms, we conduct a case study in 3-bit quantization with grid points G = α[-4, • • • , 3] via Monte Carlo simulation when α is simply set to 1.0. Here, let x be the midpoint of consecutive grid points, i.e., x = -3.5α, • • • , 2.5α. For gradient variance, since there are many pairs of ∂ri ∂πj for different i and j, we representatively choose i as the index of the grid point closest to x, and j as the indices of two neighboring grid points of x (e.g. if x ∈ [g i-1 , g i ], then j = i -1 and i). In Figure 6 , SRQ shows smaller quantization error than RQ for all x = -3.5α, • • • , 2.5α. This is because SRQ deliberately performs biased estimation on the underlying categorical distribution to prevent large quantization error from even occurring in the forward pass while sharing this underlying distribution with RQ in the backward pass. Instead, we devised DropBits to reduce the incurred distribution bias of SRQ, which is indeed the case as can be seen in Figure 6 . Interestingly, SRQ + DropBits can also achieve smaller distribution bias than RQ for all x = -3.5α, • • • , 2.5α.

3.5. LEARNING BIT-WIDTH TOWARDS RESOURCE-EFFICIENCY

As noted in Section 1 and 2, recent studies on heterogeneous quantization use at least 4-bit in almost all layers, up to 10-bit, which leaves much room for the saving of energy and memory. Towards more resource-efficient one, we introduce an additional regularization on DropBits to drop redundant bits. Since the mask design in Figure 3-(b ) reflects the actual bit-level and the probability of each binary mask in DropBits is learnable, we can penalize the case where we use higher bit-levels via a sparsity encouraging regularizer like 1 . As Louizos et al. (2018) proposed a relaxed 0 regularization using the hard concrete binary mask, we adopt this continuous version of 0 as a sparsity inducing regularizer. Following (4), we define the smoothed 0 -norm as R(Z; Π) = Sigmoid(log Π 1-Π -τ log -γ ζ ). One caveat here is that we do not have to regularize masks for low bit-level if a higher bit-level is still alive (in this case such a high bit-level is still necessary for quantization). We thus design a regularization in such a specific way as only to permit the probability of a binary mask at the current highest bit-level to approach zero. More concretely, for bit-level binary masks {Z k } b-1 k=1 as in Figure 3 -(b) and the corresponding probabilities {Π k } b-1 k=1 , our regularization term to learn the bit-width is R {Z k } b-1 k=1 , {Π k } b-1 k=1 = b-1 k=1 I(Z k > 0) b-1 j=k+1 I(Z j = 0) R(Z k ; Π k ). Note that {Z k } b-1 k=1 is assigned to each group (e.g. all weights or activations in a layer or channel for instance). Hence, every weight in a group shares the same sparsity pattern (and bit-width as a result), and learned bit-widths across groups are allowed to be heterogeneous. Assuming the l-th layer shares binary masks Z l := {Z l k } b-1 k=1 associated with probabilities Π l := {Π l k } b-1 k=1 , our final objective function for a L-layer neural network becomes L(θ, α, σ, Z, Π) + λ L l=1 R(Z l , Π l ), where α = {α l } L l=1 and σ = {σ l } L l=1 represent the layer-wise grid interval parameters and standard deviations of logistic distributions, Z = {Z l } L l=1 , Π = {Π l } L l=1 , and λ is a regularization parameter. In inference phase, we just drop unnecessary bits based on the values of Π. QUANTIZATION Frankle & Carbin (2019) articulated the 'lottery ticket hypothesis', stating that one can find some sparse sub-networks, 'winning tickets', from randomly-initialized, dense neural networks that are easier to train than sparse networks resulting from pruning. In this section, we define a new hypothesis for quantization with slightly different (opposite in some sense) perspective from the original one. Notation. a bit b and a = bit b denote that a has strictly higher bit-width than b for at least one of all groups (e.g. channels or layers), and a has the same bit-precision as b across all groups, respectively. Definition. For a network f (x; θ) with randomly-initialized parameters θ, we consider a quantized network f (x; θ ) from f (x; θ) such that θ bit θ . If the accuracy of f (x; θ ) is higher than that Under review as a conference paper at ICLR 2021 This hypothesis implies that learning bit-width would be superior to pre-defined bit-width. To the best of our knowledge, our study is the first attempt to delve into this hypothesis.

4. EXPERIMENTS

Since popular deep learning libraries such as TensorFlow (Abadi et al., 2016) and PyTorch from v1.3 (Paszke et al., 2019) already provide their own 8-bit quantization functionalities, we focus on lower bit-width regimes (i.e. 2 ∼ 4-bit). In contrast to some other quantization papers, our method uniformly quantizes the weights and activations of all layers containing both the first and last layers. We first show that SRQ and DropBits have its own contribution, none of which is negligible. Then, we evaluate our method, SRQ + DropBits on a totally large-scale dataset with deep networks. Finally, we demonstrate that our heterogeneous quantization method yields promising results even if all layers have at most 4-bit and validate a new hypothesis for quantization in Section 3.6.

4.1. ABLATION STUDIES

To validate the efficacy of SRQ and DropBits, we successively apply each piece of our method to RQ for LeNet-5 (LeCun et al., 1998) on MNIST and VGG-7 (Simonyan & Zisserman, 2014) on CIFAR-10. Table 1 shows that SRQ outperforms RQ in most cases. One might wonder that the issue of RQ introduced in Section 3.2 can be addressed by an annealing schedule of the temperature τ in RQ. It could be possible, but RQ with an annealing schedule suffers from high variance of gradients due to low temperatures at the end of training as shown in Figure 7 . As a result, annealing τ gives rise to worse performance than RQ as shown in Table 1 . However, SRQ does not suffer from both problems at all, thus displaying the best learning curve in Figure 7 . Finally, it can be clearly identified that DropBits consistently improves SRQ by decreasing the distribution bias of our multi-class STE. Table 3 : Test error (%) for quantized sub-networks using LeNet-5 on MNIST, VGG-7 on CIFAR-10,and ResNet-18 on ImageNet. Here, an underline means the learned bit-width and "T" stands for ternary precision. To verify the effectiveness of our algorithm on the ImageNet dataset, we quantize the ResNet-18 (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) architectures initialized with each pre-trained full-precision network available from the official PyTorch repository. In Table 2 , our method is only compared to the state-of-the-art algorithms that quantize both weights and activations of all layers for fair comparisons. The extensive comparison against recent works that remain the first or last layer in the full-precision is given in Appendix. Table 2 illustrates how much better our model performs than the latest quantization methods as well as our baseline, RQ. In ResNet-18, SRQ + DropBits outdoes RQ, QIL, LLSQF, and TQT, even achieving the top-1 and top-5 errors in 4-bit nearly close to those of the full-precision network. In MobileNetV2, SRQ + DropBits with 4-bit surpasses all existing studies by more than one percentage point. Moreover, we quantize MobileNetV2 to 3-bit, obtaining competitive performance, which is remarkable due to the fact that none of previous works successfully quantizes MobileNetV2 to 3-bit.

4.3. FINDING QUANTIZED SUB-NETWORKS

In this experiment, we validate a new hypothesis for quantization by training the probabilities of binary masks using the regularizer in Section 3.5 to learn the bit-width of each layer. For brevity, only weights are heterogeneously quantized, and the bit-width for activations is fixed to the initial one. In Table 3 , the fourth column represents the bit-width per layer learned by our regularizer, and the fifth and last columns indicate the test error when fixing the bit-width of each layer same as trained bit-widths (fourth column) from scratch and when using our regularization approach, respectively. Table 3 shows that a learned structure by our heterogeneous quantization method (last column) is superior to the fixed structure with learned bit-widths from scratch (fifth column) for all cases. It might be doubtful whether our regularizer is able to recognize which layer is really redundant or not. This may be indirectly substantiated by the observation that the fixed structure with trained bit-widths from scratch (fifth column) outperforms the uniform quantization (third column) on CIFAR-10. More experiments on different values of the regularization parameter λ are deferred to Appendix .

5. CONCLUSION

We proposed Semi-Relaxed Quantization (SRQ), which effectively clusters the weights in low bitwidth regimes, along with DropBits that reduces the distribution bias of SRQ. We empirically showed that both SRQ and DropBits possess its own value, thereby leading SRQ + DropBits to achieve the state-of-the-art performance for ResNet-18 and MobileNetV2 on ImageNet. Furthermore, we took one step forward to consider heterogeneous quantization by simply penalizing binary masks in DropBits, which enables us to find out quantized sub-networks. As future work, we plan to extend our heterogeneous quantization method to activations and its application to other quantizers.

A ALGORITHM OF SEMI-RELAXED QUANTIZATION WITH DROPBITS

Algorithm 1 Semi-Relaxed Quantization (SRQ) + DropBits 1: Input: Training data 2: Initialize: Bit-width b, network parameters {W l , b l } L l=1 , layer-wise grid interval parameters and the standard deviations of a logistic distribution in the l-th layer {α l , σ l } L l=1 . Initialize layer-wise grid G l = α l [-2 b-1 , • • • , 2 b-1 -1] =: [g l,0 , g l,1 , • • • , g l,2 b -1 ] for l ∈ {1, • • • , L}. 3: procedure TRAINING 4: for l = 1, • • • , L do 5: x ← Each entry of W l or b l 6: I l = G l -α/2 Shift the grid by -α/2 7: F = Sigmoid I l -x σ l Compute CDFs 8: πi = F [i + 1] -F [i] for i = 0, • • • , 2 b -1 9: if DropBits then 10: Sample a mask W l = min(max(α l • Round(W l /α l ), g l,0 ), g l,2 b -1 ) 28: Z k for k = 0, • • • , b -1 from (4) 11: π = π Z 12: ri = πi/ 2 b - b l = min(max(α l • Round(b l /α l ), g l,0 ), g l,2 b -1 ) 29: end for 30: end procedure B COMPARISON OF BIAS BETWEEN RQ, SRQ, AND SRQ + DROPBITS In general, when the loss involves discrete random variables, the true gradient, the gradient of the expectation of the loss with respect to the parameter of discrete random variables, can be obtained by using existing stochastic gradient estimation techniques such as the score function estimator (see Equation ( 4) and ( 5) in Maddison et al. (2017) ). However, the reason why we did not explicitly compare the bias of gradient is that distribution parameters (i.e. π = {πi} 2 b -1 i=0 ) in our setting are not independent of network parameters x. In fact, π is a function of x in (1). Given that an inverse of a function π does not exist because πi is not one-to-one with respect to x for each i, it is not possible to directly apply existing techniques such as the score function to compute unbiased estimators. Instead, we compute the bias of gradient with respect to x as a proxy of that of with respect to π. 

E COMPARISON OF SRQ + DROPBITS WITH GUMBEL-SOFTMAX + MULTI-CLASS STE

As described in Section 3.4, our SRQ + DropBits shows smaller quantization error, variance of gradients, and distribution bias than RQ, while maintaining stochasticity. For this reason, we employ the deterministic scheme in the first place and then encourage stochasticity via DropBits. To show the effectiveness of SRQ + DropBits further, we empirically compare it with an algorithm using the Gumbel-Softmax STE in the forward pass instead of DropBits and our multi-class STE in the backwared pass. Let such an algorithm be called "Gumbel-Softmax + multi-class STE". Although employing our multi-class STE in the backward pass, Gumbel-Softmax + multi-class STE performs worse than SRQ + DropBits. This is primarily due to the fact that Gumbel-Softmax STE still incurs a large quantization error like RQ.

F IMPLEMENTATION DETAILS

The weights and activations of all layers including the first and last layers (denoted by W and A) are assumed to be perturbed as W = W + and A = A + respectively, under ∼ L(0, σ) as we describe in Section 2. Concerning DropBits regularization in 3.3, we initialize the probability of each binary mask with Π ∼ N (0.9, 0.01 2 ) (i.e. corresponding to low dropout probability). The concrete distribution of a binary mask is stretched to ζ = 1.1 and γ = -0.1 as recommended in Louizos et al. (2018) , and τ is initialized to 0.2 to make a binary mask more discretized. For MNIST experiments, we train LeNet-5 with 32C5 -MP2 -64C5 -MP2 -512FC -Softmax architecture for 100 epochs irrespective of the bit-width. In addition, a learning rate is set to 5e-4 regardless of the bit-width and exponentially decayed with decay factor 0.8 for the last 50 epochs. The input is normalized into [-1, 1] range without any data augmentation. For CIFAR-10 experiments, following the convention that the location of max-pooling layer is changed, which originates from Rastegari et al. (2016) , a max-pooling layer is located after a convolutional layer, but before a batch normalization and an activation function. We train VGG-7 with 2x(128C3) -MP2 -2x(256C3) -MP2 -2x(512C3) -MP2 -1024FC -Softmax architecture for 300 epochs, and a learning rate is initially set to 1e-4 regardless of the bit-width. The learning rate is multiplied by 0.1 at 50% of the total epochs and decay exponentially with the decay factor 0.9 during the last 50 epochs. The input images are preprocessed by substracting its mean and dividing by its standard deviation. The training set is augmented as follows: (i) a random 32 × 32 crop is sampled from a padded image with 4 pixels on each side, (ii) images are randomly flipped horizontally. The test set is evaluated without any padding or cropping. Note that a batch normalization layer is put after every convolutional layer in VGG-7, but not in LeNet-5. In Section 4.1, RQ with an annealing schedule of the temperature τ in RQ is implemented by following Jang et al. (2017) : τ is annealed every 1000 iterations by the schedule τ = max(0.5, exp (-t/100000)) in 3-bit and τ = max(0.5, 2 exp (-t/100000)) in 4-bit in order to make the decreasing rate of τ as small as possible. Here, t is the global training iteration. For ImageNet experiments in Section 4.2, the weight parameters of both ResNet-18 and MobileNetv2 are initialized with the pre-trained full precision model available from the official PyTorch repository. When quantizing ResNet-18 to 3-bit, fine-tuning is implemented for 80 epochs with a batch size of 256: a learning rate is initialized to 2e-5 and divided by two at 50, 60, and 68 epochs. When quantizing ResNet-18 to 4-bit, fine-tuning is carried out for 150 epochs with a batch size of 128: for the first 125 epochs, a learning rate is set to 5e-6, but 1e-6 for the last 25 epochs. When quantizing MobileNetV2 to 3-bit and 4-bit, fine-tuning is performed for 25 epochs with a batch size of 48 and an initial learning rate of 2e-5: the learning rate is divided by two at 15 and 20 epochs for 3-bit and at 10, 12, 18, and 20 epochs for 4-bit. We employ AdamW in Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019) with a weight decay factor of 0.01. In Section 4.3 and D, if the probability of a binary mask is less than 0.5, then we drop the corresponding bits. For LeNet-5 on MNIST and VGG-7 on CIFAR-10, our regularization term in Section 3.5 is activated only for the first 50% of the total epochs. With the remained bit-width for each layer, fine-tuning process is conducted for the last 50% of the total epochs. For ResNet-18 on ImageNet, we initialize the weights of ResNet-18 with the pre-trained full precision model and train it for ten epochs for simplicity. During training, our regularization term in Section 3.5 is activated only for the first 9 epochs, and fine-tuning process is done for the last epoch with the remained bit-width of each layer fixed. All experiments in Table 3 and 5 were conducted by the use of AdamW: the weight decay value is set to 0.01 for LeNet-5, 0.02 for VGG-7, and 0.01 for ResNet-18. We consider the regularization parameter λ ∈ [5 × 10 -5 , 10 -2 ] to encourage layer-wise heterogeneity.

G ALGORITHM OF RQ

We provide the algorithmic details of RQ as follows. For quantizing weights, G = [gi] 2 b -1 i=0 = α[-2 b-1 , . . . , 0, . . . , 2 b-1 -1]; however, when quantizing activations, the grid points start from zero since the outputs of ReLU activations are always non-negative, that is, G = [gi] 2 b -1 i=0 = α[0, . . . , 2 b -1]. The objective function in RQ is cross-entropy loss function of class labels and class probabilities predicted by quantized weights, biases, and activations, which is the same as the loss function L in our method. 



Since imax does not change, the assumption is not unreasonable. We cannot reproduce the results of RQ in 2-bit, so we experiment only on 3-bit and 4-bit RQ 3 Our own implementation with all layers quantized by using pretrained models available from PyTorch



Figure 1: Probability of being quantized to each grid point by RQ (τ = 1.0) and SRQ respectively. (a ∼ e):

Figure 2: Weight distributions for 3-bit quantized LeNet-5 by (a) RQ and (b) SRQ. The x-axis and y-axis represent the weight values and their frequencies, respectively. The vertical dashed lines denote grid points.

∂πi

Figure 3: Two mask designs in 3-bit

Figure 4: The illustration of the effect of DropBits on SRQ. For a certain weight, (a) the categorical distribution indicates ri = πi/Σ 7 j=0 πj for each grid (i = 0, • • • , 7), (b) the distribution of SRQ is a sampling distribution after taking the argmax of ri = πi/Σ 7 j=0 πj, and (c) the distribution of SRQ + DropBits is a sampling distribution after taking the argmax of ri = πi/Σ 7 j=0 πj. Here, Π k 's are initialized to 0.7 for clear understanding.

Figure 5: Illustration of Semi-Relaxed Quantization (SRQ) framework with DropBits technique.

Figure 6: Comparison of RQ, SRQ, and SRQ + DropBits in quantization error/distribution bias/variance.

Figure 7: Learning curves of VGG-7 quantized by RQ, RQ with annealing τ , and SRQ in 3-bit.

Figure 8: The comparison of bias (a) between SRQ and SRQ + DropBits, and (b) between SRQ + DropBits and RQ, when training LeNet-5 on MNIST in 3-bit. In this experiment, only weights are quantized. The x-axis denotes the value of a weight in the last fully connected layer.

Relaxed Quantization (RQ)Louizos et al. (2019) for training 1: Input: x (a weight or an activation) 2: Initialize: scale α, standard deviation σ,grids G = [g i ] 2 b -1 i=0 = [g 0 , • • • , g 2 b -1 ] 3: Require: bit-width b, temperature τ 4: I = g 0 -α 2 , • • • , g 2 b -1 -α 2 , g 2 b -1 + α 2 5: F = Sigmoid I -x σ Compute CDF 6: π i = F [i + 1] -F [i] for i = 0, • • • , 2 b -1 7:Unnormalized Prob. for each grid point 8: # Sampling from the concrete distribution 9: u i ∼ Gumbel(0, 1)for i = 0, • • • , 2 b -1 10: r i = exp (log π i + u i )/τ j exp (log π j + u j )/τ 11: Output: x = 2 b -1 i=0 r i g iAlgorithm 3 Relaxed Quantization (RQ) Louizos et al. (2019) for inference 1: Input: x (a weight or an activation) 2: Require: scale α, grids G = [g 0 , • • • , g 2 b -1 ] 3: x = α • round x α 4: Output: min(max( x, g 0 ), g 2 b -1 )

Test error (%) for LeNet-5 on MNIST and VGG-7 on CIFAR-10. "Ann." stands for annealing the temperature τ in RQ.

Top-1/Top-5 error (%) with ResNet-18 and MobileNetV2 on ImageNet.

Test error (%) for LeNet-5 on MNIST and VGG-7 on CIFAR-10.

C EXTENSIVE COMPARISON FOR RESNET-18 AND MOBILENETV2 ON IMAGENET

Our method, SRQ + DropBits surpasses quantization methods remaining the first or last layer in the full precision as well as the latest algorithms that quantize both the weights and activations of all layers including the first and last layers, except QIL (Jung et al., 2019) and LSQ (Esser et al., 2020) both of which utilize the full-precision first and last layer as well as employ their own ResNet-18 pretrained model performing much higher than one available from the official PyTorch repository. 3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/3/4 36.46 34.943/3 37.80 3/3/2/3/2/3/3/3/3/3/3/3/2/3/3/3/3/3/3/3/3 41.01 40.30 3/3/2/2/2/2/3/3/3/3/3/3/2/3/3/3/3/3/3/3/3 43.41 42.13

