EXPLORING THE POTENTIAL OF LOW-BIT TRAINING OF CONVOLUTIONAL NEURAL NETWORKS Anonymous

Abstract

In this paper, we propose a low-bit training framework for convolutional neural networks. Our framework focuses on reducing the energy and time consumption of convolution kernels, by quantizing all the convolutional operands (activation, weight, and error) to low bit-width. Specifically, we propose a multi-level scaling (MLS) tensor format, in which the element-wise bit-width can be largely reduced to simplify floating-point computations to nearly fixed-point. Then, we describe the dynamic quantization and the low-bit tensor convolution arithmetic to efficiently leverage the MLS tensor format. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width than previous methods. When training ResNet-20 on CIFAR-10, all convolution operands can be quantized to 1-bit mantissa and 2-bit exponent, while retaining the same accuracy as the full-precision training. When training ResNet-18 on ImageNet, with 4-bit mantissa and 2-bit exponent, our framework can achieve an accuracy loss of less than 1%. Energy consumption analysis shows that our design can achieve over 6.8× higher energy efficiency than training with floating-point arithmetic.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks, such as image classification (Krizhevsky et al., 2012) and object detection (Redmon et al., 2016; Liu et al., 2016) . However, deep CNNs are both computation and storage-intensive. The training process could consume up to hundreds of ExaFLOPs of computations and tens of GBytes of storage (Simonyan & Zisserman, 2014) , thus posing a tremendous challenge for training in resource-constrained environments. At present, the most common training method is to use GPUs, but it consumes much energy. The power of a running GPU is about 250W, and it usually takes more than 10 GPU-days to train one CNN model on ImageNet (Deng et al., 2009) . It makes AI applications expensive and not environment-friendly. Table 1 : The number of different operations in the training process (batch size = 1). Abbreviations: "EW-Add": element-wise addition, ; "F": forward pass; "B": backward pass. Op Reducing the precision of NNs has drawn great attention since it can reduce both the storage and computational complexity. It is pointed out that the power consumption and circuit area of fixed-point multiplication and addition units are greatly reduced compared with floating-point ones (Horowitz, 2014) . Many studies (Jacob et al., 2017a; Dong et al., 2019; Banner et al., 2018b) focus on amending the training process to acquire a reduced-precision model with higher inference efficiency. Besides the studies on improving inference efficiency, there exist studies that accelerate the training process. Wang et al. (2018) and Sun et al. (2019) reduce the floating-point bit-width to 8 during the training process. Wu et al. (2018) implements a full-integer training procedure to reduce the cost but fails to get acceptable performance. As shown in Tab. 1, Conv in the training process accounts for the majority of the operations. Therefore, this work aims at simplifying convolution to low-bit operations, while retaining a similar performance with the full-precision baseline. The contributions of this paper are: 1. This paper proposes a low-bit training framework to improve the energy efficiency of CNN training. We design a low-bit tensor format with multi-level scaling (MLS format), which can strike a better trade-off between the accuracy and bit-width, while taking the hardware efficiency into consideration. The multi-level scaling technique extracts the common exponent of tensor elements as much as possible to reduce the element-wise bitwidth, thus improving the energy efficiency. To leverage the MLS format efficiently, we develop the corresponding dynamic quantization and the MLS tensor convolution arithmetic. 2. Extensive experiments demonstrate the effectiveness of our low-bit training framework. One only needs 1-bit mantissa and 2-bit exponent to train ResNet-20 on CIFAR-10 while retaining the same accuracy as the full-precision training. On ImageNet, using 4-bit mantissa and 2-bit exponent is enough for training ResNet-18, with a precision loss within 1%. Our method achieves higher energy efficiency using fewer bits than previous floating-point training methods and better accuracy than previous fixed-point training methods. 3. We estimate the hardware energy that implements the MLS convolution arithmetic. Using our MLS tensor format, the energy efficiency of convolution can be improved by over 6.8×, than the full-precision training, and over 1.2× than previous low-bit training methods.

2.1. POST-TRAINING QUANTIZATION

Earlier quantization methods like (Han et al., 2015) focused on the post-training quantization, and quantized the pre-trained full-precision model using the codebook generated by clustering or other criteria (e.g., SQNR Lin et al. (2015) , entropy Park et al. (2017) ). Banner et al. (2018b) selected the quantization bit-width and clipping value for each channel through the analytical investigation. Jacob et al. (2017b) developed an integer arithmetic convolution for efficient inference, but it's hard to be used in training because the scale of the output tensor should be known before calculation. These quantization methods need pretrained models, and cannot accelerate the training process.

2.2. QUANTIZE-AWARE TRAINING

Quantize-aware training considered quantization effects in the training process. Some methods trained an ultra low-bit network like binary (Rastegari et al., 2016) or ternary (Li et al., 2016) networks, with a layer-wise scaling factor. Despite that the follow-up studies (Liu et al., 2020; Qin et al., 2019) have been proposing training techniques to improve the performance of binary networks, the extremely low bit-width still causes notable performance degradation. Other methods sought to retain the accuracy with relatively higher precision, such as 8-bit (Jacob et al., 2017a) . Gysel et al. (2018) developed a GPU-based training framework to get dynamic fixed-point models. These methods focus on accelerating the inference process and the training process is still using floating-point operations.

2.3. LOW-BIT TRAINING

To accelerate the training process, studies have been focusing on design a better floating-point data format. Dillon et al. (2017) proposed a novel 16-bit floating-point format that is more suitable for CNN training, while Köster et al. (2017) proposed the Flexpoint that contains 16-bit mantissa and 5-bit tensor-shared exponent (scale), which is similar to the dynamic fixed-point format proposed by Gysel et al. (2018) . Recently, 8-bit floating-point (Wang et al., 2018; Sun et al., 2019) was used with chunk-based accumulation and hybrid format to solve swamping. ......

Tensor Scale

Group 0 Scale Some studies used fixed-point in both the forward and backward processes (Zhou et al., 2016) . Wu et al. (2018) ; Yang et al. (2020) implemented a full-integer training framework for integer-arithmetic machines. However, their methods caused notable accuracy degradation. Banner et al. (2018a) used 8-bit and 16-bit quantization based on integer arithmetic (Jacob et al., 2017b) to achieve a comparable accuracy with the full-precision baseline. But it's not very suitable for training as we discussed earlier. These methods reduced both the training and inference costs. In this paper, we seek to strike a better trade-off between accuracy and bit-width. s Exp # Man # … … … s Exp # Man # Exp $ Man $ Exp % M Group 0 Group 1 Scale Exp % M Group 1 Group N Scale Exp % M Group N s Exp # Man # … … … s Exp # Man # s Exp # Man # … … … s Exp # Man #

3. MULIT-LEVEL SCALING LOW-BIT TENSOR FORMAT

In this paper, we denote the filters and feature map of the convolution operation as weight (W) and activation (A), respectively. In the back-propagation, the gradients of feature map and weights are denoted as error (E) and gradient (G), respectively.

3.1. MAPPING FORMULA OF THE QUANTIZATION SCHEME

In quantized CNNs, floating-point values are quantized to use the fixed-point representation. In a commonly used scheme (Jacob et al., 2017b) , the mapping function is f loat = scale×(F ix+Bias), in which scale and Bias are shared in one tensor. However, since data distribution changes over time during training, one cannot simplify the Bias calculation as Jacob et al. (2017b) did. Thus, we adopt an unbiased quantization scheme,and extend the scaling factor to three levels for better representation ability. The mapping formula of our quantization scheme is X[i, j, k, l] = S s [i, j, k, l] × S t × S g [i, j] × X[i, j, k, l] where [•] denotes the indexing operation, S s is a sign tensor, S t is a tensor-wise scaling factor shared in one tensor, and S g is a group-wise scaling factor shared in one group, which is a structured subset of the tensor. Our paper considers three grouping dimensions: 1) grouping by the 1-st dimension of tensor, 2) the 2-nd dimension of tensor, or 3) the 1-st and the 2-nd dimensions simultaneously. S t , S g , and X use the same format, which we refer to as E, M , a customized floating-point format with E-bit exponent and M-bit mantissa (no sign bit). A value in the format E, M is f loat = I2F (M an, Exp) = F rac × 2 -Exp = 1 + M an 2 M × 2 -Exp (2) where M an and Exp are the M-bit mantissa and E-bit exponent, and F rac ∈ [1, 2) is a fraction.

3.2. DETAILS ON THE SCALING FACTORS

The first level tensor-wise scaling factor S t is set to be an ordinary floating-point number ( E t , M t = 8, 23 ), which is the same as unquantized data in training to retain the precision as much as possible. Considering the actual hardware implementation cost, there are some restrictions on the second level group-wise scaling factor S g . Since calculation results using different tensor groups need to be aggregated, using S g in an ordinary floating-point format will make more expensive conversions and operations necessary in the hardware implementation. We proposed two hardware-friendly group-wise scaling scheme, whose formats can be denoted as E g , 0 , and E g , 1 . The scaling factor in the E g , 0 format is simply a power of two, which can be implemented easily as shifting on the hardware. From Eq. 2, a S g = I2F (M an g , Exp g ) value in the E g , 1 format can be written as S g = 1 + M an g 2 × 2 -Expg = 2 -Expg + 2 -Expg-1 M an g = 1 2 -Expg M an g = 0 (3) which is a sum of two shifting, and can be implemented with small hardware overhead. The third level scaling factor S x = I2F (0, Exp x ) = 2 -Expx is the element-wise exponent in X = S x (1 + M anx 2 ), and we can see that the elements of X in Eq. 1 are in a E x , M x format. The specific values of E x and M x determine the type and the cost of the basic multiplication and accumulation (MAC) operation, which will be discussed later in Sec. 5.2.  Activation !"# BN Q r Mask Error ! Weight ! Weight ! Buffer Activation ! BN Error !"# SGD R ReLU Gradient ! Forward Backward Get Gradient Q Q [Low] [High] [High] [Low] Q Low-

4. LOW-BIT TRAINING FRAMEWORK OF CNN

As shown in Fig. 2 , the convolution operation (Conv) is followed by batch normalization (BN), nonlinear operations (e.g. ReLU, pooling). Since Convs account for the majority computational cost, we apply quantization right before Convs in the training process, including three types: Conv(W, A), Conv(W, E), and Conv(A, E). Note that the output data of Convs is in floating-point format, and other operations operate on floating-point numbers. An iteration of this low-bit training process is summarized in Appendix Alg. 2, in which the major differences from the vanilla training process are the dynamic quantizaiton procedure DynamicQuantization and the low-bit tensor convolution arithmetic LowbitConv.

4.1. DYNAMIC QUANTIZATION

The dynamic quantization converts floating-point tensors to MLS tensors by calculating the scaling factors S s , S t , S g and the elements X, as shown in Alg. 1. Exponent(•) and F raction(•) are to obtain the exponent (an integer) and fraction (a fraction ∈ [0, 1)) of a floating-point number. While calculating the quantized elements X, we adopt the stochastic rounding (Gupta et al., 2015) , and implement it by using a uniformly distributed random tensor r ∼ U [-1 2 , 1 2 ]. StochasticRound(x, r) = N earestRound(x + r) = x with probability x -x x with probability x -x (4) Note that Alg. 1 describes how we simulate the dynamic quantization process using floating-point arithmetic. While in the hardware design, the exponent and mantissa are obtained directly, and the clip/quantize operations are done by taking out some bits from a machine number. Algorithm 1: Software-simulated dynamic quantization process Input: X: float 4-d tensor; Axis: grouping dimension; R: U [-1 2 , 1 2 ] distributed random tensor; E g , M g : format of group-wise scaling factors; E x , M x : format of each element Output: S s : sign tensor; S t : tensor group-wise scaling factors; S g : group-wise scaling factors; X: quantized elements /* calculating scaling factors */ S s = Sign(X) S r = M ax(Abs(X), axis = Axis) // Group-wise maximum magnitude S t = M ax(S r ) // Tensor-wise maximum magnitude S gf = S r ÷ S t // Group-wise scaling factors before quantization < 1 Exp g , F rac g = Exponent(S gf ), F raction(S gf ) Exp g = Clip(Exp g , 1 -2 Eg , 0) // Clip Expg to Eg bits F rac g = Ceil(F rac g × 2 Mg ) ÷ 2 Mg // Quantize F racg to Mg bits S g = F rac g × 2 Expg // Group-wise scaling factors after quantization /* calculating elements */ X f = Abs(X) ÷ S g ÷ S t // Dividing the scaling factors Exp x , F rac x = Exponent(X f ), F raction(X f ) /* Quantize F rac x to M x bits with underflow handling † */ E xmin = 1 -2 Ex F rac xs = F rac x × 2 Mx if not underflow, else F rac x × 2 Mx-Exmin+Ex F rac xint = Clip(StochasticRound(F rac xs , R)), 0, 2 Mx -1) F rac x = F rac xint × 2 -Mx if not underflow, else F rac xint × 2 -Mx+Exmin-Ex Exp x = Clip(Exp x , E xmin , -1) X = F rac x × 2 Expx

// Elements after quantization

Return S s , S t , S g , X †: The underflow handling follows the IEEE 754 standard (Hough, 2019).

4.2. LOW-BIT TENSOR CONVOLUTION ARITHMETIC

In this section, we describe how to do convolution with two low-bit MLS tensors. Denoting the input channel number as C and the kernel size as K, the original formula of convolution in training is: Z[n, co, x, y] = C-1 ci=0 K-1 i=0 K-1 j=0 W [co, ci, i, j] × A[n, ci, x + i, y + j] We take Conv(W, A) as the example to describe the low-bit convolution arithmetic, and the other two types of convolution can be implemented similarly. Using the MLS data format and denoting the corresponding values (scaling factors S, exponents Exp, fractions F rac) of W and A by the superscript (w) and (a) , one output element Z[n, co, x, y] of Conv(W, A) can be calculated as: Z[n, co, x, y] = C-1 ci=0 K-1 i=0 K-1 j=0 S (w) t S (w) g [co, ci] W [co, ci, i, j] S (a) t S (a) g [n, ci] Ā[n, ci, x + i, y + j] = S (w) t S (a) t C-1 ci=0 [ S (w) g [co, ci]S (a) g [n, ci] K-1 i=0 K-1 j=0 W [co, ci, i, j] Ā[n, ci, x + i, y + j]] = S (z) t C-1 ci=0 S (p) [n, co, ci]P [n, co, ci] Eq. 6 shows that the accumulation consist of intra-group MACs that calculates P [n, co, ci] and inter-group MACs that calculates Z. And the intra-group calculation of P [n, co, ci] is: P [n, co, ci] = K-1 i,j=0 F rac (w) [co, ci, i, j]F rac (a) [n, ci, i, j] 2 (Exp (w) [co,ci,i,j]+Exp (a) [n,ci,i,j]) (7) where F rac, Exp are fractions and exponents, whose precision is (M x + 1)-bits and E x -bits, respectively. Thus the intra-group calculation contains (M x + 1)-bit multiplication, (2 Ex+1 -4)-bit shifting, and the (2M x + 2 Ex+1 -2)-bit integer results are accumulated with enough bit-width to get the partial sum P . And the accumulator has to be floating-point in some previous work (Wang et al., 2018; Sun et al., 2019) , since they use E x = 5. As for the inter-group calculation, each element in S (p) is a E, 2 number obtained by multiplying two E, 1 numbers. Omitting the n index for simplicity, the calculation can be written as: Z[co, x] = C-1 ci=0 S (p) [co, ci]P [x, ci] = C-1 ci=0      P [x, ci]2 -Exp (p) [co,ci] if M an (p) [co, ci] = 00 P [x, ci]2 -Exp (p) [co,ci] + P [x, ci]2 -Exp (p) [co,ci]-1 if M an (p) [co, ci] = 01/10 P [x, ci]2 1-Exp (p) [co,ci] + P [x, ci]2 -Exp (p) [co,ci]-2 if M an (p) [co, ci] = 11 (8) Due to the special format of S (p) , the calculation of the floating-point Z can be implemented efficiently on hardware, in which no floating-point multiplication is involved.

5. EXPERIMENTS

We train ResNet (He et al., 2016) on CIFAR-10 (Krizhevsky, 2010) and ImageNet (Deng et al., 2009) with our low-bit training framework. We experiment with the MLS tensor formats using different E x , M x configurations. And we adopt the same quantization bit-width for W, A, E, thus that hardware design is simple. The training results on CIFAR-10 and ImageNet are shown in Tab. 2. We can see that our method can achieve smaller accuracy degradation using lower bit-width. Previous study (Zhou et al., 2016) found that quantizing E to a low bit-width hurt the performance. However, our method can quantize E to M x = 1 or 2 on CIFAR-10, with a small accuracy drop from 92.45% to 91.97%. On ImageNet, the accuracy degradation of our method is rather minor under 8-bit quantization (0.6% accuracy drop from 69.1% to 68.5%). In the cases with lower bit-width, our method achieves a higher accuracy (66.5%) with only 4-bit than Banner et al. (2018a) who uses 8-bit (66.4%). With 2, 4 data format, the accuracy loss is less than 1%. In this case, the bit-width x + 2 Ex+1 -2 = 14, which means that the accumulation can be conducted using 16-bit integers, instead of floating-points (Mellempudi et al., 2019) . E x M x = 4 M x = 3 M x = 2 M x =

5.1.1. GROUPING DIMENSION

Group-wise scaling is beneficial because the data ranges vary across different groups. We compare the average relative quantization error of using the three grouping dimensions (Sec. 3.1) with 8, 1 group-wise scaling format and 0, 3 element format. The first row of Fig. 3 shows that the AREs are smaller when each tensor is split to N × C groups. Furthermore, we compare these grouping dimensions in the training process. The results in the first section of Tab. 3 show that the reduction of AREs is important to the accuracy of low-bit training. And when tensors are split to N × C groups, the low-bit training accuracy is higher. And we can see that M g = 1 is important for the performance, especially with low M x (e.g., when M x = 1). Figure 3 : Average relative quantization errors (AREs) of W, E, A (left, middle, right) in each layer when training a ResNet-20 on CIFAR-10. X axis: Layer index. Row 1: Different grouping dimensions ( 0, 3 formatted X, 8, 1 formatted S g ); Row 2: Different E x ( E x , 3 formatted X, no group-wise scaling); Row 3: Different E x ( E x , 3 formatted X, 8, 1 formatted S g , N × C groups).

5.1.2. ELEMENT-WISE EXPONENT

To demonstrate the effectiveness of the element-wise exponent, we compare the AREs of quantization with different E x without group-wise scaling, and the results are shown in the second row of Fig. 3 . We can see that using more exponent bits results in larger dynamic ranges and smaller AREs. And with larger E x , the AREs of different layers are closer. Besides the ARE evaluation, Tab. 3 also shows that larger E x achieves better accuracies, especially when M x is extremely small. As shown in Fig. 3 Row 3 and Tab. 3, when jointly using the group-wise scaling and the element-wise exponent, the ARE and accuracy are further improved. And we can see that the group-wise scaling is essential for simplifying the floating-point accumulator to fixed-point, since one can use a small E x (e.g., 2) by using the group-wise scaling technique. We not only makes FP MUL less than 8-bit, but also simplifies the local accumulator.

5.2. HARDWARE ENERGY ESTIMATION

Fig. 4 shows a typical convolution hardware architecture, which consists of three main components: local multiplication, local accumulation, and addition tree. Our algorithm mainly improves the local MAC. Compared with the full-precision design, we simplify the FP MUL to use a bit-width less than 8 and the local FP Acc to use 16-bit integer. According to the data reported by (Yang et al., 2020) , the energy efficiencies are at least 7× and 20× higher than full-precision design, respectively. While our method could significantly reduce the cost of the convolution, it also introduces some overhead: 1) Group-wise maximum statistics (Line 2) and scaling factors division (Line 9) in Alg. 1 accounts for the main overhead of dynamic quantization. The cost of these two operations is comparable with that of a batch normalization operation, which is relatively small compared with convolution since the number of operations is fewer by orders of magnitude (Tab. 1). 2) The groupwise scaling factors introduce additional scaling. Fortunately, when using the E g , 0 or E g , 1 format, we can implement it efficiently with shifting (see Eq. 3). To summarize, the introduced overhead is small compared with the reduced cost. According to the numbers of different operations in the training process (Tab. 1) and the energy consumption of each operation (Appendix Tab. 3) (Horowitz, 2014) , we can estimate that our convolution arithmetic is over 6.8× more energy-efficient than full precision one when training ResNet (see Appendix for the details). Due to the simplified integer accumulator, our energy efficiency is at least 24% higher than other low-bit floating-point training algorithms (Mellempudi et al., 2019; Wang et al., 2018) .

6. CONCLUSION

This paper proposes a low-bit training framework to enable training a CNN with lower bit-width convolution while retaining the accuracy. Specifically, we design a multi-level scaling (MLS) tensor format, and develop the corresponding quantization procedure and low-bit convolution arithmetic. Experimental results and the energy analysis demonstrate the effectiveness of our method. A LOW-BIT TRAINING FRAMEWORK  l = DynamicQuantization(W l ) qA l-1 = DynamicQuantization(A l-1 ) Z l = LowbitConv(qW l , qA l-1 ) Y l = BatchN orm(Z l ) A l = Activation(Y l ) ∂loss ∂A l = Criterion(A L , T ) /* backward propagation */ for l in L : 1 do ∂loss ∂Y l = ∂loss ∂A l × Activation (Y l ) ∂loss ∂Z l = ∂loss ∂Y l × ∂Y l ∂Z l qE l = DynamicQuantization( ∂loss ∂Z l ) G l = LowbitConv(qE l , qA l-1 ) W l t+1 = W l t -lr × G l if l is not 1 then ∂loss ∂qA l-1 = LowbitConv(qE l , qW l ) ∂loss ∂A l-1 = ST E( ∂loss ∂qA l-1 ) Return W 1:L t+1

B EXPERIMENTAL SETUP

In all the experiments, the first and the last layer are left unquantized following previous studies (Zhou et al., 2016; Mellempudi et al., 2019; Sun et al., 2019) . For both CIFAR-10 and ImageNet, SGD with momentum 0.9 and weight decay 5e-4 is used, and the initial learning rate is set to 0.1. We train the models for 90 epochs on ImageNet, and decay the learning rate by 10 every 30 epochs. On CIFAR-10, we train the models for 160 epochs and decay the learning rate by 10 at epoch 80 and 120. 

C GROUP-WISE SCALING

Group-wise scaling is beneficial because the data ranges vary across different groups, as shown in Fig. 5 . The blue line shows the max value in each group when A and E are grouped by channel or sample. If we use the overall maximum value (green lines in Fig. 5 ) as the tensor-wise scaling, many small elements will be swamped. And usually, there are over half of the groups, in which all elements are smaller than half of the overall maximum (red line). Thus, to fully exploit the bit-width, it is natural to use group-wise scaling factors. D ELEMENT-WISE EXPONENT Fig. 6 shows the performances of training ResNet20 on CIFAR-10 with different E x , M x configurations. We can see that, when the mantissa bit-width M x is extremely low (e.g., 1), the element-wise exponent bit-width E x is essential for achieving an acceptable performance.

E ENERGY EFFICIENCY ESTIMATION

Tab. 4 (Horowitz, 2014) reported that the energy consumption of a FP32 multiplication (FP32 MUL) is about 4 times that of a FP32 addition (FP32 ADD). Denoting the energy consumption of FP32 ADD as C and FP32 MUL as 4C, according to (Yang et al., 2020) , we can estimate the cost of FP8 MUL and INT16 ADD as 4/7C and 1/20C, respectively. Then, using the operation statistics in Tab. 1, we can calculate the energy consumption of one training iteration, and estimate the energy efficiency improvement ratio: EnergyRatio = 4(#M U L) + 1(#LocalACC) + 1(#T reeADD) 4/7(#M U L) + 1/20(#LocalADD + #Scale) + 1(#T reeACC) ≈ 6.8 (9) In order to evaluate energy efficiency advantage more accurately, we have implemented the RTL design of the two MAC units in Fig. 4 , and used Design Compiler to simulate the area and power, the power results are shown in Tab.5. We can see that the simulated results of RTL implemented is similar with our estimation above, both showing the energy efficiency of our framework. 



Figure 1: Illustration of the multi-level scaling (MLS) low-bit tensor data format.

Figure 2: Computation flow of the proposed low-bit training.

Figure 4: The convolution hardware architecture. (a) Previous studies (Mellempudi et al., 2019) developed low-bit floating-point multiplication (FP MUL) (e.g., 8-bit), but FP32 accumulations are still needed. (b) We not only makes FP MUL less than 8-bit, but also simplifies the local accumulator.

Figure 5: Maximum value of each group of A (left two) and E (right two). (a)(c): Grouped by channel; (b)(d): Grouped by sample.

Figure 6: Performances with different E x , M x configurations, no group-wise scaling is used.

Comparison of low-bit training methods on CIFAR-10 and ImageNet. Single number in the bit-width stands for M x , the corresponding E x is 0. "f" indicates that FP numbers are used.

Ablation study (ResNet-20 on CIFAR-10). "Div." means that the training failed to converge #group M g

Algorithm 2: The t-th iteration of low-bit training with vanilla SGD

The cost estimation of primitive operations with 45nm process and 0.9V(Horowitz, 2014).

The power evaluation (mW) results of MAC units with different arithmetic with TSMC 65nm process and 100MHz clock, simulated by Design Compiler.

