EXPLORING THE POTENTIAL OF LOW-BIT TRAINING OF CONVOLUTIONAL NEURAL NETWORKS Anonymous

Abstract

In this paper, we propose a low-bit training framework for convolutional neural networks. Our framework focuses on reducing the energy and time consumption of convolution kernels, by quantizing all the convolutional operands (activation, weight, and error) to low bit-width. Specifically, we propose a multi-level scaling (MLS) tensor format, in which the element-wise bit-width can be largely reduced to simplify floating-point computations to nearly fixed-point. Then, we describe the dynamic quantization and the low-bit tensor convolution arithmetic to efficiently leverage the MLS tensor format. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width than previous methods. When training ResNet-20 on CIFAR-10, all convolution operands can be quantized to 1-bit mantissa and 2-bit exponent, while retaining the same accuracy as the full-precision training. When training ResNet-18 on ImageNet, with 4-bit mantissa and 2-bit exponent, our framework can achieve an accuracy loss of less than 1%. Energy consumption analysis shows that our design can achieve over 6.8× higher energy efficiency than training with floating-point arithmetic.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks, such as image classification (Krizhevsky et al., 2012) and object detection (Redmon et al., 2016; Liu et al., 2016) . However, deep CNNs are both computation and storage-intensive. The training process could consume up to hundreds of ExaFLOPs of computations and tens of GBytes of storage (Simonyan & Zisserman, 2014) , thus posing a tremendous challenge for training in resource-constrained environments. At present, the most common training method is to use GPUs, but it consumes much energy. The power of a running GPU is about 250W, and it usually takes more than 10 GPU-days to train one CNN model on ImageNet (Deng et al., 2009) . It makes AI applications expensive and not environment-friendly. Reducing the precision of NNs has drawn great attention since it can reduce both the storage and computational complexity. It is pointed out that the power consumption and circuit area of fixed-point multiplication and addition units are greatly reduced compared with floating-point ones (Horowitz, 2014) . Many studies (Jacob et al., 2017a; Dong et al., 2019; Banner et al., 2018b) focus on amending the training process to acquire a reduced-precision model with higher inference efficiency. Besides the studies on improving inference efficiency, there exist studies that accelerate the training process. Wang et al. ( 2018 As shown in Tab. 1, Conv in the training process accounts for the majority of the operations. Therefore, this work aims at simplifying convolution to low-bit operations, while retaining a similar performance with the full-precision baseline. The contributions of this paper are: 1. This paper proposes a low-bit training framework to improve the energy efficiency of CNN training. We design a low-bit tensor format with multi-level scaling (MLS format), which can strike a better trade-off between the accuracy and bit-width, while taking the hardware efficiency into consideration. The multi-level scaling technique extracts the common exponent of tensor elements as much as possible to reduce the element-wise bitwidth, thus improving the energy efficiency. To leverage the MLS format efficiently, we develop the corresponding dynamic quantization and the MLS tensor convolution arithmetic.

2.. Extensive experiments demonstrate the effectiveness of our low-bit training framework.

One only needs 1-bit mantissa and 2-bit exponent to train ResNet-20 on CIFAR-10 while retaining the same accuracy as the full-precision training. On ImageNet, using 4-bit mantissa and 2-bit exponent is enough for training ResNet-18, with a precision loss within 1%. Our method achieves higher energy efficiency using fewer bits than previous floating-point training methods and better accuracy than previous fixed-point training methods. 3. We estimate the hardware energy that implements the MLS convolution arithmetic. Using our MLS tensor format, the energy efficiency of convolution can be improved by over 6.8×, than the full-precision training, and over 1.2× than previous low-bit training methods. 



) and Sun et al. (2019) reduce the floating-point bit-width to 8 during the training process. Wu et al. (2018) implements a full-integer training procedure to reduce the cost but fails to get acceptable performance.

like (Han et al., 2015) focused on the post-training quantization, and quantized the pre-trained full-precision model using the codebook generated by clustering or other criteria (e.g., SQNR Lin et al. (2015), entropy Park et al. (2017)). Banner et al. (2018b) selected the quantization bit-width and clipping value for each channel through the analytical investigation. Jacob et al. (2017b) developed an integer arithmetic convolution for efficient inference, but it's hard to be used in training because the scale of the output tensor should be known before calculation. These quantization methods need pretrained models, and cannot accelerate the training process.2.2 QUANTIZE-AWARE TRAININGQuantize-aware training considered quantization effects in the training process. Some methods trained an ultra low-bit network like binary(Rastegari et al., 2016)  or ternary(Li et al., 2016)  networks, with a layer-wise scaling factor. Despite that the follow-up studies(Liu et al., 2020; Qin et al., 2019)  have been proposing training techniques to improve the performance of binary networks, the extremely low bit-width still causes notable performance degradation. Other methods sought to retain the accuracy with relatively higher precision, such as 8-bit(Jacob et al., 2017a). Gysel et al. (2018) developed a GPU-based training framework to get dynamic fixed-point models. These methods focus on accelerating the inference process and the training process is still using floating-point operations.2.3 LOW-BIT TRAININGTo accelerate the training process, studies have been focusing on design a better floating-point data format.Dillon et al. (2017)  proposed a novel 16-bit floating-point format that is more suitable for CNN training, whileKöster et al. (2017)  proposed the Flexpoint that contains 16-bit mantissa and 5-bit tensor-shared exponent (scale), which is similar to the dynamic fixed-point format proposed byGysel et al. (2018). Recently, 8-bit floating-point(Wang et al., 2018; Sun et al., 2019)  was used with chunk-based accumulation and hybrid format to solve swamping.

The number of different operations in the training process (batch size = 1).

