ALL-YOU-CAN-FIT 8-BIT FLEXIBLE FLOATING-POINT FORMAT FOR ACCURATE AND MEMORY-EFFICIENT INFERENCE OF DEEP NEURAL NETWORKS Anonymous

Abstract

Modern deep neural network (DNN) models generally require a huge amount of weight and activation values to achieve good inference outcomes. Those data inevitably demand a massive off-chip memory capacity/bandwidth, and the situation gets even worse if they are represented in high-precision floating-point formats. Effort has been made for representing those data in different 8-bit floating-point formats, nevertheless, a notable accuracy loss is still unavoidable. In this paper we introduce an extremely flexible 8-bit floating-point (FFP8) format whose defining factors -the bit width of exponent/fraction field, the exponent bias, and even the presence of the sign bit -are all configurable. We also present a methodology to properly determine those factors so that the accuracy of model inference can be maximized. The foundation of this methodology is based on a key observationboth the maximum magnitude and the value distribution are quite dissimilar between weights and activations in most DNN models. Experimental results demonstrate that the proposed FFP8 format achieves an extremely low accuracy loss of 0.1% ∼ 0.3% for several representative image classification models even without the need of model retraining. Besides, it is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.

1. INTRODUCTION

With the rapid progress of deep neural network (DNN) techniques, innovative applications of deep learning in various domains, such as computer vision and natural language processing (NLP), are getting more mature and powerful (Huang et al., 2017; Vaswani et al., 2017; Szegedy et al., 2015; Howard et al., 2017; He et al., 2017; Krizhevsky et al., 2012) . To improve the model accuracy, one of the most commonly used strategies is to add more layers into a network, which inevitably increases the number of weight parameters and activation values of a model. Today, it is typical to store weights and activations in the 32-bit IEEE single-precision floating-point format (FP32). Those 32-bit data accesses thus become an extremely heavy burden on the memory subsystem in a typical edge or AIoT device, which often has very limited memory capacity and bandwidth. Even for high-end GPU or dedicated network processing unit (NPU) based computing platforms, off-chip DRAM bandwidth is still a major performance bottleneck. To relieve the issue of memory bandwidth bottleneck, several attempts of various aspects have been made including (but not limited to) weight pruning (Li et al., 2017; Han et al., 2016) , weight/activation quantization (Courbariaux et al., 2015; Hubara et al., 2017) , and probably the most straightforward way: storing weights and activations in a shorter format (Köster et al., 2017) . One trivial way to do so is to adopt the 16-bit IEEE half-precision floating-point format (FP16). An FP16 number consists of 1 sign bit, 5 exponent bits, and 10 fraction bits. In addition, Google proposed another 16-bit format, named Brain Floating-Point Format (BFP16), simply by truncating the lower half of the FP32 format (Kalamkar et al., 2019) . Compared with FP16, BFP16 allows a significantly wider dynamic value range at the cost of 3-bit precision loss. Note that the exponent bias in all of the above formats is not a free design parameter. Conventionally, the value is solely determined by the exponent size. For example, for FP16 with 5-bit exponent, the exponent bias is automatically fixed to 15 (2 5-1 -1). To make the data even shorter, 8-bit fixed-point signed/unsigned integer formats (INT8 and UINT8) are also broadly adopted. However, the 8-bit fixed-point format inherently has a narrower dynamic value range so that the model accuracy loss is usually not negligible even after extra symmetric or asymmetric quantization. As a consequence, there are a number of attempts concentrating on utilizing mixed-precision or pure 8-bit floating-point numbers in deep learning applications. Various techniques have been developed for mixed-precision training (Banner et al., 2018; Micikevicius et al., 2018; Das et al., 2018; Zhou et al., 2016) . Moreover, recent studies proposed several training frameworks that produces weights only in 8-bit floating-point formats (Wang & Choi, 2018; Cambier et al., 2020; Sun et al., 2019) . In these studies, the underlying 8-bit floating-point numbers in training and inference are represented in the format of FP8(1, 5, 2) or FP8 (1, 4, 3) , where the enclosed three parameters indicate the bit length of sign, exponent, and fraction, respectively. Note that 4 or 5 bits are essential for the exponent in their frameworks, or the corresponding dynamic range may not cover both weight and activation values well. Consequently, only 2 or 3 bits are available for fraction, which inevitably leads to lower accuracy. In this paper, we present an extremely flexible 8-bit floating-point (FFP8) number format. In FFP8, all parameters -the bit width of exponent/fraction, the exponent bias, and the presence of the sign bit -are configurable. Three major features of our inference methodology associated with the proposed FFP8 format are listed as follows. First, it is observed that both the maximum magnitude and the value distribution are quite dissimilar between weights and activations in most DNNs. It suggests the best exponent size and exponent bias for weights should be different from those for activations to achieve higher accuracy. Second, a large set of commonly-used activation functions always produce nonnegative outputs (e.g., ReLU). It implies that activations are actually unsigned if one of those activation functions is in use. Hence, it implies the sign bit is not required for those activations, which makes either exponent or fraction 1-bit longer. Note that even one bit can make a big impact since only 8 bits are available. Third, all aforementioned studies require their own sophisticated training frameworks to produce 8-bit floating-point models. Our flow does not. Our flow simply takes a model generated by any conventional FP32 training framework as the input. Then, it simply converts the given pre-trained FP32 model into an FFP8 model. The rest of this paper is organized as follows. Section 2 briefly introduces related work. In Section 3, we elaborate more on the proposed FFP8 format and how to properly convert a pre-trained FP32 model into an FFP8 one. Section 4 demonstrates the experimental results on various DNN models. The system and hardware design issues for the support of FFP8 numbers are discussed in Section 5. Finally, the concluding remarks are given in Section 6.

2. RELATED WORK

DNNs are becoming larger and more complicated, which means they require a bigger memory space and consume more energy during inference. As a result, it is getting harder to deploy them on systems with limited memory capacity and power budget (e.g., edge devices). (Han et al., 2016; Zhao et al., 2020; Horowitz, 2014) also demonstrated that off-chip DRAM access is responsible for a significantly big share of system power consumption. Hence, it remains an active research topic about how to reduce the memory usage for weights and activations. As mentioned in the previous section, one way to do so is to use short 8-bit floating-point number formats. Wang & Choi (2018) introduced a DNN training methodology using 8-bit floating-point numbers in FP8(1, 5, 2) format. The methodology features chunk-based accumulation and stochastic rounding methods for accuracy loss minimization. Besides, to achieve a better trade-off between precision and dynamic range during model training, Sun et al. (2019) proposed an improved methodology that utilizes two different 8-bit floating-point formats -FP8(1, 4, 3) for forward propagation and FP8(1, 5, 2) for backward propagation. Nevertheless, both methodologies fail to make a DNN model entirely in 8-bit numbers: the first and the last layers of the given model are still in 16-bit floating-point numbers; otherwise, the model suffers about 2% accuracy degradation. Cambier et al. (2020) then proposed the S2FP8 format, which allows a DNN model represented in 8-bit floating point numbers completely. By adding a scaling factor and a shifting factor, data can thus be well represented in FP8(1, 5, 2) after proper shifting and squeezing operations, which eliminates the need of 16-bit floating point numbers. However, the S2FP8 format still results in about 1% accuracy drop in ResNet-50 (He et al., 2018) .

