WINNING BOTH THE ACCURACY OF FLOATING POINT ACTIVATION AND THE SIMPLICITY OF INTEGER ARITHMETIC

Abstract

Even though floating point (FP) numbers have been adopted as a de facto standard data format for deep learning computing, the complexity of FP arithmetic impedes a broader deployment of Deep Neural Networks (DNNs). Recent works such as quantization have attempted to replace the FP matrix multiplication (MatMul) of DNNs with simple integer MatMul by transforming the datatypes of both weights and activations into integers. Unfortunately, unlike weight values that are static, it is challenging to represent dynamic activations with integers. In this paper, to simultaneously achieve the accuracy of FP activation and the simplicity of integer arithmetic, we present a method for replacing FP arithmetic with integer one without changing FP activations in the storage format while weights are quantized. The proposed method pre-aligns the significands of FP activations just ahead of the MatMul on-the-fly so that the aligned significands (integers) can be used for the computation. Inspired by an observation that conventional FP arithmetic does not produce precise results due to rounding, we demonstrate that our proposed integer arithmetic-based scheme can produce the same level of errors as that of the FP arithmetic in case DNNs use FP activations and quantized weights. Experimental results show that the hardware based on the proposed scheme shows significant improvement over FP arithmetic-based designs in terms of energy efficiency and throughput-per-area while maintaining a similar level of accuracy.

1. INTRODUCTION

Deep Neural Networks (DNNs) usually use Floating-Point (FP) number systems to represent a wide range of weight and activation values. Such a comprehensive representation, however, demands high computational complexity and cost for FP matrix multiplication (MatMul) (Sze et al., 2017) . On the other hand, integer (a.k.a fixed-point) arithmetic logic is much simpler while consuming less energy compared to FP counterpart (Jouppi et al., 2021) . As such, the computational efficiency of DNNs can be enhanced by replacing FP arithmetic with integer one. Accordingly, quantization has been actively studied as a promising technique to support DNN computations with integer arithmetic, as it maps the input values of a (virtually) continuous domain (FP numbers) to the output values of a discrete set (integers) (Jacob et al., 2018) . Note that even though several studies have successfully quantized weights and activations of some target DNNs with low-precision integer values (Li et al., 2021; Wu et al., 2022) , quantization is still challenging for numerous DNNs. In particular, activation values are known to be more difficult to be quantized than the weight parameters because activations are dynamically generated during inference while the distribution of weights is static. The uncertainty of the distribution of dynamic activation values limits the ability to estimate proper quantization range (Choi et al., 2018) . Such issues on activation quantization become even more serious when DNNs involve highly non-linear activation functions (e.g., GeLU) or modules that increase the variance of the activations (e.g., softmax and normalization layers) (Jeon et al., 2020) . As a result, while the weight parameters can be successfully quantized even for generative mod-

