BLOCK AND SUBWORD-SCALING FLOATING-POINT (BSFP) : AN EFFICIENT NON-UNIFORM QUANTIZA-TION FOR LOW PRECISION INFERENCE

Abstract

In this paper, we propose Block and Subword-Scaling Floating-Point (BSFP), a datatype with a non-uniform quantization scheme for the skewed and non-uniform distribution of weight vectors in neural networks. By quantizing each weight vector as the superposition of multiple subword vectors (in two's complement) with scaling factors (in Low-bit Floating-Point, LBFP), BSFP can effectively fit the distribution of weight vectors while maintaining high computation efficiency. Furthermore, we present a grid search-based MSE-optimal quantization flow and a scaled serial processing engine to complete the quantization pipeline and the infrastructure. The experimental results on the ImageNet classification task show that our proposed method outperforms state-of-the-art Microsoft Floating Point (MSFP) by up to 18.57% top-1 accuracy at the same weight precision and reduces up to 10.3% model size. Furthermore, BSFP outperforms MSFP by up to 2.0× computing throughput and up to 5.3× energy efficiency under the same silicon area budget.

1. INTRODUCTION

Deep Neural Networks (DNNs) have continuously enabled more and more eye-catching artificial intelligence (AI) applications Johnson et al. (2016) ; Lin et al. (2014) ; Deng et al. (2009) . However, their large model size and high computational complexity hinder the wide deployment of DNNs to latency-sensitive cloud services and energy-constrained edge devices. To address the performance and energy challenges, in addition to compacting neural network structures Sandler et al. ( 2018 This work focuses on post-training quantization, which is preferable in practice. First, for end users, it involves no data (including private data) and enables a low-friction deployment pipeline Nagel et al. (2019) . Second, according to our discussions with an IC design house that tapes out AI chips in advanced technology nodes, the industry (at least their application-side customers) does appreciate post-training quantization because, in most cases, AI application companies are reluctant to release AI models and training data to AI accelerator companies. Although we focus on post-training quantization, we still include the fine-tuning results in Appendix A. This paper proposes Block and Subword-Scaling Floating-Point (BSFP), a new class of datatypes with a bit-efficient, non-uniform quantization method and custom hardware to improve the energy efficiency and performance over state-of-the-art MSFP. As shown in Figure 1 (a), the key idea of BSFP is to approximate each full-precision weight vector using the sum of two subword vectors with two scalings, respectively. More specifically, each subword is a low-bit (e.g., 2-bit), signed (two's complement) integer, and each scaling is a low-bit floating-point (LBFP) number (e.g., a 7-bit one). We will show that BSFP is superior to MSFP in capturing the nonuniformity and skewness of per-vector weight distributions, which are common cases for a vector of a small number (e.g., 16) of weights. In addition, although BSFP adopts two scalings and two subword vectors, it can still be efficiently computed for the following three reasons. First, the computation cost of scaling is amortized over 16 weights. Second, each scaling is an LBFP and involves only low-bit operations, e.g., multiplications with a 3-bit mantissa. Third, the subword vector structure happens to fit bit-serial computation architectures Qian Zhang et al. ( 2022); Judd et al. ( 2016). One property that BSFP exhibits is to approximate the desired weight vector using both coarse and fine vectors. One subword vector with a large scaling captures large weights, and the other subword vector with a small scaling mitigates the remaining deviations. Therefore, BSFP can adapt to large outliers and small resolutions simultaneously. Figure 2 (a) compares the quantization results of a real 16-element weight vector from ShuffleNet-v2 in either 8-level BSFP or 15-level MSFP. This example clearly demonstrates the potential that even BSFP with relatively fewer quantization levels can achieve smaller quantization errors (e.g., in terms of MSE) than MSFP with more quantization levels. We summarize the rationales for BSFP's superiority below: • No waste of quantization level: BSFP utilizes two's complement for each subword and does not waste precious quantization levels. In comparison, MSFP resembles sign-magnitude and wastes one quantization level (i.e., duplicated +0 and -0). Even worse, the impact of wasting quantization levels increases as the bitwidth goes down. For instance, a 3-b two's complement number can represent eight quantization levels, 12.5% more than the seven levels of the 3-b sign-magnitude number. • Adaptation to skewed distribution: BSFP exploits the asymmetrical nature of two's complement numbers (e.g., -2, -1, 0, 1 for 2-b two's complement numbers) and the sign of the associated scaling to adapt to the asymmetrical weight distribution of in weight vectors. In comparison, MSFP is permanently restricted to symmetrical quantization levels and leads to a waste of quantization levels fitting asymmetrical distributions. • Adaptation to non-uniform distribution: BSFP can offer non-uniform quantization levels by combining two subword-scaling vectors. In comparison, MSFP always uniformly quantizes weight vectors, which may instead exhibit non-uniform weight distributions. • Better freedom of quantization step size: The quantization step size of BSFP is defined by the two scalings, which are (low-bitwidth) floating-point values. In contrast, the quantization step size of MSFP cannot be any value other than power-of-two, e.g., 0.5, 0.25, 0.125.



); Ma et al. (2018), reducing the bitwidths of weights or activations also have been extensively explored Jacob et al. (2018); Darvish Rouhani et al. (2020); Tambe et al. (2020); Li et al. (2020). Particularly, non-conventional datatypes and custom hardware are emerging to optimize the performance, energy efficiency, area efficiency, and memory requirements of DNN inference. Prior industry and academia researches have explored low-bit floating-point datatypes Kalamkar et al. (2019); Jouppi et al. (2020); NVIDIA (2022); Tambe et al. (2020), block-based floating-point datatypes Darvish Rouhani et al. (2020); Köster et al. (2017), low-bit fixed-point datatypes NVIDIA (2020); Jacob et al. (2018), and power-of-two fixed-point datatypes Miyashita et al. (2016); Zhou et al. (2017); Li et al. (2020) as the potential candidates in efficient DNN inference. Among many datatypes, Microsoft Floating Point (MSFP), a kind of block-based floating-point type as shown in Figure 1(b), claims to achieve the state-of-the-art tradeoff among dynamic range, DNN accuracy, and hardware complexity Darvish Rouhani et al. (2020).

Figure 1: Number system comparison between (a) the proposed Block and Subword-Scaling Floating-Point (BSFP), (b) Microsoft FP (MSFP Darvish Rouhani et al. (2020)), and (c) floating-point numbers (IEEE 754 FP16, Google BF16 Jouppi et al. (2020), and Nvidia TensorFloat (TF19) NVIDIA (2022)).

