BLOCK AND SUBWORD-SCALING FLOATING-POINT (BSFP) : AN EFFICIENT NON-UNIFORM QUANTIZA-TION FOR LOW PRECISION INFERENCE

Abstract

In this paper, we propose Block and Subword-Scaling Floating-Point (BSFP), a datatype with a non-uniform quantization scheme for the skewed and non-uniform distribution of weight vectors in neural networks. By quantizing each weight vector as the superposition of multiple subword vectors (in two's complement) with scaling factors (in Low-bit Floating-Point, LBFP), BSFP can effectively fit the distribution of weight vectors while maintaining high computation efficiency. Furthermore, we present a grid search-based MSE-optimal quantization flow and a scaled serial processing engine to complete the quantization pipeline and the infrastructure. The experimental results on the ImageNet classification task show that our proposed method outperforms state-of-the-art Microsoft Floating Point (MSFP) by up to 18.57% top-1 accuracy at the same weight precision and reduces up to 10.3% model size. Furthermore, BSFP outperforms MSFP by up to 2.0× computing throughput and up to 5.3× energy efficiency under the same silicon area budget.



This work focuses on post-training quantization, which is preferable in practice. First, for end users, it involves no data (including private data) and enables a low-friction deployment pipeline Nagel et al. (2019) . Second, according to our discussions with an IC design house that tapes out AI chips in advanced technology nodes, the industry (at least their application-side customers) does appreciate post-training quantization because, in most cases, AI application companies are reluctant to release AI models and training data to AI accelerator companies. Although we focus on post-training quantization, we still include the fine-tuning results in Appendix A. This paper proposes Block and Subword-Scaling Floating-Point (BSFP), a new class of datatypes with a bit-efficient, non-uniform quantization method and custom hardware to improve the energy



(DNNs) have continuously enabled more and more eye-catching artificial intelligence (AI) applications Johnson et al. (2016); Lin et al. (2014); Deng et al. (2009). However, their large model size and high computational complexity hinder the wide deployment of DNNs to latency-sensitive cloud services and energy-constrained edge devices. To address the performance and energy challenges, in addition to compacting neural network structures Sandler et al. (2018); Ma et al. (2018), reducing the bitwidths of weights or activations also have been extensively explored Jacob et al. (2018); Darvish Rouhani et al. (2020); Tambe et al. (2020); Li et al. (2020). Particularly, non-conventional datatypes and custom hardware are emerging to optimize the performance, energy efficiency, area efficiency, and memory requirements of DNN inference. Prior industry and academia researches have explored low-bit floating-point datatypes Kalamkar et al. (2019); Jouppi et al. (2020); NVIDIA (2022); Tambe et al. (2020), block-based floating-point datatypes Darvish Rouhani et al. (2020); Köster et al. (2017), low-bit fixed-point datatypes NVIDIA (2020); Jacob et al. (2018), and power-of-two fixed-point datatypes Miyashita et al. (2016); Zhou et al. (2017); Li et al. (2020) as the potential candidates in efficient DNN inference. Among many datatypes, Microsoft Floating Point (MSFP), a kind of block-based floating-point type as shown in Figure 1(b), claims to achieve the state-of-the-art tradeoff among dynamic range, DNN accuracy, and hardware complexity Darvish Rouhani et al. (2020).

