ACCURACY BOOSTERS: EPOCH-DRIVEN MIXED-MANTISSA BLOCK FLOATING-POINT FOR DNN TRAINING

Abstract

The unprecedented growth in DNN model complexity, size and the amount of training data have led to a commensurate increase in demand for computing and a search for minimal encoding. Recent research advocates Hybrid Block Floating-Point (HBFP) as a technique that minimizes silicon provisioning in accelerators by converting the majority of arithmetic operations in training to 8-bit fixed-point. In this paper, we perform a full-scale exploration of the HBFP design space including minimal mantissa encoding, varying block sizes, and mixed mantissa bit-width across layers and epochs. We propose Accuracy Boosters, an epoch-driven mixedmantissa HBFP that uses 6-bit mantissa only in the last epoch and converts 99.7% of all arithmetic operations in training to 4-bit mantissas. Accuracy Boosters enable reducing silicon provisioning for an HBFP training accelerator by 16.98× as compared to FP32, while preserving or outperforming FP32 accuracy.

1. INTRODUCTION

Improvements in Deep Neural Network (DNN) algorithms over the past decade have led to an unprecedented growth in model complexity and dataset size and consequently the required computational resources to train DNN models. One of the largest DNN models (GPT-3) (Brown et al., 2020) has 175 billion parameters and requires 3.14×10 23 FLOPs to train. With the slowdown in Moore's law, researchers and vendors have begun to search for ways to improve the arithmetic density of the underlying hardware platforms. Narrow bit-width (with lower precision) number formats (Wang & Kanwar, 2019; Micikevicius et al., 2018; Sun et al., 2019; Mellempudi et al., 2019; Sun et al., 2020) have emerged as a promising approach to increase arithmetic density, as well as, reduce the required operand storage and communication bandwidth while maintaining high accuracy for training. Many have tried fixed-point formats, often used in inference, to further reduce silicon logic complexity for arithmetic (Courbariaux et al., 2015a; Hubara et al., 2016; Rastegari et al., 2016; Dettmers et al., 2022a; b) . Fixed-point formats, unfortunately, dramatically suffer from a limited range in numerical representation especially for arithmetic in backward propagation. As such, researchers have tried mixed-precision training to trade off accuracy for efficiency (Zhang et al., 2021; Fu et al., 2020; 2021) . Recently there have been several proposals for block floating point (Köster et al., 2017; Das et al., 2018; Zhang et al., 2021) , a numerical encoding that groups a block of mantissas which rely on only fixed-point arithmetic with a single exponent. Block floating point asympototically approaches the arithmetic density of fixed point with larger block sizes and naturally lends itself well to mixedprecision hardware where a block with the same number of exponent bits can have a fixed-point datapath which is bitsliced for various multiples of mantissa bit encodings (e.g., the same today's CPU cores implement SIMD). While block floating point has been promising in use for inference (e.g., Microsoft Floating Point (Darvish Rouhani et al., 2020) ), most proposals to train with block floating point have either failed to reach its full potential by requiring small blocks and/or just fall short of reaching FP32 accuracy. One specific proposal, Hybrid Block Floating Point (HBFP) (Drumond et al., 2018) , uses a mixedprecision format where the dominant fraction of training which is the dot products happens in block floating point (e.g., convolutions, matrix multiplications, outer products), and higher precision (e.g., FP32) is used for other less frequent operations requiring larger numerical ranges (e.g., activations, regularizations). HBFP simultaneously offers the high accuracy of floating point and the superior hardware density of fixed point, delivering up to 8.5× higher throughput than FP16 with 2× more compact models (Drumond, 2020) . Prior work on HBFP only presented a preliminary analysis of the design space for power-of-two mantissa bit widths (e.g., 2, 4, 8-bit mantissas). In this paper, we make the observation that the parameter space for HBFP is quite rich presenting several opportunities for further improving efficiency and density in hardware platforms. First, custom accelerators can support non-power-of-two numerical formats; and minimizing the number of bits improves operand storage and communication linearly, and arithmetic logic quadratically. Second, there is an interplay between the block size and the number of mantissa bits, allowing for an overall denser numerical format with smaller blocks while maintaining high accuracy. Finally, HBFP allows for mixed-mantissa block floating point encodings. Prior work studies training with various HBFP formats in isolation; however, the design space of mixed-mantissa HBFP is yet to be explored. To minimize the number of bits in HBFP, we explore the interplay between the block size and the number of mantissa bits. We show that HBFP with six or more mantissa bits has no sensitivity to block size. While HBFP with a smaller number of mantissa bits is sensitive to block size, these configurations do not result in enough accuracy even with the smallest blocks and require additional methods to increase accuracy. Accuracy Boosters further minimizes the number of bits and enables training with 4-bit mantissas. Our method improves epoch-wise mixed-precision training by introducing high precision to the training process only at the last epoch. The main contributions of this paper are as follows: • We show that HBFP6 is the smallest HBFP format achieving competitive accuracies. • We enable HBFP5 training for smaller models (e.g., ResNet20) by using small block sizes and for larger models (e.g., DenseNet40) by keeping the first and last layers in FP32. • We improve the silicon area-and energy-efficiency of training without significant accuracy loss by performing a large fraction of epochs in low precision for CNN and Transformer models. We show that for a few models, our method even outperforms FP32 training.

2. WHY MINIMIZE HBFP?

HBFP We argue that employing both smaller mantissa bit widths and larger block sizes are the keys to improving HBFP hardware efficiency. Prior results show that, to the first order, minimizing the number of bits in fixed-point arithmetic reduces operand storage and memory bandwidth linearly and multiplication power and area quadratically (Gholami et al., 2021) . As a result, HBFP hardware efficiency increases with the reduced number of mantissa bits due to the large fraction of fixed-point operations. We also note that the hardware area and energy expenditure of HBFP accelerators is determined by the number of mantissa bits and the block size because the overhead of the exponent bits is negligible. Therefore, we work with 10-bit exponents as in prior work (Drumond et al., 2018) , and explore the HBFP design space by varying the mantissa bit width and the block size. Our experiments with different mantissa bit widths give rise to a re-configurable DNN accelerator for mixed-mantissa HBFP. Furthermore, we demonstrate that using smaller block sizes reduces the number of fixed-point operations in dot products causing an increase in the area overhead. In line with these observations, we establish decreasing the mantissa bit widths and increasing the block sizes as design guidelines for HBFP-based accelerators.



is a mixed-precision training technique that brings area-and energy-efficient fixed-point arithmetic into DNN training. DNN training traditionally uses floating-point operations because training algorithms require high precision. However, floating-point hardware occupies a larger silicon area and consumes more energy than the fixed-point arithmetic due to exponent management and normalization. HBFP alleviates the need for floating-point operations by employing fixed-point arithmetic for dot products, the most common type of operations in DNN training. Furthermore, it assigns exponents to fixed-point tensors to emulate the dynamic range provided by floating-point to maintain the training accuracy. In other words, HBFP achieves fixed-point efficiency in DNN training with floating-point accuracy.

