A BLOCK MINIFLOAT REPRESENTATION FOR TRAIN-ING DEEP NEURAL NETWORKS

Abstract

Training Deep Neural Networks (DNN) with high efficiency can be difficult to achieve with native floating-point representations and commercially available hardware. Specialized arithmetic with custom acceleration offers perhaps the most promising alternative. Ongoing research is trending towards narrow floating-point representations, called minifloats, that pack more operations for a given silicon area and consume less power. In this paper, we introduce Block Minifloat (BM), a new spectrum of minifloat formats capable of training DNNs end-to-end with only 4-8 bit weight, activation and gradient tensors. While standard floating-point representations have two degrees of freedom, via the exponent and mantissa, BM exposes the exponent bias as an additional field for optimization. Crucially, this enables training with fewer exponent bits, yielding dense integer-like hardware for fused multiply-add (FMA) operations. For ResNet trained on ImageNet, 6-bit BM achieves almost no degradation in floating-point accuracy with FMA units that are 4.1 × (23.9×) smaller and consume 2.3 × (16.1×) less energy than FP8 (FP32). Furthermore, our 8-bit BM format matches floating-point accuracy while delivering a higher computational density and faster expected training times.

1. INTRODUCTION

The energy consumption and execution time associated with training Deep Neural Networks (DNNs) is directly related to the precision of the underlying numerical representation. Most commercial accelerators, such as NVIDIA Graphics Processing Units (GPUs), employ conventional floating-point representations due to their standard of use and wide dynamic range. However, double-precision (FP64) and single-precision (FP32) formats have relatively high memory bandwidth requirements and incur significant hardware overhead for general matrix multiplication (GEMM). To reduce these costs and deliver training at increased speed and scale, representations have moved to 16-bit formats, with NVIDIA and Google providing FP16 (IEEE-754, 2019) and Bfloat16 (Kalamkar et al., 2019) respectively. With computational requirements for DNNs likely to increase, further performance gains are necessary in both datacenter and edge devices, where there are stricter physical constraints. New number representations must be easy to use and lead to high accuracy results. Recent 8-bit floating-point representations have shown particular promise, achieving equivalent FP32 accuracy over different tasks and datasets (Wang et al., 2018; Sun et al., 2019) . We refer to such representations as minifloats in this paper. Minifloats are ideal candidates for optimization. By varying the number of exponent and mantissa bits, many formats can be explored for different trade-offs of dynamic range and precision. These include logarithmic and fixed point representations which provide substantial gains in speed and hardware density compared to their floating-point counterparts. For instance, 32-bit integer adders are approximately 10× smaller and 4× more energy efficient than comparative FP16 units (Dally, 2015) . That said, fixed point representations still lack the dynamic range necessary to represent small gradients for backpropagation, and must be combined with other techniques for training convergence. Block floating point (BFP) in (Yang et al., 2019; Drumond et al., 2018) share exponents across blocks of 8-bit integer numbers, and provide a type of coarse-grained dynamic range for training. This

