REVISITING BFLOAT16 TRAINING

Abstract

State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit precision alone is not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit compute units which is more costly than only using 16-bit units for hardware design. We ask can we do pure 16-bit training which requires only 16-bit compute units, while still matching the model accuracy attained by 32-bit training. Towards this end, we study pure 16-bit training algorithms on the widely adopted BFloat16 compute unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates can often cancel small updates, which degrades the convergence and model accuracy. Motivated by this, we identify two simple existing techniques, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in pure 16-bit training. We empirically show that these two techniques can enable up to 7% absolute validation accuracy gain in pure 16-bit training. This leads to 0.1% lower to 0.2% higher matching validation accuracy compared to 32-bit precision training across seven deep learning applications.

1. INTRODUCTION

Recently there has been an explosion in the compute resources required for training deep learning models (Shoeybi et al., 2019; Rajbhandari et al., 2019; Real et al., 2019) . As a result, there has been broad interest in leveraging low-precision (< 32-bit) training algorithms to reduce the required compute resources (De Sa et al., 2017; Hubara et al., 2017; Gupta et al., 2015) . Among these algorithms, mixed-precision training-in which model activations and gradients are stored using a 16-bit floating point format while model weights and optimizer states use 32-bit precision-is commonly used when training generic deep learning models (Micikevicius et al., 2017; Kalamkar et al., 2019) . While there is a wide body of literature showing that low-precision training can minimally impact accuracy on specific models (Wang et al., 2018b; De Sa et al., 2015; Zhang et al., 2017) , conventional wisdom suggests that at least some 32-bit computation is required as a fail-safe in generic deep learning training. As such, new accelerator architectures for deep learning are forced to support both 32-bit and 16-bit compute units. This is much more costly in terms of area, power, and speed when compared to hardware with only 16-bit compute units (Horowitz, 2014; Galal et al., 2013) . In this paper we question if 32-bit compute units are truly needed for new deep learning hardware accelerators. Namely, can we match the model accuracy of 32-bit-precision algorithms while leveraging only 16-bit compute units? To answer this question, we study pure 16-bit training algorithms, ones which use only 16-bit compute units and which store activations, gradients, model weights, and optimizer states all in a 16-bit precision. Specifically, we focus on training with the BFloat16 compute unit which is widely adopted in modern deep learning accelerators (Jouppi et al., 2017; Burgess et al., 2019) . Such units take 16-bit inputs, perform computation, and then round the results to a 16bit output. BFloat16 compute units can provide 3⇥ higher power efficiency, 1.5⇥ lower latency, and 1.5⇥ less chip area than 32-bit units (Horowitz, 2014; Galal et al., 2013) . In addition, pure 16-bit training algorithms can reduce the memory footprint and bandwidth consumption of model weights and optimizers by 2⇥ compared to mixed precision or 32-bit precision training, especially for large models with billions of weights (Shoeybi et al., 2019; Rajbhandari et al., 2019) . Developing reliable pure 16-bit training algorithms will enable hardware designers to realize these advantages. The simplest approach to pure 16-bit training is to take a 32-bit baseline and "make it low-precision" by replacing all the 32-bit numbers with 16-bit numbers and replacing each 32-bit floating-point op-eration with its 16-bit analog, using nearest roundingfoot_0 to quantize as necessary: we call this approach the standard algorithm. Unfortunately, we show empirically that standard pure 16-bit training does not match 32-bit training on model accuracy across deep learning models. For example, the standard pure 16-bit training algorithm one would run on conventional hardware attains 16% and 7% lower training and validation accuracies than a 32-bit baseline. Motivated by this observation, we start by analyzing what factors limit the model accuracy of this standard pure 16-bit algorithm. The goal of our analysis is to inspire a clean, minimal set of simple techniques that allow pure 16-bit training to attain strong model accuracy for state-of-the-art deep learning models across application domains. Towards this end, we derive insights from a simple least-squares regression model in Section 3. Using this least-squares regression model, we reveal that nearest rounding of compute unit outputs causes significant convergence degradation and consequent model accuracy loss. More concretely, we show a key theoretical insight hidden in existing work: when running stochastic gradient descent on a least-squares regression model, nearest rounding while updating model weights ignores small updates. This phenomenon significantly degrades the convergence of stochastic gradient descent when model updates become small relative to model weights, which is also what we observe when training deep learning models. In comparison, nearest rounding in the forward and backward pass of backpropagation has a negligible impact on convergence. These insights lead us to consider two simple existing techniques to achieve high-accuracy pure 16-bit training. First, we can use stochastic rounding instead of nearest rounding for the model weight updates. Here, the rounded weights become an unbiased estimate of the precise weights without rounding: thus, regardless of the magnitude of updates, the expectation of rounded weights converges at the same speed as the precise weights. Second, we can use the well-known Kahan summation algorithm (Kahan, 1965) to accumulate model updates while still keeping nearest rounding for all operations. This method tracks and compensates weight rounding errors across iterations with auxiliary 16-bit values, which avoids catastrophic cancellation of many small model weight updates. Empirically, in Section 4 we first validate that, as suggested by our theory, nearest rounding for model weight updates is the sole bottleneck for convergence and model accuracy on several deep learning models. We then demonstrate that pure 16-bit training using stochastic rounding or Kahan summation on model weight updates can match 32-bit training in model accuracy across a wide range of applications (He et al., 2016; Amodei et al., 2016; Devlin et al., 2018; Naumov et al., 2019) . To validate that nearest rounding for model weight updates is the cause of the accuracy degradation, we show that if we store model weights in 32-bit precision without rounding during weight updates, and we keep using 16-bits and nearest rounding for all other operations, then the attained model accuracy matches full 32-bit precision training. Next, we demonstrate that 16-bit training with stochastic rounding for weight updates attains model accuracy matching 32-bit training for five out of seven applications in our study. Note that while it works most of the time, this is not a silver bullet, as using stochastic rounding alone could not fully match 32-bit training on all models. To address this, we show that Kahan summation for model weight updates closes remaining gaps on all the models we consider; this Kahan summation comes with a trade off, as it requires 2⇥ weight memory, but achieves up to 0.2% higher validation accuracy than stochastic rounding. Our results suggest that deep learning accelerators using only 16-bit compute units are feasible if stochastic rounding and Kahan summation are supported respectively by the hardware and the software stack.

2. PRELIMINARY

In this section we establish the background and notation for our study and present the preliminary observations that motivate our work. We focus on the case of stochastic gradient descent (SGD), which is the primary workhorse used to train deep learning models. SGD computes gradients from a subset of training samples, and uses them to update the model weights so as to decrease the loss in expectation. In the classic supervised learning setting, let (X, y) be a dataset where X = [x 1 , x 2 , ..., x n ] 2 R n⇥d and y = (y 1 , y 2 , ..., y n ) 2 R n . On this dataset, we use stochastic gradient descent to optimize a loss function f (w) = 1/n P n i=1 f i (w, x i , y i ) defined by the model. At the t-th iteration, we sample an index subset (t) ⇢ {1, 2, .., n} and compute a sample gradient rf (t) (w t ) as an unbiased estimate of the full gradient rf (w). In deep learning, model training can be described as a compute graph where the compute graph operators such as addition and ma-



This nearest rounding is the standard rounding mode for compute unit output commonly supported across hardware platforms (Intel, 2018; Nvidia, 2020).

