SCHR ÖDINGER'S FP: TRAINING NEURAL NETWORKS WITH DYNAMIC FLOATING-POINT CONTAINERS

Abstract

We introduce a software-hardware co-design approach to reduce memory traffic and footprint during training with BFloat16 or FP32, in order to boost energy efficiency and execution time performance. Our methods dynamically adjust the size and format of the floating-point containers used to store activations and weights during training. The different value distributions lead us to different approaches for exponents and mantissas. Gecko exploits the favourable exponent distribution with a lossless delta encoding approach to reduce the total exponent footprint by up to 58% in comparison to the FP32 baseline. To contend with the noisy mantissa distributions, we present two lossy methods to eliminate as many as possible least significant bits without affecting accuracy. Quantum Mantissa is a machine learning mantissa compression method that taps onto the gradient descent algorithm to learn the minimal mantissa bitlengths on a per-layer granularity, and obtain up to 92% reduction in total mantissa footprint. Alternatively, BitChop observes changes in the loss function during training to adjust mantissa bitlength network-wide, yielding a reduction of 81% in footprint. Schrödinger's FP implements hardware encoders/decoders that, guided by Gecko/Quantum Mantissa or Gecko/BitChop, transparently encode/decode values when transferring to/from off-chip memory, boosting energy efficiency and reducing execution time.

1. INTRODUCTION

Training most state-of-the-art neural networks has become an exascale class task (Venkataramani et al., 2017; Amodei et al., 2018) requiring many graphics processors (NVidia, 2017) or specialized accelerators, e.g., (Jouppi et al., 2017; Hab, 2019; Liao et al., 2019; Cer, 2019) . While training is both computationally and data demanding, it is the memory transfers to off-chip DRAM for stashing (i.e., saving and much later recovering) activation and weight tensors that dominate overall execution time and energy (Jain et al., 2018) (see Fig. 1 ). The per batch data volume easily surpasses on-chip memory capacities, necessitating off-chip DRAM accesses which are up to two orders of magnitude slower and more energy expensive. It's no wonder that reducing this overhead has been receiving attention throughout the software/hardware stack. Zheng et al. (2020) recompute rather than stash activations, whereas microbatching strives to keep activations on chip (Huang et al., 2018) . Encoding methods target specific value patterns such as zeros (Rhu et al., 2018) or redundant spatial information (Evans et al., 2020) , or exploit underlying properties of training for certain tensors, e.g., the outputs of ReLU or Pooling (Jain et al., 2018) . These lossless and lossy encodings use fewer bits for stashed tensor content to reduce tensor volume. This also boosts the effective capacity of each node's main memory, which further reduces traffic during distributed training. All aforementioned methods either shift significant costs to compute or target only some values and offer only limited relief.

Chen et al. (2016) and

The most direct way to reduce tensor volume is to use a more compact datatype. Initially, with the goal to demonstrate that neural networks can tackle challenging problems, training relied on single precision 32b floating-point (FP32), which still remains the datatype of choice when achieving the best accuracy is the priority. Recently, we have seen some success in training with more compact datatypes such as half-precision FP16, BFloat16 (Kalamkar et al., 2019) , dynamic floating-point (Das et al., 2018), and flexpoint (Köster et al., 2017) and even with using combinations with other datatypes such as fixed-point (Das et al., 2018; Micikevicius et al., 2018; NVIDIA; Drumond et al., 2018) Obviously, knowing in advance which compact datatypes to use during training would be the best. However, given that this goal still eludes us, our work asks whether we can harness the training process itself to automatically learn them. Ideally, such a method would automatically tailor datatypes to meet the demands of each tensor, layer, and network. Furthermore, it could continuously adjust datatype selection as training progresses, adapting to the changing needs. In addition to accelerating training, methods such as ours can further inform efforts for selecting more efficient datatypes for inference such as those by (Micikevicius et al., 2022) or (Sun et al., 2020) . A similar idea has successfully targeted fixed-point inference by using reinforcement learning (Wang et al., 2018a) , clever differentiable datatype definitions (Nikolić et al., 2020) , architecture search (Wu et al., 2018), and profiling (Nikolić et al., 2018) , etc. However, all of these are too expensive for training and their overheads would overshadow the benefits of a more compact training datatype. Given that floating point remains the datatype of choice, we focus on floating-point datatype selection. We explore the possibility to dynamically and continuously adjust the mantissa bitlength (fractional bits) and the container (overall bits) for floating-point values (activations and/or weights) for stashed tensors, and to do so transparently at no additional burden to the user. Our solution is Schrödinger's FP, a family of methods that dynamically adjust the floating-point encoding and complement the aforementioned training acceleration methods. Our approach is end-to-end fully-automated, requiring no input, guessing, or advanced knowledge from the operator. Schrödinger's FP can be used to reduce memory overheads and boost computation throughput. In this work, we limit our attention to boosting energy efficiency and performance by using Schrödinger's FP to transparently encode values as they are being stashed to off-chip memory, and decode them to their original format as they are being read back. This application can be used as a plug-in over any hardware without changing the existing on-chip memory hierarchy and compute units. Similarly, Schrödinger's FP will generally work in conjunction with methods that can improve accuracy for a preselected datatype, partition, distribute, or reschedule the training work to improve energy efficiency and performance. Schrödinger's FP uses tailored approaches for the mantissa and exponent. It dynamically adjusts mantissa bitlengths in order to store and read fewer bits per number in off-chip memory. This work explores two such methods. The first, Quantum Mantissa, harnesses the training algorithm itself to learn on-the-fly the mantissa bitlengths that are needed per tensor/layer and continuously adapts those bitlengths per batch. Quantum Mantissa introduces a single learning parameter per tensor and a loss



Figure1: Training process and its memory transfers. Blue -Activations that are typically saved to off-chip memory during forward pass and retrieved during backward pass, Red -Weights that are typically stored and loaded once from off-chip memory, Gray -Updates and Gradients -through mini-batching during the backward pass they can often fit on-chip managed to push the datatype to 8b(Wang et al., 2018b) and 4b (Sun et al., 2020)  extremes for some cases. As Moore's law and Dennard scaling for semiconductors have come to an end, using more efficient datatypes during training is getting wider attention -even major hardware manufacturers are investigating how to use 8b floating point with different mantissa/exponent ratios according to perceived needs of tensors(Micikevicius et al., 2022). These methods require careful trial-and-error investigation of where, when, and which narrow datatypes to use. This is especially true because different tensors, tasks, architectures, or layers require different datatypes. Consequently, there is no guarantee of success. The methods require trial-and-error full training runs as whether the choice of datatypes is viable can only be evaluated post mortem. Moreover, since the datatypes are statically chosen they offer no opportunity to amend the choice if accuracy suffers (e.g., significant drop with deeper networks identified byIBM (Sun et al., 2020)).

. IBM

