ENABLING BINARY NEURAL NETWORK TRAINING ON THE EDGE

Abstract

The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. In this paper, we demonstrate that they are also strongly robust to gradient quantization, thereby making the training of modern models on the edge a practical reality. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions and energy savings vs Courbariaux & Bengio's standard approach. Against the latter, we see coincident memory requirement and energy consumption drops of 2-6×, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also showcase ImageNet training of ResNetE-18, achieving a 3.12× memory reduction over the aforementioned standard. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency and increasing energy efficiency while also safeguarding privacy.

1. INTRODUCTION

Although binary neural networks (BNNs) feature weights and activations with just single-bit precision, many models are able to reach accuracy indistinguishable from that of their higher-precision counterparts (Courbariaux & Bengio, 2016; Wang et al., 2019b) . Since BNNs are functionally complete, their limited precision does not impose an upper bound on achievable accuracy (Constantinides, 2019) . BNNs represent the ideal class of neural networks for edge inference, particularly for custom hardware implementation, due to their use of XNOR for multiplication: a fast and cheap operation to perform. Their use of compact weights also suits systems with limited memory and increases opportunities for caching, providing further potential performance boosts. FINN, the seminal BNN implementation for field-programmable gate arrays (FPGAs), reached the highest CIFAR-10 and SVHN classification rates to date at the time of its publication (Umuroglu et al., 2017) . Despite featuring binary forward propagation, existing BNN training approaches perform backward propagation using high-precision floating-point data types-typically float32-often making training infeasible on resource-constrained devices. The high-precision activations used between forward and backward propagation commonly constitute the largest proportion of the total memory footprint of a training run (Sohoni et al., 2019; Cai et al., 2020) . Additionally, backward propagation with high-precision gradients is costly, challenging the energy limitations of edge platforms. An understanding of standard BNN training algorithms led us to ask two questions: why are highprecision weight gradients used when we are only concerned with weights' signs, and why are highprecision activations used when the computation of weight gradients only requires binary activations as input? In this paper, we present a low-memory, low-energy BNN training scheme based on this intuition featuring (i) the use of binary, power-of-two and 16-bit floating-point data types, and (ii) batch normalization modifications enabling the buffering of binary activations. By increasing the viability of learning on the edge, this work will reduce the domain mismatch between training and inference-particularly in conjunction with federated learning (McMahan et al., 2017; Bonawitz et al., 2019) -and ensure privacy for sensitive applications (Agarwal et al., 2018) . Via the aggressive energy and memory footprint reductions they facilitate, our proposals will enable BNN-specific l 1 foot_0 Arbitrary precision was supported, but significant accuracy degradation was observed below 6 bits. 2 Activations were not retained between forward and backward propagation in order to save memory. 3 Power-of-two format comprising sign bit and exponent. networks to be trained without the network access reliance, latency and energy overheads or data divulgence inherent to cloud offloading. To this end, we make the following novel contributions. • We conduct the first variable representation and lifetime analysis of the standard BNN training process, informing the application of beneficial approximations. In particular, we binarize weight gradients owing to the lack of importance of their magnitudes, modify the forward and backward batch normalization operations such that we remove the need to buffer high-precision activations and determine and apply appropriate additional quantization schemes-power-of-two activation gradients and reduced-precision floating-point data-taken from the literature. • Against Courbariaux & Bengio (2016)'s approach, we demonstrate the preservation of test accuracy and convergence rate when training BNNs to classify MNIST, CIFAR-10, SVHN and ImageNet while lowering memory and energy needs by up to 5.67× and 4.53×. • We provide an open-source release of our training software, along with our memory and energy estimation tools, to the community 1 .

2. RELATED WORK

The authors of all published works on BNN inference acceleration to date made use of high-precision floating-point data types during training (Courbariaux et al., 2015; Courbariaux & Bengio, 2016; Lin et al., 2017; Ghasemzadeh et al., 2018; Liu et al., 2018; Wang et al., 2019a; 2020; Umuroglu et al., 2020; He et al., 2020; Liu et al., 2020) . There is precedent, however, for the use of quantization when training non-binary networks, as we show in Table 1 via side-by-side comparison of the approximation approaches taken in those works along with that proposed herein. et al. (2018) focused solely on aggressive weight gradient quantization, aiming to reduce communication costs for distributed learning. Weight gradients were losslessly quantized into ternary and binary formats, respectively, with forward propagation and activation gradients kept at high precision. In this work, we make the novel observations that activation gradient dynamic range is more important than precision, and that BNNs are more robust to approximation than higher-precision networks. We thus propose a data representation scheme more aggressive than all of the aforementioned works combined, delivering large memory and energy savings with near-lossless performance. Gradient checkpointing-the recomputation of activations during backward propagation-has been proposed as a method to reduce the memory consumption of training (Chen et al., 2016; Gruslys 



Source supplied in .zip for review.



Comparison of applied approximations vs related low-cost neural network training works.

The effects of quantizing the gradients of networks with high-precision data, either fixed or floating point, have been studied extensively. Zhou et al. (2016) and Wu et al. (2018a) trained networks with fixed-point weights and activations using fixed-point gradients, reporting no accuracy loss for AlexNet classifying ImageNet with gradients wider than five bits.Wen et al. (2017)  and Bernstein

