OPTIMIZING DATA-FLOW IN BINARY NEURAL NET-WORKS

Abstract

Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bitwise operations. Most existing solutions, however, do not fully optimize data flow through the BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder efficiency. We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline; specifically, we introduce a clipping block that decreases the data-width from 32 bits to 8. Furthermore, we reduce the internal accumulator size of a binary layer, usually kept using 32bit to prevent data overflow without losing accuracy. Additionally, we provide an optimization of the Batch Normalization layer that both reduces latency and simplifies deployment. Finally, we present an optimized implementation of the Binary Direct Convolution for ARM instruction sets. Our experiments show a consistent improvement of the inference speed (up to 1.77 and 1.9× compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.

1. INTRODUCTION

In the last decade deep neural networks (DNNs) have come to demonstrate high accuracy on many datasets like ImageNet Russakovsky et al. (2015) , outperforming legacy methods and sometimes even human experts (Krizhevsky et al. (2012) , Simonyan & Zisserman (2014) , Szegedy et al. (2015) , He et al. (2016) ). These improvements have been achieved by increasing the depth and complexity of the network, leading to intensive usage of computational resources and memory bandwidth. Large DNN models run smoothly on expensive GPU-based machines but cannot be easily deployed to edge devices (i.e., small mobile or IoT systems), which are typically more resource-constrained. 2020)) has shown that a DNN model can be even quantized to 1-bit (also known as binarization) thus achieving a remarkable speedup compared to the full precision network. The memory requirement of such a binarized DNN (BNN) is drastically reduced compared to a DNN of the same structure, since a significant proportion of weights and activations can be represented by 1-bit, usually {-1, +1}. In addition, high-precision multiply-and-accumulate operations can be replaced by faster XNOR and popcount operations. However, the aggressive quantization can make BNN's less accurate than their full-precision counterparts. Some researchers showed that the performance loss often arises from the gradient mismatch problem caused by the non-differentiable binary activation function Darabi et al. (2018); Liu et al. (2018) . This non-differentiability of the quantization functions prevents gradient back-propagation through the quantization layer. Therefore, previous works used straight-through-estimator (STE) to approximate the gradient on non-differentiable layers Bengio et al. (2013); Hubara et al. (2016) . Furthermore, to prevent that the binarization of weights and activations leads to feature maps of lower quality and capacity, a combination of binary and floating-point layers is usually adopted. Unfortunately, each time a binary layer is connected to a floating-point one, the efficiency of the pipeline is compromised by input/output layer data type conversion. In addition, the internal parallelism of a binary layer depends on the encoding of the accumulator, which is often maintained at 32 bits to prevent overflow. In this paper we present several optimizations that allow training a BNN with an inter-layer data width of 8 bits. Most prior work on BNN's emphasize overall network accuracy; in contrast, our aim is to preserve initial accuracy while improving efficiency. Our contributions (graphically highlighted in Figure 1i and 1ii ) can be summarized as follows: • a novel training scheme is proposed to improve the data-flow in the BNN pipeline (Section 3.1); specifically, we introduce a clipping block to shrink the data width from 32 to 8 bits while simultaneously reducing the internal accumulator size. • we provide (Section 3.2) an optimization of the Batch Normalization layer that decreases latency and simplifies deployment. • we optimize the Binary Direct Convolution method for ARM instruction sets (Section 3.3). To prove the effectiveness of the proposed optimizations in Section 4 we provide experimental evaluations that show's the speed-up relative to state-of-the-art BNN engines like LCE Bannink et al. Many subsequent studies focused on enhancing BNN accuracy. Rastegari et al. ( 2016) proposed XNOR-Net, where real-valued scaling factors are used to multiply the binary weight kernels, and this methodology then became a representative binarization approach to bridge the gap between BNN's and their real-valued counterparts. The Bi-Real Net Liu et al. ( 2018) added shortcuts to propagate values along the feature maps, which further boosted the top-1 accuracy on ImageNet; nevertheless, the model still relies on 32-bit floating point to execute batch normalization and addition operator (as shown in Fig. 1iia ). One of the major weaknesses of BNN's is the gradient approximation by the STE binarization function Courbariaux et al. (2016) . In fact, STE computes the derivative of sign as if the binary operation was a linear function, as reported in the following formula : A(x) = max(-1, min(1, x)), ST E(x) = A ′ (x) = [-1 ≤ x ≤ 1] The implementation of STE stated above, uses the STE with the addition that it cancels the gradients when the inputs get too large Hubara et al. (2016) . STE provides a coarse approximation of the gradient that inevitably affects the testing accuracy of the BNN. 

3. DATA-FLOW OPTIMIZATIONS

As illustrated in Fig. 1a , the most commonly used BNN architectures (e.g., VGG and ResNet) have four essential blocks in each convolution/fully-connected (CONV/FC) layer: sign (binarization), XNOR, popcount and Batch Normalization (BN). Since the weights, inputs and outputs are all binary, the traditional multiply-and-accumulate operation is replaced by XNOR and bit counting (i.e., popcount). XNOR and popcount are usually fused to improve efficiency. The use of Batch Normalization after each binarized layer is very important in BNN's as pointed out by Santurkar et al.



Various techniques have been introduced to mitigate this problem, including network quantization Choi et al. (2018); Hubara et al. (2016); Lin et al. (2017); Rastegari et al. (2016); Zhou et al. (2016), network pruning Han et al. (2015); Wen et al. (2016) and efficient architecture design Howard et al. (2017); Tan & Le (2019). Recent work on quantization (e.g. Courbariaux et al. (2016); Hubara et al. (2016); Liu et al. (2018); Martinez et al. (

(2021)  andDaBNN Zhang et al. (2019).2 RELATED WORKBNNs were first introduced by Courbariaux et al. (2016), who established an end-to-end gradient back-propagation framework for training the binary weights and activations. They achieved good success on small classification datasets including CIFAR10 Krizhevsky et al. (2009) and MNIST Netzer et al. (2011), but encountered a severe accuracy drop on ImageNet.

To address this issue, other recent studies tried to improve the performance of BNNs by adopting a proper optimization method for the quantization. Inspired by STE, many works update the parameters approximately introducing auxiliary loss functions Gu et al. (2019); Qin et al. (2020). Besides many efforts to develop more efficient and accurate architectures, a few works have provided benchmarks on real devices such as ARM processors. Based on the analysis provided in Bannink et al. (2021), the fastest inference engines for binary neural networks, with proven benchmarks (Section 4 of Bannink et al. (2021)), are LCE and DaBNN.

