OPTIMIZING DATA-FLOW IN BINARY NEURAL NET-WORKS

Abstract

Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bitwise operations. Most existing solutions, however, do not fully optimize data flow through the BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder efficiency. We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline; specifically, we introduce a clipping block that decreases the data-width from 32 bits to 8. Furthermore, we reduce the internal accumulator size of a binary layer, usually kept using 32bit to prevent data overflow without losing accuracy. Additionally, we provide an optimization of the Batch Normalization layer that both reduces latency and simplifies deployment. Finally, we present an optimized implementation of the Binary Direct Convolution for ARM instruction sets. Our experiments show a consistent improvement of the inference speed (up to 1.77 and 1.9× compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.

1. INTRODUCTION

In the last decade deep neural networks (DNNs) have come to demonstrate high accuracy on many datasets like ImageNet Russakovsky et al. (2015) , outperforming legacy methods and sometimes even human experts (Krizhevsky et al. (2012 ), Simonyan & Zisserman (2014 ), Szegedy et al. (2015) , He et al. (2016) ). These improvements have been achieved by increasing the depth and complexity of the network, leading to intensive usage of computational resources and memory bandwidth. Large DNN models run smoothly on expensive GPU-based machines but cannot be easily deployed to edge devices (i.e., small mobile or IoT systems), which are typically more resource-constrained. 2020)) has shown that a DNN model can be even quantized to 1-bit (also known as binarization) thus achieving a remarkable speedup compared to the full precision network. The memory requirement of such a binarized DNN (BNN) is drastically reduced compared to a DNN of the same structure, since a significant proportion of weights and activations can be represented by 1-bit, usually {-1, +1}. In addition, high-precision multiply-and-accumulate operations can be replaced by faster XNOR and popcount operations. However, the aggressive quantization can make BNN's less accurate than their full-precision counterparts. Some researchers showed that the performance loss often arises from the gradient mismatch problem caused by the non-differentiable binary activation function Darabi et al. (2018); Liu et al. (2018) . This non-differentiability of the quantization functions prevents gradient back-propagation through the quantization layer. Therefore, previous works used straight-through-estimator (STE) to approximate the gradient on non-differentiable layers Bengio et al. (2013); Hubara et al. (2016) . Furthermore, to prevent that the binarization of weights and activations leads to feature maps of lower quality and capacity, a combination of binary and floating-point layers is usually adopted. Unfortunately, each time a binary layer is connected to a floating-point one, the efficiency of the 1



Various techniques have been introduced to mitigate this problem, including network quantization Choi et al. (2018); Hubara et al. (2016); Lin et al. (2017); Rastegari et al. (2016); Zhou et al. (2016), network pruning Han et al. (2015); Wen et al. (2016) and efficient architecture design Howard et al. (2017); Tan & Le (2019). Recent work on quantization (e.g. Courbariaux et al. (2016); Hubara et al. (2016); Liu et al. (2018); Martinez et al. (

