SPARSE BINARY NEURAL NETWORKS

Abstract

Quantized neural networks are gaining popularity thanks to their ability to solve complex tasks with comparable accuracy as full-precision Deep Neural Networks (DNNs), while also reducing computational power and storage requirements and increasing the processing speed. These properties make them an attractive alternative for the development and deployment of DNN-based applications in Internet-Of-Things (IoT) devices. Among quantized networks, Binary Neural Networks (BNNs) have reported the largest speed-up. However, they suffer from a fixed and limited compression factor that may result insufficient for certain devices with very limited resources. In this work, we propose Sparse Binary Neural Networks, a novel model and training scheme that allows to introduce sparsity in BNNs by using positive 0/1 binary weights, instead of the -1/+1 weights used by state-ofthe-art binary networks. As a result, our method is able to achieve a high compression factor and reduces the number of operations and parameters at inference time. We study the properties of our method through experiments on linear and convolutional networks over MNIST and CIFAR-10 datasets. Experiments confirm that SBNNs can achieve high compression rates and good generalization, while further reducing the operations of BNNs, making it a viable option for deploying DNNs in very cheap and low-cost IoT devices and sensors.

1. INTRODUCTION

The term Internet-Of-Things (IoT) became notable in the late 2000s under the idea of enabling internet access to electrical and electronic devices (Miraz et al., 2015) , thus allowing them to collect and exchange data. Since its introduction, the number of connected devices has managed to surpass the number of humans connected to the internet (Evans, 2011) . The increasing number of both mobile and embedded IoT devices has led to a sensors-rich world, capable of addressing a various number of real-time applications, such as security systems, healthcare monitoring, environmental meters, factory automation, autonomous vehicles and many others, where both accuracy and time matter (Al-Fuqaha et al., 2015) . At the same time, Deep Neural Networks (DNNs) have reached and surpassed state-of-the-art results for multiple tasks involving images and video (Krizhevsky et al., 2012 ), speech (Hinton et al., 2012) or language processing (Collobert & Weston, 2008) . Thanks to their ability to process large and complex multiple heterogeneous data and extract patterns needed to take autonomous decisions with high reliability (LeCun et al., 2015) , DNNs have the potential of enabling a myriad of new IoT applications. DNNs, however, suffer from high resource consumption, in terms of required computational power, memory and energy consumption (Canziani et al., 2016) . Instead, most IoT devices are characterized by their limited resources. They have limited processing power, small storage capabilities, they are not GPU-enabled and they are powered with batteries of limited capacity, which are expected to last over 10 years without being replaced or recharged (Global System for Mobile Communications, 2018). All these important constraints remain an important bottleneck towards deploying DNN models in IoT applications (Yao et al., 2018) . Achieving deployment of DNNs in IoT devices requires to compress deep neural networks to fit on IoT devices, while enabling real-time "intelligent" interactions with the environment (Yao et al., 2018) and without degrading their accuracy. Sparsity, compression and quantization, i.e. replacing size and power consumption of DNNs. This under the constraint of keeping high accuracy. Different studies (Denil et al., 2013; Frankle & Carbin, 2019) have demonstrated that deep models contain optimal subnetworks, which can perform the same task of their related super-network with less memory and computational burden. Among the various techniques used to extract these subnetworks, pruning and quantization have shown promising results. The first one is based on removing parameters to obtain a sparser network, whereas the second focuses on reducing the bit-width to represent the parameters. Under this principle, Binary Neural Networks (BNNs) (Courbariaux et al., 2015) and Ternary Neural Networks (TNNs) (Hwang & Sung, 2014) are two recently proposed quantized neural networks with weights and activation functions using one and two bits, respectively. This approach avoids multiplication operations in the forward propagation, which are well-known to be computationally expensive, and replaces them with low-cost bitwise operations. This allows to speed-up the resulting networks and to compress them. For instance, BNNs with binary weights {-1, +1} can reach a compression factor w.r.t. full-precision models of up to approximately 32 times (Rastegari et al., 2016) . Despite this improvement the compression factor is upper bounded to 32, which is the result of representing the network's weights with 1-bit instead of the full-precision 32-bits, and may result insufficient for certain limited size and low power embedded devices. To address this limitation, we introduce a novel quantized model denoted Sparse Binary Neural Network (SBNN). It shares the advantages of BNNs as it performs quantization using only one bit, while also introducing sparsity. Our SBNN uses 0s and 1s as weights, instead of +1s and -1s (Courbariaux et al., 2015; Rastegari et al., 2016) , allowing to reduce the total number of required operations, and to achieve higher network compression rates and lower energy consumption at inference time. To achieve this, we propose a training scheme that starts from a "nearly-empty" model, rather than from fully connected models that prune their connections, as most state-of-the-art works do. The remaining parts of this work is organized as follows. Section 2 discusses previous works to achieve sparsity and quantization in DNNs. The core of our contribution is described in Section 3. In Section 4, we study the properties of the proposed method and assess its performance, in terms of accuracy and compression results, through a set of experiments using MNIST and CIFAR-10 datasets. Finally, a discussion on the results and main conclusions are drawn in Section 5.

2. RELATED WORK

We review different approaches to address sparsity and quantization in quantized networks. Sparsity. The concept of sparsity has been well studied beyond quantized neural networks as it reduces computational and storage requirements of the networks and it prevents overfitting. Methods to achieve sparsity either explicitly induce it during learning through regularization, such as L 0 (Louizos et al., 2018) or L 1 (Han et al., 2015) regularization; do it incrementally by gradually augmenting small networks (Bello, 1992) ; or by post hoc pruning (Srivastava et al., 2014; Srinivas et al., 2017; Gomez et al., 2019) . In the context of quantized networks, Han et al. (2016) proposed magnitude-pruning of the nearly-zero parameters from the trained full-precision dense model followed by a quantization step of the remaining weights. The method achieved high compression rates of ∼35-49× and inference speed-up on well-known DNN topologies, without incurring in accuracy losses. However it has a time-consuming train-pruning stage and a relatively limited speed-up, which is model-dependent. Tung & Mori (2018) tried to optimize the scheme in Han et al. (2016) reporting an improvement in accuracy. However, their method encountered a smaller compression factor. Regarding TNNs, this type of networks naturally performs magnitude-based pruning thanks to their quantization function which maps real-valued weights to {-1, 0, +1}. Nevertheless, some works (Faraone et al., 2017; Marban et al., 2020) have achieved larger compression rates by explicitly inducing sparsity through regularization. Current BNN implementations do not address sparsity explicitly and focus on compression improvement through quantization (Rastegari et al., 2016) . To account for sparsity, our work proposes to map real-valued weights to the positive {0, 1} values, instead of the standard mapping of real weights to {-1, +1}. Quantization. Network quantization allows the use of fixed-point arithmetic and a smaller bitwidth to represent network parameters w.r.t the full-precision counterpart. As such, it has been

