FTBNN: RETHINKING NON-LINEARITY FOR 1-BIT CNNS AND GOING BEYOND Anonymous

Abstract

Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices. In contrast to previous methods tending to reduce the quantization error for training BNN structures, we argue that the binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. In this paper, we re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-theart performance on the large-scale ImageNet dataset in terms of accuracy and training efficiency. To go further, we find that the proposed BNN model still has much potential to be compressed by making a better use of the efficient binary operations, without losing accuracy. In addition, the limited capacity of the BNN model can also be increased with the help of group execution. Based on these insights, we are able to improve the baseline with an additional 4∼5% top-1 accuracy gain even with less computational cost. Our code and all trained models will be made public.

1. INTRODUCTION

In the past decade, Deep Neural Networks (DNNs), in particular Deep Convolutional Neural Networks (DCNNs), has revolutionized computer vision and been ubiquitously applied in various computer vision tasks including image classification (Krizhevsky et al., 2012) , object detection (Liu et al., 2020a) and semantic segmentation (Minaee et al., 2020) . The top performing DCNNs (He et al., 2016; Huang et al., 2017) are data and energy hungry, relying on cloud centers with clusters of energy hungry processors to speed up processing, which greatly impedes their deployment in ubiquitous edge devices such as smartphones, automobiles, wearable devices and IoTs which have very limited computing resources. Therefore, in the past few years, numerous research effort has been devoted to developing DNN compression techniques to pursue a satisfactory tradeoff between computational efficiency and prediction accuracy (Deng et al., 2020) . Among various DNN compression techniques, Binary Neural Networks (BNNs), firstly appeared in the pioneering work by Hubara et al. (2016) , have attracted increasing attention due to their favorable properties such as fast inference, low power consumption and memory saving. In a BNN, the weights and activations during inference are aggressively quantized into 1-bit (namely two values), which can lead to 32× saving in memory footprint and up to 64× speedup on CPUs (Rastegari et al., 2016) . However, the main drawback of BNNs is that despite recent progress (Liu et al., 2018; Gu et al., 2019; Kim et al., 2020b) , BNNs have trailed the accuracy of their full-precision counterparts. This is because the binarization inevitably causes serious information loss due to the limited representational capacity with extreme discreteness. Additionally, the discontinuity nature of the binarization operation brings difficulty to the optimization of the deep network (Alizadeh et al., 2018) . A popular direction on enhancing the predictive performance of a BNN is to make the binary operation mimic the behavior of its full-precision counterpart by reducing the quantization error cuased by the binarization function. For example, XNOR-Net (Rastegari et al., 2016) firstly introduced scaling factors for both the binary weight and activation such that the output of the binary convolution can be rescaled to closely match the result of the real-valued convolution just like the original full-precision weight and activation are used. The method outperforms its vanilla counterpart BNN (Hubara et al., 2016 ) by a large margin (44.2% vs. 27.9% in Top-1 accuracy on ImageNet using the AlexNet architecture (Krizhevsky et al., 2012) ). Because of the remarkable success of XNOR-Net, a series of approaches emerged subsequently with the effort of either finding better scaling factors or proposing novel optimization strategies to further reduce the quantization error. Specifically, XNOR-Net++ (Bulat & Tzimiropoulos, 2019) improved the way of calculating the scaling factors by regarding them as model parameters which can be learnt end-to-end from the target loss. While Real-to-Bin (Martinez et al., 2020) proposed to compute the scaling factors on the fly according to individual input samples, which is more flexible. From another perspective, IR-Net (Qin et al., 2020) progressively evolved the backward function for binarization from an identity map into the original Sign function during training, which can avoid big quantization error in the early stage of training. BONN (Gu et al., 2019) added a Bayesian loss to encourage the weight kernel following a Gaussian mixture model with each Gaussian centered at each quantization value, leading to higher accuracy. Other works aiming to reduce the quantization error also include ABC-Net (Lin et al., 2017 ), Bi-Real Net (Liu et al., 2018) , ProxyBNN (He et al., 2020) , etc. However, another problem arises with the quantization error optimized towards 0, especially for the structure like Bi-Real Net ( Fig. 1 ), where the only non-linear function is the binarization function. The non-linearity of the binarization function will be eliminated if the binary convolution with scaling factors can perfectly mimic the real-valued convolution in the extreme case (quantization error equals to 0), thus hindering the discriminative ability of BNNs. Therefore, it is necessary to re-investigate the non-linear property of BNNs when inheriting existing advanced structures. Based on this consideration, we conduct the experiment on MNIST dataset (LeCun & Cortes, 2010) using a 2-layer Bi-Real Net like structure (which begins with an initial real-valued convolution layer and two basic blocks illustrated in Fig. 1 (b), optionally followed by a non-linear module, and ends with a fully connected (FC) layer) and some interesting phenomenons can be found as shown in Fig. 2 , where we visualized the feature space before the FC layer and calculated the feature discrepancy caused by the binarization process as well as the corresponding classification accuracy (ACC, in %) for each model. Firstly, comparing the first two figures, despite the big quantization error made by binarization, the binary model achieves much higher accuracy than the real-valued model, which does not have quantization error. This indicates that the binarization function owns a potential ability to enhance the model's discriminative power, and also explains why Bi-Real Net



Figure 1: Left: The basic block in original Bi-Real Net vs. the simplified basic block in FTBNN, where we directly absorb the explicit scaling factors into the BN layer by leveraging BN's scaling factors. Right: The non-linear modules (ReLU or FPReLU) are explicitly added after each basic block in FTBNN. To maximize the model's discriminative power while keeping its training stability, the number of ReLU is controlled and the proposed FPReLU is connected with most blocks. Training curves on ImageNet of the two 18-layer networks are depicted to show the training efficiency. Solid lines denote Top-1 accuracy on the validation set (y-axis on the right), dashed lines denote training loss (y-axis on the left). Both models are trained from scratch.

