WEIGHTS HAVING STABLE SIGNS ARE IMPORTANT: FINDING PRIMARY SUBNETWORKS AND KERNELS TO COMPRESS BINARY WEIGHT NETWORKS

Abstract

Binary Weight Networks (BWNs) have significantly lower computational and memory costs compared to their full-precision counterparts. To address the nondifferentiable issue of BWNs, existing methods usually use the Straight-Through-Estimator (STE). In the optimization, they learn optimal binary weight outputs represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values, usually with a layer-wise quantization scheme. In this paper, we begin with an empirical study of training BWNs with STE under the settings of using common techniques and tricks. We show that in the context of using batch normalization after convolutional layers, adapting scaling factors with either hand-crafted or learnable methods brings marginal or no accuracy gain to final model, while the change of weight signs is crucial in the training of BWNs. Furthermore, we observe two astonishing training phenomena. Firstly, the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings on the lottery ticket hypothesis for efficient learning of sparse neural networks. Secondly, we find binary kernels in the convolutional layers of final models tend to be centered on a limited number of the most frequent binary kernels, showing binary weight networks may has the potential to be further compressed, which breaks the common wisdom that representing each weight with a single bit puts the quantization to the extreme compression. To testify this hypothesis, we additionally propose a binary kernel quantization method, and we call resulting models Quantized Binary-Kernel Networks (QBNs). We hope these new experimental observations would shed new design insights to improve the training and broaden the usages of BWNs.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have achieved great success in many computer vision tasks such as image classification (Krizhevsky et al., 2012) , object detection (Girshick et al., 2014) and semantic segmentation (Long et al., 2015) . However, modern CNNs usually have large number of parameters, posing heavy costs on memory and computation. To ease their deployment in resourceconstrained environments, different types of neural network compression and acceleration techniques have been proposed in recent years, such as network pruning (Han et al., 2015; Li et al., 2017 ), network quantization (Hubara et al., 2016; Rastegari et al., 2016; Zhou et al., 2016) , knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015) , efficient CNN architecture engineering and searching (Howard et al., 2017; Zhang et al., 2018b; Zoph & Le, 2017) . Comparatively, network quantization is more commercially attractive as it can not only benefit specialized hardware accelerator designs (Sze et al., 2017) , but also can be readily combined with other techniques to get further improved compression and acceleration performance (Mishra & Marr, 2018; Han et al., 2016; Zhou et al., 2017) . Quantization methods aim to approximate fullprecision (32-bit floating-point) neural networks with low-precision (low-bit) ones. In particular, the extremely quantized models called Binarized Neural Networks (BNNs) (Courbariaux et al., 2015; 2016; Rastegari et al., 2016) which force the weights or even weights and activations to have 1-bit values (+1 and -1), bringing 32× reduction in model size and making costly 32-bit floating-point multiplications can be replaced by much cheaper binary bit-wise operations. Because of this, how to train accurate BNNs either in a post-training manner or in a training from scratch manner has attracted increasing attention. However, training BNNs poses a non-differentiable issue as converting full-precision weights into binary values leads to zero gradients. To combat this issue, most existing methods use the Straight-Through-Estimator (STE). Although there are few attempts (Achterhold et al., 2018; Chen et al., 2019; Bai et al., 2019; Hou et al., 2017) to learn BNNs without STE by using proximal gradient methods or meta-learning methods, they suffer from worse accuracy and heavier parameter tuning compared to STE based methods. In STE based methods, full-precision weights are retained during training, and the gradients w.r.t. them and their binarized ones are assumed to be the same. In the forward pass of the training, the full-precision weights of the currently learnt model are quantized to binary values for predication loss calculation. In the backward pass, the gradients w.r.t. full-precision weights instead of binary ones are used for model update. To compensating for drastic information loss and training more accurate BNNs, most state of the art STE based methods follow the formulation of (Rastegari et al., 2016) in which the binary weights are represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values layer-by-layer, yet also present a lot of modifications. These modifications include but are not limited to expanding binary weights to have multiple binary bases (Lin et al., 2017; Guo et al., 2017) , replacing hand-crafted scaling factors with learnable ones (Zhang et al., 2018a) , making an ensemble of multiple binary models (Zhu et al., 2019) , searching high-performance binary network architectures (Kim et al., 2020) , and designing improved regularization objectives, optimizers and activation functions (Cai et al., 2017; Liu et al., 2018; Helwegen et al., 2019; Martinez et al., 2020) . There are also a few works, trying to make a better understanding of the training of BNNs with STE. In (Alizadeh et al., 2019) , the authors evaluate some of the widely used tricks, showing that adapting learning rate with a second-moment optimizer is crucial to train BNNs with STE based methods while other tricks such as weights and gradients clipping are less important. Bethge et al. (2019) shows the commonly used techniques such as hand-crafted scaling factors and custom gradients are also not crucial. Sajad et al. (2019) demonstrates learnable scaling factors combined into a modified sign function can enhance the accuracy of BNNs. Anderson & Berg (2018) makes an interpretation of why binary models can approximate their full-precision references in terms of high-dimensional geometry. Galloway et al. (2018) validates that BNNs have surprisingly improved robustness against some adversarial attacks compared to their full-precision counterparts. In this paper, we revisit the training of BNNs, particularly Binary Weight Networks (BWNs) with STE, but in a new perspective, exploring structural weight behaviors in training BWNs. Our main contributions are summarized as follows: • We use two popular methods (Rastegari et al., 2016) and (Zhang et al., 2018a) for an empirical study, showing both hand-crafted and learnable scaling factors are not that important, while the change of weight signs plays the key role in the training of BWNs, under the settings of using common techniques and tricks. • More importantly, we observe two astonishing training phenomena: (1) the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings of the lottery ticket hypothesis (Frankle & Carbin, 2019) for training sparse neural networks; (2) binary kernels in the convolutional layers (Conv layers) of final BWNs tend to be centered on a limited number of binary kernels, showing binary weight networks may has the potential to be further compressed. This breaks the common understanding that representing each weight with a single bit puts the quantization to the extreme compression. • We propose a binary kernel quantization method to compress BWNs, bringing a new type of BWNs called Quantized Binary-Kernel Networks (QBNs).

2. AN EMPIRICAL STUDY ON UNDERSTANDING BWNS' TRAINING

In this section we will briefly describe BWNs we use in experiments, implementation details, scaling factors in BWNs, full-precision weight norm, weight sign, and sub-networks in BWNs.

