WEIGHTS HAVING STABLE SIGNS ARE IMPORTANT: FINDING PRIMARY SUBNETWORKS AND KERNELS TO COMPRESS BINARY WEIGHT NETWORKS

Abstract

Binary Weight Networks (BWNs) have significantly lower computational and memory costs compared to their full-precision counterparts. To address the nondifferentiable issue of BWNs, existing methods usually use the Straight-Through-Estimator (STE). In the optimization, they learn optimal binary weight outputs represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values, usually with a layer-wise quantization scheme. In this paper, we begin with an empirical study of training BWNs with STE under the settings of using common techniques and tricks. We show that in the context of using batch normalization after convolutional layers, adapting scaling factors with either hand-crafted or learnable methods brings marginal or no accuracy gain to final model, while the change of weight signs is crucial in the training of BWNs. Furthermore, we observe two astonishing training phenomena. Firstly, the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings on the lottery ticket hypothesis for efficient learning of sparse neural networks. Secondly, we find binary kernels in the convolutional layers of final models tend to be centered on a limited number of the most frequent binary kernels, showing binary weight networks may has the potential to be further compressed, which breaks the common wisdom that representing each weight with a single bit puts the quantization to the extreme compression. To testify this hypothesis, we additionally propose a binary kernel quantization method, and we call resulting models Quantized Binary-Kernel Networks (QBNs). We hope these new experimental observations would shed new design insights to improve the training and broaden the usages of BWNs.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have achieved great success in many computer vision tasks such as image classification (Krizhevsky et al., 2012) , object detection (Girshick et al., 2014) and semantic segmentation (Long et al., 2015) . However, modern CNNs usually have large number of parameters, posing heavy costs on memory and computation. To ease their deployment in resourceconstrained environments, different types of neural network compression and acceleration techniques have been proposed in recent years, such as network pruning (Han et al., 2015; Li et al., 2017) , network quantization (Hubara et al., 2016; Rastegari et al., 2016; Zhou et al., 2016) , knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015) , efficient CNN architecture engineering and searching (Howard et al., 2017; Zhang et al., 2018b; Zoph & Le, 2017) . Comparatively, network quantization is more commercially attractive as it can not only benefit specialized hardware accelerator designs (Sze et al., 2017) , but also can be readily combined with other techniques to get further improved compression and acceleration performance (Mishra & Marr, 2018; Han et al., 2016; Zhou et al., 2017) . Quantization methods aim to approximate fullprecision (32-bit floating-point) neural networks with low-precision (low-bit) ones. In particular, the extremely quantized models called Binarized Neural Networks (BNNs) (Courbariaux et al., 2015; 2016; Rastegari et al., 2016) which force the weights or even weights and activations to have 1-bit values (+1 and -1), bringing 32× reduction in model size and making costly 32-bit floating-point multiplications can be replaced by much cheaper binary bit-wise operations. Because of this, how to train accurate BNNs either in a post-training manner or in a training from scratch manner has attracted increasing attention. However, training BNNs poses a non-differentiable issue as converting full-precision weights into binary values leads to zero gradients. To combat this issue, most existing methods use the Straight-Through-Estimator (STE). Although there are few attempts (Achterhold et al., 2018; Chen et al., 2019; Bai et al., 2019; Hou et al., 2017) to learn BNNs without STE by using proximal gradient methods or meta-learning methods, they suffer from worse accuracy and heavier parameter tuning compared to STE based methods. In STE based methods, full-precision weights are retained during training, and the gradients w.r.t. them and their binarized ones are assumed to be the same. In the forward pass of the training, the full-precision weights of the currently learnt model are quantized to binary values for predication loss calculation. In the backward pass, the gradients w.r.t. full-precision weights instead of binary ones are used for model update. To compensating for drastic information loss and training more accurate BNNs, most state of the art STE based methods follow the formulation of (Rastegari et al., 2016) in which the binary weights are represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values layer-by-layer, yet also present a lot of modifications. These modifications include but are not limited to expanding binary weights to have multiple binary bases (Lin et al., 2017; Guo et al., 2017) , replacing hand-crafted scaling factors with learnable ones (Zhang et al., 2018a) , making an ensemble of multiple binary models (Zhu et al., 2019) , searching high-performance binary network architectures (Kim et al., 2020) , and designing improved regularization objectives, optimizers and activation functions (Cai et al., 2017; Liu et al., 2018; Helwegen et al., 2019; Martinez et al., 2020) . There are also a few works, trying to make a better understanding of the training of BNNs with STE. In (Alizadeh et al., 2019) , the authors evaluate some of the widely used tricks, showing that adapting learning rate with a second-moment optimizer is crucial to train BNNs with STE based methods while other tricks such as weights and gradients clipping are less important. Bethge et al. (2019) shows the commonly used techniques such as hand-crafted scaling factors and custom gradients are also not crucial. Sajad et al. (2019) demonstrates learnable scaling factors combined into a modified sign function can enhance the accuracy of BNNs. Anderson & Berg (2018) makes an interpretation of why binary models can approximate their full-precision references in terms of high-dimensional geometry. Galloway et al. (2018) validates that BNNs have surprisingly improved robustness against some adversarial attacks compared to their full-precision counterparts. In this paper, we revisit the training of BNNs, particularly Binary Weight Networks (BWNs) with STE, but in a new perspective, exploring structural weight behaviors in training BWNs. Our main contributions are summarized as follows: • We use two popular methods (Rastegari et al., 2016) and (Zhang et al., 2018a) for an empirical study, showing both hand-crafted and learnable scaling factors are not that important, while the change of weight signs plays the key role in the training of BWNs, under the settings of using common techniques and tricks. • More importantly, we observe two astonishing training phenomena: (1) the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings of the lottery ticket hypothesis (Frankle & Carbin, 2019) for training sparse neural networks; (2) binary kernels in the convolutional layers (Conv layers) of final BWNs tend to be centered on a limited number of binary kernels, showing binary weight networks may has the potential to be further compressed. This breaks the common understanding that representing each weight with a single bit puts the quantization to the extreme compression. • We propose a binary kernel quantization method to compress BWNs, bringing a new type of BWNs called Quantized Binary-Kernel Networks (QBNs).

2. AN EMPIRICAL STUDY ON UNDERSTANDING BWNS' TRAINING

In this section we will briefly describe BWNs we use in experiments, implementation details, scaling factors in BWNs, full-precision weight norm, weight sign, and sub-networks in BWNs.

2.1. DIFFERENT BINARY WEIGHT NETWORKS

BWNs generally represents those networks with binary weights, and there are several different BWNs existing. Overall they use αB to replace full-precision weight W , where B = sign(W ) and α is proposed to minimize ||αB -W || in an either learnable or calculated way. In following experiments, we use the one implemented in XNor-Net (Rastegari et al., 2016) and denote it as XNor-BWN, and the one implemented in LQ-Net (Zhang et al., 2018a) Compare learnable SF and γ in BN: LQ-BWN uses channel-wise scaling factors. From the experiments in Appendix.C, we find that these channel-wise scaling factors having a high correlation with γ in the BN after corresponding binary Conv. This finding indicates that BN's γ can replace channel-wise SF to some extent. x = Normalize(x) = x - x √ σ 2 + y = Affine(x ) = γx + β y α = γ αx -αx √ α 2 σ 2 + + β ≈ γ x - x √ σ 2 + + β = y (2) Quantization Error Curve: Another purpose using scaling factors is to reduce the quantization error between full-precision weights and binary weights according to a BNN survey (Qin et al., 2020) . By using experiments in Appendix.D we prove that the quantization error is not actually reduced by scaling factors, but weight decay helps on this reduction.

2.4. WEIGHT NORM, WEIGHT SIGN, AND SUB-NETWORKS IN BWNS

We already analyse one essential element, scaling factors, in the previous section, and another essential element of BWNs are weights' signs. In deterministic binarization methods, full-precision weights' signs decide their binary weights' signs using a sign() function. In this section, we will discuss the relationship between weight norm, weight sign and how to find primary binary subnetworks in BWNs. Weight Histogram: We visualize the full-precision weight distribution in different layers of different networks as shown in Figure .1. Different from a bi-modal distribution, it shows a centered distribution around 0. This again proves that the actual distance or so-called quantization error is very large. And there are many weights close to zero behaving very unstable, which will change their signs with little perturbations. More experiments and visualizations are in Appendix.E. Flipping Weights' Signs: We flip the weights' signs during the inference section according to weights' full-precision norm as shown in Figure .12 of Appendix.G. We flip those weights with the largest norm and the smallest norm in two experiments. It shows that even the weights have the same norm after binarization, and the changed norm is the same for the same flipping percentage, there is still a very large gap between the two results. Flipping those weights with large full-precision magnitude will cause a significant performance drop compared to those weights close to zero. This reveals that weights are different where some with small norm can tolerate sign flipping, and those with large norm cannot suffer from changing signs, even though both two kinds of weights have the same norm after binarization. Tracing Large Weights From the last experiment, we conduct that weights with large norm are vulnerable and important during inference, however, the function of them during training remains unclear. Then we conduct two experiments to tracing these large weights during training. We also use "these large weights" to indicate these weights having the larger magnitude/larger norm in the network that has already finished training. One is to trace these large weights' signs, to find when these weights' signs become the same as the ones finishing training. Another is to trace these large weights' indices, to find when they become the largest weights among all weights. The results of VGG-7 are shown in Figure .3. The results of ResNet-20 in Figure.9 and ResNet-18 in 10 are placed in Appendix.F. We find those large weights mostly have been decided in the early training stage. The larger magnitude these weights finally have, the earlier they decide and fix their sign. And this rule also applies to their magnitude, that the final trained weights with larger magnitude become having a larger magnitude in the very early stage. Both curves have a similar trend to the accuracy curve's trend.

2.5. PRIMARY BINARY SUB-NETWORKS IN BWNS

We find that there are weights with the large norm, fixing their signs in the early training stage. These weights are stable and vulnerable when inversing their signs. We name these weights as Primary Binary Sub-Networks. This idea is akin to the lottery ticket hypothesis (Frankle & Carbin, 2019) , but the difference is our primary binary sub-networks' weights usually have fixed signs, and the rest of BWNs are not zero like the pruned networks. The primary binary sub-networks have the same norm for each weight after binarization, but different importance. The lottery ticket is based on full-precision network pruning, and it pays more attention to getting sparse networks using the retraining method, while ours is to claim the meta idea that weights with larger norm are stable and sensitive on signs' changes. We will show how we utilize this idea in the rest paper.

2.6. BINARY-KERNEL DISTRIBUTION

Besides the centered distribution of full precision weights in each layer, we find that there exists another distribution of binary-kernels in each layer. For a binary-kernel with 3 × 3 kernel size, there are 2 9 possible kernels in total. For easier illustrations, we use 0 to 511 to index these kernels as shown in Figure 5 : This is a pipeline to illustrate our compressing method on BWNs using 2-bit kernels. We first set those weights with larger norm into ±1 and keep those weights with the smaller norm, then calculate the L2 distance with 2-bit selected binary-kernels. After sorting the distances, we assign the one with the smallest distance to the original kernels. The right is two figures about the distribution of binary-kernels before and after quantization. which is usually not binarized). From Figure .4, we can find that certain binary-kernels are in favor across different layers and networks.

3. QUANTIZED BINARY-KERNEL NETWORKS

In this section, we will introduce Quantized Binary-Kernel Networks (QBNs). In previous sections, we have several conclusions: 1. Scaling factors are not essential to BWNs, which guide us not to concentrate on designing scaling factors since good learning rates help in the most cases; 2. Weights with larger magnitude contribute to the primary binary sub-networks in BWNs, and these large weights are stable but sensitive on sign changing, determined and fixed in the early training stage; 3. Certain binary-kernels are centered on a limited number of the most frequent binary kernels. All these conclusions lead us to propose a new compression algorithm, which will further compress BWNs into a more structural and compact network, and we name this algorithm Quantized Binary-Kernel Networks(QBNs). QBN basically is to the ultimate extent to maintain the primary binary sub-networks, changing those smaller weights' signs and quantize the less frequent kernels to those high frequent kernels to save space.

3.1. ALGORITHM

Before training a QBN, we first train an ordinary VGG-7 XNor-BWN on Cifar-10 and extract its last Conv layer's binary kernel distribution. This has been already done as shown in Figure .5. Then we sort these binary kernels according to their appearance frequency and select top 2 1 , 2 2 , ..., 2 8 frequent binary kernels. These kernels are called selected binary-kernels K 0,1,...,2 8 -1 . In the rest of the paper, we use the selected binary kernels to indicate the kernels K 0,1,...,2 8 -1 in our algorithm. In our following experiments these selected binary kernels are extracted from one single VGG-7 BWN's last Conv layer. After pre-processing these and obtaining K 0,1,...,2 8 -1 , we start to train a QBN using Algorithm.1, which is written with python-style pseudocodes. We use the function where(A, B, C) from NumPy indicating that if the value satisfies A then equals to B otherwise equals to C. We set scaling factors fixed to 0.05 when using default learning rate mentioned in experimental settings of Section.2.2. We use L 2 norm to calculate the distance between the full-precision kernel W ij to the selected kernels K m , where the full-precision kernel will be replaced by the selected kernel whose distance to the full-precision kernel is the shortest one during forward.

3.2. QBN EXPERIMENTS

We display our QBN experiments on in Table .1, where we use the same experiment settings mention in Section.2.2. Besides different networks and datasets are tested, we also use a different quantized bit on these networks to find how QBN can perform. When we use the quantized bit p < 9, we can use less than 9-bit number to represent the binary-kernel, this provides the compression ability of QBN. We use compressed ratio (CR) which is a number larger than 1 to show the ratio between the original BWNs and the compressed model's parameters only including binarized layers. In this paper, we do not use 8-bit quantized binary kernels, which have a high computational cost and small compressed ratio.

4. DISCUSSION ON QBN

In this section, we will discuss the experimental results of QBN and its potential usage, including model compression, kernel quantization strategies, the existence and transferability of the selected kernels, and other selection of binary-kernels.

4.1. MODEL COMPRESSION

With the discovery that BWNs contain primary binary sub-networks, we can reduce the number of parameters to represent a binary-kernel by changing the small magnitude weights' signs with bearable to the performance of BWNs. For VGG-7 on Cifar-10 and ResNet-18 on ImageNet, we can compress their parameters to an extremely small number by replacing the whole 512 types of 3×3 binary-kernel with fewer types of binary kernels from those 2 k selected binary-kernels, and the compressed ratio can be higher than 5×. For ResNet-20 and ResNet-56 which are thinner and have a small number of channels and parameters, they have a low endurance on compression, the compressed ratio can achieve to 1.5× with a bearable accuracy drop (less than 3% on Cifar-10). For a more aggressive compression with very low bit quantization binary-kernels, networks with fewer parameters like ResNet-20's training stability will drop due to their limited number of parameters. The experimental results are shown in Table .3 in Appendix.H.

4.2. CONNECTION BETWEEN PRIMARY BINARY SUB-NETWORKS

We use a hyper-parameter threshold ∆ in Algorithm.1 to bridge QBN and Primary Binary Sub-Networks. When ∆ = 0, it means we first binarize all weights, then quantize these binary-kernels to those selected kernels. When ∆ is large enough, it means we directly quantize the full-precision kernels. When ∆E is at a proper range of weight norm, those large weights will be first binarized to ±1. Considering the weight norm is usually a small value (from weight visualization in Figure .1 and Figure .8) compared to 1, these large weights receive a larger penalty by changing their signs during calculating the L2 distance between full-precision kernels and the selected binary-kernels. Thus, ∆ is a hyper-parameter deciding how many portions will be considered as large weights, in the same term, Primary Binary Sub-Networks. According to our experiments of using different ∆ in Figure .16 of Appendix.O, we find that ∆ > 0 is almost better than ∆ = 0. This sign() operation for all weights will eliminate the information of full-precision norm. Overall, these experimental results suggest our settings of ∆ primary binary sub-networks first helping on quantizing binary-kernels, compared to binarizing weights first.

4.3. QUANTIZATION BIT STRATEGY

When using low quantization bit for binary-kernels, the performance drop will not be negligible, thus how to assign quantization bit to different layer is important. For VGG-7 and ResNet, they contain much more parameters in higher layers (layers near to the output), which have more channels, but their computational cost is similar in each layer. From the view of model compression, we find that higher layers have a higher endurance for the low-bit quantized kernels compared to lower layers (layers near to the input). Thus, we use low-bits in the last layer/group and use more-bits for the rest layers/groups to avoid bottlenecks.

4.4. EXISTANCE AND TRANSFERABILITY OF THE SELECTED KERNELS

To prove the existence of the selected kernels in other cases, and the transferability of our selected kernels, we did experiments on extracting top frequent kernels from different networks and layers and compare them with our selected kernels in Appendix.L. Then we conduct fine-tuning experiments for a pretrained BWN. This will be further studied in Appendix.M.

4.5. OTHER SELECTION OF BINARY-KERNELS

We discuss the other selection of binary-kernels in Appendix.N. For very low-bit quantization, we suggest using the most frequent binary-kernels rather than those less frequent ones. For the case like quantization bit p > 4, the choice of binary-kernels is not a essential problem. 

A APPENDIX

In this appendix, we will display many additional experiments due to the limited pages in the Context. These experiments are all introduced in the Context as supplementary material to strength our main idea and make our contribution more convincing.

B EXPERIMENTS ON TRAINING BWNS

In this appendix section, we display our experiments on training different networks using XNor-BWN and LQ-BWN as shown in Table .2. 

D QUANTIZATION ERROR CURVE

In XNor-Net where XNor-BWN is first raised, the scaling factors are proposed to minimize the quantization error in a calculated deterministic way. In the BNN survey (Qin et al., 2020) , the authors summarized several BWNs algorithms using "Minimizing the Quantization Error" which has a common form as shown in Equation3 to design their methods. Thus we plot the quantization error curve of different networks as shown in Figure .7. We can find that in most case the quantization error between full-precision weights and binary weights is not minimized. Therefore we are concerned that it might not be reasonable to use scaling factors to reduce the quantization error, even further, it might not be necessary to reduce the quantization error. those selected kernels perform well. We display how the selected binary-kernels we use throughout our experiments from the last Conv of VGG-7 appear in other networks' different layer, plotting a Figure .14 to compare their frequency. J(b, α) = ||x -αb|| 2 α * , b * = argmin α,b J(b, α) We also apply QBNs to fine-tune pre-trained BWNs, when using relatively more quantization bit, the network can usually gain comparable performance without training from scratch. In our QBN algorithm, the computational cost increases when using more quantization bits, thus we can directly fine-tune a pre-train BWN. The experiments in Appendix.M show that we can gain a uniform 7 -bit QBN in one epoch, and a uniform 6 -bit QBN in two epochs. This provides a higher efficient way to train those high quantization bit QBNs.

L THE DEGREE OF AGGREGATION OF BINARY-KERNELS

We do a statistic on how much percentage of the target networks' top frequent binary-kernels can be covered by our selected binary-kernels from the last Conv of XNor-BWN VGG-7. We use the first figure on the top left of Figure .14 to demonstrate that, when using 1 -bit quantization on binary-kernels of XNor-BWN ResNet-18's first 4 layers(the first group of ResNet-18), our selected kernels can cover 100% of XNor-BWN ResNet-18's top 2 1 frequent kernels. The percentage has already been processed with the number of each kernel in the layer. 100% is the limit that if we directly choose the most frequent kernels from this layer/group as our selected kernels. For the lowest percentage appearing when using 5-bit, there are still more than 80% kernels in the first group's top 2 5 frequent kernels can be represented directly by our selected kernels. 

LQ-BWN R-18 G3

Figure 14 : The ratio between the number of kernels in one layer that are in our selected binarykernels, and that are in the top 2 p frequent binary-kernels in that layer. Thus 100% is the upper bound, which means our selected binary-kernels can cover the same number of kernels that this layer's top frequent kernels can cover. The X-axis is log2 scale to visualize quantization bits. G0 to G3 are the groups of ResNet-18, and each group contains two blocks, two layers for each block. M FINE-TUNE WITH HIGHER QUANTIZATION BIT(ON 6 BIT, 7 BIT) We write a Table.5 using relatively higher quantization bit to fine-tune ResNet-18 on ImageNet with one epoch. The pretrained model is the normal ResNet-18 BWN at 90th epoch. We list the number of epochs that the fine-tune requires. For those lower bit, the fine-tuned performance cannot achieve to the one training from scratch.

N OTHER SELECTION OF BINARY-KERNELS

Here we list three reasons to illustrate why we use such a way to collect the binary-kernels. 1. If we regard QBNs as a clustering method given cluster numbers, these kernels with high appearance frequency are most likely the cluster centers. 2. We visualize these top frequent kernels as shown in Fig. 15 , and they appear similar to conventional spatial filters, including "Moving Average Filter", "Point Detector", "Line Detector" and so on. 3. We use the less frequent binary-kernels as the selected kernels to test if our selection based on frequency is a good choice. We find that using those less frequent kernels, an accuracy drop can be observed in different experiments. Those less frequent kernels are used by inverting the order of top 128 frequent kernels, eg. for 2 2 kernels, they are 125-th, 126-th, 127-th, and 128-th frequent kernels. Given the experiment results in Table .6, we find that when using very low quantization bit, specifically less than or equal to 3, will significantly hurt the network when using those least frequent kernels. When the quantization bit increases, the difference in performances reduce. Thus we suggest to use those most frequent kernels if training a very low-bit QBN, or finetuning a pre-trained BWN. In other cases like training from scratch with quantization bit more than 3, the frequency of the selected kernels is not a strict deterministic factor. 

O INFLUENCE OF HYPER-PARAMETER ∆

To test the influence of the Hyper-Parameter ∆ in our algorithm, we use VGG-7 with all 5-bit quantization kernels and ResNet-56 with FP-9-6-2. We do not use ResNet-20 due to its instability training which we have discussed. The result is shown in Figure .16.

P THE SELECTED KERNELS FROM OTHER CONV LAYERS

In previous experiments, our selected binary kernels K m use the statistic information extracted from one single VGG-7's last Conv layer and regard them as constant kernels. We further use different strategies to demonstrate that the selected binary kernels have a strong generalization ability to apply on other networks. In the following experiments, we use the selected binary kernels from the last



We use the codes of DoReFa-Net to realize XNor-BWN which is the same as the original implementation. https://github.com/tensorpack/tensorpack/tree/master/examples/DoReFa-Net LQ-BWN is the 1-bit weight 32-bit activation version of LQ-Nets. https://github.com/microsoft/LQ-Nets



Figure 1: The visualization of full precision weights distribution in BWNs. The X-axis indicates the full precision weight value while Y-axis indicates the frequency that the value appears in a certain layer of a binary weights network. In the figure captions, VGG is VGG-7, and R-20 is ResNet-20. For VGG-7, we draw 2nd, 4th, 6th Conv layers' weight distributions (Conv1, Conv3, Conv5). For ResNet-20, we display the first and the last Conv layers' weight distributions.

Figure 2: Inference accuracy on training sets after flipping a certain percentage of weights' signs.We design two flipping methods, flipping those weights with larger norm (from the largest norm to the smallest norm) and flipping those weights with the smaller norm. The X-axis indicates how many percentage of weights is flipped, while the Y-axis indicates the inference accuracy. The topleft point in each figure is the un-flipped case which is the same as the result reported in Table.2. This flipping operation is done to each binary Conv layer and each layer has the same flipping percentage.

Figure 2: Inference accuracy on training sets after flipping a certain percentage of weights' signs.We design two flipping methods, flipping those weights with larger norm (from the largest norm to the smallest norm) and flipping those weights with the smaller norm. The X-axis indicates how many percentage of weights is flipped, while the Y-axis indicates the inference accuracy. The topleft point in each figure is the un-flipped case which is the same as the result reported in Table.2. This flipping operation is done to each binary Conv layer and each layer has the same flipping percentage.

Figure 4: The visualization of binary weight kernels in one Conv layer after assigning indices. The X-axis indicates the index of a 3×3 binary weight kernel while Y-axis indicates the frequency that the binary kernel appears in one certain Conv layer. Left Figure is an example to illustrate how we index a 3 × 3 kernel into the range of 0 to 511. Two figures on the right are from the last Conv layer of two networks. Right Figures are the visualization of binary weight kernels in XNor-BWN VGG-7's last Conv layer and XNor-BWN ResNet-20's last Conv layer after assigning indices. The X-axis indicates the index of a 3 × 3 binary weight kernel while Y-axis indicates the frequency that the certain appears in Conv layer.

Figure 6: The relation between the scaling factor of the LQ-BWN and the gamma in BatchNorm layer after the corresponding Conv layer channel by channel after normalizing their values. The X-axis is the scaling factor value and Y-axis is the gamma in the corresponding BatchNorm layer. The point scattered in the figure is a combination of two values after normalization. r in the legend indicates the correlation coefficient. Texts below the figures indicates the certain layer, VGG is VGG-7, R-20 is ResNet-20, R-18 is ResNet-18, C is Conv, G is Group, and B is Block. These abbreviations will also be used in the rest of the paper.

Figure 7: L2 Distance between full-precision weights and binarized weights during training. We use the L2 distance of the trained networks at last epoch as the unit norm, and other L2 distances are divided by this unit distance to better display the increase or decrease trend. textbfw/o WD means we do the same experiments on training the network without using weight decay. Two figures are displayed together and the left one uses WD while the right one does not. The X-axis is training epochs while Y-axis is the re-scaled sum of L2 norm of all binarized layers.

Figure 15: Top: The visualization of Top 16 frequent binary-kernels. Down: The visualization of the least frequent 16 binary-kernels among top 2 7 frequent binary-kernels, more specifically, the 113rd to 128th frequent kernels.

We use the same training parameters on each network. The network is trained for 200 epochs. The learning rate is set initially to 0.02 and divided by 10 at 80 and 160 epochs. For random crop, we first use zero pad to resize the image into 40 × 40, and random crop into 32 × 32. For BWN trained on ImageNet, each is trained for 100 epochs. The initial learning rate is 0.1 and decays 0.1 at 30, 60, 90 epoch. The image is rescaled into 256×256 and then randomly cropped into 224 × 224. No additional data augmentations are used. For all networks, weight decay is applied to all Conv layers set to 4 × 10 -5 .

Algorithm 1: QBN Parameters: Quantized kernel bit number p, selected kernels K 0,1,...,2 p -1 , hyper-parameter threshold ∆, weight input channel number I, output channel number O, scaling factors Experiments of VGG-7, ResNet-20, and ResNet-56 on Cifar10, and ResNet-18 on Ima-geNet. We put the results of baseline of full-precision networks and BWNs in Table.2. FP indicates the first full-precision Conv layer which is not quantized according to the common practice. VGG-7 has 6 Conv layers, and we use the quantized bit numbers for each layer to indicate how many selected quantized kernels are used. ResNet-20 and ResNet-56 have three groups, each group shares the same number of channels which are 16, 32, 64 in order. We assign the same quantized bit number for each group. ResNet-18 has four groups which have channels of 64, 128, 256, 512. CR indicates Compressed Ratio. Acc is the top-1 test accuracy on Cifar-10 and top-1 validation accuracy on Im-ageNet. The accuracy is reported as the mean of the best accuracy during training of 5 runs with different random seeds. More results are displayed in Appendix.H.

Shuchang Zhou, Wu Yuxin, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training  low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016. Shilin Zhu, Xin Dong, and Su Hao. Binary ensemble neural network: More bits per network or more networks per bit? In CVPR, 2019. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.

This is a performance table of different quantization methods with different network architectures on different datasets with different training methods. "Baseline" means the network with full precision as the baseline. "w/o WD" means without weight decay, and "w/o SF" means without using scaling factors. SF = 1 means we fix scaling factors in all layers to 1. LR×10 means we magnify learning rate 10 times. Test Acc(Validation Acc on ImageNet) means top 1 accuracy on testing set(validation set).

We train several networks using the normal top frequent binary-kernels, and those least frequent binary-kernels named "Reverse" in the table since they use the order of reversed top 128 frequent kernels. 10.0% in the table indicates the training cannot converge from the beginning.

G FLIPPING AND PRE-FIXING LARGE MAGNITUDE WEIGHTS AND RETRAINING

We first flip 10% of weights of a trained BWN, and then fix its weights according to their magnitude and retraining it in Figure .12. The small learning rate in retraining is 0.0002, which is applied to the ordinary 160th to 200th epoch training. The large learning rate is using the original learning rate when training a BWN from scratch. We use three flipping strategy, flipping those largest magnitude weights, random weights, and those smallest magnitude weights.

H MODEL COMPRESSION ADDITIONAL EXPERIMENTS

We did more experiment with different quantization bit (together with different compressed ratio) on different networks and datasets as shown in Fig3.

I BINARY-KERNEL DISTRIBUTION IN EACH LAYER

We visualize the binary-kernel frequency of other layers and networks in Figure .13. We sum the frequency of top 2 p binary-kernels in each group of ResNet-18 with the log-scale x-axis. 

LQ-BWN R-18 G3

Figure 13 : The sum of frequency of top 2 p binary-kernels of XNor-BWN and LQ-BWN ResNet-18. X-axis is log2 scale to visualize quantization bits. G0 to G3 are the groups of ResNet-18, and each group contains two blocks, two layers for each block.of all binary kernels. This proves that the QBNs algorithm can adapt to a wide range of networks and datasets.To prove the transferability of the selected binary-kernels, 

