WEIGHTS HAVING STABLE SIGNS ARE IMPORTANT: FINDING PRIMARY SUBNETWORKS AND KERNELS TO COMPRESS BINARY WEIGHT NETWORKS

Abstract

Binary Weight Networks (BWNs) have significantly lower computational and memory costs compared to their full-precision counterparts. To address the nondifferentiable issue of BWNs, existing methods usually use the Straight-Through-Estimator (STE). In the optimization, they learn optimal binary weight outputs represented as a combination of scaling factors and weight signs to approximate 32-bit floating-point weight values, usually with a layer-wise quantization scheme. In this paper, we begin with an empirical study of training BWNs with STE under the settings of using common techniques and tricks. We show that in the context of using batch normalization after convolutional layers, adapting scaling factors with either hand-crafted or learnable methods brings marginal or no accuracy gain to final model, while the change of weight signs is crucial in the training of BWNs. Furthermore, we observe two astonishing training phenomena. Firstly, the training of BWNs demonstrates the process of seeking primary binary sub-networks whose weight signs are determined and fixed at the early training stage, which is akin to recent findings on the lottery ticket hypothesis for efficient learning of sparse neural networks. Secondly, we find binary kernels in the convolutional layers of final models tend to be centered on a limited number of the most frequent binary kernels, showing binary weight networks may has the potential to be further compressed, which breaks the common wisdom that representing each weight with a single bit puts the quantization to the extreme compression. To testify this hypothesis, we additionally propose a binary kernel quantization method, and we call resulting models Quantized Binary-Kernel Networks (QBNs). We hope these new experimental observations would shed new design insights to improve the training and broaden the usages of BWNs.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have achieved great success in many computer vision tasks such as image classification (Krizhevsky et al., 2012 ), object detection (Girshick et al., 2014) and semantic segmentation (Long et al., 2015) . However, modern CNNs usually have large number of parameters, posing heavy costs on memory and computation. To ease their deployment in resourceconstrained environments, different types of neural network compression and acceleration techniques have been proposed in recent years, such as network pruning (Han et al., 2015; Li et al., 2017) , network quantization (Hubara et al., 2016; Rastegari et al., 2016; Zhou et al., 2016) , knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015) , efficient CNN architecture engineering and searching (Howard et al., 2017; Zhang et al., 2018b; Zoph & Le, 2017) . Comparatively, network quantization is more commercially attractive as it can not only benefit specialized hardware accelerator designs (Sze et al., 2017) , but also can be readily combined with other techniques to get further improved compression and acceleration performance (Mishra & Marr, 2018; Han et al., 2016; Zhou et al., 2017) . Quantization methods aim to approximate fullprecision (32-bit floating-point) neural networks with low-precision (low-bit) ones. In particular, the extremely quantized models called Binarized Neural Networks (BNNs) (Courbariaux et al., 2015; 2016; Rastegari et al., 2016) which force the weights or even weights and activations to have 1-bit values (+1 and -1), bringing 32× reduction in model size and making costly 32-bit floating-point

