Multi-Prize Lottery Ticket Hypothesis: FINDING ACCURATE BINARY NEURAL NETWORKS BY PRUNING A RANDOMLY WEIGHTED NETWORK

Abstract

Recently, Frankle & Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained. Without ever updating the weight values, our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -94.8% on CIFAR-10 and 74.03% on ImageNet -but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively. Further, our MPT-1/1 achieves SOTA Top-1 accuracy (91.9%) for binary neural networks on CIFAR-10. Code and pre-trained models are available at: https: //github.com/chrundle/biprop.

1. INTRODUCTION

Deep learning (DL) has made a significant breakthroughs in a wide range of applications (Goodfellow et al., 2016) . These performance improvements can be attributed to the significant growth in the model size and the availability of massive computational resources to train such models. Therefore, these gains have come at the cost of large memory consumption, high inference time, and increased power consumption. This not only limits the potential applications where DL can make an impact but also have some serious consequences, such as, (a) generating huge carbon footprint, and (b) creating roadblocks to the democratization of AI. Note that significant parameter redundancy and a large number of floating-point operations are key factors incurring the these costs. Thus, for discarding the redundancy from DNNs, one can either (a) Prune: remove non-essential connections from an existing dense network, or (b) Quantize: constrain the full-precision (FP) weight and activation values to a set of discrete values which allows them to be represented using fewer bits. Further, one can exploit the complementary nature of pruning and quantization to combine their strengths. In addition to saving memory, binarization results in more power efficient networks with significant computation acceleration since expensive multiply-accumulate operations (MACs) can be replaced by cheap XNOR and bit-counting operations (Qin et al., 2020a) . In light of these benefits, it is of interest to question if conditions exists such that a binarized DNN can be pruned to achieve accuracy comparable to the dense FP DNN. More importantly, even if these favourable conditions are met then how do we find these extremely compressed (or compact) and highly accurate subnetworks? Traditional pruning schemes have shown that a pretrained DNN can be pruned without a significant loss in the performance. Recently, (Frankle & Carbin, 2019) made a breakthrough by showing that dense network contain sparse subnetworks that can match the performance of the original network when trained from scratch with weights being reset to their initialization (Lottery Ticket Hypothesis). Although the original approach to find these subnetworks still required training the dense network, some efforts (Wang et al., 2020b; You et al., 2019; Wang et al., 2020a) have been carried out to overcome this limitation. Recently a more intriguing phenomenon has been reported -a dense network with random initialization contains subnetworks that achieve high accuracy, without any further training (Zhou et al., 2019; Ramanujan et al., 2020; Malach et al., 2020; Orseau et al., 2020) . These trends highlight good progress being made towards efficiently and accurately pruning DNNs. In contrast to these positive developments for pruning, results on binarizing DNNs have been mostly negative. To the best of our knowledge, post-training schemes have not been successful in binarizing pretrained models without retraining. Even with training binary neural networks (BNNs) from scratch (though inefficient), the community has not been able to make BNNs achieve comparable results to their full precision counterparts. The main reason being that network structures and weight optimization techniques are predominantly developed for full precision DNNs and may not be suitable for training BNNs. Thus, closing the gap in accuracy between the full precision and the binarized version may require a paradigm shift. Furthermore, this also makes one wonder if efficiently and accurately binarizing DNNs similar to the recent trends in pruning is ever feasible. In this paper, we show that a randomly initialized dense network contains extremely sparse binary subnetworks that without any weight training (i.e., efficient) have comparable performance to their trained dense and full-precision counterparts (i.e., accurate). Based on this, we state our hypothesis: Multi-Prize Lottery Ticket Hypothesis. A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). Contributions. First, we propose the multi-prize lottery ticket hypothesis as a new perspective on finding neural networks with drastically reduced memory size, much faster test-time inference and



A detailed discussion on related work on pruning and quantization is provided in Appendix F.



Figure 1: Multi-Prize Ticket Performance: Multi-prize tickets, obtained only by pruning and binarizing random networks, outperforms trained full precision and SOTA binary weight networks.

