Multi-Prize Lottery Ticket Hypothesis: FINDING ACCURATE BINARY NEURAL NETWORKS BY PRUNING A RANDOMLY WEIGHTED NETWORK

Abstract

Recently, Frankle & Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained. Without ever updating the weight values, our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -94.8% on CIFAR-10 and 74.03% on ImageNet -but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively. Further, our MPT-1/1 achieves SOTA Top-1 accuracy (91.9%) for binary neural networks on CIFAR-10. Code and pre-trained models are available at: https: //github.com/chrundle/biprop.

1. INTRODUCTION

Deep learning (DL) has made a significant breakthroughs in a wide range of applications (Goodfellow et al., 2016) . These performance improvements can be attributed to the significant growth in the model size and the availability of massive computational resources to train such models. Therefore, these gains have come at the cost of large memory consumption, high inference time, and increased power consumption. This not only limits the potential applications where DL can make an impact but also have some serious consequences, such as, (a) generating huge carbon footprint, and (b) creating roadblocks to the democratization of AI. Note that significant parameter redundancy and a large number of floating-point operations are key factors incurring the these costs. Thus, for discarding the redundancy from DNNs, one can either (a) Prune: remove non-essential connections from an existing dense network, or (b) Quantize: constrain the full-precision (FP) weight and activation values to a set of discrete values which allows them to be represented using fewer bits. Further, one can exploit the complementary nature of pruning and quantization to combine their strengths. Although pruning and quantizationfoot_0 are typical approaches used for compressing DNNs (Neill, 2020), it is not clear under what conditions and to what extent compression can be achieved without sacrificing the accuracy. The most extreme form of quanitization is binarization, where weights and/or activations can only have two possible values, namely -1(0) or +1 (the interest of this paper). In addition to saving memory, binarization results in more power efficient networks with significant computation acceleration since expensive multiply-accumulate operations (MACs) can be replaced by cheap XNOR and bit-counting operations (Qin et al., 2020a) . In light of these benefits, it is of interest to question if conditions exists such that a binarized DNN can be pruned to achieve accuracy comparable to the dense FP DNN. More importantly, even if these favourable conditions are met then how do we find these extremely compressed (or compact) and highly accurate subnetworks? Traditional pruning schemes have shown that a pretrained DNN can be pruned without a significant loss in the performance. Recently, (Frankle & Carbin, 2019 ) made a breakthrough by showing that dense network contain sparse subnetworks that can match the performance of the original network when trained from scratch with weights being reset to their initialization (Lottery Ticket Hypothesis). Although the original approach to find these subnetworks still required training the dense network, some efforts (Wang et al., 2020b; You et al., 2019; Wang et al., 2020a) have been carried out to overcome this limitation. Recently a more intriguing phenomenon has been reported -a dense network with random initialization contains subnetworks that achieve high accuracy, without any further training (Zhou et al., 2019; Ramanujan et al., 2020; Malach et al., 2020; Orseau et al., 2020) . These trends highlight good progress being made towards efficiently and accurately pruning DNNs. In contrast to these positive developments for pruning, results on binarizing DNNs have been mostly negative. To the best of our knowledge, post-training schemes have not been successful in binarizing pretrained models without retraining. Even with training binary neural networks (BNNs) from scratch (though inefficient), the community has not been able to make BNNs achieve comparable results to their full precision counterparts. The main reason being that network structures and weight optimization techniques are predominantly developed for full precision DNNs and may not be suitable for training BNNs. Thus, closing the gap in accuracy between the full precision and the binarized version may require a paradigm shift. Furthermore, this also makes one wonder if efficiently and accurately binarizing DNNs similar to the recent trends in pruning is ever feasible. In this paper, we show that a randomly initialized dense network contains extremely sparse binary subnetworks that without any weight training (i.e., efficient) have comparable performance to their trained dense and full-precision counterparts (i.e., accurate). Based on this, we state our hypothesis: Multi-Prize Lottery Ticket Hypothesis. A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). Contributions. First, we propose the multi-prize lottery ticket hypothesis as a new perspective on finding neural networks with drastically reduced memory size, much faster test-time inference and lower power consumption compared to their dense and full-precision counterparts. Next, we provide theoretical evidence of the existence of highly accurate binary subnetworks within a randomly weighted DNN (i.e., proving the multi-prize lottery ticket hypothesis). Specifically, we mathematically prove that we can find an ε-approximation of a fully-connected ReLU DNN with width n and depth using a sparse binary-weight DNN of sufficient width. Our proof indicates that this can be accomplished by pruning and binarizing the weights of a randomly weighted neural network that is a factor O(n 3/2 /ε) wider and 2 deeper. To the best of our knowledge, this is the first theoretical work proving the existence of highly accurate binary subnetworks within a sufficiently overparameterized randomly initialized neural network. Finally, we provide biprop (binarize-prune optimizer) in Algorithm 1 to identify MPTs within randomly weighted DNNs and empirically test our hypothesis. This provides a completely new way to learn BNNs without relying on weight-optimization. Results. We explore two variants of multi-prize tickets -one with binary weights (MPT-1/32) and other with binary weights and activation (MPT-1/1) where x/y denotes x and y bits to represent weights and activation, respectively. MPTs we find have 60 -80% fewer parameters than the original network. We perform a series of experiments on on small and large scale datasets for image recognition, namely CIFAR-10 ( Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) . On CIFAR-10, we test the performance of multi-prize tickets against the trend of making the model deeper and wider. We found that as models grow deeper and wider, both variants of multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to the dense and full precision original network with learned weights. In other words, the performance of multiprize tickets improves with the amount of redundancy in the original network. We also carry out experiments with state-of-the-art (SOTA) architectures on CIFAR-10 and ImageNet datasets with an aim to investigate their redundancy. We find that within most randomly weighted SOTA DNNs reside extremely compact (i.e., sparse and binary) subnetworks which are smaller than, but match the performance of trained target dense and full precision networks. Furthermore, with minimal hyperparameter tuning, our MPTs achieve Top-1 accuracy comparable to (or higher than) SOTA BNNs. The performance of MPTs is further improved by allowing the parameters in BatchNorm layer to be learned. Finally, on both CIFAR-10 and ImageNet, MPT-1/32 subnetworks outperform their significantly larger and full-precision counterparts that have been weight-trained.

2. MULTI-PRIZE LOTTERY TICKETS: THEORY AND ALGORITHMS

We first prove the existence of MPTs in an overparameterized randomly weighted DNN. For ease of presentation, we state an informal version of Theorem 2 which can be found in Appendix B. We then explore two variants of tickets (MPT-1/32 and MPT-1/1) and provide an algorithm to find them.

2.1. PROVING THE MULTI-PRIZE LOTTERY TICKETS HYPOTHESIS

In this section we seek to answer the following question: What is the required amount of overparameterization such that a randomly weighted neural network can be compressed to a sparse binary subnetwork that approximates a dense trained target network? Theorem 1. (Informal Statement of Theorem 2) Let ε, δ > 0. For every fully-connected (FC) target network with ReLU activations of depth and width n with bounded weights, a random binary FC network with ReLU activations of depth 2 and width O ( n 3/2 /ε) + n log( n/δ) contains with probability (1 -δ) a binary subnetwork that approximates the target network with error at most ε. Sketch of Proof. Consider a FC ReLU network F (x) = W ( ) σ(W ( -1) • • • σ(W (1) x)), where σ(x) = max{0, x}, x ∈ R d , W (i) ∈ R ki×ki-1 , k 0 = d, and i ∈ [ ]. Additionally, consider a FC network with binary weights given by G (x) = B ( ) σ(B ( -1) • • • σ(B (1) x)), where B (i) ∈ {-1, +1} k i ×k i-1 , k 0 = d, and i ∈ [ ]. Our goal is to determine a lower bound on the depth, , and the widths, {k i } i=1 , such that with probability (1 -δ) the network G(x) contains a subnetwork G(x) satisfying G(x) -F (x) ≤ ε, for any ε > 0 and δ ∈ (0, 1). We first establish lower bounds on the width of a network of the form g(x) = B (2) σ(B (1) x) such that with probability (1 -δ ) there exists a subnetwork g(x) of g(x) s.t. g(x) -σ(W x) ≤ ε , for any ε > 0 and δ ∈ (0, 1). This process is carried out in detail in Lemmas 1, 2, and 3 in Appendix B. We have now approximated a single layer FC real-valued network using a subnetwork of a two-layer FC binary network. Hence, we can take = 2 and Lemma 3 provides lower bounds on the width of each intermediate layer such that with probability (1 -δ) there exists a subnetwork G(x) of G(x) satisfying G(x) -F (x) ≤ ε. This is accomplished in Theorem 2 in Appendix B. To the best of our knowledge this is the first theoretical result proving that a sparse binary-weight DNN that can approximate a real-valued target DNN. As it has been established that real-valued DNNs are universal approximators (Scarselli & Tsoi, 1998) , our result carries the implication that sparse binary-weight DNNs are also universal approximators. In relation to the first result establishing the existence of real-valued subnetworks in a randomly weighted DNN approximating a realvalued target DNN (Malach et al., 2020) , the lower bound on the width established in Theorem 2 is better than their lower bound of Ofoot_1 n 2 log( n/δ)/ε 2 .

2.2. FINDING MULTI-PRIZE WINNING TICKETS

Given the existence of multi-prize winning tickets from Theorem 2, a natural question arises -How should we find them? In this section, we answer this question by introducing an algorithm for finding multi-prize tickets. 2 Specifically, we explore two variants of multi-prize tickets in this paper -1) MPT-1/32 where weights are quantized to 1-bit with activations being real valued (i.e., 32-bits) and 2) MPT-1/1 where both weights and activations are quantized to 1-bit. We first outline a generic process for identifying MPTs along with some theoretical motivation for our approach. Given a neural network g(x; W ) with weights W ∈ R m , we can express a subnetwork of g using a binary mask M ∈ {0, 1} m as g(x; M W ), where denotes the Hadamard product. Hence, a binary subnetwork can be expressed as g(x; M B), where B ∈ {-1, +1} m . Lemma 1 in Appendix B indicates that rescaling the binary weights to {-α, α} using a gain term α ∈ R is necessary to achieve good performance of the resulting subnetwork. We note that the use of gain terms is common in binary neural networks (Qin et al., 2020a; Martinez et al., 2020; Bulat & Tzimiropoulos, 2019) . Combining all this allows us to represent a binary subnetwork as g(x; α(M B)). Now we focus on how to update M , B, and α. Suppose f (x; W * ) is a target network with optimized weights W * that we wish to approximate. Assuming g(x; •) is κ-Lipschitz continuous yields g (x; α(M B)) -f (x; W * ) MPT error ≤ κ M (W -αB) Binarization error + g(x; M W ) -f (x; W * ) Subnetwork error . (1) Hence, the MPT error is bounded above by the error of the subnetwork of g with the original weights and the error from binarizing the current subnetwork. This informs our approach for identifying MPTs: 1) Update a pruning mask M that reduces the subnetwork error (lines 7 -9 in Algorithm 1), and 2) apply binarization with a gain term that minimizes the binarization error (lines 4 and 10). We first discuss how to update M . While we could search for M by minimizing the subnetwork error in (1), this would require the use of a pretrained target network (i.e., f (x; W * )). To avoid requiring a target network in our method we instead aim to minimize the training loss w.r.t. M in the current binary subnetwork. Directly optimizing over the pruning mask is a combinatorial problem. So to update the pruning mask efficiently we optimize over a set of scores S ∈ R m corresponding to each randomly initialized weight in the network. In this approach, each component of the randomly initialized weights is assigned a pruning score. The pruning scores are updated via backpropagation by computing the gradient of the loss function over minibatches with respect to the pruning scores (line 7). Then the magnitude of the scores in absolute value are used to identify the P percent of weights in each layer that are least important to the success of the binary subnetwork (line 8). The components of the pruning mask corresponding to these indices are set to 0 and the remaining components are set to 1 (line 9). To avoid unintentionally pruning an entire layer of the network, we use a pruning mask for each layer that prunes P percent of the weights in that layer. The choice to use pruning scores to update the mask M was due to the fact that it is computationally efficient. The use of pruning scores is a well-established optimization technique used in a range of applications (Joshi & Boyd, 2009; Ramanujan et al., 2020) . Algorithm 1 biprop: Finding multi-prize tickets in a randomly weighted neural network 1: Input: Neural network g(x; •) with 1-or 32-bit activations; Network depth ; Layer widths {k j } j=1 ; Loss function L; Training data {(x (i) , y (i) )} N i=1 ; Pruning percentage P . 2: Randomly Initialize FP Parameters: Network weights {W (j) } j=1 ; Pruning scores {S (j) } j=1 . 3: Initialize Layerwise Pruning Masks: {M (j) } j=1 each to 1. 4: Initialize Binary Subnetwork Weights: {B (j) } j=1 ← {sign(W (j) )} j=1 . 5: Initialize Layerwise Gain Terms: {α (j) } j=1 ← { M (j) W (j) 1 / M (j) 1 } j=1 . 6: for k = 1 to N epochs do 7: S (j) ← S (j) -η∇ S (j) L({α (j) (M (j) B (j) )} j=1 ) Update pruning scores at layer j 8: {τ (i)} kj i=1 ← Sorting of indices {i} kj i=1 s.t. |S (j) τ (i) | ≤ |S (j) τ (i+1) | Index sort over values |S (j) | 9: M (j) i ← 1 {τ (i)≥ kj P/100 } (i) Update pruning mask at layer j 10: α (j) ← M (j) W (j) 1 / M (j) 1 Update gain term at layer j 11: Output: Return Binarized Subnetwork g(x; {α (j) (M (j) B (j) )} j=1 ). We now consider how to update B and α. By keeping M fixed, we can derive the following closed form expressions that minimize the binarization error in (1): B * = sign(W ) and α * = M W 1 / M 1 . These closed form expressions indicate that only the gain term needs to be recomputed after each update to M . Hence, B = sign(W ) throughout our entire approach (line 4). We update a gain term for each layer of the subnetwork in our approach based on the formula for α * (line 10). More details on the derivation of B * and α * are provided in Appendix C. Pseudocode for our method biprop (binarize-prune optimizer) is provided in Algorithm 1 and crossentropy loss is used in our experiments. Note that the process for identifying MPT-1/32 and MPT-1/1 differs only in computation of the gradient. Next, we explain how these gradients can be computed.

2.2.1. UPDATING PRUNING SCORES FOR BINARY-WEIGHT TICKETS (MPT-1/32)

As an example, for a FC network where the state at each layer is defined recursively by U (1) = α (1) (B (1) M (1) )x and U (j) = α (j) (B (j) M (j) )σ(U (j-1) ) we have ∂L ∂S (j) p,q = ∂L ∂U (j) q ∂U (j) q ∂M (j) p,q ∂M (j) p,q ∂S (j) p,q . We use the straight-through estimator (Bengio et al., 2013) for ∂M (j) p,q ∂S (j) p,q which yields ∂L ∂S (j) p,q = ∂L ∂U (j) q α (j) B (j) p,q σ U (j-1) p , where ∂L ∂U (j) q is computed via backpropagation.

2.2.2. UPDATING PRUNING SCORES FOR BINARY-ACTIVATION TICKETS (MPT-1/1)

Note that MPT-1/1 uses the sign activation function. From Section 2.2.1, it immediately follows that ∂L ∂S (j) p,q = ∂L ∂U (j) q α (j) B (j) p,q sign U (j-1) p . However, updating ∂L ∂U (j) q via backpropagation requires a gradient estimator for the sign activation function. To motivate our choice of estimator note that we can approximate the sign function using a quadratic spline parameterized by some t > 0: s t (x) =      -1 : x < -t q 1 (x) : x ∈ [-t, 0) q 2 (x) : x ∈ [0, t) 1 : x ≥ t . In (2), q i (x) = a i x 2 + b i x + c i and suitable values for the coefficients are derived using the following zero-and first-order constraints: q 1 (-t) = -1, q 1 (0) = 0, q 2 (0) = 0, q 2 (t) = 1, q 1 (-t) = 0, q 1 (0) = q 2 (0), and q 2 (t) = 0. This yields q 1 (x) = (x/t) 2 + 2(x/t) and q 2 (x) = -(x/t) 2 + 2(x/t). As s t (x) approximates sign(x), we can use s t (x) as our gradient estimator. Since q 1 (x) = 2 t (1 + x t ) and q 2 (x) = 2 t (1 -x t ) it follows that s t (x) = 2 t 1 -|x| t 1 {x∈[-t,t]} (x). The choice to approximate sign using a quadratic spline instead of a cubic spline results in a gradient estimator that can be implemented efficiently in PyTorch as torch.clamp(2 * (1-torch.abs(x)/t)/t,min=0.0). We note that lim t→0 s t (x) = sign(x), which suggests that smaller values of t yield more suitable approximations. Our experiments use s 1 (x) as the gradient estimator since we found it to work well in practice. Finally, we note that taking t = 1 in our gradient estimator yields the same value as the gradient estimator in (Liu et al., 2018a) , however, our implementation in PyTorch is 6× more memory efficient.

3. EXPERIMENTAL RESULTS

The primary goal of the experiments in Section 3.1 is to empirically verify our Multi-Prize Lottery Ticket Hypothesis. As a secondary objective, we would like to determine tunable factors that make randomly-initialized networks amenable to containing readily identifiable Multi-Prize Tickets (MPTs). Thus, we test our hypothesis against the general trend of increasing the model size (depth and width) and monitor the accuracy of the identified MPTs. After verifying our Multi-Prize Lottery Ticket Hypothesis, we consider the performance of MPTs compared to state-of-the-arts in binary neural networks and their dense counterparts on CIFAR-10 and ImageNet datasets in Section 3. We use VGG (Simonyan & Zisserman, 2014) variants as our network architectures for searching for MPTs. In each randomly weighted network, we find winning tickets MPT-1/32 and MPT-1/1 for different pruning rates using Algorithm 1. We choose our baselines as dense full-precision models with learned weights. In all experiments, we use three independent initializations and report the average of Top-1 accuracy with with error bars extending to the lowest and highest Top-1 accuracy. Additional experiment configuration details are provided in Appendix A.

3.1.1. DO WINNING TICKETS EXIST IN DEEP NETWORKS?

In this experiment, we empirically test the following hypothesis: As a network grows deeper, the performance of multi-prize tickets in the randomly initialized network will approach the performance of the same network with learned weights. We are further interested in exploring the required network depth for our hypothesis to be true. In Figure 2 , we vary the depth of VGG architectures (d = 2 to 8) and compare the Top-1 accuracy of MPTs (at different pruning rates) with weight-trained dense network. We notice that there exist a range of pruning rates where the performance of MPTs are very similar, and beyond this range the performance drops quickly. Interestingly, as the network depth increases, more parameters can be pruned without hurting the performance of MPTs. For example, MPT-1/32 can match the performance of trained Conv-8 while having only ∼ 20% of its parameter count. Interestingly, the performance gap between MPT-1/32 and MPT-1/1 does not change much with depth across different pruning rates. We further note that the performance of MPTs improve when increasing the depth and both start to approach the performance of the dense model with learned weights. This gain starts to plateau beyond a certain depth, suggesting that the MPTs might be approaching the limit of their achievable accuracy. Surprisingly, MPT-1/32 performs equally good (or better) than the weight-trained model regardless of having 50 -80% lesser parameters and weights being binarized.

3.1.2. DO WINNING TICKETS EXIST IN WIDE NETWORKS?

Figure 4 : Effect of Varying Width on MPT-1/1: Comparing the Top-1 accuracy of sparse and binary MPT-1/1 to dense, full-precision, and weight-optimized network on CIFAR-10. Similar to the previous experiment, in this experiment, we empirically test the following hypothesis: As a network grows wider, the performance of multi-prize tickets in the randomly initialized network will approach the performance of the same network with learned weights. We are further interested in exploring the required layer width for our hypothesis to be true. In Figures 3 and 4 , we vary the width of different VGG architectures and compare the Top-1 accuracy of MPT-1/32 and MPT-1/1 tickets (at different pruning rates) with weight-trained dense network. A width multiplier of value 1 corresponds to the models in Figure 2 . Performance of all the models improves when increasing the width and the performance of both MPT-1/32 and MPT-1/1 start to approach the performance of the dense model with learned weights. Although, this gain starts to plateau beyond a certain width. For both MPT-1/32 and MPT-1/1, as the width and depth increase the performance at different pruning rates approach the same value. This observed phenomenon yields a more significant gain in the performance for MPTs with higher pruning rates. Similar to the previous experiment, the performance of MPT-1/32 matches (or exceeds) the performance of dense models for a large range of pruning rates. Furthermore, in the high width regime, a large number of weights (∼ 90%) can be pruned without having a noticeable impact on the performance of MPTs. We also notice that the performance gap between MPT-1/32 and MPT-1/1 decreases significantly with an increase the width which is in sharp contrast with the with the depth experiments where the performance gap between MPT-1/32 and MPT-1/1 appeared to be largely independent of the depth. Published as a conference paper at ICLR 2021 Key Takeaways. Our experiments verify Multi-Prize Lottery Ticket Hypothesis and additionally convey the significance of choosing appropriate network depth and layer width for a given pruning rate. In particular, we find that a network with a large width can be pruned more aggressively without sacrificing much accuracy, while the accuracy of a network with smaller widths suffers when pruning a large percentage of the weights. Similar patterns hold for the depth of the networks as well. The amount of overparametrization needed to approach the performance of dense networks seems to differ for MPT variants -MPT-1/1 requires higher depth and width compared to MPT-1/32.

3.2. HOW REDUNDANT ARE STATE-OF-THE-ART DEEP NEURAL NETWORKS?

Having shown that MPTs can perform equally good (or better) than overparameterized networks, this experiment aims to answer: Are state-of-the-art weight-trained DNNs overparametrized enough that significantly smaller multi-prize tickets can match (or beat) their performance? Experimental Configuration. Instead of focusing on extremely large DNNs, we experiment with small to moderate size DNNs. Specifically, we analyze the redundancy of following backbone models: (1) VGG-Small and ResNet-18 on CIFAR-10, and (2) WideResNet-34 and WideResNet-50 on ImageNet. As we will show later that even these models are highly redundant, thus, our finding automatically extends to larger models. In this process, we also perform a comprehensive comparison of the performance of our multi-prize winning tickets with state-of-the-art in binary neural networks (BNNs). Details on the experimental configuration are provided in Appendix A. This experiment uses Algorithm 1 to find MPTs within randomly initialized backbone networks. We compare the Top-1 accuracy and number of non-zero parameters for our MPT-1/32 and MPT-1/1 tickets with selected baselines in BNNs (Qin et al., 2020a) . Results for CIFAR-10 and ImageNet are shown in Tables 1, 2 and Tables 3, 4 , respectively. Next to each MPT method we include the percentage of weights pruned in parentheses. Motivated by (Frankle et al., 2020) , we also include models in which the BatchNorm parameters are learned when identifying the random subnetwork using biprop, indicated by +BN. A more comprehensive comparison can be found in Appendix D. Our results highlight that SOTA DNN models are extremely redundant. For similar parameter count, our binary MPT-1/32 models outperform even full-precision models with learned weights. When compared to state-of-the-art in BNNs, with minimal hyperparameter tuning our multi-prize tickets achieve comparable (or higher) Top-1 accuracy. Specifically, our MPT-1/32 outperform trained binary weight networks on CIFAR-10 and ImageNet and our MPT-1/1 outperforms trained binary weight and activation networks on CIFAR-10. Further, on CIFAR-10 and ImageNet, MPT-1/32 networks with significantly reduced parameter counts outperform dense and full precision networks with learned weights. Searches for MPT-1/1 in BNN-specific architectures (Kim et al., 2020; Bulat et al., 2020a) and adopting other commonly used tricks to improve model & representation capacities (Bulat et al., 2020b; Yang et al., 2020; Lin et al., 2020; 2021) are likely to yield MPT-1/1 networks with improved performance. For example, up to a 7% gain in the MPT-1/1 accuracy was achieved by simply allowing BatchNorm parameters to be updated. Additionally, alternative approaches for updating the pruning mask in biprop could alleviate issues with back-propagating gradients through binary activation networks.

4. DISCUSSION AND IMPLICATIONS

Existing compression approaches (e.g., pruning and binarization) typically rely on some form of weight-training. This paper showed that a sufficiently overparametrized randomly weighted network contains binary subnetworks that achieve high accuracy (comparable to dense and full precision original network with learned weights) without any training. We referred to this finding as the Multi-Prize Lottery Ticket Hypothesis. We also proved the existence of such winning tickets and presented a generic procedure to find them. Our comparison with state-of-the-art neural networks corroborated our hypothesis. With minimal hyperparameter tuning, our binary weight multi-prize tickets outperformed current state-of-the-art in BNNs and proved its practical importance. Our work has several important practical and theoretical implications. Next, in contrast to weight-optimization that requires large model size and massive compute resources to achieve high performance, our hypothesis suggests that one can achieve similar performance without ever training the large model. Therefore, strategies such as fast ticket search (You et al., 2019) or forward ticket selection (Ye et al., 2020) can be developed to enable more efficient ways of finding-or even designing-MPTs. Finally, as opposed to weight-optimization, biprop by design achieves compact yet accurate models. Theoretical. MPTs achieve similar performance as the model with learned weights. First, this observation notes the benefit of overparameterization in the neural network learning and reinforces the idea that an important task of gradient descent (and learning in general) may be to effectively compress overparametrized models to find multi-prize tickets. Next, our results highlight the expressive power of MPTs -since we showed that compressed subnetworks can approximate any target neural network who are known to be universal approximators, our MPTs are also universal approximators. Finally, the multi-prize lottery ticket hypothesis also uncovers the generalization properties of DNNs. Generalization theory for DL is still in its infancy and its not clear what and how DNNs learn (Neyshabur et al., 2017) . Multi-prize lottery ticket hypothesis may serve as a valuable tool for answering such questions as it indicates the dependence of generalization on the compressiblity. Practical. Huge storage and heavy computation requirements of state-of-the-art deep neural networks inevitably limit their applications in practice. Multi-prize tickets are significantly lighter, faster, and efficient while maintaining performance. This unlocks a range of potential applications DL could be applied to (e.g., applications with resource-constrained devices such as mobile phones, embedded devices, etc.). Our results also indicate that existing SOTA models might be spending far more compute and power than is needed to achieve a certain performance. In other words, SOTA DL models have terrible energy efficiency and significant carbon footprint (Strubell et al., 2019) . In this regard, MPTs have the potential to enable environmentally friendly artificial intelligence.

A HYPERPARAMETER CONFIGURATIONS

A.1 HYPERPARAMETERS FOR SECTION 3.1 Experimental Configuration. For MPT-1/32 tickets, the network structure is not modified from the original. For MPT-1/1 tickets, the network structure is modified by moving the max-pooling layer directly after the convolution layer and adding a batch-normalization layer before the binary activation function, as is common in many BNN architectures (Rastegari et al., 2016) . We choose our baselines as dense full precision models with learned weights. The baselines were obtained by training backbone networks using the Adam optimizer with learning rate of 0.0003 for 100 epochs and with a batch size of 60. In each randomly weighted backbone network, we find winning tickets MPT-1/32 and MPT-1/1 for different pruning rates using Algorithm 1. For both the weight-optimized and MPT networks, the weights are initialized using the Kaiming Normal distribution (He et al., 2015) . All training routines make use of a cosine decay learning rate policy. In these experiments, the weights are initialized using the Kaiming Normal distribution (He et al., 2015) for all the models except for MPT-1/32 on ImageNet where we use the Signed Constant initialization (Ramanujan et al., 2020) as it yielded slightly better performance. All training routines make use of a cosine decay learning rate policy. For ImageNet training we used a label smoothing value of 0.1 and a learning rate warmup length of 5 epochs. In the following analysis, note that we write Bin({-1, +1} m×n ) to denote matrices of dimension m × n whose components are independently sampled from a binomial distribution with elements {-1, +1} and probability p = 1/2. Lemma 1. Let s ∈ [d], α ∈ -1 √ s , 1 √ s , i ∈ [d], and ε, δ ≥ 0 be given. Let B ∈ {-1, +1} k×d be chosen randomly from Bin({-1, 1} k×d ) and u ∈ {-1, +1} k be chosen randomly from Bin({-1, +1} k ). If k ≥ 16 ε √ s + 16 log 2 δ , then with probability at least 1 -δ there exist masks m ∈ {0, 1} k and M ∈ {0, 1} k×d such that the function g : R d → R defined by g(x) = ( m u) σ (ε(M B)x) , satisfies |g(x) -αx i | ≤ ε, for all x ∞ ≤ 1. Furthermore, m 0 = M 0 ≤ 2 ε √ s , and max 1≤j≤k M j,: 0 ≤ 1. Proof. If |α| ≤ ε then taking M = 0 yields the desired result. Suppose that |α| > ε. Then there exists a c i ∈ N such that c i ε ≤ |α| ≤ (c i + 1)ε and |c i ε -|α|| ≤ ε. Hence, it follows that |c i ε sign(α)x i -αx i | = |x i ||c i ε -|α|| ≤ ε, where the final inequality follows from ( 6) and the hypothesis that x ∞ ≤ 1. Our goal now is to show that with probability 1 -δ the random initialization of u and B yield masks m and M such that g (x) = c i ε sign(α)x i . Now fix i ∈ [d] and take k = k 2 . First, we consider the probability P (|{j ∈ [k ] : u j = +1 and B j,i = sign(α)}| < c i ) . As u and B :,i are each sampled from a binomial distribution with k trials, the distribution that the pair (u j , B j,i ) is sampled from is a multinomial distribution with four possible events each having a probability of 1/4. Since we are only interested in the event (u j , B j,i ) = (+1, sign(α)) occurring, we can instead consider a binomial distribution where P ((u j , B j,i ) = (+1, sign(α)) = 1 4 and P ((u j , B j,i ) = (+1, sign(α)) = 3 4 . Hence, using Hoeffding's inequality we have that P (|{j ∈ [k ] : u j = +1 and B j,i = sign(α)}| < c i ) ≤ exp -2k 1 4 - c i k 2 (9) = exp - 1 8 k + c i -2 c 2 i k (10) < exp - 1 8 k + 2c i , where the final inequality follows since exp() is an increasing function and -2 6) and the fact that |α| ≤ 1 √ s , it follows that c 2 i k < 0. From ( c i ≤ 1 ε √ s . ( ) Combining our hypothesis in (3) with (12) yields that - 1 8 k + c i = - 1 16 k + c i ≤ - 1 16 16 ε √ s + 16 log 2 δ + 1 ε √ s = log δ 2 . ( ) Substituting ( 13) into (11) yields P (|{j ∈ [k ] : u j = +1 and B j,i = sign(α)}| < c i ) < δ 2 . ( ) Additionally, it follows from the same argument that P (|{k < j ≤ k : u j = -1 and B j,i = -sign(α)}| < c i ) < δ 2 . ( ) From ( 14) and ( 15) it follows with probability at least 1 -δ that there exist sets S + := {j :  u j = +1 Using the definition of g(x) in ( 4) we now have that g(x) = i∈S+ σ (ε sign(α)x i ) - i∈S- σ (-ε sign(α)x i ) (18) = c i σ (ε sign(α)x i ) -c i σ (-ε sign(α)x i ) (19) = c i ε sign(α)x i , where the final equality follows from the identity σ(a) -σ(-a) = a, for all a ∈ R. This concludes the proof of (5). Lastly, by our choice of m in ( 16), M in (17), and ( 12), it follows that m 0 = M 0 = 2c i ≤ 2 ε √ s , and max 1≤j≤k M j,: 0 ≤ 1, which concludes the proof. The next step is to consider an analogue for Lemma A.2 from (Malach et al., 2020) which we provide in Lemma 2. Lemma 2. Let s ∈ [d], w * ∈ -1 √ s , 1 √ s d with w * 0 ≤ s, and ε, δ > 0 be given. Let B ∈ {-1, +1} k×d be chosen randomly from Bin({-1, 1} k×d ) and u ∈ {-1, +1} k be chosen randomly from Bin({-1, +1} k ). If k ≥ s • 16 √ s ε + 16 log 2s δ , then with probability at least 1 -δ there exist masks m ∈ {0, 1} k and M ∈ {0, 1} k×d such that the function g : R d → R defined by  g(x) = ( m u) σ (ε(M B)x) , u (i) := u k (i-1)+1 • • • u k i ∈ {-1, +1} k ×1 (26) m(i) := mk (i-1)+1 • • • mk i ∈ {0, 1} k ×1 (27) B (i) :=    b (k (i-1)+1),1 • • • b (k (i-1)+1),d . . . . . . . . . b k i,1 • • • b k i,d    ∈ {-1, +1} k ×d (28) M (i) :=    m (k (i-1)+1),1 • • • m (k (i-1)+1),d . . . . . . . . . m k i,1 • • • m k i,d    ∈ {0, 1} k ×d , for i ∈ [s]. Note that these submatrices satisfy 1) . . . 1) . . . 1) . . .  u =    u u (s)    , m =    m(1) . . . m(s)    , B =    B B (s)    , M =    M M (s)    . ( g i (x) := m(i) u (i) σ ε(M (i) B (i) )x By ( 23), taking ε = ε s and δ = δ s yields that k ≥ 16 ε √ s + 16 log 2 δ . Hence, it follows from Lemma 1 that with probability at least 1 -δ there exist m(i) ∈ {0, 1} k and M (i) ∈ {0, 1} k ×d such that |g i (x) -w * i x i | ≤ ε = ε s , for every x ∈ R d with x ∞ ≤ 1, and m(i) 0 = M (i) 0 ≤ 2 ε √ s = 2 √ s ε and max k (i-1)+1≤j≤k i M (i) j,i 0 ≤ 1. By the definition of g(x) in (24), using (30) yields g(x) = ( m u) σ (ε(M B)x) = s i=1 m(i) u (i) σ ε(M (i) B (i) )x = s i=1 g i (x). Hence, combining (32) for all i ∈ [s], it follows that with probability at least 1 -δ we have |g(x) -w * , x | = s i=1 g i (x) - s i=1 w * i x i ≤ s i=1 |g i (x) -w * i x i | ≤ ε. Finally, it follows from (30) and ( 33) that m 0 = M 0 ≤ 2s √ s ε and max 1≤j≤k M j,: 0 ≤ 1, which concludes the proof. We now state and prove an analogue to Lemma A.5 in (Malach et al., 2020) which is the last lemma we will need to establish the desired result. Published as a conference paper at ICLR 2021 Lemma 3. Let s ∈ [d], W * ∈ -1 √ s , 1 √ s n×d with W * 0 ≤ s, F : R d → R n defined by F i (x) = σ( w * i , x ), and ε, δ > 0 be given. Let B ∈ {-1, +1} k×d be chosen randomly from Bin({-1, 1} k×d ) and U ∈ {-1, +1} k×n be chosen randomly from Bin({-1, +1} k×n ). If k ≥ ns • 16 √ ns ε + 16 log 2ns δ , then with probability at least 1 -δ there exist masks M ∈ {0, 1} k×n and M ∈ {0, 1} k×d such that the function G : R d → R n defined by G(x) = σ ( M U ) σ (ε(M B)x) , satisfies G(x) -F (x) 2 ≤ ε, for all x ∞ ≤ 1. (39) Furthermore, M 0 = M 0 ≤ 2ns √ ns ε . Proof. Assume k = ns • 16 √ ns ε + 16 log 2ns δ and set k = k n . Note that if k > ns • 16 √ ns ε + 16 log 2ns δ then excess neurons can be masked to yield the desired value for k. As in the proof of Lemma 2, we can split U , M , B, and M into n submatrices, denoted U (i) ∈ {-1, +1} k ×n , M (i) ∈ {-1, +1} k ×n , B (i) ∈ {-1, +1} k ×d , and M (i) ∈ {-1, +1} k ×d for i ∈ [n], such that U =    U (1) . . . 1) . . . 1) . . . U (n)    , M =    M (1) . . . M (n)    , B =    B B (n)    , and M =    M M (n)    . To simplify notation in the following definition, we define the vectors m(i) := M (i) :,i and ũ(i) := Ũ (i) :,i . Now we define the functions g i : R d → R by  g i (x) = m(i) u (i) σ β(M (i) B (i) )x , (i) with m(i) 0 = M (i) 0 ≤ 2s √ s ε = 2s √ ns ε (42) such that |g i (x) -W * i , x | ≤ ε √ n , for all x ∞ ≤ 1. For each i ∈ [n], note that this results in choosing the columns of the mask M (i) by M (i) :, = m(i) : = i 0 : otherwise Combining this choice with (40) yields ( M U ) σ (β(M B)x) =    g 1 (x) . . . g n (x)    . By the definition of G(x) in (38), it follows from (45) that G(x) =    σ(g 1 (x)) . . . σ(g n (x))    . Combining ( 43) and ( 46), we have with probability at least 1 -δ that G(x) -F (x) 2 2 = n i=1 (σ(g i (x)) -σ( w * i , x )) 2 ≤ n i=1 (g i (x) -W * i , x ) 2 ≤ ε 2 . ( ) Finally, it follows from ( 42) and ( 44) that M 0 = M 0 ≤ 2ns √ ns ε which concludes the proof. We are now ready to prove the main result in Theorem 2. Theorem 2. Let , n, s ∈ N, W (1) * ∈ -1 √ s , 1 √ s d×n , {W (i) * } -1 i=2 ∈ -1 √ n , 1 √ n n×n , W ( ) * ∈ -1 √ n , 1 √ n 1×n . Assume that for each i ∈ [ ] we have W (i) * 2 ≤ 1 and max j W (i) * j 0 ≤ s. Define F (x) := F ( ) • • • • • F (1) (x) where F (i) (x) = σ(W (i) * x) for i ∈ [ -1] and F ( ) (x) = W ( ) * x. Fix ε, δ ∈ (0, 1). Let B (1) ∈ {-1, +1} k×d be sampled from Bin({-1, +1} k×d ), {B (i) } i=2 ∈ {-1, +1} k×n be sampled from Bin({-1, +1} k×n ), {U (i) } -1 i=1 ∈ {-1, +1} k×n be sampled from Bin({-1, +1} k×n ) and U ( ) ∈ {-1, +1} k×1 sampled from Bin({-1, +1} k×1 ). If k ≥ ns • 32 √ ns ε + 16 log 2ns δ , then with probability at least 1 -δ there exist binary masks {M (i) } i=1 and { M (i) } i=1 for {B (i) } i=1 and {U (i) } i=1 , respectively, such that the function G : R d → R defined by G(x) := G ( ) • • • • • G (1) (x), where G (i) (x) := σ ( M (i) U (i) ) σ(ε(M (i) B (i) )x) , for i ∈ [ -1] (51) G ( ) (x) := ( M (i) U (i) ) σ(ε(M (i) B (i) )x), satisfies |G(x) -F (x)| ≤ ε, for all x 2 . ( ) Additionally, M 0 = M 0 ≤ 4ns 2 √ ns ε . Proof. Let i ∈ [ -1]. Using Lemma 3 with ε = ε 2 and δ = δ , with probability at least 1 -δ there exist M (i) and M (i) such that G (i) (x) -F (i) (x) 2 ≤ ε 2 , for all x ∞ ≤ 1 and M (i) 0 = M (i) 0 ≤ 2ns √ ns ε = 4ns √ ns ε . ( ) The remainder of the proof follows from applying the same argument as in the proof of Theorem A.6 from (Malach et al., 2020) . an over-parameterized model is often not necessary to obtain an efficient final model and network architecture itself is more important than the remaining weights after pruning pretrained networks. These findings has revived interest in finding approaches for searching sparse and trainable subnetworks. For example, (Lee et al., 2018; Wang et al., 2020b; You et al., 2019; Wang et al., 2020a) explored efficient approaches to search for these sparse and trainable subnetworks. Along this line of work, a striking finding was reported by (Zhou et al., 2019; Ramanujan et al., 2020) showing that randomly-initialized neural networks contain sparse subnetworks that achieve good performance without any training. (Malach et al., 2020; Pensia et al., 2020) provided theoretical evidences for this phenomenon and showed that one can approximate any target neural network, by pruning a sufficiently over-parameterized network of random weights.

F.2 BINARIZATION

Similar to pruning, we categorize binarization methods based on whether a model is binarized either after the training or during the training (see (Qin et al., 2020a) for a comprehensive review). Post-Training Binarization. To the best of our knowledge, none of the post-training schemes have been successful in binarizing pretrained models with or without retraining to achieve reasonable test accuracy. Most existing works (Han et al., 2015; Zhou et al., 2017) are limited to ternary weight quantization. Training-Aware Binarization. There are several efforts to improve the performance of BNN training. This is a challenging problem as binarization introduces discontinuities which makes differentiation during backpropogation difficult. Binaryconnect (Courbariaux et al., 2015) established how to train networks with binary weights within the familiar back-propagation paradigm. Bina-ryNet (Courbariaux et al., 2016) further quantize both the weights and the activations to 1-bit values. Unfortunately, these early schemes resulted in a staggering drop in the accuracy compared to their full precision counterparts. In an attempt to improve the performance, XNOR-Net (Rastegari et al., 2016) proposed to add a real-valued channel-wise scaling factor. Dorefa-Net (Zhou et al., 2016) extends XNOR-Net to accelerate the training process using quantized gradients. ABC-Net (Lin et al., 2017) improved the performance by using more weight bases and activation bases at the cost of increase in memory and computation. There have also been efforts in making modifications to the network architectures to make them amenable for the binary neural network training. For example, Bireal-Net (Liu et al., 2018a) added layer-wise identity short-cut, and AutoBNN (Shen et al., 2020) proposed to widen or squeeze the channels in an automatic manner. (Han et al., 2020) proposed to learn to binarize neurons with noisy supervision. Some efforts also have been carried out to designing gradient estimators extending straight-through estimator (STE) (Bengio et al., 2013) for accurate gradient back-propagation. DSQ (Gong et al., 2019) used differentiable soft quantization to have accurate gradients in backward propagation. On the other hand, PCNN Gu et al. (2019) proposed a new discrete back-propagation via projection algorithm to build BNNs.

F.3 OTHER RELATED DIRECTIONS

Gaier & Ha (2019) proposed a search method for neural network architectures that can already perform a task without any explicit weight training, i.e., each weight in the network has the same shared value. Recent work in randomly wired neural networks (Xie et al., 2019) showed that constructing neural networks with random graph algorithms often outperforms a manually engineered architecture. As opposed to fixed wirings in (Xie et al., 2019) , (Wortsman et al., 2019) learned the network parameters as well as the structure. This show that finding a good architecture is akin to finding a sparse subnetwork of the complete graph.



A detailed discussion on related work on pruning and quantization is provided in Appendix F. Although our results are derived under certain assumptions (e.g., fully-connected, ReLU neural network approximated by a subnetwork with binary weights), our algorithm is not restricted by these assumptions. A comparison of MPT-1/32 found using biprop and edgepopup is provided in Appendix E, which demonstrates that biprop outperforms edgepopup.



Figure 1: Multi-Prize Ticket Performance: Multi-prize tickets, obtained only by pruning and binarizing random networks, outperforms trained full precision and SOTA binary weight networks.

2.Building upon edge-popup(Ramanujan et al., 2020), we implement Algorithm 1 to identify MPTs. 33.1 WHERE CAN WE EXPECT TO FIND MULTI-PRIZE TICKETS?In this section, we empirically test the effect of overparameterization on the performance of MPTs. We overparameterize networks by making them (a) deeper (Sec. 3.1.1) and (b) wider (Sec. 3.1.2).

Figure 2: Effect of Varying Depth and Pruning Rate: Comparing the Top-1 accuracy of small and binary MPTs to a large, full-precision, and weight-optimized network on CIFAR-10.

Figure 3: Effect of Varying Width on MPT-1/32: Comparing the Top-1 accuracy of sparse and binary MPT-1/32 to dense, full-precision, and weight-optimized network on CIFAR-10.

Our biprop framework enjoys certain advantages over traditional weightoptimization. First, contemporary experience suggests that sparse BNN training from scratch is challenging. Both sparseness and binarization bring their own challenges for gradient-based weight training -getting stuck at bad local minima in the sparse regime, incompatibility of backpropagation due to discontinuity in activation function, etc. Although we used gradient-based approaches in this paper, biprop is flexible to accommodate different class of algorithms that might avoid the pitfalls of gradient-based weight training.

and B j,i = sign(α)} and S -:= {j : u j = -1 and B j,i = -sign(α)} satisfying |S + | = |S -| = c i and S + ∩ S -= ∅. Using these sets, we define the components of the mask m and M by mj = 1 : j ∈ S + ∪ S - ∈ S + ∪ S -and = i 0 : otherwise .

satisfies |g(x) -w * , x | ≤ ε, for all x ∞ ≤ 1. (25) Furthermore, m 0 = M 0 ≤ 2s √ s ε and max 1≤j≤k M j,: 0 ≤ 1. Proof. Assume k = s • 16 √ s ε + 16 log 2s δ and set k = k s . Note that if k > s • 16 √ s ε+ 16 log 2s δ then the excess neurons can be masked yielding the desired value for k. We decompose u, m, B, and M into s equal size submatrices by defining

) Now let I := {i ∈ [d] : w * i = 0}. By our hypothesis w * 0 ≤ s, it follows that |I| ≤ s. WLOG, assume that I ⊆ [s]. Now fix i ∈ [s] and define g i : R d → R by

for each i ∈ [n]. Taking ε = ε √ n and δ = δn , it follows from (37) that k ≥ s • As the hypotheses of Lemma 2 are satisfied, with probability at least 1 -δ n there exist masks m(i) and M





Hyperparameter Configurations for CIFAR-10 Experiments A.2 HYPERPARAMETERS FOR SECTION 3.2

Hyperparameter Configurations for CIFAR-10 Experiments

ACKNOWLEDGEMENTS

The authors would like to thank Shreya Chaganti for her valuable contributions to the biprop open source code development and for her help on training MPT models for the final version of the paper. This work was performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344, Lawrence Livermore National Security, LLC. This document was prepared as an account of the work sponsored by an agency of the United States Government. Neither the United States Government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or Lawrence Livermore National Security, LLC. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. This work was supported by LLNL Laboratory Directed Research and Development project 20-ER-014 and released with LLNL tracking number LLNL-CONF-815432.

C MOTIVATION FOR FRAMEWORK TO IDENTIFY MPTS

Suppose that f (x; W * ) with optimized weights W * is a target network that we wish to approximate. Let g(x; W ) denote the network in which we want to identify a MPT-1/32 that is an εapproximation of f (x; W * ), for some ε > 0. Now assume that g(x; •) is Lipschitz continuous with constant κ, B ∈ {-1, +1} m are binary parameters for g, and α ∈ R is gain term. It follows that g (x; α(M B)) -f (x; W * ) ≤ g (x; α(M B) -g(x; M B)If we take M to be a fixed binary mask, we can minimize the error of binarizing the subnetwork parameters M W by solving the optimization problemwhere M , W , and B are stacked into vectors of some length, say n. As the pruning mask M is applied to both W and B, solving problem ( 57) is equivalent to solving problem (2) in (Rastegari et al., 2016) with a different dimension. Hence, it immediately follows that one closed form solution for B in problem ( 57) isTaking the derivative of the cost function in (57) with respect to α and setting it equal to zero yieldsRecalling that M ∈ {0, 1} n and using (58), we haveandSubstituting ( 60) and ( 61) into (59) and solving for α yields the closed form solutionHence, α * and B * minimize the right hand side of (56) and, consequently, reduce the approximation error of the MPT-1/32. So when the binarization error, (M W ) -α(M sign(W )) , and the subnetwork error, g(x; M W )-f (x; W * ) , are sufficiently small then the binarized subnetwork g (x; α(M sign(W ))) serves as a good approximation to the target network.These closed form expressions for the gain term and the binarized weights are the updates used for the gain term and binary subnetwork weights in biprop after updating the binary pruning mask.

D COMPARISON OF MPTS WITH BINARY NEURAL NETWORK SOTA

Here we provide a more exhaustive comparison of MPT-1/32 and MPT-1/1 on CIFAR-10 and Im-ageNet to SOTA methods -BinaryConnect (Courbariaux et al., 2015) , BNN (Courbariaux et al., 2016) , DoReFa-Net (Zhou et al., 2016) , LQ-Nets (Zhang et al., 2018) , BWN and XNOR-Net (Rastegari et al., 2016) , ABC-Net (Lin et al., 2017) , IR-Net (Qin et al., 2020b) , LAB (Hou et al., 2016) , ProxQuant (Bai et al., 2018) , DSQ (Gong et al., 2019) , and BBG (Shen et al., 2020) . Results for CIFAR-10 can be found in Tables 8 and 9 and results for ImageNet can be found in Tables 10 and 11 .Next to the MPT method we include the percentage of weights pruned and the layer width multiplier (if larger than 1) in parentheses. Note that binarization step of biprop can be avoided while finding MPT-1/32 -by initializing (and pruning) our backbone neural network with binary initialization (e.g., edgepopup with Signed Constant initialization (Ramanujan et al., 2020) ). In this specific instance, biprop boils down to edgepopup with proper scaling. Next, we compare the performance of MPT-1/32 networks identified using these two approaches. Both networks presented below use the same hyperparameter configurations and are trained for 250 epochs on the CIFAR-10 dataset. We initialize the networks identified with edgepopup using the Signed Constant initialization as it yielded their best performance. MPT-1/32 networks identified using biprop are initialized using the Kaiming Normal initialization. We plot the average over three experiments for each pruning percentage and bars extending to the minimum and maximum accuracy for each pruning percentage. Additionally, for each network we include the Top-1 accuracy of a dense model with learned weights. These plots can be found in Figure 5 . We find that the performance of MPT-1/32 identified with biprop outperforms networks identified using edgepopup. This highlights the benefit of binarization (in conjunction with pruning) as a learning strategy. Post-Training Pruning. The traditional pruning methods leverage a three-stage pipeline -pretraining (a large model), pruning, and fine-tuning. The main distinction lies among these approaches is what type of criteria is used for pruning. One of the most popular approach is the magnitude-based pruning where the weights with the magnitude below a certain threshold are discarded (Hagiwara, 1993) . Further, certain penalty term (e.g., l 1 , l 2 or lasso weight regularization) can be used during training to encourage a model to learn certain smaller magnitude weights and removing them posttraining (Weigend et al., 1991) . Models can also be pruned by measuring the importance of weights by computing the sensitivity of the loss function when weights are removed and prune those which cause the smallest change in the loss (LeCun et al., 1990) .Pruning Before Training. Thus far, we have have discussed methods for pruning pretrained DNNs.Recently, (Frankle & Carbin, 2019) proposed the Lottery Ticket Hypothesis and showed that randomly-initialized neural networks contain sparse subnetworks that can be effectively trained from scratch when reset to their initialization. Further, (Liu et al., 2018b) showed that the training

