HOW ERD ÖS AND R ÉNYI WIN THE LOTTERY

Abstract

Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erdös-Rényi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/ log(1/sparsity). While we are the first to show theoretically and experimentally that random ER source networks contain strong lottery tickets, we also prove the existence of weak lottery tickets that require a lower degree of overparametrization than strong lottery tickets. These unusual results are based on the observation that ER masks are well trainable in practice, which we verify in experiments with varied choices of random masks. Some of these data-free choices outperform previously proposed random approaches on standard image classification benchmark datasets.

1. INTRODUCTION

The impressive breakthroughs achieved by deep learning have largely been attributed to the extensive overparametrization of deep neural networks, as it seems to have multiple benefits for their representational power and optimization (Belkin et al., 2019) . The resulting trend towards ever larger models and datasets, however, imposes increasing computational and energy costs that are difficult to meet. This raises the question: Is this high degree of overparameterization truly necessary? Training general small-scale or sparse deep neural network architectures from scratch remains a challenge for standard initialization schemes Li et al. (2016) ; Han et al. (2015) . However, Frankle & Carbin (2019) have recently demonstrated that there exist sparse architectures that can be trained to solve standard benchmark problems competitively. According to their Lottery Ticket Hypothesis (LTH), dense randomly initialized networks contain subnetworks that can be trained in isolation to a test accuracy that is comparable with the one of the original dense network. Such subnetworks, the lottery tickets (LTs), have since been identified as weak lottery tickets (WLTs) by pruning algorithms that require computationally expensive pruning-retraining iterations (Frankle & Carbin, 2019; Tanaka et al., 2020) or mask learning procedures Savarese et al. (2020) ; Sreenivasan et al. (2022b) . While these can lead to computational gains at training and inference time and reduce memory requirements (Hassibi et al., 1993; Han et al., 2015) , the real goal remains to identify good trainable architectures before training, as this could lead to significant computational savings. Yet, contemporary pruning at initialization approaches (Lee et al., 2018; Wang et al., 2020; Tanaka et al., 2020; Fischer & Burkholz, 2022; Frankle et al., 2021) achieve less competitive performance. For that reason it is so remarkable that even iterative state-of-the-art approaches struggle to outperform a simple, computationally cheap, and data independent alternative: random pruning at initialization (Su et al., 2020) . Liu et al. (2021) have provided systematic experimental evidence for its 'unreasonable' effectiveness in multiple settings, including complex, large scale architectures and data. We explain theoretically why they can be effective by proving that randomly masked networks, so called Erdös-Rényi (ER) networks, contain lottery tickets under realistic conditions. Our results imply that sparse ER networks are highly expressive and have the universal function approximation property like dense networks. This insight also provides a missing piece in the theoretical foundation for dynamic sparse training approaches (Evci et al., 2020a; Liu et al., 2021.; Bellec et al., 2018) that start pruning from a random ER network instead of a dense one. The main underlying idea could also be exploited in different sparsification settings to save computational resources. Up to our knowledge, we are the first to utilize it in the search of strong lottery tickets (SLTs). Most theoretical results pertaining to LTs focus on the existence of such SLTs (Malach et al., 2020; Pensia et al., 2020; Fischer et al., 2021; da Cunha et al., 2022; Burkholz, 2022a; b) . These are subnetworks of large, randomly initialized source networks, which do not require any further training after pruning. Ramanujan et al. (2020) have provided experimental evidence for their existence and suggested that training neural networks could be achieved by pruning alone. By modifying their proposed algorithm, edge-popup (EP), we show experimentally that SLTs are contained in randomly masked ER source networks. We also prove this existence rigorously by transferring the construction of Burkholz (2022a) to random networks. This introduces an additional factor 1/ log(1/sparsity) in the lower bound on the width of the source network that guarantees existence with high probability. In contrast to previous works on SLTs, we also prove the existence of weak LTs. Since every strong LT is also a weak LT, formally, the theory for strong LTs also covers the existence of weak LTs. However, experiments and theoretical derivations suggest that limiting ourselves to SLTs leads to LTs with lower sparsity than what can be achieved by WLT algorithms (Fischer & Burkholz, 2022) . In line with this observation, we derive improved existence results for WLTs. However, these cannot overcome the overparameterization factor 1/ log(1/sparsity), which is even required under ideal conditions, as we argue by deriving a lower bound on the required width. Our strategy relies on a property of ER networks that is crucial for their effectiveness: They are well trainable with standard initialization approaches. With various experiments on benchmark image data and commonly used neural network architectures, we verify the validity of this assumption, complementing experiments by Liu et al. (2021) for different choices of sparsity ratios. This demonstrates that multiple choices can lead to competitive results. Some of these choices outperform previously proposed random masks Liu et al. (2021) , highlighting the potential for tuning sparsity ratios in applications. Contributions In summary, our contributions are as follows: 1) We show theoretically and empirically that ER random networks contain LTs with high probability, if the ER source network is wider than a target by a factor 1/ log(1/sparsity). 2) We prove the existence of strong as well as weak LTs in random ER source networks. 3) In support of our theory, we verify in experiments that ER networks are well trainable with standard initialization schemes for various choices of layerwise sparsity ratios. 4) We propose two data-independent, flow preserving and computationally cheap approaches to draw random ER masks defining layerwise sparsity ratios, balanced and pyramidal. These can outperform previously proposed choices on standard architectures, highlighting the potential benefits resulting from tuning ER sparsity ratios. 5) Our theory explains why ER networks are likely not competitive at extreme sparsities. But this can be remedied by targeted rewiring of random edges as proposed by dynamic sparse training, for which we provide theoretical support with our analysis.

1.1. RELATED WORK

Algorithms to prune neural networks can be broadly categorized into two groups, pruning before training and pruning after training. The first group of algorithms that prune after training are effective in speeding up inference, but they still rely on a computationally expensive training procedure (Hassibi et al., 1993; LeCun et al., 1989; Molchanov et al., 2016; Dong et al., 2017; Yu et al., 2022) . The second group of algorithms prune at initialization (Lee et al., 2018; Wang et al., 2020; Tanaka et al., 2020; Sreenivasan et al., 2022b) or follow a computationally expensive cycle of pruning and retraining for multiple iterations (Gale et al., 2019; Savarese et al., 2020; You et al., 2019; Frankle & Carbin, 2019; Renda et al., 2019; Dettmers & Zettlemoyer, 2019) . These methods find trainable subnetworks, i.e., WLTs (Frankle & Carbin, 2019) . Single shot pruning approaches are computationally cheaper but are susceptible to problems like layer collapse which render the pruned network untrainable (Lee et al., 2018; Wang et al., 2020) . Tanaka et al. (2020) address this issue by preserving flow in the network through their scoring mechanism. The best performing WLTs are still obtained by expensive iterative pruning methods like Iterative Magnitude Pruning (IMP), Iterative Synflow (Frankle & Carbin, 2019; Fischer & Burkholz, 2022) , or training mask parameters of dense networks (Sreenivasan et al., 2022b; Savarese et al., 2020) . However, Su et al. (2020) found that ER masks can outperform expensive iterative pruning strategies in different situations. Inspired by this finding, Golubeva et al. (2021) ; Chang et al. (2021) have hypothesized that sparse overparameterized networks are more effective than smaller networks with the same number of parameters. Liu et al. (2021) have further demonstrated the competitiveness of ER masks for two data and pruning free choices of layerwise sparsity ratios across a wide range of neural network architectures and datasets, including complex ones. We show how and when this effectiveness is reasonable. Complementing experiments by Liu et al. (2021) , we highlight that ER masks are competitive for various choices of layerwise sparsity ratios. In addition, we build on the theory for SLTs that prove SLT existence if the widths of the randomly initialized source network (Malach et al., 2020; Pensia et al., 2020; Orseau et al., 2020; Fischer et al., 2021; Burkholz et al., 2022; Burkholz, 2022b; Ferbach et al., 2022) exceeds a value that is proportional to the width of a target network. This theory has been inspired by experimental evidence for SLTs (Ramanujan et al., 2020; Zhou et al., 2019; Diffenderfer & Kailkhura, 2021; Sreenivasan et al., 2022a) . The underlying algorithm edge-popup (Ramanujan et al., 2020) finds SLTs by training scores for each parameter of the dense source network and is thus computationally as expensive as dense training. We show that smaller ER masked source networks can be trained instead, as they also contain SLTs, and avoid training a complete network. However, pruning for SLTs does not seem to achieve as high sparsity ratios as WLT pruning algorithms Fischer & Burkholz (2022) , which is also reflected in our existence proofs for SLTs and WLTs. Remarkably, even WLTs require that the ER source network and the resulting LTs are overparameterized relative to the target network, as we show by providing a lower bound on the required width. Our theory suggests that random ER networks face a fundamental limitation at extreme sparsities, as the overparameterization factor scales in this regime as 1/ log(1/(sparsity)) ≈ 1/(1sparsity). This shortcoming could be potentially addressed by targeted rewiring of random edges with Dynamical Sparse Training (DST) that starts pruning from an ER network (Liu et al., 2021.; Evci et al., 2020a; b) .

2. ERD ÖS-R ÉNYI NETWORKS AS LOTTERY TICKETS

Our main contribution is to prove the existence of strong and weak lottery tickets in ER networks. In comparison with previous existence results for complete source networks, we require a network width that is larger by a factor 1/ log(1/sparsity). To formalize our claims, we first introduce our notation.

Background, Notation, and Proof

-Setup Let x = (x 1 , x 2 , .., x d ) ∈ [a 1 , b 1 ] d be a bounded d- dimensional input vector, where a 1 , b 1 ∈ R with a 1 < b 1 . f : [a 1 , b 1 ] d → R n L is a fully-connected feed forward neural network with architecture (n 0 , n 1 , .., n L ), i.e., depth L and n l neurons in Layer l. Every layer l ∈ {1, 2, .., L} computes neuron states x (l) = ϕ h (l) , h (l) = W (l-1) x (l-1) + b (l-1) . h (l) is called the pre-activation, W (l) ∈ R n l ×n l-1 is the weight matrix and b (l) is the bias vector. We also write f (x; θ) to emphasize the dependence of the neural network on its parameters θ = (W (l) , b (l) ) L l=1 . For simplicity, we restrict ourselves to the common ReLU ϕ(x) = max{x, 0} activation function, but most of our results can be easily extended to more general activation functions as in (Burkholz, 2022b; a) . In addition to fully-connected layers, we also consider convolutional layers. For a convenient notation, without loss of generality, we flatten the weight tensors so that W (l) T ∈ R c l ×c l-1 ×k l where c l , c l-1 , k l are the output channels, input channels and filter dimension respectively. For instance, a 2-dimensional convolution on image data would result in k l = k ′ 1,l k ′ 2,l , where k ′ 1,l , k ′ 2,l define the filter size. We distinguish two kinds of neural networks, a target network f T and a source network f S . f T is approximated by a lottery ticket (LT) that is a obtained by masking the parameters of f S and, in case of a weak LT, learning of the parameters. We assume that f T has depth L and parameters W . Note that l ranges from 0 to L for the source network, while it only ranges from 1 to L for the target network. The extra source network layer l = 0 accounts for an extra layer that we need in our LT construction. ER networks Instead of a complete source network, we will consider random Erdös-Rényi (ER) networks f ER ∈ ER(p) with layerwise sparsity ratios p l . f ER can be defined as subnetwork of a complete source network using a binary mask S (l) ER ∈ {0, 1} n l ×n l-1 or S (l) ER ∈ {0, 1} n l ×n l-1 ×k l for every layer. The mask entries are drawn from independent Bernoulli distributions with layerwise success probability p l > 0, i.e., s (l) ij,ER ∼ Ber(p l ). The random pruning is performed initially with negligible computational overhead and the mask stays fixed during training. Note that p l is also the expected density of that layer. The overall expected density of the network is given as p = l m l p l k m k = 1sparsity. In case of uniform p l = p, we also write ER(p) instead of ER(p). An ER neural network is defined as f ER = f S (x; W • S ER ). We will also call f ER ∈ ER(p) the source network, as we might need to prune additional parameters during a LT construction. Here p denotes the vector of layerwise success probabilities. This pruning defines another mask S LT , which is a subnetwork of S ER , i.e., a zero entry s ij,ER = 0 implies also a zero in s ij,LT = 0, but the converse is not true. We skip the subscripts LT or ER if the nature of the mask is clear from the context.

2.1. EXISTENCE OF STRONG LOTTERY TICKETS

Most strong lottery ticket (SLT) existence proofs construct explicitly a LT that approximates any target network of a given width and depth. The LT is defined by a mask that encodes the result of pruning a randomly initialized source network. Most proofs that derive a logarithmic lower bound on the overparametrization factor (i.e., the factor by which the source network is supposed to be wider than the target network) (Pensia et al., 2020; Burkholz et al., 2022; Burkholz, 2022a; da Cunha et al., 2022; Burkholz, 2022b; Ferbach et al., 2022) , solve multiple subset sum approximation problems (Lueker, 1998) . For every target parameter z, they identify some random parameters of the source network X 1 , ..., X n that can be masked or not to approximate z. In case of an ER source network, 1 -p random connections are missing in comparison with a dense source network. These missing connections also reduce the amount of available source parameters X 1 , ..., X n . To take this into account, we modify the corresponding subset sum approximations according to the following lemma. Lemma 2.1 (Subset sum approximation in ER Networks). Let X 1 , ..., X n be independent, uniformly distributed random variables so that X i ∼ U ([-1, 1]) and M 1 , ..., M n be independent, Bernoulli distributed random variables so that M i ∼ Ber(p) for a p > 0. Let ϵ, δ ∈ (0, 1) be given. Then for any z ∈ [-1, 1] there exists a subset I ⊂ [n] so that with probability at least 1 -δ we have |z -i∈I M i X i | ≤ ϵ if n ≥ C 1 log(1/(1 -p)) log 1 min (δ, ϵ) . ( ) The proof is given in Appendix A.2 and utilizes the original subset sum approximation result for random subsets of the base set X 1 , ..., X n . In addition, it solves the challenge to combine the involved constants respecting the probability distribution of the random subsets. For simplicity, we have formulated it for uniform random variables and target parameters z ∈ [-1, 1] but it could be easily extended to random variables that contain a uniform distribution (like normal distributions) and generally bounded targets as in Corollary 7 in (Burkholz et al., 2022) . In comparison with the original subset sum approximation result, we need a base set that is larger by a factor 1 log(1/(1-p)) . This is exactly the factor by which we can modify contemporary SLT existence results to transfer to ER source networks. By replacing the subset sum approximation construction with Lemma 2.1, we can thus show SLT existence for fully-connected (Pensia et al., 2020; Burkholz, 2022b) , convolutional (Burkholz et al., 2022; Burkholz, 2022a; da Cunha et al., 2022) , and residual ER networks (Burkholz, 2022a) or random GNNs (Ferbach et al., 2022 ). To give an example for the effective use of this lemma and discuss the general transfer strategy, we explicitly extend the SLT existence results by Burkholz (2022b) for fully-connected networks to ER source networks. We thus show that pruning a random source network of depth L + 1 with widths larger than a logarithmic factor can approximate any target network of depth L with a given probability 1 -δ. Theorem 2.2 (Existence of SLTs in ER Networks). Let ϵ, δ ∈ (0, 1), a target network f T of depth L, an ER(p) source network f S of depth L + 1 with edge probabilities p l in each layer l and iid initial parameters θ with w -1 , 1]) be given. Then with probability at least 1 -δ, there exists a mask S LT so that each target output component i is approximated as (l) ij ∼ U ([-1, 1]), b (l) i ∼ U ([ max x∈D ∥f T,i (x) -f S,i (x; W S • S LT )∥ ≤ ϵ if n S,l ≥ C n T,l log (1/(1 -p l+1 )) log 1 min{ϵ l , δ/ρ} for l ≥ 1, where ϵ l = g(ϵ, f T ) is defined in Appendix A.2 and ρ = CN 1+γ T log(1/(1-min l p l )) 1+γ log(1/ min{min l ϵ l , δ}) for any γ ≥ 0. We also require n S,0 ≥ Cd 1 log(1/(1-p1)) log 1 min{ϵ1,δ/ρ} , where C > 0 denotes a generic constant that is independent of n T,l , L, p l , δ, and ϵ. Proof Outline: The main LT construction idea is visualized in Fig. 3 (b) . For every target neuron, multiple approximating copies are created in the respective layer of the LT to serve as basis for modified subset sum approximations (see Lemma 2.1) of the parameters that lead to the next layer. In line with this approach, the first layer of the LT consists of univariate blocks that create multiple copies of the input neurons. In addition to Lemma 2.1, also the total number of subset sum approximation problems ρ that have to be solved needs to be re-assessed for ER source networks, as this influences the probability of LT existence. This modification is driven by the same factor as the width increase. The full proof is given in Appendix A.2. Experiments for SLTs To verify the existence of SLTs in ER networks experimentally, we have conducted experiments with a ResNet18 model on CIFAR10. Average results based on 3 independent repetitions are shown in Table 1 . We initialize the ResNet18 as a sparse ER network and use the edge-popup (Ramanujan et al., 2020) algorithm to find a SLT, restricting edge-popup only to the nonzero (unmasked) parameters in the ER network as explained in Fig. 1 . Our results show that it is possible to obtain SLTs by starting pruning from a sparse random network instead of a dense network. Importantly, we can start with a sparse ER network of up to 0.8 sparsity and still achieve competetive performance to find a SLT with final sparsity 0.9, without the need to train a dense network from scratch. Additional experiments for ResNet18 and VGG16 on CIFAR10 and ResNet110 on CIFAR100 are presented in the appendix (see, e.g., Table 13 and 15 ). Sparsity 0 → 0.9 0.5 → 0.9 0.7 → 0.9 0.5 → 0.95 0.8 → 0.95 0.5 → 0.99 Test Acc. 87.9 ± 0.2 88.1 ± 0.3 88.0 ± 0.3 87.8 ± 0.3 88.1 ± 0.1 87.9 ± 0.1 The ER network is initialized with a uniform initial sparsity, which is gradually annealed to attain a SLT of the final sparsity (initial → final sparsity). Note that the first column serves as baseline.

2.2. EXISTENCE OF WEAK LOTTERY TICKETS

LT existence proofs have been restricted to SLTs, which do not need to be trained after pruning. These proofs (Fischer et al., 2021; Burkholz, 2022b; Pensia et al., 2020) automatically hold for WLTs, as every SLT is also a WLT. Hence, SLTs and WLTs in ER networks exist as we have shown with Theorem 2.2. However, the constructed LTs have usually more parameters than the target network, as the sum over multiple random weight parameters is used to approximate a single target parameter. If we could use only one parameter in the source network for every target we could potentially achieve sparser LTs. The reason that LT theory is primarily focused on SLTs is because it is not well understood what properties and parameter initialization schemes render arbitrarily pruned networks trainable with SGD Li et al. (2016) ; Han et al. (2015) . ER masks, however, do not seem to suffer from this limitation, as they have been found to be well trainable in practice (Su et al., 2020; Ma et al., 2021; Liu et al., 2021) , which we also verify in experiments in Section 3 in a broader context with different choices of layerwise sparsity ratios. The following assumption underlying our WLT existence proofs is therefore justified. ER networks are well trainable with SGD for different layerwise sparsity ratios and standard weight initialization schemes. Assumption 2.3 (ER networks are trainable). An ER network f ∈ ER(p) with layerwise density vector p is trainable by SGD with standard weight initializations (He et al., 2015) . Based on this assumption, we provide a construction of a LT by deriving an exact representation of the target network that only uses existing edges of a random ER network.

2.3. WLT EXISTENCE FOR A SINGLE HIDDEN LAYER TARGET NETWORK

We start with showing WLT existence for a single hidden layer fully-connected target network. Our proof strategy is visually explained in Figure 1 . To approximate a target network with a single hidden layer, we use a two hidden layer ER source network. For a given density, the following theorem identifies the width of an ER source network, above which we can show existence. Theorem 2.4 (Existence of an ER network as a WLT for a single hidden layer target network). Let Assumption 2.3 be fulfilled and a single layer fully-connected target network f T (x) = W (2) T ϕ(W (1) T x + b (1) T ) + b (2) T , δ ∈ (0, 1), a target density p and a 2 layer ER source network f S ∈ ER(p) with widths n S,0 = q 0 d, n S,1 = q 1 n T,1 , n S,2 = q 2 n T,2 be given. If q 0 ≥ 1 log(1/(1 -p 1 )) log 2m T,1 q 1 δ , q 1 ≥ 1 log(1/(1 -p 2 )) log 2m T,2 δ and q 2 = 1, then with a probability 1 -δ the source network f S contains a weak LT f W LT . Proof Outline: Similar to the strategy for SLTs, we create an univariate first layer in the source network as explained in Fig. 1 . Different from the subset sum approximation in case of SLTs, we can now use the trainability assumption 2.3 to choose a weight in the source ER network which exactly learns the value of a target weight. The key idea is to create multiple copies (blocks in Fig. 1 (b )) in the source network for each target neuron such that every target link is realized by pointing to at least one of these copies in the ER source. In the appendix, we derive the corresponding weight and bias parameters that can be learned by SGD. Thus, our main task is to estimate the probability that we can find representatives of all target links in the ER source network, i.e., every neuron in Layer l = 1 has at least one edge to every block in l = 0 of size q 0 , as shown in Fig. 1 (b ). This probability is given by (1 -(1 -p 1 ) q0 ) m T ,1 q1 . For the second layer, we repeat a similar argument to bound the probability (1 -(1 -p 2 ) q1 ) m T ,2 with q 2 = 1, since we do not require multiple copies of the output neurons. Bounding this probability by 1 -δ completes the proof, as detailed in Appendix A.3. Theorem 2.4 shows that q 0 and q 1 depend on 1/ log(1/sparsity). We now generalize the idea to create multiple copies of target neurons in every layer to a fully connected network of depth L, which yields a similar result as above and is stated formally in Theorem 2.5.

Target Network

ER with Edge-Popup Source Network 

2.4. EXTENSION TO L-LAYER TARGET NETWORKS

Extending our insight from the 2-layer construction of the source network in the previous section, we provide a general result for a target network f T of depth L and ER source networks with different layerwise sparsity ratios p l . While we could approximate each target layer separately with two ER source layers, we instead present a construction that requires only one additional layer so that L s = L + 1. This transfers the approach for SLTs Burkholz (2022b; a) to weak ER networks. But we have to solve two extra challenges. (a) We need to ensure that a sufficient number of neurons is connected to the main network and can be used in the LT construction. (b) We have to show that the required number of potential matches for target neurons q l does not explode for an increasing number of layers. In fact it only scales logarithmically in the relevant variables. Theorem 2.5 (Existence of weak lottery tickets in ER networks ). With Assumption 2.3, given a fully connected target network f T of depth L, δ ∈ (0, 1), a target density p and a L + 1-layer ER source network f S ∈ ER(p) with widths n S,0 = q 0 d and n S,l = q l n T,l , l ∈ {1, 2, .., L}, where q l ≥ 1 log(1/(1 -p l+1 )) log Lm T,l+1 q l+1 δ for l ∈ {0, 1, .., L -1} and q L = 1, then with probability 1 -δ the random source network f S contains a weak LT f W LT . Proof Outline: Again we follow the same procedure of finding the smallest width for every layer in the source network such that there is at least one connecting edge between a target neuron copy and one of the copies in the previous layer. Repeating this argument for every layer starting from the output layer in reverse order gives us the lower bound on the factor q l in every layer l ∈ {0, 1, .., L}. We can use the trainability assumption 2.3 to choose the weights of the sparse ER network such that for every target parameter there is at least one nonzero (unmasked) parameter in the source which exactly learns the required value. Full details are given in Appendix A.4.

2.5. EXISTENCE FOR CONVOLUTIONAL LAYERS

We can also extend our WLT existence results in ER networks to convolutional layers whose number of channels need to be overparameterized by a factor of 1/ log(1/sparsity). Theorem 2.6 (Existence of WLTs for ER networks with convolutional layers). With Assumption 2.3, given a target network f T of depth L with convolutional layers h (l)  T,i = c l-1 j=1 W (l) T,ij * x (l-1) ij + b (l) T,i , W T ∈ R c l ×c l-1 ×k l , δ ∈ (0, S,i = c l-1 j=1 W (l) S,ij * x (l-1) ij + b (l) S,i , W S ∈ R q l c l ×c l-1 ×k l where q l ≥ 1 log(1/(1 -p l+1 )) log Lm T,l+1 q l+1 δ for l ∈ {0, 1, .., L -1} and q L = 1, then with probability 1 -δ the source network f S contains a a weak LT f W LT . Main idea: Similarly as in case of fully-connected ER source networks, we create q l copies of every output channel of the target c l in the LT. Every filter element of the target can be learnt using the trainability assumption 2.3. Note that any tensor entry that leads to the same block is sufficient, since the convolution is a bi-linear operation so that i ′ ∈Ii W i ′ j * x i = i ′ ∈Ii W i ′ j * x i . Specifically, i ′ ∈Ii W (l) S,i ′ j can represent a target element w (l) T,ije if at least one weight w (l) S,i ′ je is nonzero. Our method is visually explained in Fig. 3 alongside the full proof in Appendix A.5.

Theoretical insights

We have shown that ER networks are provably a good starting point to find SLTs and WLTs without the overhead of multiple prune-train iterations. Our analysis furthermore reveals that ER networks are still overparametrized, as many of the random edges get pruned in the LT construction. This insight presents a theoretical justification for pruning approaches that start from random ER masks like Dynamic Sparse Training (Evci et al., 2020a; Mocanu et al., 2018) . But how far can we go with random ER masks alone? Our LT constructions suggest a need for considerable overparametrization if we wanted to start from ER masks with extreme initial sparsities ≥ 0.9, since 1/ log(1/(1 -p l )) ≈ 1/p l for p l << 1. The next theorem establishes that we cannot expect to get around this 1/ log(1/(1 -p l )) limitation. Targeted rewiring, however, might improve even extremely sparse random masks, as we demonstrate in experiments. Theorem 2.7 (Lower bound on overparametrization in ER networks). There exist univariate target networks f T (x) = ϕ(w T T x + b T ) that cannot be represented by a random 1-hidden-layer ER source network f S ∈ ER(p) (i.e. weak or strong LT) with probability at least 1 -δ, if its width is n S,1 < 1 log(1/(1-p)) log 1 1-(1-δ) 1/d .

3. EXPERIMENTS FOR WEAK LOTTERY TICKETS

Our existence proofs for WLTs rely on the property that ER networks are trainable in practice. We perform experiments on benchmark image datasets to validate this assumption. Layerwise Sparsity Ratios There are plenty of reasonable choices for the layerwise sparsity ratios and thus ER probabilities p l . Our theory applies to all of them. The optimal choice for a given source network architecture depends on the target network and thus the solution to a learning problem, which is usually unknown a-priori in practice. To demonstrate that our theory holds for different approaches and to provide practical insights into standard image classification problems, we investigate the following layerwise sparsity ratios in experiments. The simplest baseline is a globally uniform choice p l = p. (Liu et al., 2021) have compared this choice in extensive experiments with their main proposal, ERK, which assigns p l ∝ nin+nout ninnout to a linear and p l ∝ c l +c l-1 +k l c l c l-1 k l (Mocanu et al., 2017) to a convolutional layer. In addition, we propose a pyramidal and balanced approach, which are visualized in Appendix A.14 for VGG19. Pyramidal: This method emulates a property of WLTs that are obtained by IMP (Frankle & Carbin, 2019) , that is the layer densities decay with increasing depth of the network. For a network of depth L, we use p l = (p 1 ) l , p l ∈ (0, 1) so that 2020) to obtain p 1 for the first layer such that p 1 ∈ (0, 1).

Balanced:

The second layerwise sparsity method aims to maintain the same number of parameters in every layer for a given target network sparsity p and source network architecture. Each neuron has a similar in-and out-degree on average. Every layer has x = p L l=L l=1 m l nonzero parameters. Such an ER network can be realized with p l = x/m l . In case that x ≥ m l , we set p l = 1. To judge the quality of LT pruning algorithms, ER randomizations of pruned tickets have also been studied as baselines, which challenge state-of-the-art pruning algorithms Su et al. (2020) ; Ma et al. (2021) . The corresponding sparsity ratios are computationally more cumbersome to obtain and thus of reduced practical interest. We still report comparisons with randomized Snip (Lee et al., 2018) , Iterative Synflow (Tanaka et al., 2020) , and IMP (Frankle & Carbin, 2019 To complement (Liu et al., 2021) , we conduct experiments in more extreme sparsity regimes ≥ 0.9 to test the limit up to which ER networks are a viable alternative to more advanced but computationally expensive pruning algorithms. We show empirically that even in regimes of reduced performance, rewiring edges by Dynamical Sparse Training (DST) can improve the performance substantially, which highlights the utility of random ER masks even at extreme sparsities. Experimental Setup We conduct our experiments with two datasets built for image classification tasks: CIFAR10 and CIFAR100 Krizhevsky et al. (2009) . Additional experiments on Tiny Imagenet (Russakovsky et al., 2015) are reported in Appendix A.12. We train two popular architectures, VGG16 Simonyan & Zisserman (2015) and ResNet18 He et al. (2016) , to classify images in the CIFAR10 dataset. On the larger CIFAR100 dataset, we use VGG19 and ResNet50. Each model is trained using SGD with learning rate 0.1 and momentum 0.9 with weight decay 0.0005 and batch size 128. We use the same hyperparameters as Ma et al. (2021) and train every model for 160 epochs. We repeat all our experiments over three runs and report averages and standard 0.95-confidence intervals, which can be found in the appendix due to space constraints. Our code builds on the work of Liu et al. (2021) ; Tanaka et al. (2020) and is included in the supplement. All our experiments were run with 4 Nvidia A100 GPUs. Results on CIFAR10 and CIFAR100 Experiments on the CIFAR10 dataset are shown in Table 2 and on CIFAR100 in Table 3 . The pyramidal and balanced methods are competitive and even outperform ERK in our experiments for sparsities up to 0.99. Importantly, they also outperform layerwise sparsity ratios obtained by the expensive iterative pruning algorithms Synflow and IMP. However, for extreme sparsities 1 -p ≥ 0.99, the performance of ER networks drops significantly and even completely breaks down for methods like ER Snip and pyramidal. We conjecture that ER Snip and pyramidal are susceptible to layer collapse in the higher layers and even flow repair (see Appendix 

Dynamical Sparse Training

In order to improve the expressiveness of ER networks and achieve extremely sparse WLTs, ER networks can be rewired with the help of DST. Specifically, we use the algorithm RiGL (Evci et al., 2020a) . First, we only rewire edges, which allows us to start from relatively sparse networks. Our results on CIFAR10 with VGG16 in Table 4 indicate that DST can improve even extremely sparse architectures. Usually, however, DST is started from ER networks with sparsity 0.5. We show in the appendix that it can also be initialized at higher sparsity with insignificant losses in accuracy. In particular, initial balanced or pyramidal sparsity ratios seem to be able to improve the performance of RiGL. In Appendix A.13, we report additional experiments that further illustrate the utility of balanced and pyramidal sparsity ratios for typical DST experiments that prune ER networks. These results suggest that there is potential for tuning ER sparsity ratios to reduce computational costs and increase predictive performance. 

4. CONCLUSIONS

We have systematically explained the effectiveness of random pruning and thus provided a theoretical justification for the use of Erdös-Rényi (ER) masks as strong baselines for lottery ticket pruning and starting point of dynamical sparse training (DST). This effectiveness has been demonstrated so far only experimentally for weak lottery tickets (WLT). We have proven theoretically and experimentally that ER networks also contain strong lottery tickets. This finding is also of practical interest, as initial sparse random sparse masks can avoid the computationally expensive process of pruning a dense network from scratch. Remarkably, as we could assume that ER networks are well trainable in practice, we could also prove the existence of WLTs. Our theory holds for a wide range of sparsity ratios, as we have also demonstrated in experiments. Yet, it suggests the necessity of high overparametrization in regimes of extreme sparsity. These limitations could partially be remedied by a combination of random pruning and targeted rewiring as, for instance, realized by DST. Xin Yu, Thiago Serra, Srikumar Ramalingam, and Shandian Zhe. The combinatorial brain surgeon: Pruning weights that cancel one another in neural networks, 2022. URL https://arxiv. org/abs/2203.04466. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. Advances in neural information processing systems, 32, 2019.

A APPENDIX

A.1 FLOW PRESERVATION Targeted pruning is known to be susceptible to layer collapse or just a sub-optimal use of resources (given in form of trainable parameters), when intermediary neurons receive no input despite nonzero output weights or zero output weights despite nonzero input weights. To avoid this issue, (Tanaka et al., 2020) has derived a specific data-independent pruning criterion, i.e., synaptic flow. Yet, flow preservation can also be achieved with a simple and computationally efficient random repair strategy that applies to diverse masking methods, including random ER masking. The main idea behind this algorithm is to connect neurons (or filters) with zero in-or out-degree with at least one other randomly chosen neuron (or filter) in the network. To preserve the global sparsity, a new edge can replace a random previously chosen edge. Alternatively, ER networks with flow preservation could also be obtained by rejection sampling, which is equivalent to conditioning neurons on nonzero in-and out-degrees. To still meet the target density p l , the ER probability pl would need to be appropriately adjusted. Our experiments reveal, however, that most randomly masked standard ResNet and VGG architectures usually perserve flows with high probability for different layerwise sparsity ratios up to sparsities ≈ 0.95 (see Appendix A.1). The most problematic layers are the first and the last layer if the number of input channels and output neurons is relatively small. In consequence, most pruning schemes keep these layers relatively dense in general. In our theoretical derivations, we assume flow preservation in the first layer. We propose two methods to achieve flow preservation, which guarantees that every neuron (or filter) has at least in-degree and out-degree 1. Rejection Sampling: We can resample the mask edges s (l) ij,ER of the neurons (filters) that have a zero in-degree or a zero out-degree till there is at least one in-degree and one-out-degree for that neuron.

Random Addition:

We randomly add an edge to a neuron with zero in-degree or out-degree. While this method adds an extra edge in the network, the total number of edges that need to be added are usually negligible in practice. We verify the number of corrections required in an ER network to preserve flow. Notice that in most cases ER networks inherently preserve flow. For each of the used layerwise sparsity ratios, we calculate the number of connections (edges) added in the network to ensure that every neuron (or filter) has at least in-degree and one out-degree 1 using the Random Addition method. Tables 5 and 6 show the results. Our analysis shows that flow preservation is an important property that avoids layer collapse in sparse networks and is inherently satisfied in reasonable sparsity regimes ≈ 0.9. It has a similar effect as making the final layer and the initial layer dense during pruning, which is followed in some pruning algorithms (Liu et al., 2021) . Figure 2 compares the different layerwise sparsity methods for ER networks with and without flow preservation. Our results show that flow preservation is especially important in Pyramidal and ER Snip methods. Both these methods have a higher sparsity in the final layer which leads to performance problems in case of high global sparsities. Flow preservation is able to address this partially so that a clear improvement is visible for the pyramidal method at sparsities 0.99 and 0.995.

A.2 PROOF FOR EXISTENCE OF STRONG LOTTERY TICKETS IN ER NETWORKS

As discussed in the main manuscript, most SLT existence proofs that derive a logarithmic lower bound on the overparametrization factor of the source network utilize subset sum approximation (Lueker, 1998) in the explicit construction of a lottery ticket that approximates a target network (Pensia et al., 2020; Burkholz et al., 2022; Burkholz, 2022a; da Cunha et al., 2022; Burkholz, 2022b) . We can transfer all of these proofs to ER source networks by modifying the subset sum approximation results to random variables that are set to zero with a Bernoulli probability p to account for randomly missing links in the source network. We just have to replace Lueker's subset sum approximation result by Lemma 2.1 in the corresponding proofs. For simplicity, we have formulated it for uniform random variables and target parameters z ∈ [-1, 1] but it could be easily extended to random variables that contain a uniform distribution (like normal distributions) and generally bounded targets as in Corollary 7 in (Burkholz et al., 2022) . For convenience, we restate Lemma 2.1 from the main manuscript: Lemma A.1 (Subset sum approximation in ER Networks). Let X 1 , ..., X n be independent, uniformly distributed random variables so that X i ∼ U ([-1, 1]) and M 1 , ..., M n be independent, Bernoulli distributed random variables so that M i ∼ Ber(p) for a p > 0. Let ϵ, δ ∈ (0, 1) be given. Then for any z ∈ [-1, 1] there exists a subset I ⊂ [n] so that with probability at least 1 -δ we have |z -i∈I M i X i | ≤ ϵ if n ≥ C 1 log(1/(1 -p)) log 1 min (δ, ϵ) . ( ) Proof Random variables Xi = M i X i do not contribute to the approximation of a target value z, if they are zero and thus in particular in the case that M i = 0, which happens with probability 1 -p for each index i. We can thus remove all the variables Xi , for which M i = 0. After a change of indexing, we arrive at a subset X1 , ..., XK of K random variables, which are uniformly distributed as Xi = M i X i = X i ∼ U ([-1, 1]), since M i is independent of X i . The number of variables K follows a binomial distribution, K ∼ Bin(n, p), since M 1 , ..., M n are independent Bernoulli distributed. For fixed K = k, Lueker (1998) has proven that there exists constants a k > 0 and b k > 0 so that the probability that the approximation is not possible is of the form P (∀I ⊂ [k]) |z -i∈I Xi | > ϵ ′ ≤ a k exp(-b k k)/ϵ ′ . Using this result and defining a := max k∈[n] a k > 0 and b := min k∈[n] b k > 0, we just have to take an average with respect to the random variable K ∼ Bin(B, p). P (∀I ⊂ [n]) |z - i∈I Xi | > ϵ ′ ≤ n k=0 a k ϵ ′ exp(-b k k) n k p k (1 -p) n-k ≤ a ϵ ′ n k=0 n k exp(-bk)p k (1 -p) n-k = a ϵ ′ [1 -p(1 -exp(-b))] n To ensure the subset sum approximation is feasible with probability of at least 1 -δ ′ we need to fulfill a ϵ ′ [1 -p(1 -exp(-b))] n ≤ δ ′ . Solving for n leads to n ≥ 1 log 1 1-p(1-exp(-n)) log a δ ′ ϵ ′ . This inequality is satisfied if n ≥ C 1 log(1/(1 -p)) log 1 min{δ ′ , ϵ ′ } for a generic constant C > 0 that depends on a and b. With this modified subset sum approximation, we show next that in comparison with a complete source network, an ER network needs to be wider by a factor 1 log(1/(1-p)) . To provide an example of how to transfer an SLT existence proof, we focus on the construction by Burkholz (2022b) . Note that in all our theorems we assume that flow is preserved in the first layer, as it is reasonable to apply a simple and computationally cheap flow preservation algorithm after drawing a random mask (see Appendix A.1). This algorithm just ensures that all neurons are connected to the main network and are thus useful for training a neural network. If we do not assume that flow is preserved, some neurons in the first layer might be disconnected from all input neurons with probability (1 -p 0 ) d . Disconnected neurons could simply be ignored in the LT construction. Their share is usually negligible but, technically, without flow preservation, we would need to ensure that n S,1 ≥ C(1 -p 0 ) d + n * S,1 , where n * S,1 denotes the bound on the width that we are actually going to derive. Theorem A.2 (Existence of SLTs in ER Networks). Let ϵ, δ ∈ (0, 1), a target network f T of depth L, an ER(p) source network f S of depth L + 1 with edge probabilities p l in each layer l and iid initial parameters θ with w -1 , 1]) be given. Then with probability at least 1 -δ, there exists a mask S LT so that each target output component i is approximated as (l) ij ∼ U ([-1, 1]), b (l) i ∼ U ([ max x∈D ∥f T,i (x) -f S,i (x; W S • S LT )∥ ≤ ϵ if n S,l ≥ C n T,l log (1/(1 -p l+1 )) log 1 min{ϵ l , δ/ρ} for l ≥ 1, where ϵ l = g(ϵ, f T ) is defined in Equation (3) and ρ = CN 1+γ T log(1/(1-min l p l )) 1+γ log(1/ min{min l ϵ l , δ}) for any γ ≥ 0. We also require n S,0 ≥ Cd 1 log(1/(1-p1)) log 1 min{ϵ1,δ/ρ} , where C > 0 denotes a generic constant that is independent of n T,l , L, p l , δ, and ϵ. Here, ϵ l = g(ϵ) is defined in accordance with Lemma 5.1 in (Burkholz, 2022b) : ϵ l = g(ϵ, f T ) = ϵ n T,L L (1 + B l-1 )(1 + ϵ L ) L-1 k=l+1 (||W (k) T || ∞ + ϵ L ) -1 , B l := sup x∈D ||x (l) T || 1 . (3) Proof To prove the existence of strong lottery tickets in ER networks, we modify the proof by (Burkholz, 2022b) for complete fully-connected networks. We first answer the question, how the fact that random weights are set irreversibly to zero, changes our construction. Fig. 1 visualizes the general schematic. The general idea is that we have to create multiple copies ρ l of each target neuron in the LT, as these will enable the approximation of target parameters by utilizing subset sum approximation as modified by Lemma 2.1. First, as Fig. 1 visualizes, we have to argue why and how we can create univariate blocks in the first layer or in general 2L constructions. In this case, a target layer is approximated by two appropriately pruned layers of the source network. The first of these two source layers contains only univariate neurons that form blocks that consist of neurons of the same type, which correspond to the same input target neuron i. All weights that start in the same block i and end in the same neuron j can then be utitilzed to approximate the target parameter w T,ji . The required univariate blocks can be easily realized by pruning if flow is preserved. The reason is that each neuron in source layer l = 0 has at least one in-coming edge, which can survive the pruning. Since this edge could be adjacent to any of the input neurons with the same probability, we can always find enough neurons in Layer l = 0 that point to any of the input neurons and this allows us to form univariate blocks of similar size B. Second, we have to analyze how the construction of each following target layer is affected by randomly missing edges in the source network. Each target weight w (l) T,ij can be approximated by w (l) T,ij ≈ j ′ ∈I m (l) S,i ′ j ′ w (l) S,i ′ j ′ , where the neuron i ′ in the LT approximates the target neuron i and the neuron j ′ in the LT approximates the target neuron j. The subset I is chosen based on a modified subset sum approximation and informs the mask of the LT. Thus, I exists according to Lemma 2.1, since the initially random mask entries of the source network m (l) S,i ′ j ′ are Bernoulli distributed with probability p l . The second issue that needs to be modified for ER networks is the analysis of the number of required subset sum approximation problems ρ. As explained before, the main idea of the construction is to create ρ l copies of each target neuron in target Layer l in Layer l of the LT. These copies serve then multiple subset sum approximations to approximate the target neurons in the next layer in a similar way as the univariate blocks of the first layer. This, however, increases the total number of subset sum approximation problems ρ that need to be solved and that influence the probability with which we can solve all of them. Using a union bound, we can spend δ/ρ on every approximation with a modified ρ for ER networks. Similar to (Burkholz, 2022b) , we can derive a lower bound on ρ l in the subsequent layers, so that the subset sum approximation is feasible for every parameter of layer l when the block size B is B ≥ 1 log(1/(1 -p l )) log a δ ′ ρ ϵ ′ so that with an appropriately chosen constant C we have B ≥ C log(1/(1 -p l )) log 1 min{ δ ′ ρ , ϵ ′ } so that it follows in total that n S,l ≥ C n T,l log(1/(1 -p l+1 )) log 1 min{ϵ l .δ/ρ} The remaining objective is to find a ρ ≥ ρ ′ = L l=1 ρ ′ l , where ρ ′ is the factor of increased subset sum approximation problems required to approximate L target layers with an ER source network and ρ l counts the number of parameters in each LT layer. Following Burkholz (2022b)'s method to identify ρ, we start with the last layer. The number ρ L of subset sum approximation problems that have to be solved to approximate the last layer determines the number of neurons required in the previous layer. This in turn determines the required number of neurons in the layer before it, etc. The last layer requires to solve exactly ρ ′ L = n T,L n T,L-1 subset sum problems which can be solved with sufficiently high probability if n S,L-1 ≥ Cn T ,L-1 log(1/(1-p L )) log(1/ min{ϵ L , δ/ρ ′ }). As we would need maximally C log(1/(1-p L )) log(1/ min{ϵ L , δ/ρ ′ }) sets of the target parameters in the last layer, we can bound ρ ′ L-1 ≤ CN L-1 log(1/(1-p L )) log(1/ min{ϵ L , δ/ρ ′ }). Repeating the same argument for every layer, we derive ρ ′ l ≤ CN l log(1/(1-p l+1 )) log(1/ min{ϵ l+1 , δ/ρ ′ }). In total, we find that ρ ′ = L l=1 ρ ′ l ≤ L l=1 CN l log(1/(1-p l+1 )) log(1/ min{ϵ l+1 , δ/ρ ′ }) ≤ CNt log(1/(1-min l p l )) log(1/ min{min l ϵ l , δ/ρ}). Here, N l = n T,L n T,L-1 and N t = l N l . A ρ that fulfills ρ ≥ CNt log(1/(1-min l p l )) log(1/ min{ϵ l+1 , δ/ρ}) will be sufficient. It is easy to see that ρ = CN 1+γ T log(1/(1-min l p l )) 1+γ log(1/ min{min l ϵ l , δ}) for any γ ≥ 0 fulfills our requirement. We have thus shown the existence of SLTs in ER networks following similar ideas as the proof of Theorem 5.2 by Burkholz (2022b) . Thus, our construction would also apply to more general activation functions than ReLUs. Note that we could also follow the proof strategy of Pensia et al. (2020) to show the existence of strong lottery tickets in ER networks. The key difference between the proofs of Burkholz (2022b) and Pensia et al. (2020) is how the subset sum base is created to approximate a target parameter. Pensia et al. (2020) use two layers for every layer in the target and create a basis set to approximate every target weight while Burkholz (2022b) go one step further and create multiple subset sum approximations of every target weight to avoid the two layer construction. In both these cases, the underlying subset sum approximation can be modified as shown above for ER networks and the same proof strategy as (Burkholz, 2022b) or (Pensia et al., 2020 ) can be followed. Similarly, we could also extend our proofs to convolutional and residual architectures (Burkholz, 2022a) .

A.3 WLT EXISTENCE PROOF FOR SINGLE HIDDEN LAYER TARGET NETWORK

Theorem A.3 (Existence of an ER network as a WLT for a single hidden layer target network). Let Assumption 2.3 be fulfilled and a single layer fully-connected target network f T (x) = W (2) T ϕ(W (1) T x + b (1) T ) + b (2) T , δ ∈ (0, 1), a target density p and a 2 layer ER source network f S ∈ ER(p) with widths n S,0 = q 0 d, n S,1 = q 1 n T,1 , n S,2 = q 2 n T,2 be given. If q 0 ≥ 1 log(1/(1 -p 1 )) log 2m T,1 q 1 δ , q 1 ≥ 1 log(1/(1 -p 2 )) log 2m T,2 δ and q 2 = 1, then with a probability 1 -δ the source network f S contains a weak LT f W LT = f S (x; W S • S LT ). Proof of Theorem 2.4 A two hidden layer network can approximate a single hidden layer target network as explained in Section 2.4. (q 0 , q 1 , q 2 ) are the overparametrization factors in each layer in the source network which ensure that we can find the links that we need in our WLT construction. Why would we need any form of overparametrization? Different from the SLT construction, we do not need to employ multiple parameters to approximate a single parameter and thus do not use any subset sum approximation. Our trainability assumption allows us to define the parameters of the ER source network so that a target network is exactly represented (i.e. with ϵ = 0). Yet, we still need to prove that we can find all required nonzero entries in our mask. To increase the probability that a target link exists, we also create multiple copies of input neurons. As in the SLT construction, we prune the neurons in first layer to univariate neurons and choose the bias large enough so that the ReLU acts essentially as an identity function. p 0 > 0 can thus be arbitrary, as long as flow is preserved. Note that q 2 = 1, as the output neurons for the source and target should be identical n T,2 = n S,2 . The last layer (output layer) in the target contain n T,2 neurons and the penultimate layer n T,1 . In the source network, we create q 1 copies of each neuron in the second layer of the target network such that n S,1 = q 1 × n T,1 . Our goal is to bound the width of Layer 1 in the ER network such that there is at least one nonzero edge in the ER network for every nonzero target weight. To lower bound q 1 , each nonzero weight w (2) T,ij must have at least one nonzero weight (edge) in the source network with sufficiently high probability, i.e., every neuron in the output layer n S,2 must have a nonzero edge to every block in the previous layer n S,1 as explained in Figure 1 . The probability that at least one such edge exists for each output neuron is given as (1 -(1 -p 2 ) q1 ) m T ,2 . Similarly, we can compute the probability that each neuron in the second layer of the source n S,1 has at least one nonzero edge to to each of the univariate blocks in the first layer as (1 -(1 -p 1 ) q0 ) m T ,1 ×q1 . Since each layer construction is independent from the other, the above probabilities can be multiplied to obtain the probability that we can represent the entire target network as 2 l=0 (1 -(1 -p l ) q l-1 ) m T ,l q l ≥ 1 -δ One way to fulfill the above inequality is to split the error between the two product terms, (1 -(1 -p 1 ) q0 ) m T ,1 q1 ≥ (1 -δ) 1 2 and (1 -(1 -p 2 ) q1 ) m T ,2 q2 ≥ (1 -δ) 1 2 Both equations above are satisfied with 1 - (1 -p 2 ) q1 ≥ 1 - δ 2m T ,2 q2 and 1 -(1 -p 1 ) q0 ≥ 1 - δ 2m T ,1 q1 . We can now solve for q i , i ∈ {0, 1} q 0 ≥ 1 log(1/(1 -p 1 )) log 2m T,1 q 1 δ and q 1 ≥ 1 log(1/(1 -p 2 )) log 2m T,2 δ , since q 2 = 1 After having identified a representative link in the source ER network for each target weight, we next define the weights and biases for the source ER network leveraging the trainability assumption 2.3. Each representative link in the ER source network is assigned the weight of its corresponding target. For the first layer in the source network, which is an univariate construction of the input, the weights are defined as w S,ij = 1 and the bias is large enough so that all relevant inputs pass through the ReLU activation function as if it was the identity: w (0) S,ij = 1 ∀j ∈ {1, 2, .., d} and i ∈ {1, 2, .., n S,0 }, b (0) S,i = -a 1 if a 1 ≤ 0 0 if a 1 > 0 for every i ∈ {1, 2, .., n S,0 }. Recall that a 1 is defined as the lower bound of each input input component x. We compensate for this additional bias in the last layer. Now for the second layer, every weight w T,ij in the target network is assigned to one of the nonzero mask entries in the ER source network that lead to the corresponding input block j and output block i. The remaining extra weights in the source are set to zero. w (1) S,i ′ j ′ = w T,ij , i ′ ∈ {q 1 i, q 1 i + 1, .., q 1 i + q 1 } and j ′ ∈ {q 0 j, q 0 j + 1, .., q 0 j + q 0 } for one pair of i ′ , j ′ . The remaining connections between i ′ and block j can be pruned away, i.e., masked or set to zero. The bias of the second layer can be chosen so that it compensates for the extra bias added in the univariate construction of the first layer: ∀i ′ ∈ {1, ..., n S,1 } b (1) S,i ′ = b (1) T,i -w (1) T,ij b (0) S,j ′ .

A.4 WLT EXISTENCE PROOF FOR A TARGET NETWORK OF DEPTH L

In this section, we generalize the idea of constructing a source ER network as presented in Section 2.3 to a fully connected target network f T (x) of depth L. Each layer has weight W T . As before, we assume that the target network is equipped with the nonlinear ReLU activation function ϕ(x). The weight matrix has size W (l) T ∈ R n T ,l-1 ×n T ,l . n T,l is the number of neurons in each layer and m T,l = n T,l-1 × n T,l is the number of parameters in the weight matrix. Each layer has a layerwise expected density p l . Theorem A.4 (Existence of weak lottery tickets in ER networks ). With Assumption 2.3, given a fully connected target network f T of depth L, δ ∈ (0, 1), a target density p and a L + 1-layer ER source network f S ∈ ER(p) with widths n S,0 = q 0 d and n S,l = q l n T,l , l ∈ {1, 2, .., L}, where q l ≥ 1 log(1/(1 -p l+1 )) log Lm T,l+1 q l+1 δ for l ∈ {0, 1, .., L -1} and q L = 1, then with probability 1 -δ the random source network f S contains a weak LT f W LT = f S (x; W S • S LT ). Proof for Theorem 2.5 We now construct a source network f S (x) that contains a random subnetwork which replicates f T (x) with probability 1 -δ. As explained in Section 2.3, we first construct an univariate layer (with index l = 0) in the source network assuming flow preservation. Next, we calculate the overparametrization factor required for every layer in the source network using the same argument as A.3 starting from the last layer and working our way backwards. The last output layer has the same number of neurons in both the source and the target, n S,L = n T,L . Hence, the required width overparametrization factor is q L = 1. In every intermediary layer, we create blocks of neurons that consist of q l replicates of the same target neuron. How large should q l be? In the second to last layer, the probability that each neuron in the output layer has at least one edge to each of the q L-1 blocks in Layer L -1 is (1 -(1 -p L ) q L-1 ) m T ,L q L . We can similarly compute this probability for every layer all the way to the input in the source network which ensures that there is at least one edge between a neuron in every layer and each of the q l-1 blocks in the previous layer. The probability that Layer l can be constructed is thus (1 -(1 -p l ) q l-1 ) m T ,l q l . These events are independent and should hold simultaneously with probability 1 -δ. The following inequality formalizes our argument L l=1 (1 -(1 -p l ) q l-1 ) m t,l q l ≥ 1 -δ One way to fulfill the above equation would be to ensure that (1 -(1 -p l ) q l-1 ) m T ,l q l ≥ (1 -δ) 1/L for each layer and thus (1 -(1 -p l ) q l-1 ) ≥ (1 -δ) 1/(m T ,l q l L) This inequality is fulfilled if 1 -(1 -p l ) q l-1 ≥ 1 - δ m T,l q l L Note that for convolutional weights, m T,l is the number of nonzero parameters in W (l) T ∈ R c l ×c l-1 ×k l . The following width overparametrization of the output channels in a convolutional network q l-1 ≥ log δ m T ,l q l L log(1 -p l ) = 1 log(1/(1 -p l )) log Lm T,l q l δ allows an ER network to contain a WLT with probability 1 -δ. The weights in the convolutional network can now be chosen using the trainability assumption 2.3 as: w (l) S,i ′ j ′ k = w (l) T,ijk , for every i ′ ∈ {q l i, q l i + 1, .., q l i + q l }, where j ′ ∈ {q l-1 j, q l-1 j+1, .., q l-1 j+q l-1 } is chosen randomly among all non-masked connections of i ′ to block j and the remaining connections are pruned away or set to zero. The biases are set as in the proof of Theorem 2.5. T , we create q l copies in the source weight tensor W (l) S as shown on the left (a). The width overparametrization is further elucidated in (b) where each filter element of a target output filter has q l independent copies in the source, at least one of which is nonzero (unmasked). Coloured squares in (b) show the nonzero parameters in the source ER network. A.6 LOWER BOUND ON THE OVERPARAMETRIZATION OF ER NETWORKS Our theoretical analysis suggests that ER networks require a width overparametrization by a factor of log(1/sparsity) to exist as a LT. We also show that we cannot do substantially better than a width that is proportional to log(1/sparsity). Theorem A.6 (Lower bound on overparametrization in ER networks). There exist univariate target networks f T (x) = ϕ(w T T x + b T ) that cannot be represented by a random 1-hiddenlayer ER source network f S ∈ ER(p) with probability at least 1 -δ, if its width is n S,1 < 1 log(1/(1-p)) log 1 1-(1-δ) 1/d . Proof: The main idea is to find the minimum width of a single hidden layer network ER(p) which can approximate a single output target f T (x) = ϕ(w T T x + b T ). This minimum would be achieved when every target weight in w T is approximated by exactly one path in the ER network from the input to the output (through the hidden layer). We derive the probability that for every weight in the target, there is at least one non-masked path in the ER source that can represent this weight as shown in Figure 4 . Bounding this probability will give us a lower bound on the minimum width required in the ER network to be able to represent the target network. There are n S,1 paths from an input neuron to an output neuron in the source network and the probability that each of this path exists is p 2 , independently for each path, since both the input and output links in the path must be nonzero

Target Network

Source Network and each edge exists independently. Starting from the first input neuron, the probability that there is at least one path from input x i to the output is 1 -(1 -p 2 ) n S,1 . The paths exist independelty from each other if they start in different input neurons. Thus, the probability that we can represent an arbitrary target neuron with d input neurons is 1 -(1 -p 2 ) n S,1 d . In order to find the minimum width required, we lower bound this probability as: 1 -(1 -p 2 ) n S,1 d ≥ 1 -δ Solving this inequality for n S,1 proves the statement, since we would need n S,1 ≥ 1 log(1/(1 -p 2 )) log 1 (1 -(1 -δ) 1/d ) ≥ 1 log(1/(1 -p)) log 1 (1 -(1 -δ) 1/d ) . A.7 EXPERIMENTAL SETUP Our code base is built on code made available by the authors of Liu et al. (2021) and Tanaka et al. (2020) . To train ER networks as weak lottery tickets for the different layerwise sparsity ratios, we use a learning rate of 0.1 scheduled by a factor of 1/10 at 80 and 120 epochs. We train with a batch size of 128 for 160 epochs using SGD with momentum 0.9 and weight decay 0.0005. The same hyperparameter setup is used for both ResNet18 and VGG16. For experiments on strong lottery tickets using edge popup, we use an iterative version of edge popup as described in (Fischer & Burkholz, 2022) . We initialize a sparse network and anneal the sparsity iteratively while keeping the mask fixed. For the ResNet18 we use a learning rate of 0.1 and anneal in 5 levels and 100 epochs for each level. The batch size is 128 and we use SGD with momentum 0.9 and weight decay 0.0005. We report performances after one run for each of these experiments due to limited computation. In the DST experiments, we use the same setup as for the ER networks stated above, and modify the mask every 100 iterations. For sparse to sparse training with DST, we us weight magnitude as importance score for pruning (with prune rate 0.5) and gradient for growth. For the weak lottery ticket pruning baselines of Iterative Synflow and IMP, we prune the network in 25 levels with 30 epoch warm up and train the final pruned network for 160 epochs with a learning rate 0.001. The batch size is 256 and the learning rate is scheduled by 0.1 at the epochs 80 and 120 with the Adam(Kingma & Ba, 2015) optimizer. We use the code base of the authors of Synflow (Tanaka et al., 2020) (2020) to search for SLTs in ER ResNet18. We gradually anneal the sparsity of the ER network with 5 levels as proposed by (Fischer & Burkholz, 2022) . The results are presented in Table 11 . As a reference, we also report baseline results for dense networks in Table 12 . Additional experiments for ER VGG16 on CIFAR10 are shown in We have also performed experiments that compare ER networks with state of the art pruning algorithms, Iterative Synflow (Tanaka et al., 2020) and Iterative Magnitude Pruning (Frankle & Carbin, 2019) . We use the iterative version of Synflow (Fischer & Burkholz, 2022) and the results are presented in Fig. 5 . ER networks with layerwise sparsity ratios, which are chosen independently of the data with negligible computational overhead, are able to challenge and even outperform stateof-the-art pruning algorithms, which require computationally highly demanding pruning-training iterations. This highlights the general effectiveness of random masks at moderate sparsity levels. While random masks outperform IMP consistently in our experiments, for extreme sparsities of 0.999, Iterative Synflow still performs best among all considered algorithms. Starting Synflow from a random mask instead of a complete network, however, could potentially save computational resources to achieve similar results. We leave this analysis for future work. To showcase the scaleability of the suggested algorithms, we additionally report experiments for a larger model, i.e., Resnet110.

A.11.1 FOR WLTS IN ER NETWORKS

We report results for different layerwise sparsities in Table 16 . As a reference we also report results on pruning with a baseline algorithm Iterative Magnitude Pruning in Table 17 . We also conducted experiments with an Iterative Synflow algorithm Tanaka et al. (2020) We also report experiments with different layerwise sparsity methods in ER networks for the Tiny Imagenet dataset. We use a VGG19 and a ResNet20 and show that our proposed layerwise sparsity methods for ER networks are competetive for this dataset. Note that we use the validation set provided by the creators of Tiny Imagenet (Russakovsky et al., 2015) as a test set to measure the generalization performance of our trained models. See Table 18 where the model is initialized with an ER network of some initial sparsity and further pruned to a final sparsity (initial → final) while modifying the mask using the RiGL (Evci et al., 2020a) algorithm. Notably, we observe that it is also possible to start at a sparsity of up to 0.95 and still achieve a competitive test accuracy, only marginally worse than starting with a sparsity of 0.5.

A.14 VISUALIZING LAYERWISE SPARSITIES FOR ER NETWORKS

We report layerwise sparsity ratios for the proposed methods discussed in Section 3 in comparison to ERK, for VGG19 with CIFAR100.



, n T,l , m T,l are the weight, bias, number of neurons and number of nonzero parameters of the weight matrix in Layer l ∈ {1, 2, .., L}. Note that this implies m l ≤ n l n l-1 . Similarly, f S has depth L + 1 with parameters W (l) S , b (l) S , n S,l , m S,l L l=0

Figure 1: LTs in ER networks: In (a), f T (x) is a single layer target network. (b) visualizes the source network f S (x), which contains a WLT. (c) shows a strong LT. The figure shows connections for only one neuron in every layer of f S for simplicity. Dotted and solid lines belong to the random mask S ER , while the solid lines belong to nonzero weights of the final LT (S LT ).

l=L l=1 p l m l l=L l=1 m l = p. Given the architecture, we use a polynomial equation solver Harris et al. (

Figure 2: Flow Comparison: We compare the results of ER networks for each layerwise sparsity method with and without flow preservation. Solid lines denote that flow is preserved while dotted lines show the corresponding method without flow preservation for a VGG16 on CIFAR10.

Figure 3: Construction of a convolutional target in an ER network: For every output channel c T,l in the target convolutional weight tensor W (l)

Figure4: Lower bound of width of an ER source network shown on the right required to approximate the target network on the left using the trainability assumption 2.3. The solid edges in the source on the right are the nonzero (unmasked) edges while the dotted lines are masked away in an ER source network.

Figure 5: Comparing ER networks with Iterative Synflow and IMP: For both VGG16 and ResNet18 models the average and confidence interval over three runs is shown. The legend defines the layerwise sparsity ratio used for masking in the ER network.

ER networks for Strong Lottery Tickets: Average results and 0.95 standard confidence intervals for training an ER ResNet18 network with edge popup (Ramanujan et al., 2020) on CIFAR10.

). ER networks with different layerwise sparsities on CIFAR10 with VGG16. We compare test accuracies of our layerwise sparsity ratios balanced and pyramidal with the uniform baseline, ERK and ER networks with layerwise sparsitiy ratios obtained by IMP, Iterative Synflow and Snip (denoted by ER). Confidence intervals are reported in Appendix A.8.

A.1) cannot dramatically increase the network's expressiveness. Additional results with ResNets on CIFAR10 and 100 are reported in the Appendix A.8. ER networks on CIFAR100 with VGG19. Extending the comparison in Table2to test accuracies on the CIFAR100 dataset. See Appendix for A.8 for confidence intervals.

ER networks rewired with DST: Test Accuracies for an ER(p) VGG16 network initialized with sparsity = 1 -p (original) and after rewiring edges with RiGL(Evci et al., 2020a; Liu et al.,  2021)  (rewired) on CIFAR10. Confidence intervals are reported in Appendix A.13.



for these experiments. ER networks with different layerwise sparsities on CIFAR10 with ResNet18.

ER networks with different layerwise sparsities on CIFAR100 with ResNet50.

ER networks with different layerwise sparsities on CIFAR10 with VGG16. We compare our layerwise sparsity ratios balanced and pyramidal with the uniform baseline, ERK and ER networks with layerwise sparsitiy ratios obtained by IMP, Iterative Synflow and Snip.A.9 ADDITIONAL EXPERIMENTS FOR STRONG LOTTERY TICKETS IN ER



ER networks with different layerwise sparsities on CIFAR100 with VGG19. We compare our layerwise sparsity ratios balanced and pyramidal with the uniform baseline, ERK and ER networks with layerwise sparsitiy ratios obtained by IMP, Iterative Synflow and Snip.

ER networks for Strong Lottery Tickets: Average results on training an ER ResNet18 network with edge popup (Ramanujan et al., 2020) on CIFAR10. The ER network is initialized with a uniform initial sparsity, which is gradually annealed to attain a SLT of the final sparsity (initial → final sparsity). Baseline results for initially dense networks are reported in Table12.

Baseline for edge popup with ResNet18 on CIFAR10: The results for finding a SLT using edge popup starting from a dense network are shown. Our ER results starting from a sparse network are comparable to these baseline results which validates the efficiency of ER networks.

ER networks for Strong Lottery Tickets: SLTs in VGG16 ER networks on CIFAR10. The ER network is initialized with a uniform initial sparsity and gradually annealed to attain a SLT of the final sparsity (initial → final sparsity).

Baseline for edge popup with VGG16 on CIFAR10: Baseline results of Edge Popup to obtain SLTs on CIFAR10 with VGG16. A.9.1 SLTS IN ER NETWORKS FOR RESNET110 ON CIFAR100 We find SLTs within ER networks using the Edge Popup algorithm for a larger Resnet110 model as reported in Table 15. Test Acc. 61.91 ± 0.13 61.76 ± 0.53 61.78 ± 0.61

Edge popup (SLTs) results on ER networks with Resnet110 on CIFAR100. Results are reported for one run due to limited compute. A.10 COMPARING ER NETWORKS WITH ITERATIVE SYNFLOW AND IMP

to prune a Resnet110 but the algorithm fails for such a large model. Pyramidal (ours) 71.16 ± 0.22 69.56 ± 0.31 63.23 ± 1.29 52.37 ± 0.51 ERK 70.76 ± 0.82 69.96 ± 0.58 68.14 ± 0.34 64.92 ± 0.31 Results for WLTs in ER networks on CIFAR100 with ResNet110. Results are average and standard deviation reported across three runs.

Results on CIFAR100 with Resnet110 pruning with the iterative magnitude pruning (IMP) algorithm for reference. Only one run of IMP was performed for each of these sparsities.A.12 EXPERIMENTS WITH TINY IMAGENET FOR WLTS

and 19. Pyramidal (ours) 58.92 ± 0.12 58.46 ± 0.15 58.08 ± 0.05 41.06 ± 0.28 Results for ER networks on Tiny Imagenet with VGG19

Results for ER networks on Tiny Imagenet with ResNet20A.13 DYNAMICAL SPARSE TRAINING ON ER NETWORKS In addition to the rewiring experiments shown in Table4, we use Dynamical Sparse Training to prune an already sparse ER network to a higher sparsity and see if this can achieve the same performance as performing DST starting from a denser network. Similar experiments have been shown by(Liu  et al., 2021.). However, we report results on ER networks starting at much higher sparsities. Our ± 0.14 84.53 ± 0.20 88.28 ± 0.52 Balanced 89.31 ± 0.11 91.41 ± 0.43 85.91 ± 0.40 89.30 ± 0.03 Pyramidal 90.41 ± 0.03 91.97 ± 0.08 87.76 ± 0.13 90.61 ± 0.15

ER networks rewired with DST: An ER(p) VGG16 network with sparsity = 1 -p is initialized and the mask is modified by rewiring edges with RiGL on CIFAR10. results shown in Table21are able to match the performance of(Liu et al., 2021.)  while being more efficient as we start at a higher sparsity. Balanced (ours) 93.08 ± 0.01 92.75 ± 0.25 92.70 ± 0.10 Pyramidal (ours) 93.13 ± 0.05 92.93 ± 0.08 92.58 ± 0.21 ERK 92.94 ± 0.12 92.77 ± 0.01 92.47 ± 0.11

Sparse to sparse training with DST Final test accuracy for VGG16 on CIFAR10 is reported

annex

Solving for q l-1 leads to q l-1 ≥ log δ m T ,l q l L log(1 -p l ) = 1 log(1/(1 -p l )) log Lm T,l q l δ .We can thus compute the required width overparametrization for every layer starting from the last one, where we know q L = 1. Note, that q l depends on the logarithm of q l+1 of the next layer, which ensures that q l does not blow up as depth increases.After making sure that the required edges exist in the ER network to represent every target weight, we still have to derive concrete parameter choices. It follows then from the trainability assumption 2.3 that these choices (or equally good ones) could be found proving the existence of WLT in ER networks.Similar as in the single hidden layer case, each representative link in the ER source network is assigned the weight of its corresponding target and the weights in the first univariate layer are set to 1.The biases in the univariate layer are chosen so that all the inputs pass through the ReLU activation.The biases in the next layer compensate for the additional biases in the first layer.For the subsequent layers in the source network l ∈ {1, 2, .., L} the weights are wT,ij , where j ′ is randomly chosen among all the non-masked connections of i ′ to block j and j ′ ∈ {q l-1 j, q l-1 j + 1, .., q l-1 j + q l-1 } and i ′ ∈ {q l i, q l i + 1, .., q l i + q l }. The remaining connections between block j and i ′ can be pruned away or the weight parameters set to zero. The biases are set to the corresponding target bias for layers l ∈ {2, .., L} ∀i ′ ∈ {q l i, q l i + 1, .., q l i + q l } b (l)but the second layer l = 1 has an additional term to compensate for the bias in the first (univariate) layer: ∀i ′ ∈ {q l i, q l i + 1, .., q l i + q l } b (1)T,ij b (0) S,j .

A.5 EXISTENCE FOR CONVOLUTIONAL LAYERS

Theorem A.5 (Existence of WLTs for ER networks with convolutional layers). With Assumption 2.3, given a target network f T of depth L with convolutional layers h (l)then with probability 1 -δ the source network f S contains a a weak LT f W LT .Proof: The linearity of convolutions allows us to construct a target filter by combining elements that are scattered between different input channels in the ER source network as shown in Figure 3 . Using the same argument as the fully connected layer case, we bound the probability that at least one of the q l channels of every filter element in a convolutional weight tensor has a non-masked entry to a channel in the next layer. As for fully-connected networks, we can create blocks of channels that correspond to replicates of the same target channel. The first layer can be pruned down to univariate convolutional filters. The probability that each layer can thus be reconstructed in the convolutional network can be bounded as:(1 -(1 -p l ) q l-1 ) m T ,l q l ≥ (1 -δ) 3 for target sparsity 0.9 i.e. 10% of the parameters are retained.

