ROBUST PRUNING AT INITIALIZATION

Abstract

Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (LeCun et al., 1990; Hassibi et al., 1993) , recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned. In this paper, we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures. Overparameterized deep NNs have achieved state of the art (SOTA) performance in many tasks (Nguyen and Hein, 2018; Du et al., 2019; Zhang et al., 2016; Neyshabur et al., 2019) . However, it is impractical to implement such models on small devices such as mobile phones. To address this problem, network pruning is widely used to reduce the time and space requirements both at training and test time. The main idea is to identify weights that do not contribute significantly to the model performance based on some criterion, and remove them from the NN. However, most pruning procedures currently available can only be applied after having trained the full NN (LeCun et al., 1990; Hassibi et al., 1993; Mozer and Smolensky, 1989; Dong et al., 2017) although methods that consider pruning the NN during training have become available. For example, Louizos et al. (2018) propose an algorithm which adds a L 0 regularization on the weights to enforce sparsity while Carreira-Perpiñán and Idelbayev (2018); Alvarez and Salzmann (2017); Li et al. (2020) propose the inclusion of compression inside training steps. Other pruning variants consider training a secondary network that learns a pruning mask for a given architecture (Li et al. (2020); Liu et al. (2019)). Recently, Frankle and Carbin ( 2019) have introduced and validated experimentally the Lottery Ticket Hypothesis which conjectures the existence of a sparse subnetwork that achieves similar performance to the original NN. These empirical findings have motivated the development of pruning at initialization such as SNIP (Lee et al. ( 2018)) which demonstrated similar performance to classical pruning methods of pruning-after-training. Importantly, pruning at initialization never requires training the complete NN and is thus more memory efficient, allowing to train deep NN using limited computational resources. However, such techniques may suffer from different problems. In particular, nothing prevents such methods from pruning one whole layer of the NN, making it untrainable. More generally, it is typically difficult to train the resulting pruned NN (Li et al., 2018). To solve this situation, Lee et al. (2020) try to tackle this issue by enforcing dynamical isometry using orthogonal weights, while Wang et al. (2020) (GraSP) uses Hessian based pruning to preserve gradient flow. Other work by Tanaka et al. ( 2020) considers a data-agnostic iterative approach using the concept of synaptic flow in order to avoid the layer-collapse phenomenon (pruning a whole layer). In our work, we use principled scaling and re-parameterization to solve this issue, and show numerically that our algorithm achieves SOTA performance on CIFAR10, CIFAR100, TinyImageNet and ImageNet in some scenarios and remains competitive in others. N l-1 j=1 W l ij φ(y l-1 j (x)) + B l i for l ≥ 2. (2)



In this paper, we provide novel algorithms for Sensitivity-Based Pruning (SBP), i.e. pruning schemes that prune a weight W based on the magnitude of |W ∂L ∂W | at initialization where L is the loss. Experimentally, compared to other available one-shot pruning schemes, these algorithms provide state-of the-art results (this might not be true in some regimes). Our work is motivated by a new theoretical analysis of gradient back-propagation relying on the mean-field approximation of deep NN (Hayou et al., 2019; Schoenholz et al., 2017; Poole et al., 2016; Yang and Schoenholz, 2017; Xiao et al., 2018; Lee et al., 2018; Matthews et al., 2018) . Our contribution is threefold: • For deep fully connected FeedForward NN (FFNN) and Convolutional NN (CNN), it has been previously shown that only an initialization on the so-called Edge of Chaos (EOC) make models trainable; see e.g. (Schoenholz et al., 2017; Hayou et al., 2019) . For such models, we show that an EOC initialization is also necessary for SBP to be efficient. Outside this regime, one layer can be fully pruned. • For these models, pruning pushes the NN out of the EOC making the resulting pruned model difficult to train. We introduce a simple rescaling trick to bring the pruned model back in the EOC regime, making the pruned NN easily trainable. • Unlike FFNN and CNN, we show that Resnets are better suited for pruning at initialization since they 'live' on the EOC by default (Yang and Schoenholz, 2017) . However, they can suffer from exploding gradients, which we resolve by introducing a re-parameterization, called 'Stable Resnet' (SR). The performance of the resulting SBP-SR pruning algorithm is illustrated in Table 1 : SBP-SR allows for pruning up to 99.5% of ResNet104 on CIFAR10 while still retaining around 87% test accuracy. The precise statements and proofs of the theoretical results are given in the Supplementary. Appendix H also includes the proof of a weak version of the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) showing that, starting from a randomly initialized NN, there exists a subnetwork initialized on the EOC. 2 SENSITIVITY PRUNING FOR FFNN/CNN AND THE RESCALING TRICK

2.1. SETUP AND NOTATIONS

Let x be an input in R d . A NN of depth L is defined by y l (x) = F l (W l , y l-1 (x)) + B l , 1 ≤ l ≤ L, where y l (x) is the vector of pre-activations, W l and B l are respectively the weights and bias of the l th layer and F l is a mapping that defines the nature of the layer. The weights and bias are initialized with W l iid ∼ N (0, σ 2 w /v l ), where v l is a scaling factor used to control the variance of y l , and B l iid ∼ N (0, σ 2 b ). Hereafter, M l denotes the number of weights in the l th layer, φ the activation function and [m : n] := {m, m + 1, ..., n} for m ≤ n. Two examples of such architectures are: • Fully connected FFNN. For a FFNN of depth L and widths (N l ) 0≤l≤L , we have v l = N l-1 , M l = N l-1 N l and y 1 i (x) = d j=1 W 1 ij x j + B 1 i , y l i (x) = • CNN. For a 1D CNN of depth L, number of channels (n l ) l≤L , and number of neurons per channel (N l ) l≤L , we have y 1 i,α (x) = n l-1 j=1 β∈ker l W 1 i,j,β x j,α+β + b 1 i , y l i,α (x) = n l-1 j=1 β∈ker l W l i,j,β φ(y l-1 j,α+β (x)) + b l i , for l ≥ 2, (3) where i ∈ [1 : n l ] is the channel index, α ∈ [0 : N l -1] is the neuron location, ker l = [-k l : k l ] is the filter range, and 2k l + 1 is the filter size. To simplify the analysis, we assume hereafter that N l = N and k l = k for all l. Here, we have v l = n l-1 (2k + 1) and M l = n l-1 n l (2k + 1). We assume periodic boundary conditions; so y l i,α = y l i,α+N = y l i,α-N . Generalization to multidimensional convolutions is straightforward. When no specific architecture is mentioned, (W l i ) 1≤i≤M l denotes the weights of the l th layer. In practice, a pruning algorithm creates a binary mask δ over the weights to force the pruned weights to be zero. The neural network after pruning is given by y l (x) = F l (δ l • W l , y l-1 (x)) + B l , (4) where • is the Hadamard (i.e. element-wise) product. In this paper, we focus on pruning at initialization. The mask is typically created by using a vector g l of the same dimension as W l using a mapping of choice (see below), we then prune the network by keeping the weights that correspond to the top k values in the sequence (g l i ) i,l where k is fixed by the sparsity that we want to achieve. There are three popular types of criteria in the literature : • Magnitude based pruning (MBP): We prune weights based on the magnitude |W |. • Sensitivity based pruning (SBP): We prune the weights based on the values of |W ∂L ∂W | where L is the loss. This is motivated by L W ≈ L W =0 + W ∂L ∂W used in SNIP (Lee et al. (2018) ). • Hessian based pruning (HBP): We prune the weights based on some function that uses the Hessian of the loss function as in GraSP (Wang et al., 2020) . In the remainder of the paper, we focus exclusively on SBP while our analysis of MBP is given in Appendix E. We leave HBP for future work. However, we include empirical results with GraSP (Wang et al., 2020) in Section 4. Hereafter, we denote by s the sparsity, i.e. the fraction of weights we want to prune. Let A l be the set of indices of the weights in the l th layer that are pruned, i.e. A l = {i ∈ [1 : M l ], s.t. δ l i = 0}. We define the critical sparsity s cr by s cr = min{s ∈ (0, 1), s.t. ∃l, |A l | = M l }, where |A l | is the cardinality of A l . Intuitively, s cr represents the maximal sparsity we are allowed to choose without fully pruning at least one layer. s cr is random as the weights are initialized randomly. Thus, we study the behaviour of the expected value E[s cr ] where, hereafter, all expectations are taken w.r.t. to the random initial weights. This provides theoretical guidelines for pruning at initialization. For all l ∈ [1 : L], we define α l by v l = α l N where N > 0, and ζ l > 0 such that M l = ζ l N 2 , where we recall that v l is a scaling factor controlling the variance of y l and M l is the number of weights in the l th layer. This notation assumes that, in each layer, the number of weights is quadratic in the number of neurons, which is satisfied by classical FFNN and CNN architectures.

2.2. SENSITIVITY-BASED PRUNING (SBP)

SBP is a data-dependent pruning method that uses the data to compute the gradient with backpropagation at initialization (one-shot pruning).We randomly sample a batch and compute the gradients of the loss with respect to each weight. The mask is then defined by δ l i = I(|W l i ∂L ∂W l i | ≥ t s ), where ks) is the k th s order statistics of the sequence t s = |W ∂L ∂W | (ks) and k s = (1 -s) l M l and |W ∂L ∂W | ( (|W l i ∂L ∂W l i |) 1≤l≤L,1≤i≤M l . However, this simple approach suffers from the well-known exploding/vanishing gradients problem which renders the first/last few layers respectively susceptible to be completely pruned. We give a formal definition to this problem. Definition 1 (Well-conditioned & ill-conditioned NN). Let m l = E[|W l 1 ∂L ∂W l 1 | 2 ] for l ∈ [1 : L]. We say that the NN is well-conditioned if there exist A, B > 0 such that for all L ≥ 1 and l ∈ [1 : L] we have A ≤ m l /m L ≤ B, and it is ill-conditioned otherwise. Understanding the behaviour of gradients at initialization is thus crucial for SBP to be efficient. Using a mean-field approach, such analysis has been carried out in (Schoenholz et al., 2017; Hayou et al., 2019; Xiao et al., 2018; Poole et al., 2016; Yang, 2019) where it has been shown that an initialization known as the EOC is beneficial for DNN training. The mean-field analysis of DNNs relies on two standard approximations that we will also use here. Approximation 1 (Mean-Field Approximation). When N l 1 for FFNN or n l 1 for CNN, we use the approximation of infinitely wide NN. This means infinite number of neurons per layer for fully connected layers and infinite number of channels per layer for convolutional layers. Approximation 2 (Gradient Independence). The weights used for forward propagation are independent from those used for back-propagation. These two approximations are ubiquitous in literature on the mean-field analysis of neural networks. They have been used to derive theoretical results on signal propagation (Schoenholz et al., 2017; Hayou et al., 2019; Poole et al., 2016; Yang, 2019; Yang and Schoenholz, 2017; Yang et al., 2019) and are also key tools in the derivation of the Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019; Hayou et al., 2020) . Approximation 1 simplifies the analysis of the forward propagation as it allows the derivation of closed-form formulas for covariance propagation. Approximation 2 does the same for back-propagation. See Appendix A for a detailed discussion of these approximations. Throughout the paper, we provide numerical results that substantiate the theoretical results that we derive using these two approximations. We show that these approximations lead to excellent match between theoretical results and numerical experiments. Edge of Chaos (EOC): For inputs x, x , let c l (x, x ) be the correlation between y l (x) and y l (x ). From (Schoenholz et al., 2017; Hayou et al., 2019) , there exists a so-called correlation function f that depends on (σ w , σ b ) such that c l+1 (x, x ) = f (c l (x, x )). Let χ(σ b , σ w ) = f (1) . The EOC is the set of hyperparameters (σ w , σ b ) satisfying χ(σ b , σ w ) = 1. When χ(σ b , σ w ) > 1, we are in the Chaotic phase, the gradient explodes and c l (x, x ) converges exponentially to some c < 1 for x = x and the resulting output function is discontinuous everywhere. When χ(σ b , σ w ) < 1, we are in the Ordered phase where c l (x, x ) converges exponentially fast to 1 and the NN outputs constant functions. Initialization on the EOC allows for better information propagation (see Supplementary for more details). Hence, by leveraging the above results, we show that an initialization outside the EOC will lead to an ill-conditioned NN. Theorem 1 (EOC Initialization is crucial for SBP). Consider a NN of type (2) or (3) (FFNN or CNN). Assume (σ w , σ b ) are chosen on the ordered phase, i.e. χ(σ b , σ w ) < 1, then the NN is ill-conditioned. Moreover, we have E[s cr ] ≤ 1 L 1 + log(κLN 2 ) κ + O 1 κ 2 √ LN 2 , where κ = | log χ(σ b , σ w )|/8. If (σ w , σ b ) are on the EOC, i.e. χ(σ b , σ w ) = 1 , then the NN is well-conditioned. In this case, κ = 0 and the above upper bound no longer holds. The proof of Theorem 1 relies on the behaviour of the gradient norm at initialization. On the ordered phase, the gradient norm vanishes exponentially quickly as it back-propagates, thus resulting in an ill-conditioned network. We use another approximation for the sake of simplification of the proof (Approximation 3 in the Supplementary) but the result holds without this approximation although the resulting constants would be a bit different. Theorem 1 shows that the upper bound decreases the farther χ(σ b , σ w ) is from 1, i.e. the farther the initialization is from the EOC. For constant width FFNN with L = 100, N = 100 and κ = 0.2, the theoretical upper bound is E[s cr ] 27% while we obtain E[s cr ] ≈ 22% based on 10 simulations. A similar result can be obtained when the NN is initialized on the chaotic phase; in this case too, the NN is ill-conditioned. To illustrate these results, Figure 1 shows the impact of the initialization with sparsity s = 70%. The dark area in Figure 1 (b) corresponds to layers that are fully pruned in the chaotic phase due to exploding gradients. Using an EOC initialization, Figure 1 (a) shows that pruned weights are well distributed in the NN, ensuring that no layer is fully pruned. 

2.3. TRAINING PRUNED NETWORKS USING THE RESCALING TRICK

We have shown previously that an initialization on the EOC is crucial for SBP. However, we have not yet addressed the key problem of training the resulting pruned NN. This can be very challenging in practice (Li et al., 2018) , especially for deep NN. Consider as an example a FFNN architecture. After pruning, we have for an input x ŷl i (x) = N l-1 j=1 W l ij δ l ij φ(ŷ l-1 j (x)) + B l i , for l ≥ 2, ( ) where δ is the pruning mask. While the original NN initialized on the EOC was satisfying c l+1 (x, x ) = f (c l (x, x )) for f (1) = χ(σ b , σ w ) = 1, the pruned architecture leads to ĉl+1 (x, x ) = f pruned (ĉ l (x, x )) with f pruned (1) = 1, hence pruning destroys the EOC. Consequently, the pruned NN will be difficult to train (Schoenholz et al., 2017; Hayou et al., 2019) especially if it is deep. Hence, we propose to bring the pruned NN back on the EOC. This approach consists of rescaling the weights obtained after SBP in each layer by factors that depend on the pruned architecture itself. Proposition 1 (Rescaling Trick). Consider a NN of type (2) or (3) (FFNN or CNN) initialized on the EOC. Then, after pruning, the pruned NN is not initialized on the EOC anymore. However, the rescaled pruned NN y l (x) = F(ρ l • δ l • W l , y l-1 (x)) + B l , for l ≥ 1, where ρ l ij = (E[N l-1 (W l i1 ) 2 δ l i1 ]) -1 2 for FFNN , ρ l i,j,β = (E[n l-1 (W l i,1,β ) 2 δ l i,1,β ]) -1 2 for CNN, ( ) is initialized on the EOC. (The scaling is constant across j). The scaling factors in equation 7 are easily approximated using the weights kept after pruning. Algorithm 1 (see Appendix I) details a practical implementation of this rescaling technique for FFNN. We illustrate experimentally the benefits of this approach in Section 4.

3. SENSITIVITY-BASED PRUNING FOR STABLE RESIDUAL NETWORKS

Resnets and their variants (He et al., 2015; Huang et al., 2017) are currently the best performing models on various classification tasks (CIFAR10, CIFAR100, ImageNet etc (Kolesnikov et al., 2019) ). Thus, understanding Resnet pruning at initialization is of crucial interest. Yang and Schoenholz (2017) showed that Resnets naturally 'live' on the EOC. Using this result, we show that Resnets are actually better suited to SBP than FFNN and CNN. However, Resnets suffer from an exploding gradient problem (Yang and Schoenholz, 2017) which might affect the performance of SBP. We address this issue by introducing a new Resnet parameterization. Let a standard Resnet architecture be given by y 1 (x) = F(W 1 , x), y l (x) = y l-1 (x) + F(W l , y l-1 ), for l ≥ 2, (8) where F defines the blocks of the Resnet. Hereafter, we assume that F is either of the form (2) or (3) (FFNN or CNN). The next theorem shows that Resnets are well-conditioned independently from the initialization and are thus well suited for pruning at initialization. Theorem 2 (Resnet are Well-Conditioned). Consider a Resnet with either Fully Connected or Convolutional layers and ReLU activation function. Then for all σ w > 0, the Resnet is wellconditioned. Moreover, for all l ∈ {1, ..., L}, we have m l = Θ((1 + σ 2 w 2 ) L ). The above theorem proves that Resnets are always well-conditioned. However, taking a closer look at m l , which represents the variance of the pruning criterion (Definition 1), we see that it grows exponentially in the number of layers L. Therefore, this could lead to a 'higher variance of pruned networks' and hence high variance test accuracy. To this end, we propose a Resnet parameterization which we call Stable Resnet. Stable Resnets prevent the second moment from growing exponentially as shown below. Proposition 2 (Stable Resnet). Consider the following Resnet parameterization y l (x) = y l-1 (x) + 1 √ L F(W l , y l-1 ), for l ≥ 2, ( ) then the NN is well-conditioned for all σ w > 0. Moreover, for all l ≤ L we have m l = Θ(L -1 ). In Proposition 2, L is not the number of layers but the number of blocks (Hayou et al., 2021) . In the next proposition, we establish that, unlike FFNN or CNN, we do not need to rescale the pruned Resnet for it to be trainable as it lives naturally on the EOC before and after pruning. Proposition 3 (Resnet live on the EOC even after pruning). Consider a Residual NN with blocks of type FFNN or CNN. Then, after pruning, the pruned Residual NN is initialized on the EOC.

4. EXPERIMENTS

In this section, we illustrate empirically the theoretical results obtained in the previous sections. We validate the results on MNIST, CIFAR10, CIFAR100 and Tiny ImageNet.

4.1. INITIALIZATION AND RESCALING

According to Theorem 1, an EOC initialization is necessary for the network to be well-conditioned. We train FFNN with tanh activation on MNIST, varying depth L ∈ {2, 20, 40, 60, 80, 100} and sparsity s ∈ {10%, 20%, .., 90%}. We use SGD with batchsize 100 and learning rate 10 -3 , which we found to be optimal using a grid search with an exponential scale of 10. Figure 3 shows the test accuracy after 10k iterations for 3 different initialization schemes: Rescaled EOC, EOC, Ordered. On the Ordered phase, the model is untrainable when we choose sparsity s > 40% and depth L > 60 as one layer being fully pruned. For an EOC initialization, the set (s, L) for which NN are trainable becomes larger. However, the model is still untrainable for highly sparse deep networks as the sparse NN is no longer initialized on the EOC (see Proposition 1). As predicted by Proposition 1, after application of the rescaling trick to bring back the pruned NN on the EOC, the pruned NN can be trained appropriately. (Wang et al., 2020) , i.e. we use SGD for 160 and 250 epochs for CIFAR10 and CIFAR100, respectively. We use an initial learning rate of 0.1 and decay it by 0.1 at 1/2 and 3/4 of the number of total epoch. In addition, we run all our experiments 3 times to obtain more stable and reliable test accuracies. As in (Wang et al., 2020) , we adopt Resnet architectures where we doubled the number of filters in each convolutional layer. As a baseline, we include pruning results with the classical OBD pruning algorithm (LeCun et al., 1990) for ResNet32 (train → prune → repeat). We compare our results against other algorithms that prune at initialization, such as SNIP (Lee et al., 2018) , which is a SBP algorithm, GraSP (Wang et al., 2020) which is a Hessian based pruning algorithm, and SynFlow (Tanaka et al., 2020) , which is an iterative data-agnostic pruning algorithm. As we increase the depth, SBP-SR starts to outperform other algorithms that prune at initialization (SBP-SR outperforms all other algorithms with ResNet104 on CIFAR10 and CIFAR100). Furthermore, using GraSP on Stable Resnet did not improve the result of GraSP on standard Resnet, as our proposed Stable Resnet analysis only applies to gradient based pruning. The analysis of Hessian based pruning could lead to similar techniques for improving trainability, which we leave for future work. To confirm these results, we also test SBP-SR against other pruning algorithms on Tiny ImageNet. We train the models for 300 training epochs to make sure all algorithms converge. Table 3 shows test accuracies for SBP-SR, SNIP, GraSP, and SynFlow for s ∈ {85%, 90%, 95%}. Although SynFlow competes or outperforms GraSP in many cases, SBP-SR has a clear advantage over SynFlow and other algorithms, especially for deep networks as illustrated on ResNet104. Additional results with ImageNet dataset are provided in Appendix F.

4.3. RESCALING TRICK AND CNNS

The theoretical analysis of Section 2 is valid for Vanilla CNN i.e. CNN without pooling layers. With pooling layers, the theory of signal propagation applies to sections between successive pooling layers; each of those section can be seen as Vanilla CNN. This applies to standard CNN architectures such as VGG. As a toy example, we show in Table 4 the test accuracy of a pruned V-CNN with sparsity s = 50% on MNIST dataset. Similar to FFNN results in Figure 3 , the combination of the EOC Init and the ReScaling trick allows for pruning deep V-CNN (depth 100) while ensuring their trainability. However, V-CNN is a toy example that is generally not used in practice. Standard CNN architectures such as VGG are popular among practitioners since they achieve SOTA accuracy on many tasks. Table 5 shows test accuracies for SNIP, SynFlow, and our EOC+ReScaling trick for VGG16 on CIFAR10. Our results are close to the results presented by Frankle et al. (2020) . These three algorithms perform similarly. From a theoretical point of view, our ReScaling trick applies to vanilla CNNs without pooling layers, hence, adding pooling layers might cause a deterioration. However, we know that the signal propagation theory applies to vanilla blocks inside VGG (i.e. the sequence of convolutional layers between two successive pooling layers). The larger those vanilla blocks are, the better our ReScaling trick performs. We leverage this observation by training a modified version of VGG, called 3xVGG16, which has the same number of pooling layers as VGG16, and 3 times the number of convolutional layers inside each vanilla block. Numerical results in Table 5 show that the EOC initialization with the ReScaling trick outperforms other algorithms, which confirms our hypothesis. However, the architecture 3xVGG16 is not a standard architecture and it does not seem to improve much the test accuracy of VGG16. An adaptation of the ReScaling trick to standard VGG architectures would be of great value and is left for future work. Summary of numerical results. We summarize in Table 6 our numerical results. The letter 'C' refers to 'Competition' between algorithms in that setting, and indicates no clear winner is found, while the dash means no experiment has been run with this setting. We observe that our algorithm SBP-SR consistently outperforms other algorithms in a variety of settings. Table 6 : Which algorithm performs better? (according to our results) DATASET ARCHITECTURE 85% 90% 95% 98% CIFAR10 RESNET32 - C C GRASP RESNET50 - C SBP-SR GRASP RESNET104 - SBP-SR SBP-SR SBP-SR VGG16 C C C - 3XVGG16 EOC+RESC EOC+RESC EOC+RESC - CIFAR100 RESNET32 - SBP-SR SBP-SR SBP-SR RESNET50 - SBP-SR SBP-SR SBP-SR RESNET104 - SBP-SR SBP-SR SBP-SR TINY IMAGENET RESNET32 C C SYNFLOW - RESNET50 SBP-SR C SBP-SR - RESNET104 SBP-SR SBP-SR SBP-SR -

5. CONCLUSION

In this paper, we have formulated principled guidelines for SBP at initialization. For FNNN and CNN, we have shown that an initialization on the EOC is necessary followed by the application of a simple rescaling trick to train the pruned network. For Resnets, the situation is markedly different. There is no need for a specific initialization but Resnets in their original form suffer from an exploding gradient problem. ∼ N (0, σ 2 w N l-1 ) and bias B l i iid ∼ N (0, σ 2 b ) , where N (µ, σ 2 ) denotes the normal distribution of mean µ and variance σ 2 . For some input x ∈ R d , the propagation of this input through the network is given by y 1 i (x) = d j=1 W 1 ij x j + B 1 i , y l i (x) = N l-1 j=1 W l ij φ(y l-1 j (x)) + B l i , for l ≥ 2. ( ) Where φ : R → R is the activation function. When we take the limit N l-1 → ∞, the Central Limit Theorem implies that y l i (x) is a Gaussian variable for any input x. This approximation by infinite width solution results in an error of order O(1/ N l-1 ) (standard Monte Carlo error). More generally, an approximation of the random process y l i (.) by a Gaussian process was first proposed by Neal (1995) in the single layer case and has been recently extended to the multiple layer case by Lee et al. (2018) and Matthews et al. (2018) . We recall here the expressions of the limiting Gaussian process kernels. For any input x ∈ R d , E[y l i (x)] = 0 so that for any inputs x, x ∈ R d κ l (x, x ) = E[y l i (x)y l i (x )] = σ 2 b + σ 2 w E[φ(y l-1 i (x))φ(y l-1 i (x ))] = σ 2 b + σ 2 w F φ (κ l-1 (x, x), κ l-1 (x, x ), κ l-1 (x , x )) , where F φ is a function that only depends on φ. This provides a simple recursive formula for the computation of the kernel κ l ; see, e.g., Lee et al. (2018) for more details.

Convolutional Neural Networks

Similar to the FFNN case, the infinite width approximation with 1D CNN (introduced in the main paper) yields a recursion for the kernel. However, the infinite width here means infinite number of channels, and results in an error O(1/ √ n l-1 ). The kernel in this case depends on the choice of the neurons in the channel and is given by κ l α,α (x, x ) = E[y l i,α (x)y l i,α (x )] = σ 2 b + σ 2 w 2k + 1 β∈ker E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))] so that κ l α,α (x, x ) = σ 2 b + σ 2 w 2k + 1 β∈ker F φ (κ l-1 α+β,α +β (x, x), κ l-1 α+β,α +β (x, x ), κ l-1 α+β,α +β (x , x )). The convolutional kernel κ l α,α has the 'self-averaging' property; i.e. it is an average over the kernels corresponding to different combination of neurons in the previous layer. However, it is easy to simplify the analysis in this case by studying the average kernel per channel defined by κl = 1 N 2 α,α κ l α,α . Indeed, by summing terms in the previous equation and using the fact that we use circular padding, we obtain κl (x, x ) = σ 2 b + σ 2 w 1 N 2 α,α F φ (κ l-1 α,α (x, x), κ l-1 α,α (x, x ), κ l-1 α,α (x , x )). This expression is similar in nature to that of FFNN. We will use this observation in the proofs. Note that our analysis only requires the approximation that, in the infinite width limit, for any two inputs x, x , the variables y l i (x) and y l i (x ) are Gaussian with covariance κ l (x, x ) for FFNN, and y l i,α (x) and y l i,α (x ) are Gaussian with covariance κ l α,α (x, x ) for CNN. We do not need the much stronger approximation that the process y l i (x) (y l i,α (x) for CNN) is a Gaussian process.

Residual Neural Networks

The infinite width limit approximation for ResNet yields similar results with an additional residual terms. It is straighforward to see that, in the case a ResNet with FFNN-type layers, we have that κ l (x, x ) = κ l-1 (x, x ) + σ 2 b + σ 2 w F φ (κ l-1 (x, x), κ l-1 (x, x ), κ l-1 (x , x )), whereas for ResNet with CNN-type layers, we have that κ l α,α (x, x ) = κ l-1 α,α (x, x ) + σ 2 b + σ 2 w 2k + 1 β∈ker F φ (κ l-1 α+β,α +β (x, x), κ l-1 α+β,α +β (x, x ), κ l-1 α+β,α +β (x , x )).

A.2 APPROXIMATION 2: GRADIENT INDEPENDENCE

For gradient back-propagation, an essential assumption in prior literature in Mean-Field analysis of DNNs is that of the gradient independence which is similar in nature to the practice of feedback alignment (Lillicrap et al., 2016) . This approximation allows for derivation of recursive formulas for gradient back-propagation, and it has been extensively used in literature and verified empirically; see references below. Gradient Covariance back-propagation: this approximation was used to derive analytical formulas for gradient covariance back-propagation in (Hayou et al., 2019; Schoenholz et al., 2017; Yang and Schoenholz, 2017; Lee et al., 2018; Poole et al., 2016; Xiao et al., 2018; Yang, 2019) . It was shown empirically through simulations that it is an excellent approximation for FFNN in Schoenholz et al. (2017) , for Resnets in Yang and Schoenholz (2017) and for CNN in Xiao et al. (2018) . Neural Tangent Kernel (NTK): this approximation was implicitly used by Jacot et al. (2018) to derive the recursive formula of the infinite width Neural Tangent Kernel (See Jacot et al. (2018) , Appendix A.1). Authors have found that this approximation yields excellent match with exact NTK. It was also exploited later in (Arora et al., 2019; Hayou et al., 2020) to derive the infinite NTK for different architectures. The difference between the infinite width NTK Θ and the empirical (exact) NTK Θ was studied in Lee et al. (2019) where authors have shown that Θ -Θ F = O(N -1 ) where N is the width of the NN. More precisely, we use the approximation that, for wide neural networks, the weights used for forward propagation are independent from those used for back-propagation. When used for the computation of gradient covariance and Neural Tangent Kernel, this approximation was proven to give the exact computation for standard architectures such as FFNN, CNN and ResNets, without BatchNorm in Yang (2019) (section D.5). Even with BatchNorm, in Yang et al. (2019) , authors have found that the Gradient Independence approximation matches empirical results. This approximation can be alternatively formulated as an assumption instead of an approximation as in Yang and Schoenholz (2017) . Assumption 1 (Gradient Independence): The gradients are computed using an i.i.d. version of the weights used for forward propagation.

B PRELIMINARY RESULTS

Let x be an input in R d . In its general form, a neural network of depth L is given by the following set of forward propagation equations y l (x) = F l (W l , y l-1 (x)) + B l , 1 ≤ l ≤ L, where y l (x) is the vector of pre-activations and W l and B l are respectively the weights and bias of the l th layer. F l is a mapping that defines the nature of the layer. The weights and bias are initialized with W l iid ∼ N (0, σ 2 w v l ) where v l is a scaling factor used to control the variance of y l , and B l iid ∼ N (0, σ 2 b ). Hereafter, we denote by M l the number of weights in the l th layer, φ the activation function and For a fully connected feedforward neural network of depth L and widths (N l ) 0≤l≤L , the forward propagation of the input through the network is given by y 1 i (x) = d j=1 W 1 ij x j + B 1 i , y l i (x) = N l-1 j=1 W l ij φ(y l-1 j (x)) + B l i , for l ≥ 2. ( ) Here, we have v l = N l-1 and M l = N l-1 N l . • Convolutional Neural Network (CNN/ConvNet) For a 1D convolutional neural network of depth L, number of channels (n l ) l≤L and number of neurons per channel (N l ) l≤L . we have y 1 i,α (x) = n l-1 j=1 β∈ker l W 1 i,j,β x j,α+β + b 1 i , y l i,α (x) = n l-1 j=1 β∈ker l W l i,j,β φ(y l-1 j,α+β (x)) + b l i , for l ≥ 2, ( ) where i ∈ [1 : n l ] is the channel index, α ∈ [0 : N l -1] is the neuron location, ker l = [-k l : k l ] is the filter range and 2k l + 1 is the filter size. To simplify the analysis, we assume hereafter that N l = N and k l = k for all l. Here, we have v l = n l-1 (2k + 1) and M l = n l-1 n l (2k + 1). We assume periodic boundary conditions, so y l i,α = y l i,α+N = y l i,α-N . Generalization to multidimensional convolutions is straighforward. Notation: Hereafter, for FFNN layers, we denote by q l (x) the variance of y l 1 (x) (the choice of the index 1 is not crucial since, by the mean-field approximation, the random variables (y l i (x)) i∈[1:N l ] are iid Gaussian variables). We denote by q l (x, x ) the covariance between y l 1 (x) and y l 1 (x ), and c l 1 (x, x ) the corresponding correlation. For gradient back-propagation, for some loss function L, we denote by ql (x, x ) the gradient covariance defined by ql (x, x ) = E ∂L ∂y l 1 (x) ∂L ∂y l 1 (x ) . Similarly, ql (x) denotes the gradient variance at point x. For CNN layers, we use similar notation accross channels. More precisely, we denote by q l α (x) the variance of y l 1,α, (x) (the choice of the index 1 is not crucial here either since, by the meanfield approximation, the random variables (y l i,α (x)) i∈[1:N l ] are iid Gaussian variables). We denote by q l α,α (x, x ) the covariance between y l 1,α (x) and y l 1,α (x ), and c l α,α (x, x ) the corresponding correlation. As in the FFNN case, we define the gradient covariance by ql α,α (x, x ) = E ∂L ∂y l 1,α (x) ∂L ∂y l 1,α

B.1 WARMUP : SOME RESULTS FROM THE MEAN-FIELD THEORY OF DNNS

We start by recalling some results from the mean-field theory of deep NNs.

Covariance propagation for FFNN:

In Section A.1, we presented the recursive formula for covariance propagation in a FFNN, which we derive using the Central Limit Theorem. More precisely, for two inputs x, x ∈ R d , we have q l (x, x ) = σ 2 b + σ 2 w E[φ(y l-1 i (x))φ(y l-1 i (x ))]. This can be rewritten as q l (x, x ) = σ 2 b + σ 2 w E φ q l (x)Z 1 φ q l (x )(c l-1 Z 1 + 1 -(c l-1 ) 2 Z 2 , where c l-1 := c l-1 (x, x ). With a ReLU activation function, we have q l (x, x ) = σ 2 b + σ 2 w 2 q l (x) q l (x )f (c l-1 ), where f is the ReLU correlation function given by (Hayou et al. (2019) ) f (c) = 1 π (c arcsin c + 1 -c 2 ) + 1 2 c. Covariance propagation for CNN: Similar to the FFNN case, it is straightforward to derive recusive formula for the covariance. However, in this case, the independence is across channels and not neurons. Simple calculus yields q l α,α (x, x ) = E[y l i,α (x)y l i,α (x )] = σ 2 b + σ 2 w 2k + 1 β∈ker E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))] Using a ReLU activation function, this becomes q l α,α (x, x ) = σ 2 b + σ 2 w 2k + 1 β∈ker q l α+β (x) q l α +β (x )f (c l-1 α+β,α +β (x, x )).

Covariance propagation for ResNet with ReLU :

This case is similar to the non residual case. However, an added residual term shows up in the recursive formula. For ResNet with FFNN layers, we have q l (x, x ) = q l-1 (x, x ) + σ 2 b + σ 2 w 2 q l (x) q l (x )f (c l-1 ) and for ResNet with CNN layers, we have q l α,α (x, x ) = q l-1 α,α (x, x ) + σ 2 b + σ 2 w 2k + 1 β∈ker q l α+β (x) q l α +β (x )f (c l-1 α+β,α +β (x, x )).

B.1.2 GRADIENT COVARIANCE BACK-PROPAGATION

Gradiant Covariance back-propagation for FFNN: Let L be the loss function. Let x be an input. The back-propagation of the gradient is given by the set of equations ∂L ∂y l i = φ (y l i ) N l+1 j=1 ∂L ∂y l+1 j W l+1 ji . Using the approximation that the weights used for forward propagation are independent from those used in backpropagation, we have as in Schoenholz et al. ( 2017) ql (x) = ql+1 (x) N l+1 N l χ(q l (x)), where χ(q l (x)) = σ 2 w E[φ( q l (x)Z) 2 ]. Gradient Covariance back-propagation for CNN: Similar to the FFNN case, we have that ∂L ∂W l i,j,β = α ∂L ∂y l i,α φ(y l-1 j,α+β ) and ∂L ∂y l i,α = n j=1 β∈ker ∂L ∂y l+1 j,α-β W l+1 i,j,β φ (y l i,α ). Using the approximation of Gradient independence and averaging over the number of channels (using CLT) we have that E[ ∂L ∂y l i,α 2 ] = σ 2 w E[φ ( q l α (x)Z) 2 ] 2k + 1 β∈ker E[ ∂L ∂y l+1 i,α-β 2 ]. We can get similar recursion to that of the FFNN case by summing over α and using the periodic boundary condition, this yields α E[ ∂L ∂y l i,α 2 ] = χ(q l α (x)) α E[ ∂L ∂y l+1 i,α 2 ].

B.1.3 EDGE OF CHAOS (EOC)

Let x ∈ R d be an input. The convergence of q l (x) as l increases has been studied by Schoenholz et al. (2017) and Hayou et al. (2019) . In particular, under weak regularity conditions, it is proven that q l (x) converges to a point q(σ b , σ w ) > 0 independent of x as l → ∞. The asymptotic behaviour of the correlations c l (x, x ) between y l (x) and y l (x ) for any two inputs x and x is also driven by (σ b , σ w ): the dynamics of c l is controlled by a function f i.e. c l+1 = f (c l ) called the correlation function. The authors define the EOC as the set of parameters (σ b , σ w ) such that σ 2 w E[φ ( q(σ b , σ w )Z) 2 ] = 1 where Z ∼ N (0, 1). Similarly the Ordered, resp. Chaotic, phase is defined by σ 2 w E[φ ( q(σ b , σ w )Z) 2 ] < 1, resp. σ 2 w E[φ ( q(σ b , σ w )Z) 2 ] > 1. On the Ordered phase, the gradient will vanish as it backpropagates through the network, and the correlation c l (x, x ) converges exponentially to 1. Hence the output function becomes constant (hence the name 'Ordered phase'). On the Chaotic phase, the gradient explodes and the correlation converges exponentially to some limiting value c < 1 which results in the output function being discontinuous everywhere (hence the 'Chaotic' phase name). On the EOC, the second moment of the gradient remains constant throughout the backpropagation and the correlation converges to 1 at a sub-exponential rate, which allows deeper information propagation. Hereafter, f will always refer to the correlation function.

B.1.4 SOME RESULTS FROM THE MEAN-FIELD THEORY OF DEEP FFNNS

Let ∈ (0, 1) and B = {(x, x )R d : c 1 (x, x ) < 1 -} (For now B is defined only for FFNN). Using Approximation 1, the following results have been derived by Schoenholz et al. (2017) 

and Hayou et al. (2019):

• There exist q, λ > 0 such that sup x∈R d |q l -q| ≤ e -λl . • On the Ordered phase, there exists γ > 0 such that sup x,x ∈R d |c l (x, x ) -1| ≤ e -γl . • On the Chaotic phase, For all ∈ (0, 1) there exist γ > 0 and c < 1 such that sup (x,x )∈B |c l (x, x ) -c| ≤ e -γl . • For ReLU network on the EOC, we have f (x) = x→1- x + 2 √ 2 3π (1 -x) 3/2 + O((1 -x) 5/2 ). • In general, we have f (x) = σ 2 b + σ 2 w E[φ( √ qZ 1 )φ( √ qZ(x))] q , ( ) where Z(x) = xZ 1 + √ 1 -x 2 Z 2 and Z 1 , Z 2 are iid standard Gaussian variables. • On the EOC, we have f (1) = 1 • On the Ordered, resp. Chaotic, phase we have that f (1) < 1, resp. f (1) > 1. • For non-linear activation functions, f is strictly convex and f (1) = 1. • f is increasing on [-1, 1]. • On the Ordered phase and EOC, f has one fixed point which is 1. On the chaotic phase, f has two fixed points: 1 which is unstable, and c ∈ (0, 1) which is a stable fixed point. • On the Ordered/Chaotic phase, the correlation between gradients computed with different inputs converges exponentially to 0 as we back-progapagate the gradients. Similar results exist for CNN. Xiao et al. (2018) show that, similarly to the FFNN case, there exists q such that q l α (x) converges exponentially to q for all x, α, and studied the limiting behaviour of correlation between neurons at the same channel c l α,α (x, x) (same input x). These correlations describe how features are correlated for the same input. However, they do not capture the behaviour of these features for different inputs (i.e. c l α,α (x, x ) where x = x ). We establish this result in the next section.

B.2 CORRELATION BEHAVIOUR IN CNN IN THE LIMIT OF LARGE DEPTH

Appendix Lemma 1 (Asymptotic behaviour of the correlation in CNN with smooth activation functions). We consider a 1D CNN. Let (σ b , σ w ) ∈ (R + ) 2 and x = x be two inputs ∈ R d . If (σ b , σ w ) are either on the Ordered or Chaotic phase, then there exists β > 0 such that sup α,α |c l α,α (x, x ) -c| = O(e -βl ), where c = 1 if (σ b , σ w ) is in the Ordered phase, and c ∈ (0, 1) if (σ b , σ w ) is in the Chaotic phase. Proof. Let x = x be two inputs and α, α two nodes in the same channel i. From Section B.1, we have that q l α,α (x, x ) = E[y l i,α (x)y l i,α (x )] = σ 2 w 2k + 1 β∈ker E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))] + σ 2 b . This yields c l α,α (x, x ) = 1 2k + 1 β∈ker f (c l-1 α+β,α +β (x, x )), where f is the correlation function. We prove the result in the Ordered phase, the proof in the Chaotic phase is similar. Let (σ b , σ w ) be in the Ordered phase and c l m = min α,α c l α,a (x, x ). Using the fact that f is non-decreasing (section B.1), we have that c l α,α (x, x ) ≥ 1 2k+1 β∈ker c l-1 α+β,α +β (x, x )) ≥ f (c l-1 m ). Taking the min again over α, α , we have c l m ≥ f (c l-1 m ), therefore c l m is non-decreasing and converges to a stable fixed point of f . By the convexity of f , the limit is 1 (in the Chaotic phase, f has two fixed point, a stable point c 1 < 1 and c 2 = 1 unstable). Moreover, the convergence is exponential using the fact that 0 < f (1) < 1. We conclude using the fact that sup α,α |c l α,α (x, x ) -1| = 1 -c l m .

C PROOFS FOR SECTION 2 : SBP FOR FFNN/CNN AND THE RESCALING TRICK

In this section, we prove Theorem 1 and Proposition 1. Before proving Theorem 1, we state the degeneracy approximation. Approximation 3 (Degeneracy on the Ordered phase). On the Ordered phase, the correlation c l and the variance q l converge exponentially quickly to their limiting values 1 and q respectively. The degeneracy approximation for FFNN states that • ∀x = x , c l (x, x ) ≈ 1 • ∀x, q l (x) ≈ q For CNN, • ∀x = x , α, α , c l α,α (x, x ) ≈ 1 • ∀x, q l α (x) ≈ q The degeneracy approximation is essential in the proof of Theorem 1 as it allows us to avoid many unnecessary complications. However, the results holds without this approximation although the constants may be a bit different. Theorem 1 (Initialization is crucial for SBP). We consider a FFNN (2) or a CNN (3). Assume (σ w , σ b ) are chosen on the ordered, i.e. χ(σ b , σ w ) < 1, then the NN is ill-conditioned. Moreover, we have E[s cr ] ≤ 1 L 1 + log(κLN 2 ) κ + O 1 κ 2 √ LN 2 , where κ = | log χ(σ b , σ w )|/8. If (σ w , σ b ) are on the EOC, i.e. χ(σ b , σ w ) = 1, then the NN is well-conditioned. In this case, κ = 0 and the above upper bound no longer holds. Proof. We prove the result using Approximation 3.

1.. Case 1 : Fully connected Feedforward Neural Networks

To simplify the notation, we assume that N l = N and M l = N 2 (i.e. α l = 1 and ζ l = 1) for all l. We prove the result for the Ordered phase, the proof for the Chaotic phase is similar. 1) . With sparsity x, we keep k x = (1 -x)LN 2 weights. We have Let L 0 1, ∈ (0, 1 -1 L0 ), L ≥ L 0 and x ∈ ( 1 L + , P(s cr ≤ x) ≥ P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) ) where t (kx) is the k th x order statistic of the sequence {|W l ij | ∂L ∂W l ij , l > 0, (i, j) ∈ [1 : N ] 2 }. We have ∂L ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) ∂y l i (x) ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) φ(y l-1 j (x)). On the Ordered phase, the variance q l (x) and the correlation c l (x, x ) converge exponentially to their limiting values q, 1 (Section B.1). Under the degeneracy Approximation 3, we have • ∀x = x , c l (x, x ) ≈ 1 • ∀x, q l (x) ≈ q Let ql (x) = E[ ∂L ∂y l i (x) 2 ] (the choice of i is not important since (y l i (x)) i are iid ). Using these approximations, we have that y l i (x) = y l i (x ) almost surely for all x, x . Thus E ∂L ∂W l ij 2 = E[φ( √ qZ) 2 ]q l (x), where x is an input. The choice of x is not important in our approximation. From Section B.1.2, we have ql x = ql+1 x N l+1 N l χ. Then we obtain ql x = N L N l qL x χ L-l = qL x χ L-l , where χ = σ 2 w E[φ( √ qZ) 2 ] as we have assumed N l = N . Using this result, we have E ∂L ∂W l ij 2 = A χ L-l , where A = E[φ( √ qZ) 2 ]q L x for an input x. Recall that by definition, one has χ < 1 on the Ordered phase. In the general case, i.e. without the degeneracy approximation on c l and q l , we can prove that E ∂L ∂W l ij 2 = Θ(χ L-l ) which suffices for the rest of the proof. However, the proof of this result requires many unnecessary complications that do not add any intuitive value to the proof. In the general case where the widths are different, ql will also scale as χ L-l up to a different constant. Now we want to lower bound the probability P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) ). Let t (kx) be the k th x order statistic of the sequence {|W l ij | ∂L ∂W l ij , l > 1 + L, (i, j) ∈ [1 : N ] 2 }. It is clear that t (kx) > t (kx) , therefore P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) ) ≥ P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) ). Using Markov's inequality, we have that P( ∂L ∂W 1 ij ≥ α) ≤ E ∂L ∂W 1 ij 2 α 2 . ( ) Note that Var(χ l-L 2 ∂L ∂W l ij ) = A. In general, the random variables χ l-L 2 ∂L ∂W l ij have a density f l ij for all l > 1 + L, (i, j) ∈ [1 : N ] 2 , such that f l ij (0) = 0. Therefore, there exists a constant λ such that for x small enough, P(χ l-L 2 ∂L ∂W l ij ≥ x) ≥ 1 -λx. By selecting x = χ (1-/2)L-1 2 , we obtain χ l-L 2 × x ≤ χ (1+ L)-L 2 χ (1-/2)L-1 2 = χ L/2 . Therefore, for L large enough, and all l > 1 + L, (i, j) ∈ [1 : N l ] × [1 : N l-1 ], we have P( ∂L ∂W l ij ≥ χ (1-/2)L-1 2 ) ≥ 1 -λ χ l-( L/2+1) 2 ≥ 1 -λ χ L/2 . Now choosing α = χ (1-/4)L-1 2 in inequality ( 16) yields P( ∂L ∂W 1 ij ≥ χ (1-/4)L-1 2 ) ≥ 1 -A χ L/4 . Since we do not know the exact distribution of the gradients, the trick is to bound them using the previous concentration inequalities. We define the event B := {∀(i, j) ∈ [1 : N ] × [1 : d], ∂L ∂W 1 ij ≤ χ (1-/4)L-1 2 } ∩ {∀l > 1 + L, (i, j) ∈ [1 : N ] 2 , ∂L ∂W l ij ≥ χ (1-/2)L-1 2 }. We have P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) ) ≥ P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) B)P(B). But, by conditioning on the event B, we also have P(max i,j |W 1 ij | ∂L ∂W 1 ij < t (kx) B) ≥ P(max i,j |W 1 ij | < χ -L/8 t (kx) ), where t (kx) is the k th x order statistic of the sequence {|W l ij |, l > 1 + L, (i, j) ∈ [1 : N ] 2 }. Now, as in the proof of Proposition 4 in Appendix E (MBP section), define x ζ,γ L = min{y ∈ (0, 1) : ∀x > y, γ L Q x > Q 1-(1-x) γ 2-ζ L }, where γ L = χ -L/8 . Since lim ζ→2 x ζ,γ L = 0, then there exists ζ < 2 such that x ζ ,γ L = + 1 L . As L grows, t (kx) converges to the quantile of order x- 1-. Therefore, P(max i,j |W 1 ij | < χ -L/8 t (kx) ) ≥ P(max i,j |W 1 ij | < Q 1-(1-x- 1-) γ 2-ζ L ) + O( 1 √ LN 2 ) ≥ 1 -N 2 ( x - 1 - ) γ 2-ζ L + O( 1 √ LN 2 ). Using the above concentration inequalities on the gradient, we obtain P(B) ≥ (1 -A χ L/4 ) N 2 (1 -λ χ L/2 ) LN 2 . Therefore there exists a constant η > 0 independent of such that P(B) ≥ 1 -ηLN 2 χ L/4 . Hence, we obtain P(s cr ≥ x) ≤ N 2 ( x - 1 - ) γ 2-ζ L + ηLN 2 χ L/4 + O( 1 √ LN 2 ). Integration of the previous inequality yields E[s cr ] ≤ + 1 L + N 2 1 + γ 2-ζ L + ηLN 2 χ L/4 + O( 1 √ LN 2 ). Now let κ = | log(χ)| 8 and set = log(κLN 2 )

κL

. By the definition of x ζ , we have γ L Q x ζ ,γ L = Q 1-(1-x ζ ,γ L ) γ 2-ζ L . For the left hand side, we have γ L Q x ζ ,γ L ∼ αγ L log(κLN 2 ) κL where α > 0 is the derivative at 0 of the function x → Q x . Since γ L = κLN 2 , we have γ L Q x ζ ,γ L ∼ αN 2 log(κLN 2 ) Which diverges as L goes to infinity. In particular this proves that the right hand side diverges and therefore we have that (1 -x ζ ,γ L ) γ 2-ζ L converges to 0 as L goes to infinity. Using the asymptotic equivalent of the right hand side as L → ∞, we have Q 1-(1-x ζ ,γ L ) γ 2-ζ L ∼ -2 log((1 -x ζ ,γ L ) γ 2-ζ L ) = γ 1-ζ /2 L -2 log(1 -x ζ ,γ L ). Therefore, we obtain Q 1-(1-x ζ ,γ L ) γ 2-ζ L ∼ γ 1-ζ /2 L 2 log(κLN 2 ) κL . Combining this result to the fact that γ L Q x ζ ,γ L ∼ αγ L log(κLN 2 ) κL we obtain γ -ζ L ∼ β log(κLN 2 ) κL , where β is a positive constant. This yields E[s cr ] ≤ log(κLN 2 ) κL + 1 L + µ κLN 2 log(κLN 2 ) (1 + o(1)) + η 1 κ 2 LN 2 + O( 1 √ LN 2 ) = 1 L (1 + log(κLN 2 ) κ ) + O( κ 2 √ LN 2 ), where κ = | log(χ)| 8 and µ is a constant.

2.. Case 2 : Convolutional Neural Networks

The proof for CNNs in similar to that of FFNN once we prove that E ∂L ∂W l i,j,β 2 = A χ L-l where A is a constant. We have that ∂L ∂W l i,j,β = α ∂L ∂y l i,α φ(y l-1 j,α+β ) and ∂L ∂y l i,α = n j=1 β∈ker ∂L ∂y l+1 j,α-β W l+1 i,j,β φ (y l i,α ). Using the approximation of Gradient independence and averaging over the number of channels (using CLT) we have that E[ ∂L ∂y l i,α 2 ] = σ 2 w E[φ ( √ qZ) 2 ] 2k + 1 β∈ker E[ ∂L ∂y l+1 i,α-β 2 ]. Summing over α and using the periodic boundary condition, this yields α E[ ∂L ∂y l i,α 2 ] = χ α E[ ∂L ∂y l+1 i,α 2 ]. Here also, on the Ordered phase, the variance q l and the correlation c l converge exponentially to their limiting values q and 1 respectively. As for FFNN, we use the degeneracy approximation that states • ∀x = x , α, α , c l α,α (x, x ) ≈ 1, • ∀x, q l α (x) ≈ q. Using these approximations, we have E ∂L ∂W l i,j,β 2 = E[φ( √ qZ) 2 ]q l (x), where ql (x) = α E[ ∂L ∂y l i,α (x) 2 ] for an input x. The choice of x is not important in our approximation. From the analysis above, we have ql (x) = qL (x)χ L-l , so we conclude that E ∂L ∂W l i,j,β 2 = A χ L-l where A = E[φ( √ qZ) 2 ]q L (x). After pruning, the network is usually 'deep' in the Ordered phase in the sense that χ = f (1) 1. To re-place it on the Edge of Chaos, we use the Rescaling Trick. Proposition 1 (Rescaling Trick). Consider a NN of the form (2) or (3) (FFNN or CNN) initialized on the EOC. Then, after pruning, the sparse network is not initialized on the EOC. However, the rescaled sparse network y l (x) = F(ρ l • δ l • W l , y l-1 (x)) + B l , for l ≥ 1, where • ρ l ij = 1 √ E[N l-1 (W l i1 ) 2 δ l i1 ] for FFNN of the form (2), • ρ l i,j,β = 1 E[n l-1 (W l i,1,β ) 2 δ l i,1,β ] for CNN of the form (3), is initialized on the EOC. Proof. For two inputs x, x , the forward propagation of the covariance is given by ql (x, x ) = E[y l i (x)y l i (x )] = E[ N l-1 j,k W l ij W l ik δ l ij δ l ik φ(ŷ l-1 j (x))φ(ŷ l-1 j (x ))] + σ 2 b . We have ∂L ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) ∂y l i (x) ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) φ(y l-1 j (x)). Under the assumption that the weights used for forward propagation are independent from the weights used for back-propagation, W l ij and ∂L ∂y l i (x) are independent for all x ∈ D. We also have that W l ij and φ(y l-1 j (x)) are independent for all x ∈ D. Therefore, W l ij and ∂L ∂W l ij are independent for all l, i, j. This yields ql (x, x ) = σ 2 w α l E[φ(ŷ l-1 1 (x))φ(ŷ l-1 1 (x ))] + σ 2 b , where α l = E[N l-1 (W l 11 ) 2 δ l 11 ] (the choice of i, j does not matter because they are iid). Unless we do not prune any weights from the l th layer, we have that α l < 1. These dynamics are the same as a FFNN with the variance of the weights given by σ2 w = σ 2 w α l . Since the EOC equation is given by σ 2 w E[φ ( √ qZ) 2 ] = 1, with the new variance, it is clear that σ2 w E[φ ( √ qZ) 2 ] = 1 in general. Hence, the network is no longer on the EOC and this could be problematic for training. With the rescaling, this becomes ql (x, x ) = σ 2 w ρ 2 l α l E[φ(ỹ l-1 1 (x))φ(ỹ l-1 1 (x ))] + σ 2 b = σ 2 w E[φ(ỹ l-1 1 (x))φ(ỹ l-1 1 (x ))] + σ 2 b . Therefore, the new variance after re-scaling is σ2 w = σ 2 w , and the limiting variance q = q remains also unchanged since the dynamics are the same. Therefore σ2 w E[φ ( √ qZ) 2 ] = σ 2 w E[φ ( √ qZ) 2 ] = 1. Thus, the re-scaled network is initialized on the EOC. The proof is similar for CNNs.

D PROOF FOR SECTION 3 : SBP FOR STABLE RESIDUAL NETWORKS

Theorem 2 (Resnet is well-conditioned). Consider a Resnet with either Fully Connected or Convolutional layers and ReLU activation function. Then for all σ w > 0, the Resnet is well-conditioned. Moreover, for all l ∈ {1, ..., L}, m l = Θ((1 + σ 2 w 2 ) L ). Proof. Let us start with the case of a Resnet with Fully Connected layers. we have that ∂L ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) ∂y l i (x) ∂W l ij = 1 |D| x∈D ∂L ∂y l i (x) φ(y l-1 j (x)) and the backpropagation of the gradient is given by the set of equations ∂L ∂y l i = ∂L ∂y l+1 i + φ (y l i ) N l+1 j=1 ∂L ∂y l+1 j W l+1 ji . Recall that q l (x) = E[y l i (x) 2 ] and ql (x, x ) = E[ ∂L ∂y l i (x) ∂L ∂y l i (x ) ] for some inputs x, x . We have that q l (x) = E[y l-1 i (x) 2 ] + σ 2 w E[φ(y l-1 1 ) 2 ] = (1 + σ 2 w 2 )q l-1 (x), and ql (x, x ) = (1 + σ 2 w E[φ (y l i (x))φ (y l i (x ))])q l+1 (x, x ). We also have E[ ∂L ∂W l ij 2 ] = 1 |D| 2 x,x t l x,x , where t l x,x = ql (x, x ) q l (x)q l (x )f (c l-1 (x, x )) and f is defined in the preliminary results (Eq 15). Let k ∈ {1, 2, ..., L} be fixed. We compare the terms t l x,x for l = k and l = L. The ratio between the two terms is given by (after simplification) t k x,x t L x,x = L-1 l=k (1 + σ 2 w 2 f (c l (x, x ))) (1 + σ 2 w 2 ) L-k f (c k-1 (x, x )) f (c L-1 (x, x )) . We have that f Hayou et al. (2019) for more details). (c l (x, x)) = f (1) = 1. A Taylor expansion of f near 1 yields f (c l (x, x )) = 1 -l -1 + o(l -1 ) and f (c l (x, x)) = 1 -sl -2 + o(l -2 ) (see Therefore, there exist two constants A, B > 0 such that A < L-1 l=k (1+ σ 2 w 2 f (c l (x,x ))) (1+ σ 2 w 2 ) L-k < B for all L and k ∈ {1, 2, ..., L}. This yields A ≤ E[ ∂L ∂W l ij 2 ] E[ ∂L ∂W L ij 2 ] ≤ B, which concludes the proof. For Resnet with convolutional layers, we have ∂L ∂W l i,j,β = 1 |D| x∈D α ∂L ∂y l i,α (x) φ(y l-1 j,α+β (x)) and ∂L ∂y l i,α = ∂L ∂y l+1 i,α + n j=1 β∈ker ∂L ∂y l+1 j,α-β W l+1 i,j,β φ (y l i,α ). Recall the notation ql α,α (x, x ) = E[ ∂L ∂y l i,α ∂L ∂y l i,α (x ) ]. Using the hypothesis of independence of forward and backward weights and averaging over the number of channels (using CLT), we have ql α,α (x, x ) = ql+1 α,α (x, x ) + σ 2 w f (c l α,α (x, x )) 2(2k + 1) β ql+1 α+β,α +β (x, x ). Let K l = ((q l α,α+β (x, x )) α∈[0:N -1] ) β∈[0:N -1] be a vector in R N 2 . Writing this previous equation in matrix form, we obtain K l = (I + σ 2 w f (c l α,α (x, x )) 2(2k + 1) U )K l+1 and E[ ∂L ∂W l i,j,β 2 ] = 1 |D| 2 x,x ∈D α,α t l α,α (x, x ), where t l α,α (x, x ) = ql α,α (x, x ) q l α+β (x)q l α +β (x )f (c l-1 α+β,α +β (x, x )). Since we have f (c l α,α (x, x )) → 1, then by fixing l and letting L goes to infinity, it follows that K l ∼ L→∞ (1 + σ 2 w 2 ) L-l e 1 e T 1 K L and, from Lemma 2, we know that q l α+β (x)q l α +β (x ) = (1 + σ 2 w 2 ) l-1 √ q 0,x q 0,x . Therefore, for a fixed k < L, we have t k α,α (x, x ) ∼ (1 + σ 2 w 2 ) L-1 f (c k-1 α+β,α +β (x, x ))(e T 1 K L ) = Θ(t L α,α (x, x )) . This concludes the proof. Proposition 2 (Stable Resnet). Consider the following Resnet parameterization y l (x) = y l-1 (x) + 1 √ L F(W l , y l-1 ), for l ≥ 2, then the network is well-conditioned for all choices of σ w > 0. Moreover, for all l ∈ {1, ..., L} we have m l = Θ(L -1 ). Proof. The proof is similar to that of Theorem 2 with minor differences. Let us start with the case of a Resnet with fully connected layers, we have ∂L ∂W l ij = 1 |D| √ L x∈D ∂L ∂y l i (x) ∂y l i (x) ∂W l ij = 1 |D| √ L x∈D ∂L ∂y l i (x) φ(y l-1 j (x)) and the backpropagation of the gradient is given by ∂L ∂y l i = ∂L ∂y l+1 i + 1 √ L φ (y l i ) N l+1 j=1 ∂L ∂y l+1 j W l+1 ji . Recall that q l (x) = E[y l i (x) 2 ] and ql (x, x ) = E[ ∂L ∂y l i (x) ∂L ∂y l i (x ) ] for some inputs x, x . We have q l (x) = E[y l-1 i (x) 2 ] + σ 2 w L E[φ(y l-1 1 (x)) 2 ] = (1 + σ 2 w 2L )q l-1 (x) and ql (x, x ) = (1 + σ 2 w L E[φ (y l i (x))φ (y l i (x ))])q l+1 (x, x ). We also have E[ ∂L ∂W l ij 2 ] = 1 L|D| 2 x,x t l x,x , where t l x,x = ql (x, x ) q l (x)q l (x )f (c l-1 (x, x )) and f is defined in the preliminary results (Eq. 15). Let k ∈ {1, 2, ..., L} be fixed. We compare the terms t l x,x for l = k and l = L. The ratio between the two terms is given after simplification by t k x,x t L x,x = L-1 l=k (1 + σ 2 w 2L f (c l (x, x ))) (1 + σ 2 w 2L ) L-k f (c k-1 (x, x )) f (c L-1 (x, x )) . As in the proof of Theorem 2, we have that f (c l (x, x)) = 1, f (c l (x, x )) = 1 -l -1 + o(l -1 ) and f (c l (x, x)) = 1 -sl -2 + o(l -2 ). Therefore, there exist two constants A, B > 0 such that A < L-1 l=k (1+ σ 2 w 2L f (c l (x,x ))) (1+ σ 2 w 2L ) L-k < B for all L and k ∈ {1, 2, ..., L}. This yields A ≤ E[ ∂L ∂W l ij 2 ] E[ ∂L ∂W L ij 2 ] ≤ B. Moreover, since (1+ σ 2 w 2L ) L → e σ 2 w /2 , then m l = Θ(1) for all l ∈ {1, ..., L}. This concludes the proof. For Resnet with convolutional layers, the proof is similar. With the scaling, we have ∂L ∂W l i,j,β = 1 √ L|D| x∈D α ∂L ∂y l i,α (x) φ(y l-1 j,α+β (x)) and ∂L ∂y l i,α = ∂L ∂y l+1 i,α + 1 √ L n j=1 β∈ker ∂L ∂y l+1 j,α-β W l+1 i,j,β φ (y l i,α ). Let ql α,α (x, x ) = E[ ∂L ∂y l i,α (x) ∂L ∂y l i,α (x ) ]. Using the hypothesis of independence of forward and backward weights and averaging over the number of channels (using CLT) we have ql α,α (x, x ) = ql+1 α,α (x, x ) + σ 2 w f (c l α,α (x, x )) 2(2k + 1)L β ql+1 α+β,α +β (x, x ). Let K l = ((q l α,α+β (x, x )) α∈[0:N -1] ) β∈[0:N -1] is a vector in R N 2 . Writing this previous equation in matrix form, we have K l = (I + σ 2 w f (c l α,α (x, x )) 2(2k + 1)L U )K l+1 , E[ ∂L ∂W l i,j,β 2 ] = 1 L|D| 2 x,x ∈D α,α t l α,α (x, x ), where t l α,α (x, x ) = ql α,α (x, x ) q l α+β (x)q l α +β (x )f (c l-1 α+β,α +β (x, x )). Since we have f (c l α,α (x, x )) → 1, then by fixing l and letting L goes to infinity, we obtain K l ∼ L→∞ (1 + σ 2 w 2L ) L-l e 1 e T 1 K L and we know from Appendix Lemma 2 (using α β = σ 2 w 2L for all β) that q l α+β (x)q l α +β (x ) = (1 + σ 2 w 2L ) l-1 √ q 0,x q 0,x . Therefore, for a fixed k < L, we have t k α,α (x, x ) ∼ (1 + σ 2 w 2L ) L-1 f (c k-1 α+β,α +β (x, x ))(e T 1 K L ) = Θ(t L α,α (x, x )) which proves that the stable Resnet is well conditioned. Moreover, since (1 + σ 2 w 2L ) L-1 → e σ 2 w /2 , then m l = Θ(L -1 ) for all l. In the next Lemma, we study the asymptotic behaviour of the variance q l α . We show that, as l → ∞, a phenomenon of self averaging shows that q l α becomes independent of α. Appendix Lemma 2. Let x ∈ R d . Assume the sequence (a l,α ) l,α is given by the recursive formula a l,α = a l-1,α + β∈ker λ β a l-1,α+β where λ β > 0 for all β. Then, there exists ζ > 0 such that for all x ∈ R d and α, a l,α (x) = (1 + β α β ) l a 0 + O((1 + β α β ) l e -ζl )), where a 0 is a constant and the O is uniform in α. Proof. Recall that a l,α = a l-1,α + β∈ker λ β a l-1,α+β . We rewrite this expression in a matrix form A l = U A l-1 , where A l = (a l,α ) α is a vector in R N and U is the is the convolution matrix. As an example, for k = 1, U given by -ζl ). This concludes the proof. U =             1 + λ 0 λ 1 0 ... 0 λ -1 λ -1 1 + λ 0 λ 1 0 . . . 0 0 λ -1 1 + λ 0 λ 1 . . . 0 0 0 λ -1 1 + λ 0 . . . 0 . . . . . . . . . . . . λ 1 0 . . . 0 λ -1 1 + λ 0             . U A l = (b -l 1 U l )A 0 = e 1 e T 1 A 0 + O(e Unlike FFNN or CNN, we do not need to rescale the pruned network. The next proposition establishes that a Resnet lives on the EOC in the sense that the correlation between y l i (x) and y l i (x ) converges to 1 at a sub-exponential O(l -2 ) rate. Proposition 3 (Resnet live on the EOC even after pruning). Let x = x be two inputs. The following statments hold 1. For Resnet with Fully Connected layers, let ĉl (x, x ) be the correlation between ŷl i (x) and ŷl i (x ) after pruning the network. Then we have 1 -ĉl (x, x ) ∼ κ l 2 , where κ > 0 is a constant. 2. For Resnet with Convolutional layers, let ĉl (x, x ) = α,α E[y l 1,α (x)y l 1,α (x )] α,α √ q l α (x) √ q l α (x ) be an 'average' correlation after pruning the network. Then we have 1 -ĉl (x, x ) l -2 . Proof. 1. Let x and x be two inputs. The covariance of ŷl i (x) and ŷl i (x ) is given by ql (x, x ) = ql-1 (x, x ) + αE (Z1,Z2)∼N (0,Q l-1 ) [φ(Z 1 )φ(Z 2 )] where Q l-1 = ql-1 (x) ql-1 (x, x ) ql-1 (x, x ) ql-1 (x ) and α = E[N l-1 W l 11 2 δ l 11 ]. Consequently, we have ql (x) = (1 + α 2 )q l-1 (x). Therefore, we obtain ĉl (x, x ) = 1 1 + λ ĉl-1 (x, x ) + λ 1 + λ f (ĉ l-1 (x, x )), where λ = α 2 and f (x) = 2E[φ(Z 1 )φ(xZ 1 + √ 1 -x 2 Z 2 )] and Z 1 and Z 2 are iid standard normal variables. Using the fact that f is increasing (Section B.1), it is easy to see that ĉl (x, x ) → 1. Let ζ l = 1 -ĉl (x, x ). Moreover, using a Taylor expansion of f near 1 (Section B.1) f (x) = x→1 -x + β(1 -x) 3/2 + O((1 -x) 5/2 ), it follows that ζ l = ζ l-1 -ηζ 3/2 l-1 + O(ζ 5/2 l-1 ), where η = λβ 1+λ . Now using the asymptotic expansion of ζ -1/2 l given by ζ -1/2 l = ζ -1/2 l-1 + η 2 + O(ζ l-1 ), this yields ζ -1/2 l ∼ l→∞ η 2 l. We conclude that 1 -ĉl (x, x ) ∼ 4 η 2 l 2 . 2. Let x be an input. Recall the forward propagation of a pruned 1D CNN y l i,α (x) = y l-1 i,α (x) + c j=1 β∈ker δ l i,j.β W l i,j,β φ(y l-1 j,α+β (x)) + b l i . Unlike FFNN, neurons in the same channel are correlated since we use the same filters for all of them. Let x, x be two inputs and α, α two nodes in the same channel i. Using the Central Limit Theorem in the limit of large n l (number of channels), we have E[y l i,α (x)y l i,α (x )] = E[y l-1 i,α (x)y l-1 i,α (x )]+ 1 2k + 1 β∈ker α β E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))], where α β = E[δ l i,1.β W l i,1,β 2 n l-1 ]. Let q l α (x) = E[y l 1,α (x) 2 ]. The choice of the channel is not important since for a given α, neurons (y l i,α (x)) i∈[c] are iid. Using the previous formula, we have q l α (x) = q l-1 α (x) + 1 2k + 1 β∈ker α β E[φ(y l-1 1,α+β (x)) 2 ] = q l-1 α (x) + 1 2k + 1 β∈ker α β q l-1 α+β (x) 2 . Therefore, letting q l (x) = 1 N α∈[N ] q l α (x) and σ = β α β 2k+1 , we obtain q l (x) = q l-1 (x) + 1 2k + 1 β∈ker α β α∈[n] q l-1 α+β (x) 2 = (1 + σ 2 )q l-1 (x) = (1 + σ 2 ) l-1 q 1 (x), where we have used the periodicity q l-1 α = q l-1 α-N = q l-1 α+N . Moreover, we have min α q l α (x) ≥ (1 + σ 2 ) min α q l-1 α (x) ≥ (1 + σ 2 ) l-1 min α q 1 α (x). The convolutional structure makes it hard to analyse the correlation between the values of a neurons for two different inputs. Xiao et al. (2018) studied the correlation between the values of two neurons in the same channel for the same input. Although this could capture the propagation of the input structure (say how different pixels propagate together) inside the network, it does not provide any information on how different structures from different inputs propagate. To resolve this situation, we study the 'average' correlation per channel defined as c l (x, x ) = α,α E[y l 1,α (x)y l 1,α (x )] α,α q l α (x) q l α (x ) , for any two inputs x = x . We also define cl (x, x ) by cl (x, x ) = 1 N 2 α,α E[y l 1,α (x)y l 1,α (x )] 1 N α q l α (x) 1 N α q l α (x ) . Using the concavity of the square root function, we have 1 N α q l α (x) 1 N α q l α (x ) = 1 N 2 α,α q l α (x)q l α (x ) ≥ 1 N 2 α,α q l α (x) q l α (x ) ≥ 1 N 2 α,α |E[y l 1,α (x)y l 1,α (x )]|. This yields cl (x, x ) ≤ c l (x, x ) ≤ 1. Using Appendix Lemma 2 twice with a l,α = q l α (x), a l,α = q l α (x ), and λ β = α β 2(2k+1) , there exists ζ > 0 such that c l (x, x ) = cl (x, x )(1 + O(e -ζl )). This result shows that the limiting behaviour of c l (x, x ) is equivalent to that of cl (x, x ) up to an exponentially small factor. We study hereafter the behaviour of cl (x, x ) and use this result to conclude. Recall that E[y l i,α (x)y l i,α (x )] = E[y l-1 i,α (x)y l-1 i,α (x )]+ 1 2k + 1 β∈ker α β E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))]. Therefore, α,α E[y l 1,α (x)y l 1,α (x )] = α,α E[y l-1 1,α (x)y l-1 1,α (x )] + 1 2k + 1 α,α β∈ker α β E[φ(y l-1 1,α+β (x))φ(y l-1 1,α +β (x ))] = α,α E[y l-1 1,α (x)y l-1 1,α (x )] + σ α,α E[φ(y l-1 1,α (x))φ(y l-1 1,α (x ))] = α,α E[y l-1 1,α (x)y l-1 1,α (x )] + σ 2 α,α q l-1 α (x) q l-1 α (x )f (c l-1 α,α (x, x )), where f is the correlation function of ReLU. Let us first prove that cl (x, x ) converges to 1. Using the fact that f (z) ≥ z for all z ∈ (0, 1) (Section B.1), we have that α,α E[y l 1,α (x)y l 1,α (x )] ≥ α,α E[y l-1 1,α (x)y l-1 1,α (x )] + σ 2 α,α q l-1 α (x) q l-1 α (x )c l-1 α,α (x, x ) = α,α E[y l-1 1,α (x)y l-1 1,α (x )] + σ 2 α,α E[y l-1 1,α (x)y l-1 1,α (x )] = (1 + σ 2 )E[y l-1 1,α (x)y l-1 1,α (x )]. Combining this result with the fact that α q l α (x) = (1 + σ 2 ) α q l-1 α (x), we have cl (x, x ) ≥ cl-1 (x, x ). Therefore cl (x, x ) is non-decreasing and converges to a limiting point c. Let us prove that c = 1. By contradiction, assume the limit c < 1. Using equation ( 19), we have that c l (x,x ) cl (x,x ) converge to 1 as l goes to infinity. This yields c l (x, x ) → c. Therefore, there exists α 0 , α 0 and a constant δ < 1 such that for all l, c l α0,α 0 (x, x ) ≤ δ < 1. Knowing that f is strongly convex and that f (1) = 1, we have that f (c l α0,α 0 (x, x )) ≥ c l α0,α 0 (x, x ) + f (δ) -δ. Therefore, cl (x, x ) ≥ cl-1 (x, x ) + σ 2 q l-1 α0 (x)q l-1 α 0 (x ) N 2 q l (x) q l (x ) (f (δ) -δ) ≥ cl-1 (x, x ) + σ 2 min α q 1 α (x) min α q 1 α (x ) N 2 q 1 (x) q 1 (x ) (f (δ) -δ). By taking the limit l → ∞, we find that c ≥ c + σ 2 √ minα q 1 α (x) min α q 1 α (x ) N 2 √ q 1 (x) √ q 1 (x ) (f (δ) -δ). This cannot be true since f (δ) > δ. Thus we conclude that c = 1. Now we study the asymptotic convergence rate. From Section B.1, we have that f (x) = x→1 -x + 2 √ 2 3π (1 -x) 3/2 + O((1 -x) 5/2 ). Therefore, there exists κ > 0 such that, close to 1 -we have that f (x) ≤ x + κ(1 -x) 3/2 . Using this result, we can upper bound c l (x, x ) cl (x, x ) ≤ cl-1 (x, x ) + κ α,α 1 N 2 q l-1 α (x) q l-1 α (x ) q l (x) q l (x ) (1 -c l α,α (x, x )) 3/2 . To get a polynomial convergence rate, we should have an upper bound of the form cl ≤ cl-1 + ζ(1 -cl-1 ) 1+ (see below). However, the function x 3/2 is convex, so the sum cannot be upper-bounded directly using Jensen's inequality. We use here instead (Pečarić et al., 1992 , Theorem 1) which states that for any x 1 , x 2 , ...x n > 0 and s > r > 0, we have i x s i 1/s < i x r i 1/r . ( ) Let z l α,α = 1 N 2 √ q l-1 α (x) q l-1 α (x ) √ q l (x) √ q l (x ) , we have α,α z l α,α (1 -c l α,α (x, x )) 3/2 ≤ ζ l α,α [z l α,α (1 -c l α,α (x, x ))] 3/2 , where ζ l = max α,α 1 z l α,α /2 . Using the inequality (20) with s = 3/2 and r = 1, we have α,α [z l α,α (1 -c l α,α (x, x ))] 3/2 ≤ ( α,α z l α,α (1 -c l α,α (x, x ))) 3/2 = ( α,α z l α,α -cl (x, x ))) 3/2 . Moreover, using the concavity of the square root function, we have α,α z l α,α ≤ 1. This yields cl (x, x ) ≤ cl-1 (x, x ) + ζ(1 -cl-1 (x, x )) 3/2 , where ζ is constant. Letting γ l = 1 -cl (x, x ), we can conclude using the following inequality (we had an equality in the case of FFNN) γ l ≥ γ l-1 -ζγ 3/2 l-1 which leads to γ -1/2 l ≤ γ -1/2 l-1 (1 -ζγ 1/2 l-1 ) -1/2 = γ -1/2 l-1 + ζ 2 + o(1). Hence we have γ l l -2 . Using this result combined with (19) again, we conclude that 1 -c l (x, x ) l -2 .

E THEORETICAL ANALYSIS OF MAGNITUDE BASED PRUNING (MBP)

In this section, we provide a theoretical analysis of MBP. The two approximations from Appendix A are not used here. MBP is a data independent pruning algorithm (zero-shot pruning). The mask is given by δ l i = 1 if |W l i | ≥ t s , 0 if |W l i | < t s , where t s is a threshold that depends on the sparsity s. By defining k s = (1 -s) l M l , t s is given by t s = |W | (ks) where |W | (ks) is the k th s order statistic of the network weights (|W l i |) 1≤l≤L,1≤i≤M l (|W | (1) > |W | (2) > ...). With MBP, changing σ w does not impact the distribution of the resulting sparse architecture since it is a common factor for all the weights. However, in the case of different scaling factors v l , the variances σ 2 w v l used to initialize the weights vary across layers. This gives potentially the erroneous intuition that the layer with the smallest variance will be highly likely fully pruned before others as we increase the sparsity s. This is wrong in general since layers with small variances might have more weights compared to other layers. However, we can prove a similar result by considering the limit of large depth with fixed widths. Proposition 4 (MBP in the large depth limit). Assume N is fixed and there exists l 0 ∈ [|1, L|] such that α l0 > α l for all l = l 0 . Let Q x be the x th quantile of |X| where X iid ∼ N (0, 1) and γ = min l =l0 α l 0 α l . For ∈ (0, 2), define x ,γ = inf{y ∈ (0, 1) : ∀x > y, γQ x > Q 1-(1-x) γ 2-} and x ,γ = ∞ for the null set. Then, for all ∈ (0, 2), x ,γ is finite and there exists a constant ν > 0 such that E[s cr ] ≤ inf ∈(0,2) {x ,γ + ζ l0 N 2 1 + γ 2-(1 -x ,γ ) 1+γ 2- } + O( 1 √ LN 2 ). Proposition 4 gives an upper bound on E[s cr ] in the large depth limit. The upper bound is easy to approximate numerically.  = α 3 = • • • = α L = 1. Our experiments reveal that this bound can be tight. Proof. Let x ∈ (0, 1) and k x = (1 -x)Γ L N 2 , where Γ L = l =l0 ζ l . We have P(s cr ≤ x) ≥ P(max i |W l0 i | < |W | (kx) ), where |W | (kx) is the k th x order statistic of the sequence {|W l i |, l = l 0 , i ∈ [1 : M l ]}; i.e |W | (1) > |W | (2) > ... > |W | (kx) . Let (X i ) i∈[1:M l 0 ] and (Z i ) i∈[1:Γ L N 2 ] be two sequences of iid standard normal variables. It is easy to see that P(max i,j |W l0 ij | < |W | (kx) ) ≥ P(max i |X i | < γ|Z| (kx) ) where γ = min l =l0 α l 0 α l . Moreover, we have the following result from the theory of order statistics, which is a weak version of Theorem 3.1. in Puri and Ralescu (1986) Appendix Lemma 3. Let X 1 , X 2 , ..., X n be iid random variables with a cdf F . Assume F is differentiable and let p ∈ (0, 1) and let Q p be the order p quantile of the distribution F , i.e. F (Q p ) = p. Then we have √ n(X (pn) -Q p )F (Q p )σ -1 p → D N (0, 1), where the convergence is in distribution and σ p = p(1 -p). Using this result, we obtain P(max i |X i | < γ|Z| (kx) ) = P(max i |X i | < γQ x ) + O( 1 √ LN 2 ), where Q x is the x quantile of the folded standard normal distribution. The next result shows that x ,γ is finite for all ∈ (0, 2). Appendix Lemma 4. Let γ > 1. For all ∈ (0, 2), there exists x ∈ (0, 1) such that, for all x > x , γQ x > Q 1-(1-x) γ 2-. Proof. Let > 0, and recall the asymptotic equivalent of Q 1-x given by Q 1-x ∼ x→0 -2 log(x) Therefore, γQx Q 1-(1-x) γ 2- ∼ x→1 √ γ > 1. Hence x exists and is finite. Let > 0. Using Appendix Lemma 4, there exists x > 0 such that P(max i |X i | < γQ x ) ≥ P(max i |X i | < Q 1-(1-x) γ 2-) = (1 -(1 -x) γ 2- ) ζ l 0 N 2 ≥ 1 -ζ l0 N 2 (1 -x) γ 2- , where we have used the inequality (1-t) z ≥ 1-zt for all (t, z) ∈ [0, 1]×(1, ∞) and β = α l0 α l0+1 . Using the last result, we have P(s cr ≥ x) ≤ βN 2 (1 -x) γ 2- + O( 1 √ LN 2 ). 

Now we have

≤ x + βN 2 1 + γ 2-(1 -x ) γ 2-+1 + O( 1 √ LN 2 ). This is true for all ∈ (0, 2), and the additional term O( 1 √ LN 2 ) does not depend on . Therefore there exists a constant ν ∈ R such that for all E[s cr ] ≤ x + βN 2 1 + γ 2-(1 -x ) γ 2-+1 + ν √ LN 2 . We conclude by taking the infimum over . Another interesting aspect of MBP is when the depth is fixed and the width goes to infinity. The next result gives a lower bound on the probability of pruning at least one full layer. Proposition 5 (MBP in the large width limit). Assume there exists l 0 ∈ [1 : L] such that α l0 > α l (i.e. v l0 > v l ) for all l, and let s 0 = M l 0 l M l . For some sparsity s, let P R l0 (s) be the event that layer l 0 is fully pruned before other layers, i.e. P R l0 (s) = {|A l0 | = M l0 } ∩ l∈[1:L] {|A l | < M l }, and let P R l0 = ∪ s∈(s0,smax) P R l0 (s) be the event where there exists a sparsity s such that layer l 0 is fully pruned before other layers. Then, we have P(P R l0 ) ≥ 1 - Lπ 2 4(γ -1) 2 log(N ) 2 + o 1 log(N ) 2 , where γ = min k =l0 α l 0 α k . ij |W k i |}. Let us prove that P(P R l0 ) ≥ k =l0 P(max i |W l0 i | < max j |W k i |). Let X = max i |W l0 i |. We have that P(P R l0 ) = E[ k =l0 P(X < max i |W k i ||X)] using the rearrangement inequality presented in Appendix Lemma 5 with functions f i (x) = P(X < max i |W k i ||X = x) which are all non-increasing, we obtain P(P R l0 ) ≥ k =l0 E[P(X < max i |W k i ||X)] = k =l0 P(max i |W l0 i | < max i |W k i |). In order to deal with the probability P(max i |W l0 i | < max i |W k i |), we use Appendix Lemma 6 which is a result from Extreme Value Theory which provides a comprehensive description of the law of max i X i needed in our analysis. In our case, we want to characterise the behaviour of max i |X i | where X i are iid Gaussian random variables. Let Ψ and ψ be the cdf and density of a standard Gaussian variable X. The cdf of |X| is given by F = 2Ψ -1 and its density is given by f = 2ψ on the positive real line. Thus, 1-F f = 1-Ψ ψ and it is sufficient to verify the conditions of Appendix Lemma 6 for the standard Gaussian distribution. We have lim x→F -1 (1) d dx 1-Ψ(x) ψ(x) = lim x→F -1 (1) x (1-Ψ(x)) ψ(x) -1 = x/x -1 = 0, where we have used the fact that x(1 -Ψ(x)) ∼ φ(x) in the large x limit. Let us now find the values of a n and b n . In the large x limit, we have 1 -F (x) = 2 ∞ x e -t 2 2 √ 2π dt = π 2 e -x 2 2 ( 1 x + 1 x 3 + o( 1 x 3 )). Therefore, one has log(1 -F (x)) ∼ - x 2 2 . This yields b n = F -1 (1 - 1 n ) ∼ 2 log n. Using the same asymptotic expansion of 1 -F (x), we can obtain a more precise approximation of b n b n = 2 log n 1 -log(log n) 4 log n + 1 2 log( π 4 ) 2 log n - log(log n) 8(log n) 2 + o( log(log n) (log n) 2 ) . Now let us find an approximation for a n . We have ψ(b n ) ∼ √ 2 πn log n. Therefore, it follows that a n ∼ π √ 2 log n . We use these results to lower bound the probability P(max i |W l0 i | < max i |W k i |). We have P(max i |W l0 i | ≥ max i |W k i |) = P(max i |X i | ≥ γ k max i |Y i |), where γ k = α l 0 α k and (X i ) and (Y i ) are standard Gaussian random variables. Note that γ k > 1. Let A N = max i |X i | and B N = max i |Y i |. We have that P(A N ≥ γ k B N ) = P(A N -E[A N ] ≥ γ k (B N -E[B N ]) + γ k E[B N ] -E[A N ]) ≤ E (A N -E[A N ]) 2 (γ k (B N -E[B N ]) + γ k E[B N ] -E[A N ])) 2 ∼ N →∞ π 2 4(γ k -1) 2 log(N ) 2 . We conclude that for large N P(P R l0 ) ≥ 1 - 

F IMAGENET EXPERIMENTS

To validate our results on large scale datasets, we prune ResNet50 using SNIP, GraSP, SynFlow and our algorithm SBP-SR, and train the pruned network on ImageNet. We train the pruned model for 90 epochs with SGD. The training starts with a learning rate 0.1 and it drops by a factor of 10 at epochs 30, 60, 80. We report in table 8 Top-1 test accuracy for different sparsities. Our algorithm SBP-SR has a clear advantage over other algorithms. We are currently running extensive simulations on ImageNet to confirm these results. G ADDITIONAL EXPERIMENTS In Table 10 , we present additional experiments with varying Resnet Architectures (Resnet32/50), and sparsities (up to 99.9%) with Relu and Tanh activation functions on Cifar10. We see that overall, using our proposed Stable Resnet performs overall better that standard Resnets. In addition, we also plot the remaining weights for each layer to get a better understanding on the different pruning strategies and well as understand why some of the Resnets with Tanh activation functions are untrainable. Furthermore, we added additional MNIST experiments with different activation function (ELU, Tanh) and note that our rescaled version allows us to prune significantly more for deeper networks. 



Figure 1: Percentage of weights kept after SBP applied to a randomly initialized FFNN with depth 100 and width 100 for 70% sparsity on MNIST. Each pixel (i, j) corresponds to a neuron and shows the proportion of connections to neuron (i, j) that have not been pruned. The EOC (a) allows us to preserve a uniform spread of the weights, whereas the Chaotic phase (b), due to exploding gradients, prunes entire layers.

Figure 2: Percentage of non-pruned weights per layer in a ResNet32 for our Stable ResNet32 and standard Resnet32 with Kaiming initialization on CIFAR10. With Stable Resnet, we prune less aggressively weights in the deeper layers than for standard Resnet.

Figure 3: Accuracy on MNIST with different initialization schemes including EOC with rescaling, EOC withoutrescaling, Ordered phase, with varying depth and sparsity. This shows that rescaling to be on the EOC allows us to train not only much deeper but also sparser models.

[n : m] the set of integers {n, n + 1, ..., m} for n ≤ m. Two examples of such architectures are: • Fully-connected FeedForward Neural Network (FFNN)

a circulant symmetric matrix with eigenvalues b 1 > b 2 ≥ b 3 ... ≥ b N . The largest eigenvalue of U is given by b 1 = 1 + β λ β and its equivalent eigenspace is generated by the vector e 1 = ..., 1) ∈ R N . This yields b -l 1 U l = e 1 e T 1 + O(e -ζl ), where ζ = log( b1 b2 ). Using this result, we obtain b -l 1

-1) 2 log(N ) 2 + o( 1 log(N ) 2 ), where γ = min k =l0 α l 0 α k .

Figure 4: Percentage of pruned weights per layer in a ResNet32 for our scaled ResNet32 and standard Resnet32 with Kaiming initialization

Classification accuracies on CIFAR10 for Resnet with varying depths and sparsities using SNIP(Lee   et al. (2018)) and our algorithm SBP-SR



Classification accuracies on Tiny ImageNet for Resnet with varying depths

Test accuracy on MNIST with V-CNN for different depths with sparsity 50% using SBP(SNIP)

Classification accuracy on CIFAR10 for VGG16 and 3xVGG16 with varying sparsities

We propose an alternative Resnet parameterization called Stable Resnet, which allows for more stable pruning. Our theoretical results have been validated by extensive experiments on MNIST, CIFAR10, CIFAR100, Tiny ImageNet and ImageNet. Compared to other available one-shot pruning algorithms, we achieve state-of the-art results in many scenarios.

Table 7 compares the theoretical upper bound in Proposition 4 to the empirical value of E[s

Theoretical upper bound of Proposition 4 and empirical observations for a FFNN with N = 100 and L = 100

Classification accuracy on ImageNet (Top-1) for ResNet50 with varying sparsities (TODO: These results will be updated to include confidence intervals)

Test accuracy of pruned neural network on CIFAR10 with different activation functions

Test accuracy of pruned vanilla-CNN on CIFAR10 with different depth/sparsity levels

annex

Proposition 5 shows that when the width is not the same for all layers, MBP will result in one layer being fully pruned with a probability that converges to 1 as the width goes to infinity. The larger the ratio γ (ratio of widths between the largest and the second largest layers), the faster this probability goes to 1.The intuition behind Proposition 5 comes from a result in Extreme Value Theory stated in Appendix Lemma 6. Indeed, the problem of pruning one whole layer before the others is essentially a problem of maxima: we prune one whole layer l 0 before the others if and only if for allThe expected value of n iid standard Gaussian variables is known to scale as √ log n for large n; see e.g. Van Handel (2016) .The proof of Proposition 5 relies on the following two auxiliary results.Appendix Lemma 5 (Rearrangement inequality (Hardy et al., 1952) ). Let f, g : R → R + be functions which are either both non-decreasing or non-increasing and let X be a random variable.Then ) ]. Appendix Lemma 6 (Von Mises (1936) ). Let (X i ) 1≤i≤n be iid random variables with common density f and cumulative distribution functionwhere G is the Gumbel cumulative distribution function and series a n and b n are given by bWe are now in a position to prove Proposition 5.Proof. Assume there exists l 0 ∈ [1 : L] such that α l0 > α l for all l. The trick is to see that

H ON THE LOTTERY TICKET HYPOTHESIS

The Lottery Ticket Hypothesis (LTH) (Frankle and Carbin, 2019) states that "randomly initialized networks contain subnetworks that when trained in isolation reach test accuracy comparable to the original network". We have shown so far that pruning a NN initialized on the EOC will output sparse NNs that can be trained after rescaling. Conversely, if we initialize a random NN with any hyperparameters (σ w , σ b ), then intuitively, we can prune this network in a way that ensures that the pruned NN is on the EOC. This would theoretically make the sparse architecture trainable. We formalize this intuition as follows.Weak Lottery Ticket Hypothesis (WLTH): For any randomly initialized network, there exists a subnetwork that is initialized on the Edge of Chaos.In the next theorem, we prove that the WLTH is true for FFNN and CNN architectures that are initialized with Gaussian distribution. Theorem 3. Consider a FFNN or CNN with layers initialized with variances σ 2 w > 0 for weights and variance σ 2 b for bias. Let σ w,EOC be the value of σ w such that (σ w,EOC , σ b ) ∈ EOC. Then, for all σ w > σ w,EOC , there exists a subnetwork that is initialized on the EOC. Therefore WLTH is true.The idea behind the proof of Theorem 3 is that by removing a fraction of weights from each layer, we are changing the covariance structure in the next layer. By doing so in a precise way, we can find a subnetwork that is initialized on the EOC.We prove a slightly more general result than the one stated. Theorem 4 (Winning Tickets on the Edge of Chaos). Consider a neural network with layers initialized with variances σ w,l ∈ R + for each layer and variance σ b > 0 for bias. We define σ w,EOC to be the value of σ w such that (σ w,EOC , σ b ) ∈ EOC. Then, for all sequences (σ w,l ) l such that σ w,l > σ w,EOC for all l, there exists a distribution of subnetworks initialized on the Edge of Chaos.Proof. We prove the result for FFNN. The proof for CNN is similar. Let x, x be two inputs. For all l, let (δ l ) ij be a collection of Bernoulli variables with probability p l . The forward propagation of the covariance is given byThis yieldsTherefore, the new variance after pruning with the Bernoulli mask δ is σ2 w = σ 2 w,EOC . Thus, the subnetwork defined by δ is initialized on the EOC. The distribution of these subnetworks is directly linked to the distribution of δ. We can see this result as layer-wise pruning, i.e. pruning each layer aside. The proof is similar for CNNs.Theorem 3 is a special case of the previous result where the variances σ w,l are the same for all layers.

I ALGORITHM FOR SECTION 2.3

Algorithm 1 Rescaling trick for FFNN Input: Pruned network, size m for L = 1 to L do for i = 1 to N l do α l i ←N l-1 j=1 (W ij ) 2 δ l ij ρ l ij ← 1/ α l i for all j end for end for

