STRUCTURED PRUNING OF CNNS AT INITIALIZATION

Abstract

Pruning-at-initialization (PAI) methods can prune the individual weights of a convolutional neural network (CNN) before training, thus avoiding expensive fine-tuning or retraining of the pruned model. While PAI shows promising results in reducing model size, the pruned model still requires unstructured sparse matrix computation, making it difficult to achieve a real speedup. In this work, we show both theoretically and empirically that the accuracy of CNN models pruned by a PAI method depends on the layer-wise density (i.e., the fraction of the remaining parameters in each layer), irrespective of the granularity of pruning. We formulate PAI as a convex optimization problem based on an expectation-based proxy for model accuracy, which can produce the optimal allocation of the layer-wise densities with respect to the proxy model. Using our formulation, we further propose a structured and hardware-friendly PAI method, named PreCrop, to prune or reconfigure CNNs in the channel dimension. Our empirical results show that PreCrop achieves a higher accuracy than existing PAI methods on several popular CNN architectures, including ResNet, MobileNetV2, and EfficientNet, on both CIFAR-10 and Ima-geNet. Notably, PreCrop achieves an accuracy improvement of up to 2.7% over a state-of-the-art PAI algorithm when pruning MobileNetV2 on ImageNet. PreCrop also improves the accuracy of EfficientNetB0 by 0.3% on ImageNet with only 80% of the parameters and the same FLOPs.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved state-of-the-art accuracy in a wide range of machine learning (ML) applications. However, the massive computational and memory requirements of CNNs remain a major barrier to more widespread deployment on resource-limited edge and mobile devices. This challenge has motivated a large and active body of research on CNN compression, which attempts to simplify the original model without significantly compromising the accuracy. Weight pruning [15, 7, 17, 4, 8] has been extensively explored to reduce the computational and memory demands of CNNs. Existing methods create a sparse CNN model by iteratively removing ineffective weights/activations and training the resulting sparse model. Such an iterative pruning approach usually enjoys the least accuracy degradation but at the cost of a more computationally expensive training procedure. Moreover, training-based pruning methods introduce additional hyperparameters, such as the learning rate for fine-tuning and the number of epochs before rewinding [20] , which make the pruning process even more complicated and less reproducible. To minimize the cost of pruning, a new line of research proposes pruning-at-initialization (PAI) [16, 27, 24] , which identifies and removes unimportant weights in a CNN before training. Similar to training-based pruning, PAI assigns an importance score to each individual weight and retains only a subset of them by maximizing the sum of the importance scores of all remaining weights. The compressed model is then trained using the same hyperparameters (e.g., learning rate and the number of epochs) as the baseline model. Thus, the pruning and training of CNNs are cleanly decoupled, greatly reducing the complexity of obtaining a pruned model. Currently, SynFlow [24] is considered the state-of-the-art PAI technique -it eliminates the need for data during pruning as required in prior arts [16, 27] and achieves a higher accuracy with the same compression ratio. However, existing PAI methods mostly focus on fine-grained weight pruning, which removes individual weights from the CNN model without preserving any structures. As a result, both inference and training of the pruned model require sparse matrix computation, which is challenging to accelerate on commercially-available ML hardware that is optimized for dense computation (e.g., GPUs and TPUs [14] ). According to a recent study [6] , even with the NVIDIA cuSPARSE library, one can only achieve a meaningful speedup for sparse matrix multiplications on GPUs when the sparsity is over 98%. In practice, it is difficult to compress modern CNNs by more than 50× without a drastic degradation in accuracy [2] . Therefore, structural pruning patterns (e.g., pruning weights for the entire output channel) are preferred to enable practical memory and computational saving by avoiding irregular sparse storage and computation. In this work, we propose novel structured PAI techniques and demonstrate that they can achieve the same level of accuracy as the unstructured methods. We first introduce synaptic expectation (SynExp), a new proxy metric for accuracy, which is defined to be the expected sum of the importance scores of all the individual weights in the network. SynExp is invariant to weight shuffling and reinitialization, thus addressing some of the deficiencies of the fine-grained PAI approaches found in recent studies [22, 5] . We also show that SynExp does not vary as long the layer-wise density remains the same, irrespective of the granularity of pruning. Based on this key observation, we formulate an optimization problem that maximizes SynExp to determine the layer-wise pruning ratios, subject to model size and/or FLOPs constraints. We then propose PreCrop, a structured PAI that prunes CNN models at the channel level in a way to achieve the target layer-wise density determined by the SynExp optimization. PreCrop can effectively reduce the model size and computational cost without loss of accuracy compared to existing fine-grained PAI methods. Besides channel-level pruning, we further propose PreConfig, which can reconfigure the width dimension of a CNN to achieve a better accuracy-complexity trade-off with almost zero computational cost. Our empirical results show that the model after PreConfig can achieve higher accuracy with fewer parameters and FLOPs than the baseline for a variety of modern CNN architectures. We summarize our contributions as follows: • We propose to use the SynExp as a proxy for accuracy and formulate PAI as an optimization problem that maximizes SynExp under model size and/or FLOPs constraints. We also show that the accuracy of the CNN model pruned by solving the constrained optimization is independent of the pruning granularity. • We introduce PreCrop, a channel-level structured pruning technique that builds on the proposed SynExp optimization. Our experiments show that CNN models pruned by PreCrop achieve a similar or better accuracy compared to the state-of-the-art unstructured PAI approaches. Compared to SynFlow, PreCrop achieves 2.7% and 0.9% higher accuracy on MobileNetV2 and EfficientNet on ImageNet with fewer parameters and FLOPs. • We show that PreConfig can be used to optimize the width of each layer in the network with almost zero computational cost (e.g., within one second on CPU). Notably, PreConfig can effectively optimize the structure of EfficientNet and MobileNetV2, increasing the accuracy by 0.3% on ImageNet while using 20% fewer parameters and the same FLOPs.

2. RELATED WORK

Model Compression in General can reduce the computational cost of large networks to ease their deployment in resource-constrained devices. Besides pruning, quantization [3, 30, 13] , neural architecture search (NAS) [31, 23] , and distillation [12, 28] are also commonly used to improve the efficiency of the model. Training-Based Pruning uses various heuristic criteria to prune unimportant weights. They typically employ an iterative training-prune-retrain process where the pruning stage is intertwined with the training stage, which may increase the overall training cost by several folds. Existing training-based pruning methods can be either unstructured [7, 15] or structured [11, 18] , depending on the granularity and regularity of the pruning scheme. Training-based unstructured pruning usually achieves better accuracy given the same model size budget, while structured pruning can achieve a more practical speedup and compression without special support from custom hardware. (Unstructured) Pruning-at-Initialization (PAI) [16, 27, 24] methods provide a promising approach to mitigating the high cost of training-based pruning. They can identify and prune unimportant weights right after initialization and before the training starts. Related to these efforts, authors of [5] and [22] independently find that for the existing PAI methods, randomly shuffling the weights within a layer or reinitializing the weights does not cause any accuracy degradation. Neural Architecture Search (NAS) [31, 26] automatically explores a large space of candidate models to achieve a better accuracy-efficiency trade-off. The typical bases of the NAS search space include the width, depth, resolution, and choice of building blocks. However, existing approaches can only search in a small subset of the possible configurations due to the cost. For example, the search space of the channel width usually only contains a limited set of integer values. The cost for NAS is also orders of magnitude higher than training a single model. Some NAS algorithms [1, 29] use a cheap proxy instead of training the whole network, but an expensive reinforcement learning [1] or evolutionary algorithm [19] is still used to predict a good network.

3. PRUNING-AT-INITIALIZATION VIA SYNEXP OPTIMIZATION

In this section, we first review the preliminaries and deficiencies of existing PAI methods. To overcome the limitations, we introduce a new proxy for the accuracy of the PAI compressed model. We then propose a new formulation of PAI that maximizes the proxy metric using convex optimization.

3.1. PAI BACKGROUND

Preliminaries. PAI aims to prune a neural network after initialization but before training to avoid the time-consuming training-pruning-retraining process. Prior to training, PAI typically uses the magnitude of gradients (with respect to weights) to estimate the importance of individual weights. This requires both forward and backward propagation passes. PAI prunes the weights (W ) with smaller importance scores by setting the corresponding entries in the binary weight mask (M ) to zero. More concretely, to remove weights W , M is applied to W in an element-wise manner as W ⊙ M , where ⊙ denotes the Hadamard product. Popular PAI approaches, such as SNIP [16] , GraSP [27] , and SynFlow [24] , employ different methods to estimate the importance of individual weights. Single-shot PAI algorithms, such as SNIP and GraSP, prune the model to the desired sparsity in a single pass. Alternatively, SynFlow, which represents the state-of-the-art PAI algorithm, repeats the process of pruning a small fraction of weights and re-evaluating the importance scores until the desired pruning rate is reached. Through the iterative process, the importance of each weight can be estimated more accurately. Specifically, the importance score for a fully connected network used in SynFlow is defined as: S(W l ij ) = 1 T N k=l+1 W k ⊙ M k i W l ij M l ij l-1 k=1 W k ⊙ M k 1 j , where N is the number of layers, W l and M l are the weight and weight mask of the l-th layer, S(W l ij ) is the SynFlow score for a single weight W l ij , | • | is element-wise absolute operation, and 1 is an all-one vector. Here no training data or labels are required to compute the importance score, thus making SynFlow a data-agnostic algorithm. Deficiencies. Similar to training-based fine-grained pruning [7, 15] , existing PAI methods also use the sum of importance scores of the remaining weights as a proxy for model accuracy. Specifically, PAI obtains a binary weight mask (i.e., the pruning decisions) by maximizing the following objective: maximize N l=1 S l • M l over M l subject to N l=1 ∥M l ∥ 0 ≤ B params , where S l is the importance score matrix for the l-th layer, ∥ • ∥ 0 is the number of nonzero entries in a matrix, and B params is the target size of the compressed model. Given the setup of this optimization, it is natural that a subset of the individual weights will be deemed more important than others. Moreover, existing methods for computing the importance scores all depend on the values of the weights, thus any updates to the weights (such as reinitialization) will easily result in a change to the accuracy metric (i.e., the sum of the individual importance scores). However, recent studies in [22, 5] show that randomly shuffling the weight mask M l or reinitializing the weights W l does not affect the final accuracy of models compressed by any of the existing PAI methods. In addition, they show that the different pruned models have a similar accuracy as long as they have the same layer-wise density. This finding suggests that the aforementioned metric is not a good proxy for indicating the accuracy of the pruned model.

3.2. SYNEXP INVARIANCE THEOREM

In this section, we propose a new proxy metric called SynExp to address the deficiencies of the existing PAI approaches. We argue that a good accuracy proxy should enable PAI to achieve the following: 1. The pruning decision (i.e., weight mask M ) can be made before the model is initialized. 2. Maximization of the proxy should output layer-wise density p l as the result, as opposed to pruning decisions for individual weights. For the ease of later discussion, we formalize the weight matrix W and weight mask matrix M as two random variables, given a fixed density p l for each layer, for random pruning before initialization. If W l contains α l parameters, A l = {M, M l i ∈ {0, 1} ∀1 ≤ i ≤ α l , i M l i = p l × α l } is the set of all possible M l with the same shape as the W l that satisfies the layer-wise density (p l ) constraint. Then, the random weight mask M l for layer l is sampled uniformly from A l . Also, each individual weight W l i in layer l is independently sampled from a given distribution D l . The observations in Section 3.1 indicate that different values of these two random variables M and W result in similar final accuracy of the pruned model. However, different values do change the proxy value for the model accuracy in existing PAI methods. For example, the SynFlow score in Equation 2 may change under different instantiations of M and W . Therefore, we propose a new proxy that is invariant to the instantiation of M and W for the model accuracy in the context of PAIthe expectation of the sum of the importance scores of all unpruned (i.e., remaining) weights. The proposed proxy can be formulated as follows: maximize E M,W [S] = E M,W N l=1 S l • M l over p l subject to N l=1 α l • p l ≤ B params , where EM,W [S] stands for the expectation of the importance score S over W and M . In this new formulation, p l is optimized to maximize the proposed proxy for model accuracy. Since the expectation is computed over the W and M , the instantiations of these two random variables do not affect the expectation. To evaluate the expectation before weight initialization, we adopt the importance metric proposed by SynFlow, i.e., plugging S in Equation 1 into Equation 3As a result, we can compute the expectation analytically without forward or backward propagations. This new expectation-based proxy is referred to as SynExp, i.e., synaptic expectation. We show SynExp is invariant to the granularity of pruning PAI in the SynExp Invariance Theorem, which is stated as follows. Theorem 1. SynExp Invariance Theorem. Given a specific CNN architecture, the SynExp (E [M,W ] [S SF ]) of any randomly compressed model with the same layer-wise density p l is a constant, independent of the pruning granularity. The constant SynExp equals: E M,W [S SF ] = N C N +1 N l=1 (p l C l • E x∼D [|x|]) , ( ) where N is the number of layers in the network, E x∼D [|x|] is the expectation of magnitude of distribution D, C l is the input channel size of layer l and is also the output channel size of l -1, and p l = 1 α l ∥M l ∥ 0 is the layer-wise density. In Equation 4, N and C l are all hyperparameters of the CNN architecture and can be considered constants. E |D l | is also a constant under a particular distribution D l . The layer-wise density p l is the only variable in the equation. Thus, SynExp satisfies both of the aforementioned properties: 1) pruning is done prior to the weight initialization; 2) the layer-wise density can be directly optimized. Furthermore, Theorem 1 shows that the granularity of pruning has no impact on the proposed SynExp metric. In other words, the CNN model compressed using either unstructured or structured pruning method is expected to have a similar accuracy. The detailed proof of SynExp Invariance Theorem can be found in Appendix A. We also empirically verify it by randomly pruning each layer of a CNN at three different granular levels but with the same layer-wise density. Specifically, we perform random pruning at (1) weight-, (2) filter-, and (3) channel-level to achieve the desired layer-wise pruning ratios obtained from solving Equation 3. For weight and filter pruning, randomly pruning each layer to match the layer-wise density p l occasionally detaches some weights from the network, especially when the density is low. The detached weights do not contribute to the prediction but are counted as remaining parameters. Thus, we remove the detached weights for a fair comparison following the same approach described in [25] . For channel pruning, it is not trivial to achieve the target layer-wise density while satisfying the constraint that the number of output channels of the previous layer must equal the number of input channels of the next layer. Therefore, we employ PreCrop proposed in Section 4.2. As shown in Figure 1 , random pruning with different granularity can obtain a similar accuracy compared to SynFlow, as long as the layer-wise density remains the same. The empirical results are consistent with SynExp Invariance Theorem and also demonstrate the efficacy of the proposed SynExp metric. We include additional empirical results using different CNN architectures and other importance scores (e.g, SNIP, GraSP) in Appendix C.

3.3. OPTIMIZING SYNEXP

As discussed in Section 3.2, the layer-wise density matters for our proposed SynExp approach. Here, we show how to obtain the layer-wise density in Equation 3 that maximizes SynExp under model size and/or FLOPs constraints.

3.3.1. OPTIMIZING SYNEXP WITH PARAMETER COUNT CONSTRAINT

Given that the goal of PAI is to reduce the size of the model, we need to add a constraint on the total number of parameters B params (i.e., parameter count constraint), where B params is typically greater than zero and less than the number of parameters in the original network. Since layer-wise density p l is the only variable in Equation 3, we can simplify the equation by removing other constant terms, as follows: maximize N l=1 log p l over p l subject to N l=1 α l • p l ≤ B params , 0 < p l ≤ 1, ∀1 ≤ l ≤ N , where α l is the number of parameters in layer l. Equation 5 is a convex optimization problem that can be solved analytically 1 . We compare the layer-wise density derived from solving Equation 5 with the density obtained using SynFlow. As shown in Figure 2 , the layer-wise densities obtained by both approaches are nearly identical, where our new formulation eliminates the need for the iterative re-evaluation of the SynFlow scores as well as the pruning process in SynFlow. It is also worth noting that the proposed method can find the optimal layer-wise density even before the network is initialized. 

3.3.2. OPTIMIZING SYNEXP WITH PARAMETER COUNT AND FLOPS CONSTRAINTS

As discussed in Section 3.3.1, we can formulate PAI as a convex optimization problem with a constraint on the model size. However, the number of parameters does not necessarily reflect the performance (e.g., throughput) of the CNN model. In many cases, CNN models are compute-bound on commodity hardware [14, 9] . Therefore, we also introduce a FLOPs constraint in our formulation. With the existing PAI algorithms, it is not straightforward to directly constrain the optimization using a bound on the FLOP count. The savings in FLOPs are instead the byproduct of the weight pruning as specified in Equation 2. In contrast, it is much easier to account for FLOPs in our formulation, which aims to determine the density of each layer as opposed to the inclusion of individual weights from different layers. Thus for each layer, we can easily derive the required FLOP count based on the density (p l ). After incorporating the constraint on FLOPs (B FLOPs ), the convex optimization problem becomes: maximize N l=1 log p l over p l subject to N l=1 α l • p l ≤ B params , N l=1 β l • p l ≤ B FLOPs , 0 < p l ≤ 1, ∀1 ≤ l ≤ N, where β l is the number of FLOPs in the l th layer. Since the additional FLOPs constraint is linear, the optimization problem in Equation 6remains convex and has an analytical solution 1 . By solving SynExp optimization with a fixed B params but different B FLOPs , we can obtain the layer-wise density for various models that have the same number of parameters but different FLOPs. Then, we perform random weight pruning on the CNN model to achieve the desired layer-wise density. We compare the proposed Syn-Exp optimization (denoted as Ours) with other popular PAI methods. As depicted in Figure 3 , given a fixed model size (1.5 × 10 4 in the figure), our method can be used to generate a Pareto Frontier that spans the spectrum of FLOPs, while other methods can only have a fixed FLOPs. Our method dominates all other methods in terms of both accuracy and FLOPs reduction.  𝐶 ! 𝑝 ! 𝐶 ! * 𝑝 ! 𝐶 !"# Pad zeros Drop channels + 𝐶 ! 𝑙 !" CONV layer

4. STRUCTURED PRUNING-AT-INITIALIZATION

The SynExp Invariance Theorem shows that the pruning granularity of PAI methods should not affect the accuracy of the pruned model. Channel pruning, which prunes the weights of the CNN at the output channel granularity, is considered the most coarse-grained and hardware-friendly pruning technique, Therefore, applying the proposed PAI method for channel pruning can avoid both complicated retraining/re-tuning procedures and irregular computations. In this section, we propose a structured PAI method for channel pruning, named PreCrop, to prune CNNs in the channel dimension. In addition, we propose a variety of PreCrop with relaxed density constraints to reconfigure the width of each layer in the CNN model, which is called PreConfig.

4.1. PRECROP

Applying the proposed PAI method to channel pruning requires a two-step procedure. First, the layer-wise density p l is obtained by solving the optimization problem shown in Equation 5 or 6. Second, we need to decide how many output channels of each layer should be pruned to satisfy the layer-wise density. However, it is not straightforward to compress each layer to match a given layer-wise density due to the additional constraint that the number of output channels of the current layer must match the number of input channels of the next layer. We introduce PreCrop, which compresses each layer to meet the desired layer-wise density. Let C l and C l+1 be the number of input channels of layer l and l + 1, respectively. C l+1 is also the number of output channels of layer l. For layers with no residual connections, the number of output channels of layer l is reduced to √ p l • C l+1 . The number of input channels of layer l + 1 needs to match the number of output channels of layer l, which is also reduced to √ p l • C l+1 . Therefore, the actual density of layer l after PreCrop is √ p l-1 • p l instead of p l . We empirically find that √ p l-1 • p l is close enough to p l because the neighboring layers have similar layer-wise densities. Alternatively, one can obtain the exact layer-wise density p by only reducing the number of input or output channels of a layer. However, this approach leads to a significant drop in accuracy, because the number of the input and output channels can change dramatically (e.g., p l C l ≪ C l+1 or C l ≫ p l C l+1 ). This causes the shape of the feature map to change dramatically in adjacent layers, resulting in information loss. For layers with residual connections, Figure 4 depicts an approach to circumvent the constraint on the number of channels of adjacent layers. We can reduce the number of input and output channels of layer l from C l and C l+1 to √ p l C l and √ p l C l+1 , respectively. In this way, the density of each layer can match the given layer-wise density obtained from the proposed PAI method. Since the output of layer l needs to be added element-wisely with the original input to layer l, the output of layer l is padded with zero-valued channels to match the shape of the original input. In our implementation, we simply add the output of layer l to the first √ p l C l+1 channels of the original input to layer l, thus requiring no extra memory or computation for zero padding. PreCrop eliminates the requirement for sparse computation in existing PAI methods and thus can be used to accelerate both training and inference of the pruned models.

4.2. PRECONFIG: PRECROP WITH RELAXED DENSITY CONSTRAINT

PreCrop uses the layer-wise density obtained from solving the convex optimization problem, which is always less than 1 following the common setting for pruning (i.e., p l ≤ 1). However, this constraint on layer-wise density is not necessary for our method since we can increase the number of channels (i.e., expand the width of the layer) before initialization. By solving the problem in Equation 6without the constraint p l ≤ 1, we can expand the layers with a density greater than 1 (p l > 1) and prune the layers with a density less than 1 (p l < 1). We call this variant of PreCrop as PreConfig (PreCrop-Reconfigure). If we set B params and B FLOPs to be the same as the original network, we can essentially reconfigure the width of each layer of a given network architecture under certain constraints on model size and FLOPs. The width of each layer in a CNN is usually designed manually, which often relies on extensive experience and intuition. Using PreConfig, we can automatically determine the width of each layer in the network to achieve a better cost-accuracy trade-off. PreConfig can also be used as (a part of) an ultra-fast NAS. Compared to conventional NAS, which typically searches on the width, depth, resolution, and choice of building blocks, PreConfig only changes the width. Nonetheless, PreConfig only requires a minimum amount of time and computation compared to NAS methods; it only needs to solve a relatively small convex optimization problem, which can finish within a second on a CPU.

5. EVALUATION

In this section, we empirically evaluate PreCrop and PreConfig. We first demonstrate the effectiveness of PreCrop by comparing it with SynFlow. We then use PreConfig to tune the width of each layer and compare the accuracy of the model after PreConfig with the original model. We perform experiments using various modern CNN models, including ResNet [10] , MobileNetV2 [21] , and EfficientNet [23] , on both CIFAR-10 and ImageNet. We set all hyperparameters used to train the models pruned by different PAI algorithms to be the same. See Appendix E for detailed experimental settings.

5.1. EVALUATION OF PRECROP

For CIFAR-10, we compare the accuracy of SynFlow (red line) and two variants of PreCrop: PreCrop-Params (blue line) and PreCrop-FLOPs (green line). PreCrop-Params adds the parameter count constraint whereas PreCrop-FLOPs imposes the FLOPs constraint into the convex optimization problem. As shown in Figure 5a , PreCrop-Params achieves similar or even better accuracy as SynFlow under a wide range of different model size constraints, thus validating that PreCrop-Params can be as effective as the fine-grained PAI method. Considering the benefits of structured pruning, PreCrop-Params should be favored over existing PAI methods. Figure 5b further shows that PreCrop-FLOPs consistently outperforms SynFlow by a large margin, especially when the reduction in FLOPs is large. The experimental results show that PreCrop-FLOPs should be adopted when the performance of the model is limited by the computational cost. Table 2 compares the accuracy of the reconfigured model with the original model under similar model size and FLOPs constraints. For ResNet34, with similar accuracy, we reduce the parameter count by 25%. For MobileNetV2, we achieve 0.3% higher accuracy than the baseline with 20% fewer parameters and 3% fewer FLOPs. For the EfficientNet, we can also achieve 0.3% higher accuracy than the baseline with only 80% of the parameters and the same FLOPs. Note that EfficientNet is identified by a NAS method. As PreConfig only changes the number of channels of the model before initialization, we believe it also applies to other compression techniques.

6. CONCLUSION

In this work, we theoretically and empirically show that the accuracy of the CNN models pruned using PAI methods depends on the layer-wise density.  E M,W [S SF ] = N C N +1 N l=1 (p l C l • E x∼D [|x|]) , where N is the number of layers in the network, E x∼D [|x|] is the expectation of magnitude of distribution D, C l is the input channel size of layer l and is also the output channel size of l -1, and p l = 1 α l ∥C l ∥ 0 is the layer-wise density. Proof. Assuming the network has N layers, weight matrix W l ∈ R C l ×C l+1 , mask matrix M l ∈ {0, 1} C l ×C l+1 . C l and C l+1 are the input and output channel size of layer l. As the output channel size of any layer l equals to the input channel size of the next layer l + 1, we have C l+1 = C l+1 , ∀l < N . We first prove the Theorem 1 on fully-connected network, and we can extend it to CNNs easily. From Equation 1, in a fully-connected network, the Synaptic Flow score for any parameter W l ij with mask M l ij in layer l equals to: S SF (W l (i,j) ) = 1 T N k=l+1 W k ⊙ M k i W l (i,j) M l (i,j) l-1 k=1 W k ⊙ M k 1 j We compute the SynExp of the layer l (E [M,W ] (S SF ) [l] ), then the SynExp of the network is simply the sum of SynExp of all layers: E [M,W ] (S SF ) = N l=1 E [M,W ] (S SF ) [l] We define the expectation value for input channel i, output channel j, and the whole layer in layer l as E l (i, * ) , E l ( * ,j) , and E l ( * , * ) : E l (i, * ) = 1 C l+1 x |W l (i,x) M l (i,x) | E l ( * ,j) = 1 C l x |W l (x,j) M l (x,j) | E l (i,j) = E l ( * , * ) = 1 C l C l+1 i,j |W l (i,j) M l (i,j) | = 1 α l i,j |W l (i,j) M l (i,j) | = p l E |D l | Here we use E |D l | to denote E x∼D [|x|]. As the weight in layer l is sampled from distribution D, and the mask matrices are also randomly sampled, we have  In practice, to avoid solving the µ, we use a convex optimization solver, which can obtain the solution with a CPU within a second for such a small scale convex optimization.

C MORE EMPIRICAL RESULTS ON SYNEXP INVARIANCE THEOREM

We show more empirical results that validates SynExp Invariance Theorem. We first show the comparison of the performance using different pruning granulariteis on VGG16 using CIFAR-10. All the settings in this experiment is the same as in Figure 1 , except this experiment is done on VGG16. Then we also verify that SynExp Invariance Theorem not only holds for SynFlow, but also holds for other PAI algorithms. In this experiment, we first use other PAI (i.e., SNIP and GraSP) to obtain the layerwise density p l . Then we use random pruning to match p l in the channel level. The results are shown in Figure 7 . As shown in all the above experiments, as long as the layerwise density is the same, the pruning granularties do not affect the model accuracy.



We include analytical solutions for Equation 5 and Equation6in Appendix B for completeness. https://github.com/osmr/imgclsmob https://github.com/ganguli-lab/Synaptic-Flow



Figure 1: Comparison of the performance using different pruning granularities on ResNet20 using CIFAR-10.

Figure 2: Comparison of the layer-wise densities obtained by SynExp optimization with parameter count constraint and SynFlow. Higher transparency means that the problem is constrained by a smaller parameter count.

Figure 3: Comparison of our method with other PAI methodswe repeat the experiment using ResNet-20 on CIFAR-10 five times and report the mean and variance (error bar) of the accuracy. All the models in the figures have 1.5 × 10 4 parameters.

Figure 4: Illustration of PreCrop for layers with residual connections -C l and C l+1 represent the number of input channels of layer l and l + 1, respectively. p l represents the density of layer l.

PreCrop-Params vs. SynFlow. (b) PreCrop-FLOPs vs. SynFlow.

Figure 5: Comparison of PreCrop-Params and PreCrop-FLOPs with SynFlowwe repeat the experiment using ResNet20 (top), WideResNet20 (middle), and MobileNetV2 (bottom) on CIFAR-10 three times and report the mean and variance (error bar) of the accuracy.

Figure 6: Comparison of the performance using different pruning granularities on VGG16 using CIFAR-10.

Figure 7: Comparison of the performance using different pruning granularities on ResNet20 using CIFAR-10. SNIP (left) and GraSP (right) importance scores are used. B.3 SOLUTION FOR PRECONFIG

Comparison of PreCrop with SynFlow on ImageNet -The dagger( † ) implies that the numbers are theoretical without considering the overhead of sparse matrices in storing and computing.Table1summarizes the comparison between PreCrop and SynFlow on ImageNet. For ResNet-34, PreCrop achieves 0.6% lower accuracy compared to SynFlow with a similar model size and FLOPs. For both MobileNetV2 and EfficientNetB0, PreCrop achieves 1.2% and 0.9% accuracy improvements compared to SynFlow with strictly fewer FLOPs and parameters, respectively. The experimental results on ImageNet further support SynExp Invariance Theorem that coarse-grained structured pruning (e.g., PreCrop) can perform as well as unstructured pruning. In conclusion, PreCrop achieves a favorable accuracy and model size/FLOPs tradeoff compared to the state-of-the-art PAI algorithm.5.2 EVALUATION OF PRECONFIGAs discussed in Section 4.2, PreConfig can be viewed as an ultra-fast NAS technique, which adjusts the width of each layer in the model even before the weights are initialized.

PreConfig on ImageNet.

PROOF OF SYNEXP INVARIANCE THEOREMTheorem 1. Given a specific CNN architecture, the SynExp (E [M,W ] [S SF ]) of any randomly compressed model with the same layer-wise density p l is a constant, independent of the pruning granularity. The constant SynExp equals:

REPRODUCIBILITY STATEMENT

The proof of SynExp Invariance Theorem is stated in the appendix with explanations. We provide the source code for the key experiments in the paper. We thoroughly checked the implementation and also verified empirically that the results in this paper are reproducible. The source code will be made available through GitHub.Combining Equation 3 and 14, because the instantiation of the weight matrices and mask matrices for each layer are independent:According to Equation 9,SynExp Invariance Theorem can also be extended to CNNs, as it is obvious that SynExp of CNNs is proportional to that of fully connected networks. Thus the difference of SynExp between CNNs and fully connected networks for each layer is only a factor equal to K 2 C l+1 , where K is the kernel size of the convolutional layer.

B SOLUTION OF THE OPTIMIZATION PROBLEM

For the convex optimization problem in Equation 5, Equation 6, or PreConfig, we can simply use Karush-Kuhn-Tucker (KKT) conditions to analytically solve it. We include the solutions as follows for completeness. In practice, we use convex solver to solve the problem to avoid the piecewise function.

B.1 SOLUTION FOR EQUATION 5

where µ satisfies:B.2 SOLUTION FOR EQUATION 6where µ 1 , µ 2 satisfy: 

D CHANNEL WIDTH COMPARISON

We also include a comparison of the channel width between the baseline EfficientNetB0 and PreConfig EfficientNetB0 in Figure 8 .

E EXPERIMENT DETAILS E.1 IMPLEMENTATION

We adapt model implementations of ResNet, ShuffleNet, and MobileNetv2 from imgclsmob 2 . The implementations of SynFlow, SNIP, and GraSP are based on the codebase of SynFlow 3 .

E.2 HYPERPARAMETERS

Here we provide the hyperparameters used in training all models in Table 3 . No AutoAugment, Label Smoothing, or stochastic depth is used during training. All the CIFAR-10 models are trained with same hyperparameter setting. 

