STRUCTURED PRUNING OF CNNS AT INITIALIZATION

Abstract

Pruning-at-initialization (PAI) methods can prune the individual weights of a convolutional neural network (CNN) before training, thus avoiding expensive fine-tuning or retraining of the pruned model. While PAI shows promising results in reducing model size, the pruned model still requires unstructured sparse matrix computation, making it difficult to achieve a real speedup. In this work, we show both theoretically and empirically that the accuracy of CNN models pruned by a PAI method depends on the layer-wise density (i.e., the fraction of the remaining parameters in each layer), irrespective of the granularity of pruning. We formulate PAI as a convex optimization problem based on an expectation-based proxy for model accuracy, which can produce the optimal allocation of the layer-wise densities with respect to the proxy model. Using our formulation, we further propose a structured and hardware-friendly PAI method, named PreCrop, to prune or reconfigure CNNs in the channel dimension. Our empirical results show that PreCrop achieves a higher accuracy than existing PAI methods on several popular CNN architectures, including ResNet, MobileNetV2, and EfficientNet, on both CIFAR-10 and Ima-geNet. Notably, PreCrop achieves an accuracy improvement of up to 2.7% over a state-of-the-art PAI algorithm when pruning MobileNetV2 on ImageNet. PreCrop also improves the accuracy of EfficientNetB0 by 0.3% on ImageNet with only 80% of the parameters and the same FLOPs.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved state-of-the-art accuracy in a wide range of machine learning (ML) applications. However, the massive computational and memory requirements of CNNs remain a major barrier to more widespread deployment on resource-limited edge and mobile devices. This challenge has motivated a large and active body of research on CNN compression, which attempts to simplify the original model without significantly compromising the accuracy. Weight pruning [15, 7, 17, 4, 8] has been extensively explored to reduce the computational and memory demands of CNNs. Existing methods create a sparse CNN model by iteratively removing ineffective weights/activations and training the resulting sparse model. Such an iterative pruning approach usually enjoys the least accuracy degradation but at the cost of a more computationally expensive training procedure. Moreover, training-based pruning methods introduce additional hyperparameters, such as the learning rate for fine-tuning and the number of epochs before rewinding [20] , which make the pruning process even more complicated and less reproducible. To minimize the cost of pruning, a new line of research proposes pruning-at-initialization (PAI) [16, 27, 24] , which identifies and removes unimportant weights in a CNN before training. Similar to training-based pruning, PAI assigns an importance score to each individual weight and retains only a subset of them by maximizing the sum of the importance scores of all remaining weights. The compressed model is then trained using the same hyperparameters (e.g., learning rate and the number of epochs) as the baseline model. Thus, the pruning and training of CNNs are cleanly decoupled, greatly reducing the complexity of obtaining a pruned model. Currently, SynFlow [24] is considered the state-of-the-art PAI technique -it eliminates the need for data during pruning as required in prior arts [16, 27] and achieves a higher accuracy with the same compression ratio. However, existing PAI methods mostly focus on fine-grained weight pruning, which removes individual weights from the CNN model without preserving any structures. As a result, both inference and training of the pruned model require sparse matrix computation, which is challenging to accelerate on commercially-available ML hardware that is optimized for dense computation (e.g., GPUs and TPUs [14] ). According to a recent study [6] , even with the NVIDIA cuSPARSE library, one can only achieve a meaningful speedup for sparse matrix multiplications on GPUs when the sparsity is over 98%. In practice, it is difficult to compress modern CNNs by more than 50× without a drastic degradation in accuracy [2] . Therefore, structural pruning patterns (e.g., pruning weights for the entire output channel) are preferred to enable practical memory and computational saving by avoiding irregular sparse storage and computation. In this work, we propose novel structured PAI techniques and demonstrate that they can achieve the same level of accuracy as the unstructured methods. We first introduce synaptic expectation (SynExp), a new proxy metric for accuracy, which is defined to be the expected sum of the importance scores of all the individual weights in the network. SynExp is invariant to weight shuffling and reinitialization, thus addressing some of the deficiencies of the fine-grained PAI approaches found in recent studies [22, 5] . We also show that SynExp does not vary as long the layer-wise density remains the same, irrespective of the granularity of pruning. Based on this key observation, we formulate an optimization problem that maximizes SynExp to determine the layer-wise pruning ratios, subject to model size and/or FLOPs constraints. We then propose PreCrop, a structured PAI that prunes CNN models at the channel level in a way to achieve the target layer-wise density determined by the SynExp optimization. PreCrop can effectively reduce the model size and computational cost without loss of accuracy compared to existing fine-grained PAI methods. Besides channel-level pruning, we further propose PreConfig, which can reconfigure the width dimension of a CNN to achieve a better accuracy-complexity trade-off with almost zero computational cost. Our empirical results show that the model after PreConfig can achieve higher accuracy with fewer parameters and FLOPs than the baseline for a variety of modern CNN architectures. We summarize our contributions as follows: • We propose to use the SynExp as a proxy for accuracy and formulate PAI as an optimization problem that maximizes SynExp under model size and/or FLOPs constraints. We also show that the accuracy of the CNN model pruned by solving the constrained optimization is independent of the pruning granularity. • We introduce PreCrop, a channel-level structured pruning technique that builds on the proposed SynExp optimization. Our experiments show that CNN models pruned by PreCrop achieve a similar or better accuracy compared to the state-of-the-art unstructured PAI approaches. Compared to SynFlow, PreCrop achieves 2.7% and 0.9% higher accuracy on MobileNetV2 and EfficientNet on ImageNet with fewer parameters and FLOPs. • We show that PreConfig can be used to optimize the width of each layer in the network with almost zero computational cost (e.g., within one second on CPU). Notably, PreConfig can effectively optimize the structure of EfficientNet and MobileNetV2, increasing the accuracy by 0.3% on ImageNet while using 20% fewer parameters and the same FLOPs.

2. RELATED WORK

Model Compression in General can reduce the computational cost of large networks to ease their deployment in resource-constrained devices. Besides pruning, quantization [3, 30, 13], neural architecture search (NAS) [31, 23] , and distillation [12, 28] are also commonly used to improve the efficiency of the model. Training-Based Pruning uses various heuristic criteria to prune unimportant weights. They typically employ an iterative training-prune-retrain process where the pruning stage is intertwined with the training stage, which may increase the overall training cost by several folds. Existing training-based pruning methods can be either unstructured [7, 15] or structured [11, 18] , depending on the granularity and regularity of the pruning scheme. Training-based unstructured pruning usually achieves better accuracy given the same model size budget, while structured pruning can achieve a more practical speedup and compression without special support from custom hardware. (Unstructured) Pruning-at-Initialization (PAI) [16, 27, 24] methods provide a promising approach to mitigating the high cost of training-based pruning. They can identify and prune unimportant weights right after initialization and before the training starts. Related to these efforts, authors of [5] 

