THEORETICAL CHARACTERIZATION OF HOW NEURAL NETWORK PRUNING AFFECTS ITS GENERALIZATION

Abstract

It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the first generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.

1. INTRODUCTION

Neural network pruning can be dated back to the early stage of the development of neural networks (LeCun et al., 1989) . Since then, many research works have been focusing on using neural network pruning as a model compression technique, e.g. (Molchanov et al., 2019; Luo & Wu, 2017; Ye et al., 2020; Yang et al., 2021) . However, all these work focused on pruning neural networks after training to reduce inference time, and, thus, the efficiency gain from pruning cannot be directly transferred to the training phase. It is not until the recent days that Frankle & Carbin (2018) showed a surprising phenomenon: a neural network pruned at the initialization can be trained to achieve competitive performance to the dense model. They called this phenomenon the lottery ticket hypothesis. The lottery ticket hypothesis states that there exists a sparse subnetwork inside a dense network at the random initialization stage such that when trained in isolation, it can match the test accuracy of the original dense network after training for at most the same number of iterations. On the other hand, the algorithm Frankle & Carbin (2018) proposed to find the lottery ticket requires many rounds of pruning and retraining which is computationally expensive. Many subsequent works focused on developing new methods to reduce the cost of finding such a network at the initialization (Lee et al., 2018; Wang et al., 2019; Tanaka et al., 2020; Liu & Zenke, 2020; Chen et al., 2021b) . A further investigation by Frankle et al. (2020) showed that some of these methods merely discover the layer-wise pruning ratio instead of sparsity pattern. The discovery of the lottery ticket hypothesis sparkled further interest in understanding this phenomenon. Another line of research focused on finding a subnetwork inside a dense network at the random initialization such that the subnetwork can achieve good performance (Zhou et al., 2019; Ramanujan et al., 2020) . Shortly after that, Malach et al. (2020) formalized this phenomenon which they called the strong lottery ticket hypothesis: under certain assumption on the weight initialization distribution, a sufficiently overparameterized neural network at the initialization contains a subnet-work with roughly the same accuracy as the target network. Later, Pensia et al. ( 2020) improved the overparameterization parameters and Sreenivasan et al. ( 2021) showed that such a type of result holds even if the weight is binary. Unsurprisingly, as it was pointed out by Malach et al. ( 2020), finding such a subnetwork is computationally hard. Nonetheless, all of the analysis is from a function approximation perspective and none of the aforementioned works have considered the effect of pruning on gradient descent dynamics, let alone the neural networks' generalization. Interestingly, via empirical experiments, people have found that sparsity can further improve generalization in certain scenarios (Chen et al., 2021a; Ding et al., 2021; He et al., 2022) . There have also been empirical works showing that random pruning can be effective (Frankle et al., 2020; Su et al., 2020; Liu et al., 2021b) . However, theoretical understanding of such benefit of pruning of neural networks is still limited. In this work, we take the first step to answer the following important open question from a theoretical perspective: How does pruning fraction affect the training dynamics and the model's generalization, if the model is pruned at the initialization and trained by gradient descent? We study this question using random pruning. We consider a classification task where the input data consists of class-dependent sparse signal and random noise. We analyze the training dynamics of a two-layer convolutional neural network pruned at the initialization. Specifically, this work makes the following contributions: • Mild pruning. We prove that there indeed exists a range of pruning fraction where the pruning fraction is small and the generalization error bound gets better as pruning fraction gets larger. In this case, the signal in the feature is well-preserved and due to the effect of pruning purifying the feature, the effect from noise is reduced. We provide detailed explanation in Section 3. Up to our knowledge, this is the first theoretical result on generalization for pruned neural networks, which suggests that pruning can improve generalization under some setting. Further, we conduct experiments to verify our results. • Over pruning. To complement the above positive result, we also show a negative result: if the pruning fraction is larger than a certain threshold, then the generalization performance is no better than a simple random guessing, although gradient descent is still able to drive the training loss toward zero. This further suggests that the performance drop of the pruned neural network is not solely caused by the pruned network's own lack of trainability or expressiveness, but also by the change of gradient descent dynamics due to pruning. • Technically, we develop novel analysis to bound pruning effect to weight-noise and weightsignal correlation. Further, in contrast to many previous works that considered only the binary case, our analysis handles multi-class classification with general cross-entropy loss. Here, a key technical development is a gradient upper bound for multi-class cross-entropy loss, which might be of independent interest. Pictorially, our result is summarized in Figure 1 . We point out that the neural network training we consider is in the feature learning regime, where the weight parameters can go far away from their initialization. This is fundamentally different from the popular neural tangent kernel regime, where the neural networks essentially behave similar to its linearization. 1.1 RELATED WORKS 



The Lottery Ticket Hypothesis and Sparse Training. The discovery of the lottery ticket hypothesis (Frankle & Carbin, 2018) has inspired further investigation and applications. One line of research has focused on developing computationally efficient methods to enable sparse training: the static sparse training methods are aiming at identifying a sparse mask at the initialization stage based on different criterion such as SNIP (loss-based) (Lee et al., 2018), GraSP (gradient-based) (Wang et al., 2019), SynFlow (synaptic strength-based) (Tanaka et al., 2020), neural tangent kernel based method (Liu & Zenke, 2020) and one-shot pruning(Chen et al., 2021b). Random pruning has also been considered in static sparse training such as uniform pruning(Mariet & Sra, 2015; He et al., 2017; Gale  et al., 2019; Suau et al., 2018), non-uniform pruning (Mocanu et al., 2016), expander-graph-related  techniques (Prabhu et al., 2018; Kepner & Robinett, 2019) Erdös-Rényi (Mocanu et al., 2018)  andErdös-Rényi-Kernel (Evci et al., 2020). On the other hand, dynamic sparse training allows the

