WHY LOTTERY TICKET WINS? A THEORETICAL PERSPECTIVE OF SAMPLE COMPLEXITY ON SPARSE NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

The lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) states that learning on a properly pruned network (the winning ticket) has improved test accuracy over the originally unpruned network. Although LTH has been justified empirically in a broad range of deep neural network (DNN) involved applications like computer vision and natural language processing, the theoretical validation of the improved generalization of a winning ticket remains elusive. To the best of our knowledge, our work, for the first time, characterizes the performance of training a sparse neural network by analyzing the geometric structure of the objective function and the sample complexity to achieve zero generalization error. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned, indicating the structural importance of a winning ticket. Moreover, as the algorithm for training a sparse neural network is specified as (accelerated) stochastic gradient descent algorithm, we theoretically show that the number of samples required for achieving zero generalization error is proportional to the number of the non-pruned weights in the hidden layer. With a fixed number of samples, training a pruned neural network enjoys a faster convergence rate to the desirable model than training the original unpruned one, providing a formal justification of the improved generalization of the winning ticket. Our theoretical results are acquired from learning a sparse neural network of one hidden layer, while experimental results are further provided to justify the implications in pruning multi-layer neural networks.

1. INTRODUCTION

Neural network pruning can reduce the computational cost of training a model significantly (LeCun et al., 1990; Hassibi & Stork, 1993; Dong et al., 2017; Han et al., 2015; Hu et al., 2016; Srinivas & Babu, 2015; Yang et al., 2017; Molchanov et al., 2017) . The recent Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2019) claims that a randomly initialized dense neural network always contains a so-called "winning ticket," which is a sub-network bundled with the corresponding initialization, such that when trained in isolation, this winning ticket can achieve at least the same testing accuracy as that of the original network by running at most the same amount of training time. This so-called "improved generalization of winning tickets" is verified empirically in (Frankle & Carbin, 2019) . LTH has attracted a significant amount of recent research interests (Ramanujan et al., 2020; Zhou et al., 2019; Malach et al., 2020) . Despite the empirical success (Evci et al., 2020; You et al., 2019; Wang et al., 2019; Chen et al., 2020a) , the theoretical justification of winning tickets remains elusive except for a few recent works. Malach et al. (2020) provides the first theoretical evidence that within a randomly initialized neural network, there exists a good sub-network that can achieve the same test performance as the original network. Meanwhile, recent work (Neyshabur, 2020) trains neural network by adding the 1 regularization term to obtain a relatively sparse neural network, which has a better performance numerically. Arora et al. (2018) and Zhou et al. (2018) show that the expressive power of a neural network is comparable to a compressed neural network and both networks have the same generalization error. However, no theoretical explanation has been provided for the improved generalization of winning tickets, i.e., winning tickets can achieve higher test accuracy after the same training time. Contributions: This paper provides the first systematic analysis of learning sparse neural networks with a finite number of training samples. Our analytical results also provide justification of the LTH from the perspective of the sample complexity. In particular, we provide the first theoretical justification of the improved generalization of winning tickets. Specific contributions include: 1. Sparse neural network learning via accelerated gradient descent (AGD): We propose an AGD algorithm with tensor initialization to learn the sparse model from training samples. Considering the scenario where there exists a ground-truth sparse one-hidden-layer neural network, we prove that our algorithm converges to the ground-truth model linearly, which has guaranteed generalization on testing data. 2. First sample complexity analysis for pruned networks: We characterize the required number of samples for successful convergence, termed as the sample complexity. Our sample complexity depends linearly on the number of the non-pruned weights of the sparse network and is a significant reduction from directly applying conventional complexity bounds in (Zhong et al., 2017; Zhang et al., 2020a; c) .

3.. Characterization of the benign optimization landscape of pruned networks:

We show analytically that the empirical risk function has an enlarged convex region near the ground-truth model if the neural network is sparse, justifying the importance of a good sub-network (i.e., the winning ticket).

4.. Characterization of the improved generalization of winning tickets:

We show that gradient-descent methods converge faster to the ground-truth model when the neural network is properly pruned, or equivalently, learning on a pruned network returns a model closer to the ground-truth model with the same number of iterations, indicating the improved generalization of winning tickets.

1.1. RELATED WORK

Winning tickets. Frankle & Carbin (2019) proposes an Iterative Magnitude Pruning (IMP) algorithm to obtain the proper sub-network and initialization. IMP and its variations (Frankle et al., 2019a; Renda et al., 2019) succeed in deeper networks like Residual Networks (Resnet)-50 and Bidirectional Encoder Representations from Transformers (BERT) network (Chen et al., 2020b) . (Frankle et al., 2019b) shows that IMP succeeds in finding the "winning ticket" if the ticket is stable to stochastic gradient descent noise. In parallel, (Liu et al., 2018) shows numerically that the "winning ticket" initialization does not improve over a random initialization once the correct subnetworks are found, suggesting that the benefit of "winning ticket" mainly comes from the sub-network structures. Over-parameterized model. When the number of weights in a neural network is much larger than the number of training samples, the landscape of the objective function of the learning problem has no spurious local minima, and first-order algorithms converge to one of the global optima (Livni et al., 2014; Zhang et al., 2016; Soltanolkotabi et al., 2018) . However, the global optima is not guaranteed to generalize well on testing data (Yehudai & Shamir, 2019; Zhang et al., 2016) . Model estimation & Generalization analysis. This framework assumes a ground-truth model that maps the input data to the output labels, and the learning objective is to estimate the ground-truth model, which has a generalization guarantee on testing data. The learning problem has intractably many spurious local minina even for one-hidden-layer neural networks (Shamir, 2018; Safran & Shamir, 2018; Zhang et al., 2016) . Assuming an infinite number of training samples, (Brutzkus & Globerson, 2017; Du et al., 2018; Tian, 2017) develop learning methods to estimate the ground-truth model. (Fu et al., 2018; Zhong et al., 2017; Zhang et al., 2020a; c) extend to the practical case of a finite number of samples and characterize the

