A UNIFIED VIEW OF FINDING AND TRANSFORMING WINNING LOTTERY TICKETS

Abstract

While over-parameterized deep neural networks obtain prominent results on various machine learning tasks, their superfluous parameters usually make model training and inference notoriously inefficient. Lottery Ticket Hypothesis (LTH) addresses this issue from a novel perspective: it articulates that there always exist sparse and admirable subnetworks in a randomly initialized dense network, which can be realized by an iterative pruning strategy. Dual Lottery Ticket Hypothesis (DLTH) further investigates sparse network training from a complementary view. Concretely, it introduces a gradually increased regularization term to transform a dense network to an ultra-light subnetwork without sacrificing learning capacity. After revisiting the success of LTH and DLTH, we unify these two research lines by coupling the stability of iterative pruning and the excellent performance of increased regularization, resulting in two new algorithms (UniLTH and UniDLTH) for finding and transforming winning tickets, respectively. Unlike either LTH without regularization or DLTH which applies regularization across the training, our methods first train the network without any regularization force until the model reaches a certain point (i.e., the validation loss does not decrease for several epochs), and then employ increased regularization for information extrusion and iteratively perform magnitude pruning till the end. We theoretically prove that the early stopping mechanism acts analogously as regularization and can help the optimization trajectory stop at a particularly better point in space than regularization. This does not only prevent the parameters from being excessively skewed to the training distribution (over-fitting), but also better stimulate the network potential to obtain more powerful subnetworks. Extensive experiments show the superiority of our methods in terms of accuracy and sparsity.

1. INTRODUCTION

Exactly as saying goes: you can't have your cake and eat it -though over-parameterized deep neural networks achieve encouraging performance over widespread machine learning tasks Zagoruyko & Komodakis ( 2016 2019) tries to uncover a sparse subnetwork that can retain the learning capacity of the original dense network as much as possible. While these algorithms seek to reach a preferable trade-off between performance and sparsity, they fall short of satisfying the joint optimization of both. Recently, Lottery Ticket Hypothesis (LTH) has provided a novel perspective to investigate sparse network training Frankle & Carbin (2018) . It articulates that there consistently exist sparse and highperformance subnetworks in a randomly initialized dense network, like winning tickets in a lottery pool. To identify such admirable sparse subnetworks (i.e., winning tickets), LTH trains an overparameterized neural network from scratch and prunes its smallest-magnitude weights iteratively, which is so called iterative pruning. This repeated pruning method, as opposed to one-shot pruning, allows us to learn faster and achieve higher test accuracy at smaller network size. LTH innovatively exposes the internal relationships between a randomly initialized network and its corresponding subnetworks, inspiring a series of follow-ups to explore various iterative pruning and rewind criteria for training light-weight networks Morcos et al. ( 2019 Though promising, LTH concentrates solely on identifying one sparse subnetwork by iterative pruning, which is not universal to both practical usages and investigating the relationship between dense networks and its subnetworks Bai et al. ( 2022). Hence, Bai et al. ( 2022) go from a complementary direction to propose Dual Lottery Hypothesis (DLTH) which studies a randomly selected subnetwork rather than a particular one. As a dual problem of LTH, it hypothesizes that a randomly selected subnetwork in a randomly initialized dense network can be turned into an appropriate condition with excellent performance, analogy to transforming a random lottery ticket to a winning ticket. To validate this, DLTH trains a dense network and conducts one-shot pruning with a simple yet effective strategy -it identifies the sparse subnetwork by utilizing a gradually increased regularization term throughout the training phase, which extrudes information from unimportant weights (which will be pruned afterward) to target a sparse neural structure. Although this hypothesis does not provide any theoretical proof on how much information extrusion we can achieve, it does provide a novel view on harnessing regularization terms to link the dense network with hidden winning tickets. As the key element to DLTH's success, the regularization term realizes information extrusion from the unimportant weights which will be masked (i.e., discarded), but may also become its undoing. In a training process, the equilibrium of all network weights is usually determined by two forces: loss gradient force and regularization gradient force. The latter one is generally maintained in a small regime, as the excessive weight penalty will cause the network to collapse into a suboptimal local minimum, corresponding to the ill-conditioned small weights LeCun et al. (2015) . Using a regularization term at the early training phase as DLTH does, may cripple the model performance since it complicates the network optimization and misleads the finding of a reliable equilibrium. Meanwhile, regularization-based pruning approaches (e.g., DLTH) typically perform one-shot pruning, which exacerbates the instability of sparse network training. Considering the efficacy of iterative pruning in LTH, transforming random tickets into winning tickets iteratively is appealing as well. In this paper, we aim at presenting a resilient and unified paradigm for searching winning tickets in a dense network (LTH) or transforming random tickets to winning tickets (DLTH), leading to two new pruning algorithms termed UniLTH and UniDLTH. As illustrated in Fig. 1 (b), both UniLTH and UniDLTH decouple the pruning task into two separate stages. At the first stage, the two algorithms share an identical procedure -they do not set up any obstacle force (regularization) when training a randomly initialized network. Once the validation loss does not decrease for several training cycles, we cut off the training and rewind the network parameters to several epochs earlier. We demonstrate that utilizing such an early stopping strategy can defeat the instability caused by regularization, thus achieving similar or even better performance without compromising the learning potential.



); Arora et al. (2019); Devlin et al. (2018); Brown et al. (2020), they usually suffer notoriously high computational costs and necessitate unaffordable storage resources Cheng et al. (2017); Deng et al. (2020); Wang et al. (2019a). To alleviate this issue, a stream of pruning approaches Han et al. (2015); Liu et al. (2017); He et al. (2017); Gale et al. (2019); Ding et al. (

); Maene et al. (2021); Chen et al. (2021); Frankle et al. (2019; 2020); Ding et al. (2021); Ma et al. (2021); Chen et al. (2022).

Figure 1: Illustration of LTH/DLTH and our UniLTH/UniDLTH. In (c), the blue/green solid contour lines denote the contours of the training/validation negative log-likelihood. Our goal is to drew the weights closer to ŵ. The black line indicates the training trajectory taken by SGD. Our algorithm rewinds the training procedure (the yellow line) and add increased regularization (the purple line) to move towards the validation set distribution when training reaches an early stopping threshold.

