A UNIFIED VIEW OF FINDING AND TRANSFORMING WINNING LOTTERY TICKETS

Abstract

While over-parameterized deep neural networks obtain prominent results on various machine learning tasks, their superfluous parameters usually make model training and inference notoriously inefficient. Lottery Ticket Hypothesis (LTH) addresses this issue from a novel perspective: it articulates that there always exist sparse and admirable subnetworks in a randomly initialized dense network, which can be realized by an iterative pruning strategy. Dual Lottery Ticket Hypothesis (DLTH) further investigates sparse network training from a complementary view. Concretely, it introduces a gradually increased regularization term to transform a dense network to an ultra-light subnetwork without sacrificing learning capacity. After revisiting the success of LTH and DLTH, we unify these two research lines by coupling the stability of iterative pruning and the excellent performance of increased regularization, resulting in two new algorithms (UniLTH and UniDLTH) for finding and transforming winning tickets, respectively. Unlike either LTH without regularization or DLTH which applies regularization across the training, our methods first train the network without any regularization force until the model reaches a certain point (i.e., the validation loss does not decrease for several epochs), and then employ increased regularization for information extrusion and iteratively perform magnitude pruning till the end. We theoretically prove that the early stopping mechanism acts analogously as regularization and can help the optimization trajectory stop at a particularly better point in space than regularization. This does not only prevent the parameters from being excessively skewed to the training distribution (over-fitting), but also better stimulate the network potential to obtain more powerful subnetworks. Extensive experiments show the superiority of our methods in terms of accuracy and sparsity.

1. INTRODUCTION

Exactly as saying goes: you can't have your cake and eat it -though over-parameterized deep neural networks achieve encouraging performance over widespread machine learning tasks Zagoruyko & Komodakis ( 2016 2019) tries to uncover a sparse subnetwork that can retain the learning capacity of the original dense network as much as possible. While these algorithms seek to reach a preferable trade-off between performance and sparsity, they fall short of satisfying the joint optimization of both. Recently, Lottery Ticket Hypothesis (LTH) has provided a novel perspective to investigate sparse network training Frankle & Carbin (2018) . It articulates that there consistently exist sparse and highperformance subnetworks in a randomly initialized dense network, like winning tickets in a lottery pool. To identify such admirable sparse subnetworks (i.e., winning tickets), LTH trains an overparameterized neural network from scratch and prunes its smallest-magnitude weights iteratively, which is so called iterative pruning. This repeated pruning method, as opposed to one-shot pruning, allows us to learn faster and achieve higher test accuracy at smaller network size. LTH innovatively exposes the internal relationships between a randomly initialized network and its corresponding subnetworks, inspiring a series of follow-ups to explore various iterative pruning and rewind criteria for training light-weight networks Morcos et al. ( 2019 



); Arora et al. (2019); Devlin et al. (2018); Brown et al. (2020), they usually suffer notoriously high computational costs and necessitate unaffordable storage resources Cheng et al. (2017); Deng et al. (2020); Wang et al. (2019a). To alleviate this issue, a stream of pruning approaches Han et al. (2015); Liu et al. (2017); He et al. (2017); Gale et al. (2019); Ding et al. (

); Maene et al. (2021); Chen et al. (2021); Frankle et al. (2019; 2020); Ding et al. (2021); Ma et al. (2021); Chen et al. (2022).

