A UNIFIED VIEW OF FINDING AND TRANSFORMING WINNING LOTTERY TICKETS

Abstract

While over-parameterized deep neural networks obtain prominent results on various machine learning tasks, their superfluous parameters usually make model training and inference notoriously inefficient. Lottery Ticket Hypothesis (LTH) addresses this issue from a novel perspective: it articulates that there always exist sparse and admirable subnetworks in a randomly initialized dense network, which can be realized by an iterative pruning strategy. Dual Lottery Ticket Hypothesis (DLTH) further investigates sparse network training from a complementary view. Concretely, it introduces a gradually increased regularization term to transform a dense network to an ultra-light subnetwork without sacrificing learning capacity. After revisiting the success of LTH and DLTH, we unify these two research lines by coupling the stability of iterative pruning and the excellent performance of increased regularization, resulting in two new algorithms (UniLTH and UniDLTH) for finding and transforming winning tickets, respectively. Unlike either LTH without regularization or DLTH which applies regularization across the training, our methods first train the network without any regularization force until the model reaches a certain point (i.e., the validation loss does not decrease for several epochs), and then employ increased regularization for information extrusion and iteratively perform magnitude pruning till the end. We theoretically prove that the early stopping mechanism acts analogously as regularization and can help the optimization trajectory stop at a particularly better point in space than regularization. This does not only prevent the parameters from being excessively skewed to the training distribution (over-fitting), but also better stimulate the network potential to obtain more powerful subnetworks. Extensive experiments show the superiority of our methods in terms of accuracy and sparsity.

1. INTRODUCTION

Exactly as saying goes: you can't have your cake and eat it -though over-parameterized deep neural networks achieve encouraging performance over widespread machine learning tasks Zagoruyko & Komodakis (2016) ; Arora et al. (2019) ; Devlin et al. (2018) ; Brown et al. (2020) , they usually suffer notoriously high computational costs and necessitate unaffordable storage resources Cheng et al. (2017) ; Deng et al. (2020) ; Wang et al. (2019a) . To alleviate this issue, a stream of pruning approaches Han et al. (2015) ; Liu et al. (2017) ; He et al. (2017) ; Gale et al. (2019) ; Ding et al. (2019) tries to uncover a sparse subnetwork that can retain the learning capacity of the original dense network as much as possible. While these algorithms seek to reach a preferable trade-off between performance and sparsity, they fall short of satisfying the joint optimization of both. Recently, Lottery Ticket Hypothesis (LTH) has provided a novel perspective to investigate sparse network training Frankle & Carbin (2018) . It articulates that there consistently exist sparse and highperformance subnetworks in a randomly initialized dense network, like winning tickets in a lottery pool. To identify such admirable sparse subnetworks (i.e., winning tickets), LTH trains an overparameterized neural network from scratch and prunes its smallest-magnitude weights iteratively, which is so called iterative pruning. This repeated pruning method, as opposed to one-shot pruning, allows us to learn faster and achieve higher test accuracy at smaller network size. LTH innovatively exposes the internal relationships between a randomly initialized network and its corresponding subnetworks, inspiring a series of follow-ups to explore various iterative pruning and rewind criteria for training light-weight networks Morcos et al. (2019) ; Maene et al. (2021) ; Chen et al. (2021) ; Frankle et al. (2019; 2020) ; Ding et al. (2021) ; Ma et al. (2021) ; Chen et al. (2022) . Though promising, LTH concentrates solely on identifying one sparse subnetwork by iterative pruning, which is not universal to both practical usages and investigating the relationship between dense networks and its subnetworks Bai et al. (2022) . Hence, Bai et al. (2022) go from a complementary direction to propose Dual Lottery Hypothesis (DLTH) which studies a randomly selected subnetwork rather than a particular one. As a dual problem of LTH, it hypothesizes that a randomly selected subnetwork in a randomly initialized dense network can be turned into an appropriate condition with excellent performance, analogy to transforming a random lottery ticket to a winning ticket. To validate this, DLTH trains a dense network and conducts one-shot pruning with a simple yet effective strategy -it identifies the sparse subnetwork by utilizing a gradually increased regularization term throughout the training phase, which extrudes information from unimportant weights (which will be pruned afterward) to target a sparse neural structure. Although this hypothesis does not provide any theoretical proof on how much information extrusion we can achieve, it does provide a novel view on harnessing regularization terms to link the dense network with hidden winning tickets. As the key element to DLTH's success, the regularization term realizes information extrusion from the unimportant weights which will be masked (i.e., discarded), but may also become its undoing. In a training process, the equilibrium of all network weights is usually determined by two forces: loss gradient force and regularization gradient force. The latter one is generally maintained in a small regime, as the excessive weight penalty will cause the network to collapse into a suboptimal local minimum, corresponding to the ill-conditioned small weights LeCun et al. (2015) . Using a regularization term at the early training phase as DLTH does, may cripple the model performance since it complicates the network optimization and misleads the finding of a reliable equilibrium. Meanwhile, regularization-based pruning approaches (e.g., DLTH) typically perform one-shot pruning, which exacerbates the instability of sparse network training. Considering the efficacy of iterative pruning in LTH, transforming random tickets into winning tickets iteratively is appealing as well. In this paper, we aim at presenting a resilient and unified paradigm for searching winning tickets in a dense network (LTH) or transforming random tickets to winning tickets (DLTH), leading to two new pruning algorithms termed UniLTH and UniDLTH. As illustrated in Fig. 1 (b), both UniLTH and UniDLTH decouple the pruning task into two separate stages. At the first stage, the two algorithms share an identical procedure -they do not set up any obstacle force (regularization) when training a randomly initialized network. Once the validation loss does not decrease for several training cycles, we cut off the training and rewind the network parameters to several epochs earlier. We demonstrate that utilizing such an early stopping strategy can defeat the instability caused by regularization, thus achieving similar or even better performance without compromising the learning potential. At the second stage, we integrate the iterative pruning with the increased regularization for searching or transforming winning tickets. To be more specific, we alternately train the network with increased regularization and perform pruning to tilt the network distribution towards the validation distribution until the network reaches the corresponding sparsity. The major difference between UniLTH and UniDLTH is the weights to which we apply regularization (see Fig. 1(b) ). UniLTH differentiates the magnitudes of the weights by applying progressively increasing regularization on all the weights, while UniDLTH applies L 2 regularization only on the unimportant weights for information extrusion. The contributions of this paper can be summarized as follows: • We introduce a unified view of searching and transforming winning lottery tickets. We find that the removal of regularization force of the early training phase is more helpful for preserving network expressivity. This simple yet efficient winning ticket search/transform paradigm can seamlessly translate to arbitrary networks without being subject to specific network structure. • We provide a theoretical proof of the early stopping strategy's ability to substitute L 2 regularization, as well as an intuitive explanation of its benefits. This new training paradigm show great promise in retrieving winning tickets from a large dense network. We also verify that networks can perform better under lower regularization pressures with a novel nonlinear regularization scheme. • We conduct extensive experiments to evaluate our algorithms in terms of sparsity and performance. In particular, UniLTH outperforms LTH by 0.19%∼1.35% and UniDLTH surpasses DLTH 0.29%∼0.56% on accuracy over four representative backbones on the CIFAR-10 dataset. Remarkably, we can even obtain 90% sparse winning tickets without performance dropping, especially on the large-scale dataset ImageNet, which verifies the superiority of our algorithms.

2. RELATED WORK

Winning Lottery Tickets. LTH elaborately draw an analogy between uncovering admirable subnetworks in a dense network and finding winning tickets in a lottery pool. It articulates that a randomly initialized dense network contains a high-performance subnetwork which can be trained in isolation for at most the same number of iterations as the original network Frankle & Carbin (2018) . In light of LTH, some follow-ups have explored the prospect of training sparse subnetworks in place of the entire models without sacrificing performance Malach et al. (2020) . LTH is also adopted to discover the presence of "supermasks", which can transform an untrained, randomly initialized network to a far higher-performance model Zhou et al. (2019) . Furthermore, a stream of research has been dedicated to the findings of early-bird (EB) tickets You et al. (2019) (the tickets which emerge at the very early training phase) to reduce the computational overheads, e.g., in graph neural networks You et al. (2021) ; Chen et al. (2021) , in natural language processing Chen et al. (2020) . To generalize the lottery tickets across tasks, Zhou et al. (2019) ; Morcos et al. (2019) verify the performance of the winning ticket initializations generated by sufficiently large datasets, and discover these subnetworks contain inductive biases generic to neural networks more broadly which improve training across many tasks. In addition to the above work, DLTH considers a more general and challenging case to find the relationship between a dense network and its sparse counterparts Bai et al. (2022) it argues that a randomly selected subnetwork from a dense network can be transformed into a trainable condition and achieve admirable performance compared with the winning lottery pool Bai et al. (2022) , which suggests a more adjustable way of investigating sparse neural networks. Regularization-based Pruning. Regularization has long been exploited for pruning deep neural networks by enforcing a part of parameters in the original network to zeros. The most popular approaches are using L 0 -norm or L 1 -norm Louizos et al. (2017) ; Liu et al. (2017) ; Ye et al. (2018) . For instance, He et al. (2017) adopt LASSO to achieve channel pruning for accelerating very deep neural networks. Following this trend, the Group LASSO algorithm is further introduced to obtain a regular sparse subnetwork Lebedev & Lempitsky (2016) ; Wen et al. (2016) . Ding et al. (2018) proposes to employ various penalty factors for different weights. Among these works, the "regularization force" is maintained in a small regime to avoid crippling the model performance. To take advantage of the model sparseness brought by the large penalty strength, Wang et al. (2019b; 2020) present the first attempt to utilize gradually increased regularization terms to achieve high sparsity and preserve the admirable performance of the original model. However, the magnitude of the regularization term is extremely important and needs to be carefully controlled, since an excessive weight penalty will may cause the model weights ill-conditioned. In this paper, we focus on a simple and efficient usage of the regularization term to discover more reliable subnetworks, i.e., winning tickets, while ensuring that the regularization term does not have a catastrophic influence on the training process.

3. METHODOLOGY 3.1 A UNIFIED APPROACH FOR SOLVING LTH AND DLTH

Here, we delineate our algorithm which unifies the line of LTH and DLTH to obtain winning lottery tickets. As depicted in Algo. 1, it can be flexibly adopted to the settings of LTH and DLTH with a few modifications. We denote in blue the procedures for uncovering the winning tickets (marked as UniLTH), and in red the procedures for transforming a random selected subnetwork to the winning tickets (i.e., UniDLTH). The other lines (in black) are shared by both. UniDLTH. We first describe UniDLTH which aims to transform randomly-selected tickets to the winning tickets. Unlike DLTH, we do not introduce any regularization force at the early training phase to ensure that the model accurately learns the training data distribution. Instead, we perform training, early stopping, and rewind at the first stage (Line 1-6) -when the validation loss does not drop in ϕ epochs, we stop training and rewind the parameters to ξ epochs earlier. Then we produce the random tickets by randomly selecting a set of important parameters (the rest are unimportant) in Line 7. After rewinding all weights, we integrate the iterative pruning strategy with the gradually increased regularization (Line 8-16) to extrude information from unimportant weights Θ un to target the sparse structure, i.e., the winning tickets Θ im . UniLTH. For UniLTH, our target is to find the winning tickets in a dense, random initialized neural network. The whole pipeline is shown in Algo. 1 and almost similar to UniDLTH, thus we mainly highlight how UniLTH differs from UniDLTH. Firstly, UniLTH does not need to specify a random set of weights to realize information extrusion as UniDLTH does at the second stage (in Line 7). Secondly, UniLTH applies regularization to the universe of the network while UniDLTH merely decays the unimportant weights (see Line 9). Lastly, UniDLTH prunes the non-important weights at each iteration, whereas UniLTH has no such restriction (see Line 13). As described, our pruning algorithm is iterative-based. Suppose a dense network has been pruned for i iterations and each round cut off p% of the weights that survive in the previous iteration, we target a sparse structure to be r% of the original network size. In this case, p can be expressed as p% = 1 -(r%) 1/ψ and decreased over iterations. Notably, such pruning rate becomes very low near the end of the pruning, which is analogy to the fine-tuning technique. The rest of this section are organized as follows. In Sec. 3.2, we will theoretically prove the equivalence between early stopping and regularization, and explain why we should employ early stopping and rewinding rather than regularization at the early stage. In Sec. 3.3, we will elaborate on the second stage, i.e., iterative pruning with three variants of nonlinear increased regularization. Algorithm 1 UniLTH and UniDLTH Algorithms (aligned with Fig. 1 ) Require: Network f (X, Θ) with data X and parameters Θ; sparsity level S f ; mask matrix mΘ with initialization m 0 Θ = 1; pruning rate p%; times ψ; step size η; patience ϕ for early stopping. 1: while 1 -||m Θ || 0 |m 0 Θ | < S f do 2: Forward f (X, Θ) to compute loss Ls = L (X, Θ) 3: Update Θ (j+1) ← Θ (j) -η∇ Θ (j) Ls 4: if the validation loss is not descending for ϕ epoch then 5: Rewind Θ to several training epochs ago (Note weights as Θ (E) ) 6: Load weight Θ (E) 7: Select unimportant weights Θun and important weights Θim 8: for iteration i = 1, 2 . . . M do 9: Forward f (X; mΘ; Θ) to compute the loss Ls = Ls (X, mΘ ⊙ Θ) + α ||Θ|| 2 2 (or Ls = Ls (X, mΘ ⊙ Θ) + α ||Θun|| 2 2 ) , where α is a hyper-parameter to control sparsity 10: Update Θ (i+1) ← Θ (i) -η∇ Θ (i) Ls 11: Adopt nonlinear increasing regularization (i.e., LogLaw, TanLaw and ExpLaw. ) 12: if i == M/ψ • γ(γ = 1, 2 . . . ψ) then 13: Prune p% parameters with the lowest magnitude values in Θ (or Θun) 14: Update mΘ by zeroing the elements in mΘ which are corresponding to the p% pruned parameters 15: if 1 -||m Θ || 0 |m 0 Θ | < S f (or Θun = 0) then 16: Stopping training and obtain Θ (M ) 17: return mΘ ⊙ Θ (M ) or Θim

3.2. EARLY STOPPING VERSUS REGULARIZATION

As the key operation of DLTH, regularization borrows learning capacity and perform information extrusion from the pruned weights from beginning to end. However, both early and excessive regularization considerably impedes the model expressivity: 1) At the early stage, the model should focus more on fitting the data distribution without restricting its capacity. 2) Excessive regularization force makes the training process hard to control, e.g., a slight excess may cause irreversible reactions or ill-conditioned weights. To this end, we figure out that the early stopping mechanism is a simple yet effective alternative to regularization at the early training phase with a theoretical proof. Compared to regularization, early stopping not only has no negative effect on the early-stage behavior of fitting to the data distribution, but is also more controllable due to its non-parametric nature. In next part, we will mathematically analyze why regularization is equivalent to early stopping. L 2 Regularization. We first revisit the theory of regularization. Let J(w) denote an unregularized objective function, w * = argmin w J(w) is the weight vector when J(w) achieves the minimum training error. Assuming the existence of second-order partial derivatives, we perform a quadratic approximation to the unregularized objective function in a small neighborhood of w * as Ĵ(w) = J (w * ) + 1 2 (w -w * ) T H (w -w * ) , where w in the neighborhood of w * and H is the Hessian matrix LeCun et al. (1990) at point w * . Given w is a local optimal point, the first-order term (Jacobian matrix) in Eq. 1 has been eliminated and H is positive semi-definite. When Ĵ(w) get minimum, we have ∇ w Ĵ(w) = H (w -w * ) = 0. L 2 regularization has long been the most popular form of regularization, also known as weight decay Cortes et al. (2012) , which ensures the weight constrained within a small range by adding a penalty term Ω(w) = 1 2 α||w|| 2 2 to J(w). Under this circumstance, the optimal point of w will be perturbed by the new force, causing the network to reach a new balance point ẃ as α ẃ + H ( ẃ -w * ) = 0 ⇒ ẃ = (H + αI) -1 Hw * . ( With the increase of α, we decompose H (H is real symmetric) into the diagonal matrix Λ and the standard orthonormal basis Q of the eigenvectors, H = QΛQ T : ẃ = QΛQ T + αI -1 QΛQ T w * = Q(Λ + αI) -1 ΛQ T w * . As seen above, the weight decay is essentially scaling w * along the axis defined by the eigenvectors of H. This scaling effect has less effect on the direction with larger eigenvalues and more on the direction with smaller eigenvalues, indicating that the weights of various curvatures in the network behave differently when performing such regularization. Early Stopping. A fundamental principle is recognized in deep neural networks -when the validation set loss does not decrease in several training cycles, we can intuitively find that the model has undergone slight over-fitting. We usually stop training as soon as the error on the validation set is higher than it was the last time it was checked, which is so called early stopping. Compared with L 2 regularization, early stopping is more inconspicuous since it barely affects the training process, the objective function, or any set of allowable parameter values. Suppose J(w) reaches its minimum (see Eq. 1), we obtain ∇ w Ĵ(w) = H (w -w * ) = 0. We study the process of parameter updates during training, assuming we optimize τ steps with the learning rate ϵ, by analyzing the gradient descent on Ĵ to approximately study the gradient descent on J. The parameters of the τ -th step can be derived from the (τ -1)-th step as follows: w (τ ) = w (τ -1) -ϵH w (τ -1) -w * ⇒ w (τ ) -w * = (I -ϵH) w (τ -1) -w * Taking H = QΛQ T into the above equation, we can easily obtain:          Q T w (τ ) -w * = (I -ϵΛ) Q T w (τ -1) -w * Q T w (τ -1) -w * = (I -ϵΛ) Q T w (τ -2) -w * . . . Q T w (1) -w * = (I -ϵΛ) Q T w (0) -w * ⇒ Q T w (τ ) = (I -ϵΛ) τ Q T w (0) + [I -(I -ϵΛ) τ ] Q T w * For simplicity, we first prove the equivalence under w (0) = 0. As a result, Eq. 5 can be simplified into Q T w (τ ) = [I -(I -ϵΛ) τ ] Q T w * . For L 2 regularization (see Eq. 3), Λ can be written as Λ = diag(λ 1 , . . . , λ n ), then we can get an expression similar to the early stopping: Q T ẃ = (Λ + αI) -1 ΛQ T w * = diag( λ 1 λ 1 + α , • • • , λ n λ n + α )Q T w * = I -(Λ + αI) -1 α Q T w * (6) Now we are able to link the formula of L 2 regularization (Eq. 6) with the early stopping (Eq. 5): Q T ẃ = I -(Λ + αI) -1 α Q T w * L2 Regularization ⇔ Q T w (τ ) = [I -(I -ϵΛ) τ ] Q T w * early stopping (7) As shown in Eq. 7, the early stopping is equivalent to L 2 regularization if (Λ+αI) -1 α = (I -ϵΛ) τ satisfies. In this case, we further take the logarithm form on both sides and derive: τ log (I -ϵΛ) = -log (I + Λ/α) When τ ϵΛ ≈ Λ α (i.e., α ≈ 1 ϵτ ), L 2 regularization is equivalent to the early stopping mechanism. More generally, when w (0) ̸ = 0, we can draw similar conclusions (The above two proofs will be shown in detail in Appendix A and B). ϵτ is the product of the learning rate and the number of steps, which can be intuitively regarded as the capacity of the network. For some changing learning rates, ϵτ can be viewed as the distance that the optimization curve moves in the high-dimensional space. Why Early Stopping is Preferable? After demonstrating their equivalence, we articulate the major advantages of our early stopping strategy against regularization. In the case of early termination, this actually means in the direction of the large curvature parameters than smaller curvature direction earlier to learn, at the same time can terminate automatically determine the result of the regularization (only need to change the observation test loss decline times), and the weight decay require different parameter values of weights training experiment. We believe that excessive reliance on regularization at the early stage of training is not conducive to the network moving to the training distribution. At the early stage of training, we only use the early stopping strategy, which can not only learn the data distribution better, but also will not disrupt the dynamic learning process of the network.

3.3. ITERATIVE PRUNING WITH INCREASED REGULARIZATION

Figure 2 : Different nonlinear increased regularization methods. For simplicity, we draw these nonlinear growth curves in continuous forms. We perform our pruning procedure iteratively after rewinding the weights to several epochs earlier, i.e., we repeatedly train, prune, and reset the network across M rounds (i.e., iterations). Each round eliminates p% of the weights that survive the previous round, indicating that the remaining weights in the final round has been constantly verified for M times, which ensures the reliability of the retained weights. In comparison to one-shot pruning, this iterative strategy empowers us to learn faster and achieve higher test accuracy with a small-scale network (see our experiments in Sec. 4.1). During the pruning process, we perform increased regularization to tilt the network distribution towards the validation distribution. Existing regularization-based pruning approaches (e.g., DLTH) mostly utilize linear increased regularization, where the coefficient α grows by a constant for each round. As a result, the regularization will increase continuously at equal intervals to mine out the diversity of parameters Bai et al. (2022) and enable us to obtain a more expressive subnetwork (see the proof in Appendix C). However, linear increased regularization may not be conducive to the search of optimal parameters, since a large magnitude of regularization may destroy the network dynamics at the late training phase. Instead, we expect that the regularization term can be controlled in a relatively small range for a long time while continuously increasing. As opposed to the linear increased regularization, we argue that using a nonlinear increased regularization term is more beneficial to the expressiveness of the winning tickets. As shown in Fig. 3 .3, we propose several nonlinear regularization schemes: (1) Logarithmic-Law (LogLaw): R = 1 ln2 • R ceil • ln φ E + 1 ; (2) Exponential-Law (ExpLaw): R = 1 e-1 • R ceil • e φ E -1 ; (3) Tangent-Law (TanLaw): R = 1 2 • R ceil • tan π•φ 2E -π 4 + 1 . R ceil is the ceiling value of the regularization term. φ is the current epoch and E is the total epochs. Different growth strategies can increase the degrees of freedom in such regularization methods, thereby enabling us to obtain more diverse optimal subnetworks.

4. EXPERIMENTS

In this section, we evaluate our pruning algorithms (UniLTH and UniDLTH) on several widelyused datasets. Our goal is to answer the following research questions. RQ1: How effective do our algorithms find and transform winning lottery tickets (Sec 4.1)? RQ2: How does the subnetwork perform under different pruning rates (Sec 4.1)? RQ3: Is nonlinear increased regularization more powerful than its linear counterpart (Sec 4.2)? RQ4: How should we specify the detection the patience (people typically define a patience, i.e. the number of epochs to wait before early stop if no progress on the validation set.) ϕ of early stopping and the rewind epochs ξ (Sec 4.3)? RQ5: Can our algorithms generalize to more backbones and larger-scale datasets (Appendix E)? Datasets and Backbones. We first evaluate our algorithms on CIFAR10/100 datasets Krizhevsky et al. (2009) using VGG-19 Simonyan & Zisserman (2014) or ResNet-50 He et al. (2016) as the backbone. To further verify its effectiveness on large-scale datasets, we conduct experiments on ImageNet Simonyan & Zisserman (2014) using the backbone ResNet-50. Meanwhile, we consider the following methods for comparisons: 1) L 1 is the L 1 regularization pruning based on a pretrained network Li et al. (2016) . 2) LTH/LTH-Iter is the lottery ticket hypothesis based on oneshot/iterative pruning strategies Frankle & Carbin (2018) . 3) EB is the Early-Bird ticket (i.e., LT which emerge at the very early training phase) for LTH with an one-shot pruning strategy Frankle & Carbin (2018) . 4) DLTH/DLTH-Iter is the dual lottery ticket hypothesis based on one-shot/iterative pruning strategies Bai et al. (2022) . Our experiments are run on two NVIDIA Tesla V100 GPUs. Due to the page limit, we describe the experimental settings in Appendix D.

4.1. PRUNING ALGORITHMS COMPARISON

In this part, we explore the efficacy of our proposed algorithms. For LTH, different iterative pruning values were proposed for comparison. For EB, we follow its original paper to set the early stopping point at 37 epochs (1/8 of the total epochs). For DLTH, we set weight decay starting from 0, and ceiling bound is 2e-3. Here we just adopt linear growth regularization for comparison and nonlinear regularization will be discussed in Sec. 4.2. In Tab. 1, we report the mean top-1 accuracy with its standard deviation of three-run experiments and we have placed notations in Appendix D. control the number of pruning iterations ψ to 10, the monitoring threshold ϕ (i.e., patience) for the descent of validation loss to be 2, and the rewind epoch to be 2 to keep these variables consistent. As shown in Fig. 3 , we validate our methods under 50%/70% pruning ratio on combination settings of ResNet-50 and CIFAR-10. It can be seen that our unified algorithms (i.e., UniLTH and UniDLTH) clearly surpass the traditional algorithms on both pruning rates, which demonstrates their effectiveness. Additionally, we observe in Fig. 3 that at the early stage of training, the abandonment of regularization can help the network to learn data distribution faster.

Performance Comparison (RQ1)

.

4.2. LINEAR VERSUS NONLINEAR INCREASED REGULARIZATION

In this paper, we propose three forms of nonlinear increased regularization (including LogLaw, TanLaw and ExpLaw) with different regularization intensities at the early, middle and late stage, respectively. To answer RQ3, we compare them with the linear counterpart and meanwhile investigate the effect of different regularization force ceiling R ceil on the model performance. ExpLaw of UniLTH/UniDLTH ResNet50+CIFAR10 (Pruning ratio=50%) Epoch In Tab. 2, when using R ceil with the same magnitude, ExpLaw significantly outperforms the other three strategies under all pruning ratios. As mentioned before, increased regularization stimulates the diversity of parameters. Different growth strategies (linear or nonlinear) further increase the degrees of freedom in such regularization methods, i.e., enabling us to obtain more diverse optimal subnetworks. For example, LogLaw keeps the network under intense regularization during the whole training process (see Fig. 2 ), which might lead to ill-conditioned weights. Conversely, ExpLaw utilizes smaller penalty for a long training time and can simultaneously ensure that the weights magnitude discrepancy will be magnified. Moreover, the ill-conditioned weight issue can also be prevented by means of a small penalty, which is more conducive to the search/transformation of excellent subnetworks. We further explore the impacts of different R ceil on the model results. As depicted in the left hand side of Fig. 4 , we find that UniLTH requires less penalty force than UniDLTH. This may be attributed to the uncertainty of the subnetwork structure at the beginning of LTH and the need to slowly drop weights for a reliable subnetwork. Excessive regularization force will destroy the learning process and complicate the search of winning tickets. In contrast, the discarded parameters in UniLTH are given at the beginning, which requires a larger gradually increasing regularization force to extrude the information as much as possible, so R ceil large is allowed. To answer RQ4, we first set patience ϕ to 2 and monitor the top-1 accuracy under various rewind epochs ξ.

4.3. EFFECTS OF PATIENCE FOR EARLY STOPPING AND REWIND VALUES

After that, we fix ξ as 2 and test network performance under different early stopping patience ξ under 30%, 50%, 70% and 90% pruning ratio (denoted by p). Related results are reported in Tab. 3 and Fig. 4 (Right) , from which we have the following findings. The model exhibits strong expressiveness when ξ is controlled in a small regime such as 1 or 2. In contrast, when enlarging the rewind epochs to some extent, e.g., ξ = 4, 5, the model performance will be degenerated by a considerable margin. Based on such observations, smaller ξ are recommended to avoid crippling the model capacity. Meanwhile, we find that different ϕ does not have much effect on final model performance. Although the early train performance of the model was different in various patience, the model always reached the optimum at nearly 200 epochs with similar accuracy (Seen Fig. 4 (Right)).

5. CONCLUSION

In this work, we present a unified paradigm for searching and transforming winning lottery tickets. At the early training phase, we replace regularization with early stopping to mitigate overfitting and ensure the stability of training. When parameter distribution reaches a certain point, we rewind weights and alternately adopt increased regularization and pruning to search/transform subnetworks to achieve network capacity and sparsity joint-optimization. In addition, to compensate for the illconditioned small-weight problem caused by linearly increased regularization, a variety of nonlinear regularization methods are proposed, among which ExpLaw method improves the network expression power by limiting the network role to a small penalty force for a long time. We have benchmarked our algorithms on extensive datasets and backbones, achieving comparable performance on ultra-lightweight subnetworks. A PROOF OF EQ. 8 In Eq. 7, the early stopping is equivalent to L 2 regularization when (Λ + αI) -1 α = (I -ϵΛ) τ . We take the logarithm on both sides as: log[(Λ + αI) -1 α] = log (I -ϵΛ) τ → log(diag[ α λ 1 + α , ..., α λ 1 + α ]) = τ •log(I -ϵΛ) (9) Substituting α λi+α = (1 + λi α ) -1 into Eq. 9, we can obtain: -log I + Λ α = τ log (I -ϵΛ) According to the Taylor's Expansion, when x approaches zero, log(1 -x) can be approximated as x. Likewise, when the eigenvalues in Λ approach zeros, it can be seen that -log (I + Λ/α) ≈ Λ/α and τ log (I -ϵΛ) ≈ τ ϵΛ. Therefore, the equivalence holds if α ≈ 1 ϵτ . B PROOF OF L 2 REGULARIZATION IS EQUIVALENT TO EARLY STOPPING WHEN w (0) ̸ = 0 When w (0) ̸ = 0, we have: Q T w (τ ) = (I -ϵΛ) τ Q T w (0) + [I -(I -ϵΛ) τ ] Q T w * This equation can also be written as: Q T w (τ ) = (I -ϵΛ) τ w (0) w * + I -(I -ϵΛ) τ Q T w * = I -(I -ϵΛ) τ I - w (0) w * Q T w * In practice w (0) is usually much smaller than w * , which means that w (0) w * approaches 0 (i.e., I - w (0) w * ≈ I). Compare Eq. 12 with Eq. 7, the equivalence still holds if α ≈ 1 ϵτ .

C PROOF OF INCREASED REGULARIZATION

Increased Regularization. As the formula ẃ = (H + αI) -1 Hw * shown, after the regularization force increases (increase δα on the original basis), the parameters in the network will exercise to a new position. When the network converges again, it still exists ẃ = (H + δαI) -1 Hw * . Due to (H + δαI) -1 is very complicated or even difficult to solve, to better analyze the changes of the parameters in the network after the regularization force increases, we simply explore two H forms of H forms (we investigate two simplify cases to help us move forward) Wang et al. (2020) . (1) H is diagonal matrix, which is a simplest form for Hessian information LeCun et al. (1989) . We assume H = diag(h 1,1 , . . . , diag h n,n ), then: ẃ = (H + δαI) -1 Hw * = diag h 1,1 h 1,1 + δα , • • • , h n.n h n.n + δα w * We can find that for i-th w * i in w * , ẃi w * i = hi,i hi,i+δα ∈ [0, 1) since h i,i > 0 and δα > 0. It is not difficult to find a larger curvature corresponding to the larger hessian element. The closer the ratio of ẃi w * i is to 1, the less the weight moves. (2) H is not diagonal. Here we analysis 2d matrix, w * = w * 1 w * 2 and H = h 1,1 h 1,2 h 2,1 h 2,2 , then: ẃ1 ẃ2 = 1 |H + δαI| (h 1,1 h 2,2 + h 1,1 δα -h 2 1,2 )w * 1 + δαh 1,2 w * 2 (h 1,1 h 2,2 + h 2,2 δα -h 2 1,2 )w * 2 + δαh 1,2 w * 1 (14) Due to δα is small, we can obtain: ẃ1 ẃ2 ≈ 1 |H + δαI| (h 1,1 h 2,2 + h 1,1 δα -h 2 1,2 )w * 1 (h 1,1 h 2,2 + h 2,2 δα -h 2 1,2 )w * 2 It is not difficult to find that at h 1,1 > h 2,2 , we can still get ẃ1 w * 1 > ẃ2 /w * 2 similar conclusion. To summarize, due to different local second-order partial derivative structures, different weight responses are different in response to increased regularization forces. Larger curvature results in the weights being relatively less moved towards the original points. The magnitude of the difference among the weights will increase as regularization grow.

D EXPERIMENTAL SETTINGS AND NOTATIONS

The experiments are conducted on two NVIDIA Tesla V100 (16GB per GPU) and all selected backbones follow the same experimental settings for fairness. Specifically, experiments on CI-FAR10/CIFAR100 are optimized by Stochastic Gradient Descent (SGD), and we control the learning rate to 0.1 with 0.9 momentum using batch size 128. We use the Cosine Annealing Warm Restarts Loshchilov & Hutter (2016) scheduler as our learning rate scheduler and set the maximum number of iterations T max = 200. Those models for ImageNet classification are optimized by SGD and lr=0.1 with 0.9 momentum and we follow the same setting in learning rate scheduler. The results of comparative evaluation experiments on MobileNets(v1) and EfficientNetB0 backbones are summarized in Table 5 . Our proposed pruning algorithm achieves almost the best performance under the same pruning ratios, which demonstrates the effectiveness of our unified winning lottery tickets search/transform algorithms in a more general scenario. Table 6 : Performance on ImageNet dataset counterpart with different backbones and UniLTH pruning algorithm using 30%, 50%, 70% and 90% sparsity ratios. Acc@1: If the highest probability is the correct answer, it is considered correct. Acc@5: A correct answer is considered correct if the top five probabilities contain the correct answer. From Table 6 and Table 7 , we can observe that when using the larger-scale dataset ImageNet, our unified lottery pruning algorithm is still effective, and when the model is pruned to one tenth of the original network, the backbones performance can still be comparable to the original backbones excellent performance. To summarize, We conclude our UniLTH/UniDLTH are also adaptable and robust for general backbones and large-scale datasets and even obtains better performances using the same parameter setting compared with LTH/DLTH, whcih can further search/transform the ultralightweight subnetworks.



Figure 1: Illustration of LTH/DLTH and our UniLTH/UniDLTH. In (c), the blue/green solid contour lines denote the contours of the training/validation negative log-likelihood. Our goal is to drew the weights closer to ŵ. The black line indicates the training trajectory taken by SGD. Our algorithm rewinds the training procedure (the yellow line) and add increased regularization (the purple line) to move towards the validation set distribution when training reaches an early stopping threshold.

Figure 3: ResNet-50 on CIFAR-10 using 50%/70% sparsity with UniLTH/UniDLTH algorithms.

Figure 4: (Left) ExpLaw on ResNet-50 + CIFAR-10 + 50% sparsity experimental settings under different R ceil . (Right) Network performance under different validation loss monitor threshold ϕ (i.e., patience).

Performance comparison of Vgg-19/ResNet50 on CIFAR10/CIFAR100 datasets using 30%, 50%, 70% and 90% sparsity ratios. Except for the L 1 row, the highest/second highest performances are emphasized with red/blue fonts

Performance comparison of different increased regularization strategies on Vgg-19/CIFAR10 and ResNet-50/CIFAR100 experimental settings. For fairness, we set all R ceil to be consistent and keep at 1e-3 (UniLTH/LTH) and 1e-2 (UniDLTH/DLTH).



The notations commonly reported in this work are placed here. EXPERIMENTS ON MORE BACKBONES AND LARGE-SCALE DATASETS To answer RQ5, we here evaluate MobileNets (v1) Howard et al. (2017) and EfficientNetB0Tan & Le (2019) two backbones on CIFAR-10 datasets, and add additional ImageNet dataset to verify the performance of our proposed unified lottery ticket search/transform algorithm. The results are given in the table below.

Performance comparasion of different pruning algorithm counterpart with MobileNets(v1) and EfficientNetB0 backbones on CIFAR 10 dataset using 30%, 50%, 70% and 90% sparsity ratios. For convenience, we highlight show the highest/second highest performances with red/blue fonts.

