CONSIDERING LAYERWISE IMPORTANCE IN THE LOTTERY TICKET HYPOTHESIS

Abstract

The recently-introduced Lottery Ticket Hypothesis (LTH) posits that it is possible to extract a sparse trainable subnetwork from a dense network using iterative magnitude pruning. By iteratively training the model, removing the connections with the lowest global weight magnitude and rewinding the remaining connections, sparse networks can be extracted. These sparse networks are referred to as lottery tickets and when fully trained, they reach a similar or better performance than their dense counterpart. Intuitively, this approach of comparing connection weights globally removes a lot of context about the relations between connection weights in their layer, as the weight distributions in layers throughout the network often differ significantly. In this paper, we study a number of different approaches that aim at recovering some of this layer distributional context by computing a connection importance value that is dependent on the weights of the other connections in the same layer. We then generalise the LTH to consider weight importance values rather than weight magnitudes. Experiments using these importance metrics on several architectures and datasets, reveal interesting aspects on the structure and emergence of Lottery tickets. We find that given a repeatable training procedure, applying different importance metrics leads to distinct performant lottery tickets with little overlapping connections which strongly suggests that lottery tickets are not unique.

1. INTRODUCTION

The recent trend in machine learning to chase higher benchmark scores by adding additional parameters, has led to an explosive increase in the size of neural network architectures. A prime example of this phenomenon are the GPT models. While the first model in the family (Radford et al., 2018) has 117 million parameters, the latest model (Brown et al., 2020) already has a whopping 175 billion parameters which amounts to a > 1000 times increase. However, this explosive rise in parameters poses new problems. Training a single transformer model with a parameter count of 213 millon -still orders of magnitude smaller than GPT-3 -using Neural Architecture Search emits as much CO2 as five cars during their lifetime (Strubell et al., 2020) . Furthermore larger models typically need specialised hardware and a lot of computing power for training and inference, which constrain the ability of the model to run on mobile devices, thus limiting powerful models to well-funded institutions. Finally, research has shown that these large models are typically overparameterized and encode a lot of redundant information that can be removed (Denil et al., 2013) . To alleviate these issues, numerous approaches have been studied to scale down the number of parameters in a model, while still preserving (roughly) the same performance. This can be achieved by, e.g., designing parameter-efficient network structures (Sandler et al., 2018) , or sparsifying existing neural network structures via pruning (le Cun, 1990; Hassibi & Stork, 1993; Han et al., 2015; Louizos et al., 2018; Molchanov et al., 2017) . Until recently, it was thought to be difficult to train sparse neural networks from scratch (Evci et al., 2019) , which was further strengthened by the finding that over-parameterized network architectures are proven to lead to an optimal global minimum when training (Zou & Gu, 2019) . As such the classical way to reduce parameter count was via the train-prune-finetune loop, in which a model is first trained to completion, then redundant connections are pruned and finally the resulting network is finetuned The recently introduced lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) challenged this notion and introduced a procedure to extract a sparse trainable network -a lottery ticket (LT)from a dense network. This is done using a pruning criterion in the form of the global weight magnitude in combination with a gradual pruning procedure. In this paper we study a refinement on this criterion by adding a notion of layerwise importance, which we introduce in section 2. We do this by considering a number of different weight rescaling methods, such that the comparisons are more calibrated between layers. Quantitative and qualitative comparisons between different importance measures and the baseline are reported in section 3. In addition, we shine light on the observable differences in the generated LTs (section 4), and determine how LTs emerge and differ when considering identical training conditions (section 5). A brief overview of related work is laid out in section 6 and finally, the paper is concluded in section 7. The key observations of our study are that: i) given a fixed weight initialization, it is possible to extract different lottery tickets that have similar performance, but differ significantly in their structure, ii) these tickets have a noticeable amount of common connections which have low-variance across tickets, and iii) these stable common connections survive the LTH procedure even when the other weights in the model are reinitialized. Together these observations suggest that these connections might be a promising avenue towards finding LTs more efficiently.

2. THE LOTTERY TICKET HYPOTHESIS

The Lottery ticket hypothesis uses iterative Global Magnitude Pruning (GMP) (Han et al., 2015) , which prunes individual connections that have the lowest weight magnitudes in a network. By repeating this process multiple times, it is possible to obtain a highly sparse network that when trained still reaches commensurate accuracy. Frankle et al. (2020) introduced a modification to the LTH procedure by rewinding to parameters at iteration t = k ≪ m, rather than resetting to the initial parameters at t = 0. By rewinding to a later iteration, the performance of the found lottery tickets was improved for complex networks at high sparsities. In the literature, this procedure is usually referred to as Lottery Ticket Rewinding (LTR) rather than the LTH. We will adopt this naming in the rest of the document.

Later work by

While the exact mechanism behind the success of lottery tickets is not fully understood, one hypothesis, posited by Evci et al. (2022) , is that due to the LTH/LTR procedure the resulting networks are already in the same loss basin as the fully trained dense network and as such the ticket can still converge to a performant solution during training. Pseudocode for the LTR in its initial form can be found in Algorithm 1. Algorithm 1 The LTR procedure 1: Initialize a model M with parameters θ 0 2: Pretrain M for k iterations resulting parameters θ 0,k 3: for i ← 0, n do Rewind parameters of M to θ 0,k 9: end for A weak aspect of globally pruning is that the only factor that determines whether a connection is pruned, is the magnitude of the connection weight. As such, connection weights from different layers are compared on a global scale, rather than within the layer. This disregards more complex factors such as the weight distribution within a layer and the number of remaining connections in the layer. In fact, due to the commonly used Kaiming Normal initialization (He et al., 2015) , different layers are already initialized at different weight distributions as the standard deviation is inversely



m-k iterations, resulting in parameters θ i,m 5:θ pooled ← P ool(abs(θ i,m )) 6: p ← j-th percentile of θ pooled 7:Prune all connections with abs(θ i,m ) < p 8:

