CONSIDERING LAYERWISE IMPORTANCE IN THE LOTTERY TICKET HYPOTHESIS

Abstract

The recently-introduced Lottery Ticket Hypothesis (LTH) posits that it is possible to extract a sparse trainable subnetwork from a dense network using iterative magnitude pruning. By iteratively training the model, removing the connections with the lowest global weight magnitude and rewinding the remaining connections, sparse networks can be extracted. These sparse networks are referred to as lottery tickets and when fully trained, they reach a similar or better performance than their dense counterpart. Intuitively, this approach of comparing connection weights globally removes a lot of context about the relations between connection weights in their layer, as the weight distributions in layers throughout the network often differ significantly. In this paper, we study a number of different approaches that aim at recovering some of this layer distributional context by computing a connection importance value that is dependent on the weights of the other connections in the same layer. We then generalise the LTH to consider weight importance values rather than weight magnitudes. Experiments using these importance metrics on several architectures and datasets, reveal interesting aspects on the structure and emergence of Lottery tickets. We find that given a repeatable training procedure, applying different importance metrics leads to distinct performant lottery tickets with little overlapping connections which strongly suggests that lottery tickets are not unique.

1. INTRODUCTION

The recent trend in machine learning to chase higher benchmark scores by adding additional parameters, has led to an explosive increase in the size of neural network architectures. A prime example of this phenomenon are the GPT models. While the first model in the family (Radford et al., 2018) has 117 million parameters, the latest model (Brown et al., 2020) already has a whopping 175 billion parameters which amounts to a > 1000 times increase. However, this explosive rise in parameters poses new problems. Training a single transformer model with a parameter count of 213 millon -still orders of magnitude smaller than GPT-3 -using Neural Architecture Search emits as much CO2 as five cars during their lifetime (Strubell et al., 2020) . Furthermore larger models typically need specialised hardware and a lot of computing power for training and inference, which constrain the ability of the model to run on mobile devices, thus limiting powerful models to well-funded institutions. Finally, research has shown that these large models are typically overparameterized and encode a lot of redundant information that can be removed (Denil et al., 2013) . To alleviate these issues, numerous approaches have been studied to scale down the number of parameters in a model, while still preserving (roughly) the same performance. This can be achieved by, e.g., designing parameter-efficient network structures (Sandler et al., 2018) , or sparsifying existing neural network structures via pruning (le Cun, 1990; Hassibi & Stork, 1993; Han et al., 2015; Louizos et al., 2018; Molchanov et al., 2017) . Until recently, it was thought to be difficult to train sparse neural networks from scratch (Evci et al., 2019) , which was further strengthened by the finding that over-parameterized network architectures are proven to lead to an optimal global minimum when training (Zou & Gu, 2019) . As such the classical way to reduce parameter count was via the train-prune-finetune loop, in which a model is

