UNMASKING THE LOTTERY TICKET HYPOTHESIS: WHAT'S ENCODED IN A WINNING TICKET'S MASK?

Abstract

As neural networks get larger and costlier, it is important to find sparse networks that require less compute and memory but can be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP iterates through cycles of training, pruning a fraction of smallest magnitude weights, rewinding unpruned weights back to an early training point, and repeating. Despite its simplicity, the principles underlying when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed, i.e. why can't we prune to very high sparsities in one shot? We investigate these questions through the lens of the geometry of the error landscape. First, we find that-at higher sparsities-pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey to the rewind point the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training limits the fraction of weights that can be pruned at each iteration of IMP. This analysis yields a new quantitative link between IMP performance and the Hessian eigenspectrum. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune. Overall, these results make progress toward demystifying the existence of winning tickets by revealing the fundamental role of error landscape geometry in the algorithms used to find them.

1. INTRODUCTION

Recent advances in deep learning have been driven by massively scaling both the size of networks and datasets (Kaplan et al., 2020; Hoffmann et al., 2022) . But this scale comes at considerable resource costs. For example, when training or deploying networks on edge devices, memory and computational demands must remain small. This motivates the search for sparse trainable networks (Blalock et al., 2020) , pruned datasets (Paul et al., 2021; Sorscher et al., 2022) , or both (Paul et al., 2022) that can be used within these resource constraints. However, finding highly sparse, trainable networks is challenging. A state of the art algorithm for doing so is iterative magnitude pruning (IMP) (Frankle et al., 2020a) . IMP starts with a dense network usually pretrained for a very short amount of time. The weights of this network are called the rewind point. IMP then repeatedly (1) trains this network to convergence; (2) prunes the trained network by computing a mask that zeros out a fraction (typically 20%) of the smallest magnitude weights; (3) rewinds the nonzero weights back to their values at the rewind point, and then starts the next iteration by training the masked network to convergence. Each successive iteration yields a mask with higher sparsity. The final mask applied to the rewind point constitutes a highly sparse trainable subnetwork called a winning ticket if it trains to the same accuracy as the full network, i.e. is matching.

Matching Error Sublevel Set

Figure 1: Error landscape of IMP. At iteration L, IMP trains the network from a pruned rewind point (circles), on an α L D dimensional axial subspace (colored planes), to a level L pruned solution (triangles). The smallest (1-α) fraction weights are then pruned, yielding the level L projection (×'s) whose 0 weights form a sparsity mask corresponding to a α L+1 D dimensional axial subspace. This mask, when applied to the rewind point, defines the level L + 1 initialization. Thus IMP moves through a sequence of nested axial subspaces of increasing sparsity. We find that when IMP finds a sequence of matching pruned solutions (triangles), there is no error barrier on the piecewise linear path between them. Thus the key information contained in an IMP mask is the identity of an axial subspace that intersects the connected matching sublevel set containing a well-performing overparameterized network. Although IMP produces highly sparse matching networks, it is extremely resource intensive. Moreover, the principles underlying when and how IMP finds winning tickets remain quite mysterious. Thus, the goal of this work is to develop a scientific understanding of the mechanisms governing the success or failure of IMP. By doing so, we hope to enable the design of improved pruning algorithms. The operation of IMP raises four foundational puzzles. First, the mask that determines which weights to prune is identified based on small magnitude weights at the end of training. However, at the next iteration, this mask is applied to the rewind point found early in training. Precisely what information from the end of training does the mask convey to the rewind point? Second, how does SGD starting from the masked rewind point extract and use this information? The mask indeed provides actionable information beyond that stored in the network weights at the rewind pointusing a random mask or even pruning the smallest magnitude weights at this point leads to higher error after training (Frankle et al., 2021) . Third, why are we forced to prune only a small fraction of weights at each iteration? Training and then pruning a large fraction of weights in one shot does not perform as well as iteratively pruning a small fraction and then retraining (Frankle & Carbin, 2018) . Why does pruning a larger fraction in one iteration destroy the actionable information in the mask? Fourth, why does retraining allow us to prune more weights? A variant of IMP that uses a different retraining strategy (learning rate rewinding) also successfully identifies matching subnetworks while another variant (finetuning) fails (Renda et al., 2020) . What differentiates a successful retraining strategy from an unsuccessful one? Understanding IMP through error landscape geometry. In this work, we provide insights into these questions by isolating important aspects of the error landscape geometry that govern IMP performance (Fig. 1 ). We do so through extensive empirical investigations on a range of benchmark datasets (CIFAR-10, CIFAR-100, and ImageNet) and modern network architectures . Our contributions are as follows: • We find that, at higher sparsities, successive matching sparse solutions at level L and L + 1 (adjacent pairs of triangles in Fig. 1 ) are linearly mode connected (there is no error barrier on the line connecting them in weight space). However, the dense solution and the sparsest matching solution may not be linearly mode connected. The transition from matching to nonmatching sparsity coincides with the failure of this successive level linear mode connectivity. This answers our first question of what information from an IMP solution at the end of training is conveyed by the mask to the rewind point: it is the identity of a sparser axial subspace that intersects the linearly connected sublevel set containing the current matching IMP solution. IMP fails to find matching solutions when this information is not present. (Section 3.1) • We show that, at higher sparsities, networks trained from rewind points that yield matching solutions exhibit a strong form of robustness. The linearly connected modes these networks train into are not only robust to SGD noise (Frankle et al., 2020a) , but also to random perturbations of length equal to the distance between rewind points at successive levels (adjacent circles in Fig. 1 ). This answers our second question of how SGD extracts information from the mask: two pruned rewind points at successive sparsity levels are likely to navigate back to the same linearly connected mode yielding matching solutions (adjacent triangles Fig. 1 ). (Section 3.2)

