SIGNS IN THE LOTTERY: STRUCTURAL SIMILARITIES BETWEEN WINNING TICKETS

Abstract

Winning tickets are sparse subnetworks of a deep network that can be trained in isolation to the same performance as the full network. Winning tickets have been found in many different contexts, however, their structural characteristics are not well understood. We propose that the signs of the connections in winning tickets play a crucial role. We back this claim by introducing a sign-based structural comparison metric that allows distinguishing winning tickets from other sparse networks. We further analyze typical (signed) patterns in convolutional kernels of winning tickets and find structures that resemble patterns found in trained networks.

1. INTRODUCTION

The lottery ticket hypothesis (Frankle & Carbin, 2019) claims the existence of sparse trainable subnetworks in a given deep network, the so-called winning tickets. While it has been known for a long time that artificial neural networks can be pruned significantly (removing more than 90% of the parameters) after training without impacting their performance (Liang et al., 2021; Blalock et al., 2020) , such sparse networks usually resist training when their weights are initialized randomly, meaning they need longer to converge and usually reach lower performance. One interpretation of winning tickets is that they form the trainable core of a dense network, with the other parameters essentially being dead weights. This overparameterization of the dense network is still useful, as it combinatorially expands the number of subnetworks and hence the chance of containing a wining ticket. Winning tickets have been reliably found for different types of deep networks by pruning connections from randomly initialized dense networks. However, the known pruning approaches are quite laborious, often requiring more resources than simply training the original dense network. A good characterization of what distinguishes winning tickets from other sparse networks still seems to be missing. Better understanding these properties could not only help in designing more efficient pruning algorithms, but it might also allow devising new initialization schemes for deep networks, leading to smaller networks and more efficient training. In this paper, we address the question of whether winning tickets show specific structural characteristics that allow us to distinguish them from other sparse networks. We claim that it is not merely their sparse structure but also the sign of the connections that should be considered. After briefly reviewing previous work and motivating our approach (section 2), we introduce a sign-aware structural distance metric for sparse networks (section 3), explain our experimental setup (section 4) and apply our metric to the generated winning tickets. We then complement this quantitative analysis with a more qualitative inspection of spatial structures found in winning tickets (section 6) and conclude by summarizing our findings (section 7).

2. RELATED WORK

Since their discovery by Frankle & Carbin (2019), winning tickets have attracted a lot of attention and several works have shown their existence for different types of networks and datasets, although for larger networks, some additional warmup training seems to be required (Frankle et al., 2019) . On the other hand, Ramanujan et al. ( 2020) have shown that there exist subnetworks that already perform significantly above chance level without any training (specifically they extracted a subnetwork of an untrained ResNet-50 that performs like trained ResNet-34), and Malach et al. (2020) show, based on statistical arguments, that such networks should always exist. Frankle et al. (2021) claim that there are networks that do not only show good initial performance but can also be trained. Chen et al. (2021) propose a method to transform a wining ticket found in one network to be applied to another network architecture suggesting an underlying winning structure and Movva et al. ( 2021) analyze the structural overlap of winning tickets obtained from the same initial network. Closest related to our study is the work by Zhou et al. ( 2019), who have conducted several experiments to analyze properties of winning tickets. These also include experiments investigating the role of connection signs. They tried different approaches to change the connection weights of a given wining ticket, including sampling weights from the original initialization distribution, reshuffling the connection weights, or assigning a constant value to all connections. While such a reinitialization typically destroys the winning property of a ticket, they observed that if the signs of the connections are kept, the resulting network still shows comparable performance to the original wining ticket. It is this observation that motivates us to take the sign into account when comparing the network structure.

3. STRUCTURAL COMPARISON OF SPARSE NETWORKS

In contrast to dense networks, sparse networks open a way to structural analysis based on their connection graph. For example, one can use methods from graph theory to compare the structure of two sparse networks N 1 and N 2 . One such approach is the Neural Network Sparse Topology Distance (NNSTD) by Liu et al. (2020) . This metric operates in a layer-wise fashion, considering for each pair of units the sets of incoming connections G 1 and G 2 . It uses the normalized edit distance (NED), a normalized version of the graph edit distance, that counts the number of edits (additions and removals of connections) required to transform one network into the other. To account for differences between two networks, the units in two corresponding layers can be permuted to minimize the overall edit distance. The distance between N 1 and N 2 can then be obtained by averaging over the layer distances. While the NNSTD can be used to find structural similarities between sparse networks, the structure of a network cannot be the sole factor distinguishing a winning ticket from other sparse networks: reinitializing a winning ticket with random weights usually destroys its winning property while keeping the sign, a winning ticket tends to stay trainable (Zhou et al., 2019) . Hence we propose to use a structural similarity metric that takes the sign of a connection into account. To obtain such a sign-aware metric, we adapt NNSTD in the following way: instead of just considering the set G of all incoming connections to a node, we split that set into two parts, G + and G - consisting of only the positive and negative connections, respectively. The sign-aware NED ± is then defined as the arithmetic mean of the NED for the positive and negative parts: NED ± (G 1 , G 2 ) = 1 2 NED G + 1 , G + 2 + NED G - 1 , G - This metric penalizes connections with different signs in N 1 and N 2 more than a connection that is simply missing in the other network. One may choose to put more emphasis on a mismatch in positive or negative connections (by taking a weighted average; indeed in our experiment we observe, that the sign are usually not equally represented in the winning tickets). However, for this paper we stick to the symmetric definition. Convolutional layers require a special treatment: such layers are considered to consist of multiple filters, each being a stack of 2-dimension kernels that are moved over the input tensor. We consider each filter as a unit, setting the set of incoming connections to be the entries in the filter that have not been pruned. The NED ± is then computed for each pair of filters from the corresponding layers of N 1 and N 2 to obtain the NNSTD ± for that layer.

4. EXPERIMENTAL SETUP

In designing our experiments, we follow the setting in the original paper by Frankle & Carbin (2019), using the same Conv-2 architecture, consisting of two convolutional layers with 64 kernels of size

