HOW ERD ÖS AND R ÉNYI WIN THE LOTTERY

Abstract

Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erdös-Rényi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/ log(1/sparsity). While we are the first to show theoretically and experimentally that random ER source networks contain strong lottery tickets, we also prove the existence of weak lottery tickets that require a lower degree of overparametrization than strong lottery tickets. These unusual results are based on the observation that ER masks are well trainable in practice, which we verify in experiments with varied choices of random masks. Some of these data-free choices outperform previously proposed random approaches on standard image classification benchmark datasets.

1. INTRODUCTION

The impressive breakthroughs achieved by deep learning have largely been attributed to the extensive overparametrization of deep neural networks, as it seems to have multiple benefits for their representational power and optimization (Belkin et al., 2019) . The resulting trend towards ever larger models and datasets, however, imposes increasing computational and energy costs that are difficult to meet. This raises the question: Is this high degree of overparameterization truly necessary? Training general small-scale or sparse deep neural network architectures from scratch remains a challenge for standard initialization schemes Li et al. (2016); Han et al. (2015) . However, Frankle & Carbin (2019) have recently demonstrated that there exist sparse architectures that can be trained to solve standard benchmark problems competitively. According to their Lottery Ticket Hypothesis (LTH), dense randomly initialized networks contain subnetworks that can be trained in isolation to a test accuracy that is comparable with the one of the original dense network. Such subnetworks, the lottery tickets (LTs), have since been identified as weak lottery tickets (WLTs) by pruning algorithms that require computationally expensive pruning-retraining iterations (Frankle & Carbin, 2019; Tanaka et al., 2020) or mask learning procedures Savarese et al. (2020); Sreenivasan et al. (2022b) . While these can lead to computational gains at training and inference time and reduce memory requirements (Hassibi et al., 1993; Han et al., 2015) , the real goal remains to identify good trainable architectures before training, as this could lead to significant computational savings. Yet, contemporary pruning at initialization approaches (Lee et al., 2018; Wang et al., 2020; Tanaka et al., 2020; Fischer & Burkholz, 2022; Frankle et al., 2021) achieve less competitive performance. For that reason it is so remarkable that even iterative state-of-the-art approaches struggle to outperform a simple, computationally cheap, and data independent alternative: random pruning at initialization (Su et al., 2020) . Liu et al. (2021) have provided systematic experimental evidence for its 'unreasonable' effectiveness in multiple settings, including complex, large scale architectures and data. We explain theoretically why they can be effective by proving that randomly masked networks, so called Erdös-Rényi (ER) networks, contain lottery tickets under realistic conditions. Our results imply that sparse ER networks are highly expressive and have the universal function approximation property like dense networks. This insight also provides a missing piece in the theoretical foundation for dynamic sparse training approaches (Evci et al., 2020a; Liu et al., 2021.; Bellec et al., 2018) that start pruning from a random ER network instead of a dense one. The main underlying idea could also Most theoretical results pertaining to LTs focus on the existence of such SLTs (Malach et al., 2020; Pensia et al., 2020; Fischer et al., 2021; da Cunha et al., 2022; Burkholz, 2022a; b) . These are subnetworks of large, randomly initialized source networks, which do not require any further training after pruning. Ramanujan et al. ( 2020) have provided experimental evidence for their existence and suggested that training neural networks could be achieved by pruning alone. By modifying their proposed algorithm, edge-popup (EP), we show experimentally that SLTs are contained in randomly masked ER source networks. We also prove this existence rigorously by transferring the construction of Burkholz (2022a) to random networks. This introduces an additional factor 1/ log(1/sparsity) in the lower bound on the width of the source network that guarantees existence with high probability. In contrast to previous works on SLTs, we also prove the existence of weak LTs. Since every strong LT is also a weak LT, formally, the theory for strong LTs also covers the existence of weak LTs. However, experiments and theoretical derivations suggest that limiting ourselves to SLTs leads to LTs with lower sparsity than what can be achieved by WLT algorithms (Fischer & Burkholz, 2022) . In line with this observation, we derive improved existence results for WLTs. However, these cannot overcome the overparameterization factor 1/ log(1/sparsity), which is even required under ideal conditions, as we argue by deriving a lower bound on the required width. Our strategy relies on a property of ER networks that is crucial for their effectiveness: They are well trainable with standard initialization approaches. With various experiments on benchmark image data and commonly used neural network architectures, we verify the validity of this assumption, complementing experiments by Liu et al. (2021) for different choices of sparsity ratios. This demonstrates that multiple choices can lead to competitive results. Some of these choices outperform previously proposed random masks Liu et al. ( 2021), highlighting the potential for tuning sparsity ratios in applications. Contributions In summary, our contributions are as follows: 1) We show theoretically and empirically that ER random networks contain LTs with high probability, if the ER source network is wider than a target by a factor 1/ log(1/sparsity). 2) We prove the existence of strong as well as weak LTs in random ER source networks. 3) In support of our theory, we verify in experiments that ER networks are well trainable with standard initialization schemes for various choices of layerwise sparsity ratios. 4) We propose two data-independent, flow preserving and computationally cheap approaches to draw random ER masks defining layerwise sparsity ratios, balanced and pyramidal. These can outperform previously proposed choices on standard architectures, highlighting the potential benefits resulting from tuning ER sparsity ratios. 5) Our theory explains why ER networks are likely not competitive at extreme sparsities. But this can be remedied by targeted rewiring of random edges as proposed by dynamic sparse training, for which we provide theoretical support with our analysis.

1.1. RELATED WORK

Algorithms to prune neural networks can be broadly categorized into two groups, pruning before training and pruning after training. The first group of algorithms that prune after training are effective in speeding up inference, but they still rely on a computationally expensive training procedure (Hassibi et al., 1993; LeCun et al., 1989; Molchanov et al., 2016; Dong et al., 2017; Yu et al., 2022) . The second group of algorithms prune at initialization (Lee et al., 2018; Wang et al., 2020; Tanaka et al., 2020; Sreenivasan et al., 2022b) or follow a computationally expensive cycle of pruning and retraining for multiple iterations (Gale et al., 2019; Savarese et al., 2020; You et al., 2019; Frankle & Carbin, 2019; Renda et al., 2019; Dettmers & Zettlemoyer, 2019) . These methods find trainable subnetworks, i.e., WLTs (Frankle & Carbin, 2019) . Single shot pruning approaches are computationally cheaper but are susceptible to problems like layer collapse which render the pruned network untrainable (Lee et al., 2018; Wang et al., 2020 ). Tanaka et al. (2020) address this issue by preserving flow in the network through their scoring mechanism. The best performing WLTs are still obtained by expensive iterative pruning methods like Iterative Magnitude Pruning (IMP), Iterative Synflow (Frankle & Carbin, 2019; Fischer & Burkholz, 2022) , or training mask parameters of dense networks (Sreenivasan et al., 2022b; Savarese et al., 2020) . 



However,Su et al. (2020)  found that ER masks can outperform expensive iterative pruning strategies in different situations. Inspired by this finding, Golubeva et al. (2021); Chang et al. (2021) have hypothesized that sparse overparameterized networks are more effective than smaller networks with

