REVISITING PRUNING AT INITIALIZATION THROUGH THE LENS OF RAMANUJAN GRAPH

Abstract

Pruning neural networks at initialization (PaI) has received an upsurge of interest due to its end-to-end saving potential. PaI is able to find sparse subnetworks at initialization that can achieve comparable performance to the full networks. These methods can surpass the trivial baseline of random pruning but suffer from a significant performance gap compared to post-training pruning. Previous approaches firmly rely on weights, gradients, and sanity checks as primary signals when conducting PaI analysis. To better understand the underlying mechanism of PaI, we propose to interpret it through the lens of the Ramanujan Graph -a class of expander graphs that are sparse while being highly connected. It is often believed there should be a strong correlation between the Ramanujan graph and PaI since both are about finding sparse and well-connected neural networks. However, the finer-grained link relating highly sparse and connected networks to their relative performance (i.e., ranking of difference sparse structures at the same specific global sparsity) is still missing. We observe that not only the Ramanujan property for sparse networks shows no significant relationship to PaI's relative performance, but maximizing it can also lead to the formation of pseudo-random graphs with no structural meanings. We reveal the underlying cause to be Ramanujan Graph's strong assumption on the upper bound of the largest nontrivial eigenvalue (μ) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax the μ upper bound. Likewise, we also show there exists a lower bound for μ, which we call the Normalized Random Coefficient (NaRC), that gives us an accurate assessment for when sparse but highly connected structure degenerates into naive randomness. Finally, we systematically analyze the behavior of various PaI methods and demonstrate the utility of our proposed metrics in characterizing PaI performance. We show that subnetworks preserving better the IMDB property correlate higher in performance, while NaRC provides us with a possible mean to locate the region where highly connected, highly sparse, and non-trivial Ramanujan expanders exist.

1. INTRODUCTION

Deep neural networks (DNN) have demonstrated remarkable performance as they increase in size, i.e, test accuracy scales as a power law regarding model size and training data size (Hestness et al., 2017; Kaplan et al., 2020; Brown et al., 2020; Srivastava et al., 2022 ). Yet, the memory requirements and computational costs associated with the increased model size also grow prohibitively. Modern DNNs are widely recognized to be over-parameterized, and it has been shown that eliminating a significant number of parameters in a trained DNN does not compromise its performance (Han et al., 2015c; He et al., 2017) . This over-parameterization property enables researchers to continually propose increasingly effective DNN pruning approaches that can dramatically shrink the model size while maintaining performance. The resultant sparse models can then be used with software and hardware that is optimized for sparsity, leading to faster training and inference. Neural network pruning can generally be divided into three categories: post-training pruning (Mozer & Smolensky, 1989; Han et al., 2015a ), during-training pruning (Gale et al., 2019; Louizos et al., 2018) , and pre-training pruning (Lee et al., 2019; Wang et al., 2020) , depending on the timing of the pruning relative to the training phase. For instance, post-training pruning methods are generally effective when the primary goal is to reduce inference cost. However, these methods require training the dense model fully first, potentially multiple times if iterative pruning and retraining are used. With the prevalence of powerful foundation models such as GPT-3 (Brown et al., 2020 ), PaLM (Chowdhery et al., 2022 ), and DALL•E 2 (Ramesh et al., 2022) , the prohibitively high cost of training these large models makes post-training pruning impractical. Therefore, pre-training pruning or pruning at initialization (PaI) is becoming increasingly attractive due to its potential to save time and resources end-to-end, by using a sparse DNN architecture from the outset. et al., 2021) . Since the limited information (e.g., magnitude, gradient, and Hessian) that PaIs have access to can be very noisy (Frankle et al., 2020) , pruning criteria based on such information are often ineffective. We conjecture that the graph topology of sparse neural networks, being relatively overlooked, can be an essential source of information for pruning at initialization. Graph theory has recently emerged as a particularly advantageous tool for analyzing DNN architectures. (2018) show that maximizing good graph connectivity, i.e. by maximizing a graph's expansion ratio, correlates to higher performance in hardware-efficient structured masks and lottery tickets (Frankle & Carbin, 2019) . Unfortunately, prior efforts did not consider the pseudo-randomness that naturally emerges from very good expander graphs. Therefore by maximizing sparse graph connectivity, they are unwittingly prioritizing the formation of naive random graphs with no intrinsic structure meaning. In this paper, we study PaI from the perspective of the Ramanujan bipartite graphs. The Ramanujan graph is a special graph in the bounded degree expander family, where the eigenbound is maximal (Nilli, 1991) , thus leading to a maximum possible sparsity of a network while preserving the connectivity. The Ramanujan graph is intuitively well aligned with the main goal of PaI, i.e., finding sparse and well-connected neural networks. However, we find that there is still a missing link correlating the degree of connectivity to relative performance ranking at a particular sparsity. In addition, we also show that in situations where highly sparse and highly connected structures are demanded, it can be easy to generate pseudo-random graphs with no structural meanings unwittingly. We reveal the underlying cause for such undesirable situations to be Ramanujan Graph's strong assumption on the upper bound of the largest nontrivial eigenvalue (μ) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax μ upper bound. Likewise, we also show there exists a lower bound for μ, which we call the Normalized Random Coefficient (NaRC), that gives us an accurate assessment for when sparse but highly connected structure deteriorates into randomness. Leveraging our (adjusted) Ramanujan graph-based framework, we then extensively investigate (1) whether the generated sparse structures by existing PaI approaches follow the Ramanujan characteristics, (2) if there exists a correlation between the Ramanujan graph property and the inference performance of the sparse structure, and (3) if the sparse structure is also a random graph. Ultimately, we aim to shed light on a new perspective on PAI effectiveness independent of weights, gradients, and losses. Our contributions are summarized as follows: • We are the first to reveal that the utility of the Ramanujan property is largely limited in analyzing irregular graphs at high sparsity, which is often the case in analyzing PaI-generated



The concept of pruning at initialization (PaI) was first introduced in SNIP (Lee et al., 2019), which removes the structurally unimportant connections at initialization via the proposed connection sensitivity. Follow-up works(Wang et al., 2020; Tanaka et al., 2020; Patil & Dovrolis, 2021)  propose advanced pruning criteria to improve the performance of PaI.GraSP (Wang et al., 2020)  aims to maintain the weights that can maximize the gradient flow. SynFlow (Tanaka et al., 2020) finds that previous PaI methods are prone to layer collapse and adopt iterative pruning to address it. Despite these advances, PaI still lags behind post-training pruning in terms of performance. A study by Frankle et al. (2021) suggests that connection ambiguity may explain the performance deficit, based on the finding that PaI exhibits surprising resilience against layer-wise random mask shuffling and weight re-initialization. Prior works on PaI mainly focus on "training signals" such as gradient flow (Wang et al., 2020), layer collapse (Tanaka et al., 2020), or sanity check (random weight shuffling and re-initialization) (Frankle

For example, You et al. (2020) offers a new efficient graph representation of DNN using relation graphs to formulate an efficient model generator. Liu et al. (2020) analyzes sparse neural networks with graph distance and shows a plenitude of sparse sub-networks with very different topologies while achieving similar performance. Vooturi et al. (2020); Pal et al. (2022); Bhardwaj et al. (2021); Prabhu et al.

