UNDERSTANDING PRUNING AT INITIALIZATION: AN EFFECTIVE NODE-PATH BALANCING PERSPECTIVE

Abstract

Pruning at initialization (PaI) methods aim to remove weights of neural networks before training in pursuit of reducing training costs. While current PaI methods are promising and outperform random pruning, much work remains to be done to understand and improve PaI methods to achieve the performance of pruning after training. In particular, recent studies (Frankle et al., 2021; Su et al., 2020) present empirical evidence for the potential of PaI, and show intriguing properties like layerwise random shuffling connections of pruned networks preserves or even improves the performance. Our paper gives new perspectives on PaI from the geometry of subnetwork configurations. We propose to use two quantities to probe the shape of subnetworks: the numbers of effective paths and effective nodes (or channels). Using these numbers, we provide a principled framework to better understand PaI methods. Our main findings are: (i) the width of subnetworks matters in regular sparsity levels (< 99%) -this matches the competitive performance of shuffled layerwise subnetworks; (ii) node-path balancing plays a critical role in the quality of PaI subnetworks, especially in extreme sparsity regimes. These innovate an important direction to network pruning that takes into account the subnetwork topology itself. To illustrate the promise of this direction, we present a fairly naive method based on SynFlow (Tanaka et al., 2020) and conduct extensive experiments on different architectures and datasets to demonstrate its effectiveness.

1. INTRODUCTION

Deep neural networks have achieved state-of-the-art performance in a wide range of machine learning applications (Brown et al., 2020; Dosovitskiy et al., 2021; Ramesh et al., 2021; Radford et al., 2021) . However, the huge computational resource requirements limit their applications, especially in edge computing and other future smart cyber-physical systems (Hinton et al., 2015; Zhao et al., 2019; Price & Tanner, 2021; Yuan et al., 2021; Bithika et al., 2022) . To overcome this issue, there has been a number of attempts to reduce the size of such networks without compromising their performance, among which pruning enjoys a significant interest (Hoefler et al., 2021; Deng et al., 2020; Cheng et al., 2018) . A rationale for this direction is the work of Frankle & Carbin (2018) , in which the authors provide empirical evidences on the existence of sparse subnetworks that can be trained from scratch and achieve similar performance to the original network, referred to as the Lottery Tickets. However, standard methods for finding such subnetworks typically involve the costly pre-training and iterative magnitude pruning process. This issue raises an intriguing research question: How to identify sparse, trainable subnetworks at initialization without pre-training? Specifically, a successful pruning before training method can significantly reduce both the cost of memory and runtime, without sacrificing performance much (Wang et al., 2022) . This would make neural networks applicable even in scenarios with scarce computing resources (Alizadeh et al., 2022; Yuan et al., 2021) . As such, many methods for PaI have been proposed (Lee et al., 2019; Tanaka et al., 2020; de Jorge et al., 2021; Wang et al., 2020; Alizadeh et al., 2022) . While these methods are based on a number of intuitions (e.g., leveraging the gradient information), they typically measure the importance of network parameters. More recently, Frankle et al. ( 2021); Su et al. ( 2020) observe a rather surprising phenomenon: for PaI methods, layerwise shuffling connections of pruned subnetworks does not reduce the network's performance. A surprising consequence is that layerwise sparsity ratios are more important than weight-level importance scores of the subnetwork (Frankle et al., 2021) . This indicates that in searching for good subnet-works at initialization, the topology of subnetworks, in particular the number of input-output paths and active nodes, plays a vital role and should be investigated more extensively. While the findings of previous works Frankle et al. (2021); Su et al. (2020) indicate that PaI methods are insensitive to random shuffling, we find this is not true in the extreme sparsity regime (> 99%) in which the number of effective connections in subnetworks are vulnerable to changes to shuffling. In layerwise shuffling experiments (see Section 3.3), shuffling connections results in more effective nodes but substantially fewer input-output paths. In normal sparsity levels, shuffling weights in regular sparsities can maintain and even increase effective parameters (Frankle et al., 2021) and the number of activated nodes (Patil & Dovrolis, 2021) . Having more effective nodes after shuffling helps the representation capacity of subnetworks enhance while the number of effective paths is enough for preserving the information flow, leading to competitive or even better performance of shuffled subnetworks (Section 3.3). However, we empirically show that in the extreme sparsity levels, while shuffling still more or less preserves the number of remaining attached nodes, the performance of shuffled subnetworks may drop significantly, compared to the unshuffled counterpart. This is because the information flow is hampered due to the significant decrease of the input-output paths. These findings suggest that separately considering effective paths or nodes is inadequate to fully capture behaviors of subnetworks generated by PaI methods. In addition, we design a simple toy experiment in which an MLP network is considered as a base architecture. We then randomly generate 100 subnetworks at the same sparsity level 95% while ensuring that all nodes in input and output layers are activated, all networks are trained to converge with the same setting and tested on MNIST dataset (more details in Appendix I). As depicted in Figure 1 , these subnetworks have different effective nodes and paths. We observe that subnetworks with higher in both the number of nodes and paths tend to have better performance. This highlights the essential role of simultaneously considering both node and path in the success of designing subnetworks at initialization. To overcome this issue, we introduce a novel framework that combines metrics mentioned in previous works (Tanaka et al., 2020; Patil & Dovrolis, 2021; Frankle et al., 2021) to provide a more accurate explanation of the performance of different approaches. In particular, we propose the joint usage of both the number of input-output paths (a.k.a. effective paths) and activated (i.e., effective) nodes to explain different behaviors of PaI methods in a more comprehensive way (see Section 3.3 for more details). We also demonstrate the usefulness of this framework as follows: With a simple modification in the base of the iterative pruning algorithm, we show that if we maintain both the effective path and node level high simultaneously, the quality of subnetworks will be enhanced. In summary, our main contributions are: • We propose to systematically use the topology of subnetworks, particularly the number of effective nodes and paths, as proxies for the performance of PaI methods (Section 3). We revisit the layerwise shuffling sanity check on subnetworks produced by existing PaI methods and provide unified explanations for their behaviors based on these metrics in a wide range of sparsities (Section 3.3). • We discover a new relation between the proposed metrics and the performance of subnetworks, termed the Node-Path Balancing Principle, that suggests a nontrivial balance between nodes and paths is necessary for optimal performance of PaI methods (Section 4). We introduce a simple modification of SynFlow (Tanaka et al., 2020) to give a proof-ofconcept for our principle. We perform extensive experiments to show that better balancing this trade-off in pruned networks leads to improved performance (Section 5). • Our framework opens a novel research direction that advocates taking into account both paths and nodes in the design of PaI methods. More precisely, while the regular sparsity regime places effective nodes in a higher priority than effective paths, extremely sparse subnetworks demand a more delicate balance between these two metrics (Section 5).



Figure 1: Toy experiment with 100 subnetworks randomly generated from the base MLP network.

