UNDERSTANDING PRUNING AT INITIALIZATION: AN EFFECTIVE NODE-PATH BALANCING PERSPECTIVE

Abstract

Pruning at initialization (PaI) methods aim to remove weights of neural networks before training in pursuit of reducing training costs. While current PaI methods are promising and outperform random pruning, much work remains to be done to understand and improve PaI methods to achieve the performance of pruning after training. In particular, recent studies (Frankle et al., 2021; Su et al., 2020) present empirical evidence for the potential of PaI, and show intriguing properties like layerwise random shuffling connections of pruned networks preserves or even improves the performance. Our paper gives new perspectives on PaI from the geometry of subnetwork configurations. We propose to use two quantities to probe the shape of subnetworks: the numbers of effective paths and effective nodes (or channels). Using these numbers, we provide a principled framework to better understand PaI methods. Our main findings are: (i) the width of subnetworks matters in regular sparsity levels (< 99%) -this matches the competitive performance of shuffled layerwise subnetworks; (ii) node-path balancing plays a critical role in the quality of PaI subnetworks, especially in extreme sparsity regimes. These innovate an important direction to network pruning that takes into account the subnetwork topology itself. To illustrate the promise of this direction, we present a fairly naive method based on SynFlow (Tanaka et al., 2020) and conduct extensive experiments on different architectures and datasets to demonstrate its effectiveness.

1. INTRODUCTION

Deep neural networks have achieved state-of-the-art performance in a wide range of machine learning applications (Brown et al., 2020; Dosovitskiy et al., 2021; Ramesh et al., 2021; Radford et al., 2021) . However, the huge computational resource requirements limit their applications, especially in edge computing and other future smart cyber-physical systems (Hinton et al., 2015; Zhao et al., 2019; Price & Tanner, 2021; Yuan et al., 2021; Bithika et al., 2022) . To overcome this issue, there has been a number of attempts to reduce the size of such networks without compromising their performance, among which pruning enjoys a significant interest (Hoefler et al., 2021; Deng et al., 2020; Cheng et al., 2018) . A rationale for this direction is the work of Frankle & Carbin (2018), in which the authors provide empirical evidences on the existence of sparse subnetworks that can be trained from scratch and achieve similar performance to the original network, referred to as the Lottery Tickets. However, standard methods for finding such subnetworks typically involve the costly pre-training and iterative magnitude pruning process. This issue raises an intriguing research question: How to identify sparse, trainable subnetworks at initialization without pre-training? Specifically, a successful pruning before training method can significantly reduce both the cost of memory and runtime, without sacrificing performance much (Wang et al., 2022) . This would make neural networks applicable even in scenarios with scarce computing resources (Alizadeh et al., 2022; Yuan et al., 2021) . As such, many methods for PaI have been proposed (Lee et al., 2019; Tanaka et al., 2020; de Jorge et al., 2021; Wang et al., 2020; Alizadeh et al., 2022) . While these methods are based on a number of intuitions (e.g., leveraging the gradient information), they typically measure the importance of network parameters. More recently, Frankle et al. ( 2021); Su et al. ( 2020) observe a rather surprising phenomenon: for PaI methods, layerwise shuffling connections of pruned subnetworks does not reduce the network's performance. A surprising consequence is that layerwise sparsity ratios are more important than weight-level importance scores of the subnetwork (Frankle et al., 2021) . This indicates that in searching for good subnet-

