A UNIFIED PATHS PERSPECTIVE FOR PRUNING AT INITIALIZATION

Abstract

A number of recent approaches have been proposed for pruning neural network parameters at initialization with the goal of reducing the size and computational burden of models while minimally affecting their training dynamics and generalization performance. While each of these approaches have some amount of well-founded motivation, a rigorous analysis of the effect of these pruning methods on network training dynamics and their formal relationship to each other has thus far received little attention. Leveraging recent theoretical approximations provided by the Neural Tangent Kernel, we unify a number of popular approaches for pruning at initialization under a single path-centric framework. We introduce the Path Kernel as the data-independent factor in a decomposition of the Neural Tangent Kernel and show the global structure of the Path Kernel can be computed efficiently. This Path Kernel decomposition separates the architectural effects from the data-dependent effects within the Neural Tangent Kernel, providing a means to predict the convergence dynamics of a network from its architecture alone. We analyze the use of this structure in approximating training and generalization performance of networks in the absence of data across a number of initialization pruning approaches. Observing the relationship between input data and paths and the relationship between the Path Kernel and its natural norm, we additionally propose two augmentations of the SynFlow algorithm for pruning at initialization.

1. INTRODUCTION

A wealth of recent work has been dedicated to characterizing the training dynamics and generalization bounds of neural networks under a linearized approximation of the network depending on its parameters at initialization (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019a; Woodworth et al., 2020) . This approach makes use of the Neural Tangent Kernel, and under infinite width assumptions, the training dynamics of gradient descent over the network become analytically tractable. In this paper, we make use of the Neural Tangent Kernel theory with the goal of approximating the effects of various initialization pruning methods on the resulting training dynamics and performance of the network. Focusing on networks with homogeneous activation functions (ReLU, Leaky-ReLU, Linear), we introduce a novel decomposition of the Neural Tangent Kernel which separates the effects of network architecture from effects due to the data on the training dynamics of the network. We find the data-independent factor of the Neural Tangent Kernel to have a particularly nice structure as a symmetric matrix representing the covariance of path values in the network which we term the Path Kernel. We subsequently show that the Path Kernel offers a data-independent approximation of the network's convergence dynamics during training. To validate the empirical benefits of this theoretical approach, we turn to the problem of pruning at initialization. While the problem of optimally pruning deep networks is nearly as old as deep networks themselves (Reed, 1993) , interest in this problem has experienced a revival in recent years. This revival is likely a product of a number of underlying factors, but much of the recent interest could be ascribed to the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) which states that sparse, trainable networks-which achieve task performance that matches or exceeds those of its dense counterparts-can be found at initialization. The Lottery Ticket Hypothesis implies that the over-parameterization of neural networks is incidental in finding a trainable solution, the topology of which often exists at initialization. However, finding these lottery ticket networks currently requires some amount of iterative re-training of the network at increasing levels of sparsity which is inefficient and difficult to analyze theoretically. The resurgence of interest in optimal pruning has spurred the development of a number of recent approaches for pruning deep neural networks at initialization (Lee et al., 2019b; Liu & Zenke, 2020; Wang et al., 2020; Tanaka et al., 2020) in supervised, semi-supervised, and unsupervised settings, borrowing theoretical motivation from linearized training dynamics (Jacot et al., 2018) , mean-field isometry (Saxe et al., 2013), and saliency (Dhamdhere et al., 2019) . While each of these methods have their own theoretical motivations, little work has been dedicated to formally describing the effect of these pruning methods on the expected performance of the pruned network. Also, the diversity in theoretical motivations that give rise to these pruning methods makes it difficult to observe their similarities. In this paper, we observe that a number of initialization pruning approaches are implicitly dependent on the path covariance structure captured by the Path Kernel which, in turn, affects the network's training dynamics. We show that we can approximate these training dynamics in general, and our approximation results for a number of initialization pruning approaches suggests that it is possible to estimate, prior to training, the efficacy of a particular initialization pruning approach on a given architecture by investigating the eigenstructure of its Path Kernel. Motivated by our theoretical results and the unification of a number of initialization pruning methods in this Path Kernel framework, we investigate the close relationship between the SynFlow (Tanaka et al., 2020) pruning approach and our path decomposition. This leads to our suggestion of two new initialization pruning approaches which we predict to perform well under various assumptions on the stability of the Path Kernel and the input distribution of the data. We then validate these predictions empirically by comparing the performance of these pruning approaches across a number of network architectures. The insights on initialization pruning provided by the Path Kernel decomposition are only one of a number of potential application domains which could benefit from this path-centric framework. Importantly, the coviariance structure over paths encoded by the Path Kernel is general and may be computed at any point in time, not just at initialization. We anticipate that this representation will provide insight into other application areas like model interpretation, model comparison, or transfer learning across domains. The sections of the paper proceed as follows. We start with a brief introduction to the Neural Tangent Kernel in Section 2 before introducing the Path Kernel decomposition in Section 3 and its relationship to approximations of network convergence properties. In Section 4, we reformulate in this path framework three popular initialization pruning approaches and introduce two additional initialization pruning approaches inspired by this path decomposition. We validate these convergence approximations and the behavior of these pruning approaches in Section 5 and conclude with a discussion of the results and opportunities for future work.

2. THE NEURAL TANGENT KERNEL

Recent work by Jacot et al. (2018) has shown that the exact dynamics of infinite-width network outputs through gradient descent training corresponds to kernel gradient descent in function space with respect the Neural Tangent Kernel. More formally, for a neural network f parameterized by θ and loss function : R K × R K → R, let L = (x∈X ,y∈Y) (f t (x, θ), y) denote the empirical loss function. Here, X is the training set, Y is the associated set of class labels. For multiple inputs, denote f (X , θ) ∈ R N K the outputs of the network where K is the output dimension and N is the number of training examples. In continuous-time gradient descent, the evolution of parameters and outputs can be expressed as θt = -η∇ θ f (X , θ t ) ∇ f (X ,θt) L (1) ḟ (X , θ t ) = ∇ θ f (X , θ t ) θt = -ηΘ t (X , X )∇ f (X ,θ) L (2) where the matrix Θ t (X , X ) ∈ R N K×N K is the Neural Tangent Kernel at time step t, defined as the covariance structure of the Jacobian of the parameters over all training samples: Θ t (X , X ) = ∇ θ f (X , θ t )∇ θ f (X , θ t ) . (3) For infinitely wide networks, the NTK exactly captures the output space dynamics through training, and Θ t (X , X ) remains constant throughout. Lee et al. (2019a) have shown that neural networks of

