A UNIFIED PATHS PERSPECTIVE FOR PRUNING AT INITIALIZATION

Abstract

A number of recent approaches have been proposed for pruning neural network parameters at initialization with the goal of reducing the size and computational burden of models while minimally affecting their training dynamics and generalization performance. While each of these approaches have some amount of well-founded motivation, a rigorous analysis of the effect of these pruning methods on network training dynamics and their formal relationship to each other has thus far received little attention. Leveraging recent theoretical approximations provided by the Neural Tangent Kernel, we unify a number of popular approaches for pruning at initialization under a single path-centric framework. We introduce the Path Kernel as the data-independent factor in a decomposition of the Neural Tangent Kernel and show the global structure of the Path Kernel can be computed efficiently. This Path Kernel decomposition separates the architectural effects from the data-dependent effects within the Neural Tangent Kernel, providing a means to predict the convergence dynamics of a network from its architecture alone. We analyze the use of this structure in approximating training and generalization performance of networks in the absence of data across a number of initialization pruning approaches. Observing the relationship between input data and paths and the relationship between the Path Kernel and its natural norm, we additionally propose two augmentations of the SynFlow algorithm for pruning at initialization.

1. INTRODUCTION

A wealth of recent work has been dedicated to characterizing the training dynamics and generalization bounds of neural networks under a linearized approximation of the network depending on its parameters at initialization (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019a; Woodworth et al., 2020) . This approach makes use of the Neural Tangent Kernel, and under infinite width assumptions, the training dynamics of gradient descent over the network become analytically tractable. In this paper, we make use of the Neural Tangent Kernel theory with the goal of approximating the effects of various initialization pruning methods on the resulting training dynamics and performance of the network. Focusing on networks with homogeneous activation functions (ReLU, Leaky-ReLU, Linear), we introduce a novel decomposition of the Neural Tangent Kernel which separates the effects of network architecture from effects due to the data on the training dynamics of the network. We find the data-independent factor of the Neural Tangent Kernel to have a particularly nice structure as a symmetric matrix representing the covariance of path values in the network which we term the Path Kernel. We subsequently show that the Path Kernel offers a data-independent approximation of the network's convergence dynamics during training. To validate the empirical benefits of this theoretical approach, we turn to the problem of pruning at initialization. While the problem of optimally pruning deep networks is nearly as old as deep networks themselves (Reed, 1993) , interest in this problem has experienced a revival in recent years. This revival is likely a product of a number of underlying factors, but much of the recent interest could be ascribed to the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) which states that sparse, trainable networks-which achieve task performance that matches or exceeds those of its dense counterparts-can be found at initialization. The Lottery Ticket Hypothesis implies that the over-parameterization of neural networks is incidental in finding a trainable solution, the topology of which often exists at initialization. However, finding these lottery ticket networks currently requires

