A GENERAL FRAMEWORK FOR PROVING THE EQUIV-ARIANT STRONG LOTTERY TICKET HYPOTHESIS

Abstract

The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that-when initialized randomly and without any training-achieves the accuracy of a fully trained target network. Recent works by da Cunha et al. (2022b); Burkholz (2022a) demonstrate that the SLTH can be extended to translation equivariant networks-i.e. CNNs-with the same level of overparametrization as needed for the SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group G-i.e. G-equivariant network-and prove, with high probability, that one can approximate any G-equivariant network of fixed width and depth by pruning a randomly initialized overparametrized G-equivariant network to a G-equivariant subnetwork. We further prove that our prescribed overparametrization scheme is optimal and provides a lower bound on the number of effective parameters as a function of the error tolerance. We develop our theory for a large range of groups, including subgroups of the Euclidean E(2) and Symmetric group G ≤ S n -allowing us to find SLTs for MLPs, CNNs, E(2)-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework. Empirically, we verify our theory by pruning overparametrized E(2)-steerable CNNs, k-order GNNs, and message passing GNNs to match the performance of trained target networks.

1. INTRODUCTION

Many problems in deep learning benefit from massive amounts of annotated data and compute that enables the training of models with an excess of a billion parameters. Despite this appeal of overparametrization many real-world applications are resource-constrained (e.g., on device) and demand a reduced computational footprint for both training and deployment (Deng et al., 2020) . A natural question that arises in these settings is then: is it possible to marry the benefits of large models-empirically beneficial for effective training-to the computational efficiencies of smaller sparse models? A standard line of work for building compressed models from larger fully trained networks with minimal loss in accuracy is via weight pruning (Blalock et al., 2020) . There is, however, increasing empirical evidence to suggest weight pruning can occur significantly prior to full model convergence. Frankle and Carbin (2019) postulate the extreme scenario termed lottery ticket hypothesis (LTH) where a subnetwork extracted at initialization can be trained to the accuracy of the parent network-in effect "winning" the weight initialization lottery. In an even more striking phenomenon Ramanujan et al. (2020) find that not only do such sparse subnetworks exist at initialization but they already achieve impressive performance without any training. This remarkable occurrence termed the strong lottery ticket hypothesis (SLTH) was proven for overparametrized dense networks with no biases (Malach et al., 2020; Pensia et al., 2020; Orseau et al., 2020 ), non-zero biases (Fischer and Burkholz, 2021 ), and vanilla CNNs (da Cunha et al., 2022b) . Recently, Burkholz (2022b) extended the work of Pensia et al. (2020) to most activation functions that behave like ReLU around the origin, and adopted another overparametrization framework as in Pensia et al. ( 2020) such that the overparametrized network has depth L + 1 (no longer 2L). However, the optimality with respect to the number of parameters (Theorem 2 in Pensia et al. ( 2020)) is lost with this method. Moreover, Burkholz (2022a) extended the results of da Cunha et al. (2022b) on CNNs to non-positive inputs. Modern architectures, however, are more than just MLPs and CNNs and many encode data-dependent inductive biases in the form of equivariances and invariances that are pivotal to learning smaller and more efficient networks (He et al., 2021) . This raises the important question: can we simultaneously get the benefits of equivariance and pruning? In other words, does there exist winning tickets for the equivariant strong lottery for general equivariant networks given sufficient overparametrization? Present Work. In this paper, we develop a unifying framework to study and prove the existence of strong lottery tickets (SLTs) for general equivariant networks. Specifically, in our main result (Thm. 1) we prove that any fixed width and depth target G-equivariant network that uses a point-wise ReLU can be approximated with high probability to a pre-specified tolerance by a subnetwork within a random G-equivariant network that is overparametrized by doubling the depth and increasing the width by a logarithmic factor. Such a theorem allows us to immediately recover the results of Pensia et al. ( 2020 2022b) for CNNs as specific instantiations under our unified equivariant framework. Furthermore, we prove that a logarithmic overparametrization is necessarily optimal-by providing a lower bound in Thm. 2-as a function of the tolerance. Crucially, this is irrespective of which overparametrization strategy is employed which demonstrates the optimality of Theorem 1. Notably, the extracted subnetwork is also G-equivariant, preserving the desirable inductive biases of the target model; such a fact is importantly not achievable via a simple application of previous results found in (Pensia et al., 2020; da Cunha et al., 2022b) . Our theory is broadly applicable to any equivariant network that uses a pointwise ReLU nonlinearity. This includes the popular E(2)-steerable CNNs with regular representations (Weiler and Cesa, 2019) (Corollary 1) that model symmetries of the 2d-plane as well as subgroups of the symmetric group of n elements S n , allowing us to find SLTs for permutation equivariant networks (Corollary 2) as a specific instantiation. We substantiate our theory by conducting experiments by explicitly computing the pruning masks for randomly initialized overparametrized E(2)-steerable networks, k-order GNNs, and MPGNNs to approximate another fully trained target equivariant network.

2. BACKGROUND AND RELATED WORK

Notation and Convention. For p ∈ N, [p] denotes {0, • • • , p -1}. We assume that the starting index of tensors (vectors, matrices,...) is 0, e.g., W p,q , p, q ∈ [d]. G is a group, and ρ is its representation. We use | • | for the cardinality of a set, while represents the direct sum of vector spaces or group representations and ⊗ indicates the Kroenecker product. We use * to denote a convolution. We define x + , x -as x + = max(0, x) and x -= min(0, x). ∥ • ∥ is a ℓ p norm while |||•||| is its operator norm. Equivariance. We are interested in building equivariant networks that encode the symmetries induced by a given group G as inductive biases. To act using a group we require a group representation ρ : G → GL(R D ), which itself is a group homomorphism and satisfies ρ(g 1 g 2 ) = ρ(g 1 )ρ(g 2 ) as GL(R D ) is the group of D × D invertible matrices with group operation being ordinary matrix multiplication. Let us now recall the main definition for equivariance: Definition 2.1. Let X ⊂ R Dx and Y ⊂ R Dy be two sets with an action of a group G. A map f : X → Y is called G-equivariant, if it respects the action, i.e., ρ Y (g)f (x) = f (ρ X (g)x), ∀g ∈ G and x ∈ X . A map h : X → Y is called G-invariant, if h(x) = h(ρ X (g)x), ∀g ∈ G and x ∈ X . As a composition of equivariant functions is equivariant, to build an equivariant network it is sufficient to take each layer f i to be G-equivariant and utilize a G-equivariant non-linearity (e.g. pointwise ReLU). Given a vector space and a corresponding group representation we can define a feature space F i := (R Di , ρ i ). Note that we can stack multiple such feature spaces in a layer, for example, the input feature space to an equivariant layer i can be written as n i blocks F ni i := ni m=1 F i .



); Orseau et al. (2020) for MLPs and of Burkholz et al. (2022); da Cunha et al. (

For a basis B = {b 1 , . . . , b p }, we write |||B||| = max ∥α∥∞≤1 ||| p k=1 α k b k |||. σ(x) = x + is the pointwise ReLU. Finally, we take (ϵ, δ) ∈ [0, 1 2 ] 2 , and U([a, b]) is the uniform distribution on [a, b].

