A GENERAL FRAMEWORK FOR PROVING THE EQUIV-ARIANT STRONG LOTTERY TICKET HYPOTHESIS

Abstract

The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that-when initialized randomly and without any training-achieves the accuracy of a fully trained target network. Recent works by da Cunha et al. (2022b); Burkholz (2022a) demonstrate that the SLTH can be extended to translation equivariant networks-i.e. CNNs-with the same level of overparametrization as needed for the SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group G-i.e. G-equivariant network-and prove, with high probability, that one can approximate any G-equivariant network of fixed width and depth by pruning a randomly initialized overparametrized G-equivariant network to a G-equivariant subnetwork. We further prove that our prescribed overparametrization scheme is optimal and provides a lower bound on the number of effective parameters as a function of the error tolerance. We develop our theory for a large range of groups, including subgroups of the Euclidean E(2) and Symmetric group G ≤ S n -allowing us to find SLTs for MLPs, CNNs, E(2)-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework. Empirically, we verify our theory by pruning overparametrized E(2)-steerable CNNs, k-order GNNs, and message passing GNNs to match the performance of trained target networks.

1. INTRODUCTION

Many problems in deep learning benefit from massive amounts of annotated data and compute that enables the training of models with an excess of a billion parameters. Despite this appeal of overparametrization many real-world applications are resource-constrained (e.g., on device) and demand a reduced computational footprint for both training and deployment (Deng et al., 2020) . A natural question that arises in these settings is then: is it possible to marry the benefits of large models-empirically beneficial for effective training-to the computational efficiencies of smaller sparse models? A standard line of work for building compressed models from larger fully trained networks with minimal loss in accuracy is via weight pruning (Blalock et al., 2020) . There is, however, increasing empirical evidence to suggest weight pruning can occur significantly prior to full model convergence. Frankle and Carbin (2019) postulate the extreme scenario termed lottery ticket hypothesis (LTH) where a subnetwork extracted at initialization can be trained to the accuracy of the parent network-in effect "winning" the weight initialization lottery. In an even more striking phenomenon Ramanujan et al. (2020) find that not only do such sparse subnetworks exist at initialization but they already achieve impressive performance without any training. This remarkable occurrence termed the strong lottery ticket hypothesis (SLTH) was proven for overparametrized dense networks with no biases (Malach et al., 2020; Pensia et al., 2020; Orseau et al., 2020) , non-zero biases (Fischer and Burkholz, 2021), and vanilla CNNs (da Cunha et al., 2022b) . Recently, Burkholz (2022b) extended the work of Pensia et al. (2020) to most activation functions that behave like ReLU around the origin, and adopted another overparametrization framework as in Pensia et al. (2020) such that the overparametrized network has depth L + 1 (no longer 2L). However, the optimality with respect to the number of parameters (Theorem 2 in Pensia et al. (2020) ) is lost with this method. Moreover, Burkholz (2022a) extended the results of da Cunha et al. (2022b) on CNNs to non-positive inputs. Modern architectures, however, are more than just MLPs and CNNs and many encode data-dependent inductive biases in the form of equivariances and invariances that are pivotal to learning smaller and more efficient networks (He et al., 2021) . This raises the important question: can we simultaneously get the benefits of equivariance and pruning? In other words, does there exist winning tickets for the equivariant strong lottery for general equivariant networks given sufficient overparametrization? Present Work. In this paper, we develop a unifying framework to study and prove the existence of strong lottery tickets (SLTs) for general equivariant networks. Specifically, in our main result (Thm. 1) we prove that any fixed width and depth target G-equivariant network that uses a point-wise ReLU can be approximated with high probability to a pre-specified tolerance by a subnetwork within a random G-equivariant network that is overparametrized by doubling the depth and increasing the width by a logarithmic factor. Such a theorem allows us to immediately recover the results of Pensia et al. (2020) ; Orseau et al. (2020) for MLPs and of Burkholz et al. (2022) ; da Cunha et al. (2022b) for CNNs as specific instantiations under our unified equivariant framework. Furthermore, we prove that a logarithmic overparametrization is necessarily optimal-by providing a lower bound in Thm. 2-as a function of the tolerance. Crucially, this is irrespective of which overparametrization strategy is employed which demonstrates the optimality of Theorem 1. Notably, the extracted subnetwork is also G-equivariant, preserving the desirable inductive biases of the target model; such a fact is importantly not achievable via a simple application of previous results found in (Pensia et al., 2020; da Cunha et al., 2022b) . Our theory is broadly applicable to any equivariant network that uses a pointwise ReLU nonlinearity. This includes the popular E(2)-steerable CNNs with regular representations (Weiler and Cesa, 2019) (Corollary 1) that model symmetries of the 2d-plane as well as subgroups of the symmetric group of n elements S n , allowing us to find SLTs for permutation equivariant networks (Corollary 2) as a specific instantiation. We substantiate our theory by conducting experiments by explicitly computing the pruning masks for randomly initialized overparametrized E(2)-steerable networks, k-order GNNs, and MPGNNs to approximate another fully trained target equivariant network.

2. BACKGROUND AND RELATED WORK

Notation and Convention. For p ∈ N, [p] denotes {0, • • • , p -1}. We assume that the starting index of tensors (vectors, matrices,...) is 0, e.g., W p,q , p, q ∈ [d]. G is a group, and ρ is its representation. We use | • | for the cardinality of a set, while represents the direct sum of vector spaces or group representations and ⊗ indicates the Kroenecker product. We use * to denote a convolution. We define x + , x -as x + = max(0, x) and x -= min(0, x). ∥ • ∥ is a ℓ p norm while |||•||| is its operator norm. Equivariance. We are interested in building equivariant networks that encode the symmetries induced by a given group G as inductive biases. To act using a group we require a group representation ρ : G → GL(R D ), which itself is a group homomorphism and satisfies ρ(g 1 g 2 ) = ρ(g 1 )ρ(g 2 ) as GL(R D ) is the group of D × D invertible matrices with group operation being ordinary matrix multiplication. Let us now recall the main definition for equivariance: Definition 2.1. Let X ⊂ R Dx and Y ⊂ R Dy be two sets with an action of a group G. A map f : X → Y is called G-equivariant, if it respects the action, i.e., ρ Y (g)f (x) = f (ρ X (g)x), ∀g ∈ G and x ∈ X . A map h : X → Y is called G-invariant, if h(x) = h(ρ X (g)x), ∀g ∈ G and x ∈ X . As a composition of equivariant functions is equivariant, to build an equivariant network it is sufficient to take each layer f i to be G-equivariant and utilize a G-equivariant non-linearity (e.g. pointwise ReLU). Given a vector space and a corresponding group representation we can define a feature space F i := (R Di , ρ i ). Note that we can stack multiple such feature spaces in a layer, for example, the input feature space to an equivariant layer i can be written as n i blocks F ni i := ni m=1 F i . A G-equivariant basis is a basis of the space of equivariant linear maps between two vector spaces. We can decompose a G-equivariant linear map f i : F i → F i+1 in a corresponding equivariant basis B i→i+1 = {b i→i+1,k ∈ R Di×Di+1 , ∀k ∈ [|B i→i+1 |]}. When working with stacks of n i (resp. n i+1 ) input (resp. output) feature spaces we may express the full equivariant basis by considering κ ni→ni+1 = {κ p,q ni→ni+1 ∈ R ni×ni+1 , (p, q) ∈ [n i ] × [n i+1 ]}, where each element κ p,q ni→ni+1 is a matrix with a single non-zero entry at position (p, q). Then the basis for G-equivariant maps between F ni i → F ni+1 i+1 can be written succinctly as the Kronecker product between two basis elements κ ni→ni+1 ⊗ B i→i+1 . Some instances of G and F i are presented in Tab. 2. For example, in the case of CNNs with kernel size d 2 , the linear map f is a convolution where n i (resp. n i+1 ) are the number of input (resp. output) channels and κ ni→ni+1 ⊗B i→i+1 is the basis of convolutions of size d 2 ×n i ×n i+1 . Related Work on Strong Lottery Tickets. Winning SLTs approximate a target ReLU network f (x) by pruning an overparametrized ReLU network g(x) with weights in any given layer drawn i.i.d. from w i ∼ U([-1, 1]).foot_0 Our error metric of choice is the uniform approximation over a unit ball: max x∈R D :||x||≤1 ||f (x) -ĝ(x)|| ≤ ϵ, where ĝ(x) is the subnetwork constructed from pruning g(x). Let us first consider the case of approximating a single neuron Malach et al., 2020) . A similar approximation fidelity can be achieved with an exponentially smaller number of samples by not relying on just a single X i but instead a subset whose sum approximates the target weight. Lueker (1998); da Cunha et al. (2022a) proved that n = O(log(1/ϵ)) random variables were sufficient for the existence of a solution to the random SUBSET-SUM problem (a subset S ⊆ {1, . . . , n} such that |w i -i∈S X i | ≤ ϵ). Pensia et al. (2020) utilize the SUBSET-SUM approach for weights on dense networks resulting in a logarithmic overparametrization of the width of a layer in g(x). To bypass the non-linearity (ReLU) Pensia et al. (2020) decompose the output activation σ(wx) = w + x + + w -x -and approximate each term separately. With no additional assumption on the inputs (da Cunha et al. (2022b) assume positive entries), this approach fails for equivariant networks as each entry of the output of an equivariant linear map is affected by multiple input entries. (2022b), and Burkholz (2022a) . Specifically, we rely on the SUBSET-SUM algorithm (Lueker, 1998) to aid in approximating any given parameter of the target network. Departing from prior work, the main idea used in our technical analysis is to prune an overparametrized equivariant network in a way that preserves equivariance, as applying SUBSET-SUM using da Cunha et al. (2022b) construction may destroy the prunned network's equivariance. w i ∈ [-1, 1] in some layer of f (x) with n i.i.d. samples X 1 , . . . , X n ∼ U([-1, 1]). If n = O(1/ϵ) then there exists a X i that is ϵ-close to w i (

3. SLT FOR GENERAL EQUIVARIANT NETS

( ℝ D i , ρ i) λ (i) 1→1,1 ⋅ I ( ℝ D i , ρ i) ( ℝ D i , ρ i) ( ℝ D i , ρ i) ⊕ ⊕ ⋮ λ (i) 1→2,1 ⋅ I λ (i) 1→ ñi ,1 ⋅ I ( ℝ Di+1 , ρ i+1) ∑ k μ (i) 2→1,k b i→i+1,k ≈ ∑ k α (i) 1→1,k b i→i+1,k ℬ i→i = {I, b i→i,2 , ⋯, b i→i,|ℬi→i| } ∑ k μ (i) 1→1,k b i→i+1,k ∑ k μ (i) ñi→1,k b i→i+1,k ñi ℬ i→i+1 = {b i→i+1,1 , ⋯, b i→i+1,|ℬi→i+1| }

Challenges in Adapting Proof

Techniques. There are two major difficulties in adapting the tools first introduced in Pensia et al. (2020) to G-steerable networks. In proving the SLTH for dense networks the relevant parameters that can be pruned are all the parameters of weight matrices, which can be intuitively understood as pruning in a canonical basis. However, such a strategy immediately fails for G-equivariant maps as the canonical basis is not generally G-equivariant, thus pruning in this basis breaks the structure of the network and its equivariance. In fact, as described in Weiler and Cesa (2019) a G-equivariant linear map consists of linearly combining the elements of the equivariant basis with learned combination coefficients which are the effective parameters of the G-equivariant model. To preserve equivariance we may only prune these parameters and not any weight in f i . However, this introduces a new complication as the interaction with the ReLU becomes more challenging. da Cunha et al. (2022b) circumvent this in the special case of regular CNNs by assuming only positive inputs. In contrast, our main technical lemma (Lem. 1), introduces a construction that does not require such a restrictive assumption and generalizes the techniques of Burkholz (2022a) to G-equivariant networks. Overparameterized Network Shape. We seek to approximate a single G-equivariant layer with two random overparameterized G-equivariant layers. We take the input ∥x∥ ≤ 1 to be in a bounded domain to control the error which could diverge on unbounded domains. Let F i be the set of G-equivariant linear maps F ni i → F ni+1 i+1 of the i-th layer in the target network. Then, f i ∈ F i s.t., |||f i ||| ≤ 1, is a specific realization of a target equivariant map that we will approximate-i.e. f i (x) = W f i (x). Without any loss of generality, let the coefficients of W f i be such that |α k | ≤ 1 when decomposed in the basis κ ni→ni+1 ⊗ B i→i+1 . Concretely, f i ∈ F i := W f i = k α k b k : b k ∈ κ ni→ni+1 ⊗ B i→i+1 , |α k | ≤ 1, |||W f i ||| ≤ 1 . We can now recursively apply the previous constructions to construct a desired G-equivariant target network f ∈ F of depth l ∈ N. Analagously, we can define an atomic unit of our random overparameterized source model H i as the set of G-equivariant maps with one intermediate feature space (layer) F ñi i followed by a ReLU. That is, any h i ∈ H i applied to an input x can be written as h i (x) = W h 2i+1 σ(W h 2i x). In our construction, we choose W h 2i whose equivariant basis is κ ni→ñi ⊗ B i→i where ñi is the overparametrization factor of the i-th layer. We assume B i→i contains the identity element, which is trivially equivariant. The basis coefficients of W h 2i are written as λ (i) p→q,k , which refers to the coefficient of the k-th basis element in B i→i for the map between the p-th block of F ni i to the q-th block of F ñi i . Similarly, W h 2i+1 can be decomposed in the basis κ ñi→ni+1 ⊗ B i→i+1 with coefficients µ (i) p→q,k . Fig. 1 illustrates this construction after pruning the first layer for n i = n i+1 = 1 which leads to a "diamond" shape. We can finally apply the previous construction to build an overparametrized network h ∈ H of depth 2l. We summarize all the notation used in the rest of the paper in Tab. 1.

G-Equivariant map

Basis Basis Coefficients W h 2i : F n i i → F ñi i κn i →ñ i ⊗ Bi→i λ (i) p→q,k , p ∈ [ni], q ∈ [ñi], k ∈ [|Bi→i|] W h 2i+1 : F ñi i → F n i+1 i+1 κñ i →n i+1 ⊗ Bi→i+1 µ (i) p→q,k , p ∈ [ñi], q ∈ [ni+1], k ∈ [|Bi→i+1|] W f i : F n i i → F n i+1 i+1 κn i →n i+1 ⊗ Bi→i+1 α (i) p→q,k , p ∈ [ni], q ∈ [ni+1], k ∈ [|Bi→i+1|] Table 1 : Summary of notation used to decompose each G-equivariant map in the source and target networks.

3.1. THEORETICAL RESULTS

We first prove Lemma 1 which states that with high probability a random overparametrized Gequivariant network of depth l = 2 (Fig. 1 ) can ϵ-approximate any target map in F i via pruning. Lemma 1. Let h i ∈ H i be a random overparametrized G-equivariant network as defined above, with coefficients λ (i) p→q,k and µ (i) p→q,k drawn from U([-1, 1]). Further suppose that each ñi = C 1 n i log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||) min(ϵ,δ) ) where C 1 is a constant. Then, with probability 1 -δ, for every target G-equivariant layer f i ∈ F i , one can find two pruning masks S 2i , S 2i+1 on the coefficients λ (i) p→q,k and µ (i) p→q,k respectively such that: max x∈R D i ×n i , ∥x∥≤1 ∥(S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x) -f i (x)∥ ≤ ϵ . Proof sketch. We prune all non-identity coefficients of the basis decomposition of the first layer obtaining "diamond" shape (see Fig. 1 for (n i = n i+1 = 1)) allowing us to bypass the pointwise ReLU. The two layers can now be used to approximate every weight of the target by solving independent SUBSET-SUM problems on the coefficients of the second layer. The full proof is provided in §B.1. To approximate any f in F i which is a G-equivariant target network of depth l and fixed width, we can now apply Lemma 1 l-times to obtain our main theorem, whose proof is provided in §B.2. Theorem 1. Let h ∈ H be a random overparametrized G-equivariant network with coefficients λ (i) p→q,k and µ (i) p→q,k , for i ∈ [l] and indices p, q, k as defined in Table 1 , all drawn from U([-1, 1]). Suppose that ñi = C 2 n i log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||)l min(ϵ,δ) ), where C 2 is a constant. Then with probability 1 -δ, for every f ∈ F, one can find a collection of pruning masks S 2l-1 , . . . S 0 on the coefficients λ (i) p→q,k and µ (i) p→q,k for every layer i ∈ [l] such that: max x∈R D 0 ×n 0 , ∥x∥≤1 ∥(S 2l-1 ⊙ W h 2l-1 )σ . . . σ((S 0 ⊙ W h 0 )x) -f (x)∥ ≤ ϵ . We recover a similar overparametrization as Pensia et al. (2020) with respect to the width of h. However, the significant improvement provided by this result is that, since we do not prune dense nets but G-equivariant ones, the number of effective parameters in the overparametrized network is |Bi→i+1| /DiDi+1 smaller than a dense net of the same width. In section 3.2 we make this difference explicit and show Theorem 1 is optimal up to log factors not only with respect to the tolerance ϵ but also with respect to |Bi→i+1| /DiDi+1 quantifying the expressiveness of G-equivariant networks.

3.2. LOWER BOUND ON THE OVERPARAMETRIZATION

When searching for equivariant winning tickets a natural question that arises is the optimality of the overparametrization factor ñi with respect to the tolerance ϵ. In the same vein as Pensia et al. (2020) for MLPs, we now prove under mild assumptions that, in the equivariant setting, ñi is indeed optimal (Theorem 2). We will assume that our equivariant basis B i→i+1 has the following property: ∀f i ∈ Span(B i→i+1 ) where f i = k α k b i→i+1,k we have: |||f i ||| ≤ 1 =⇒ |α k | ≤ 1, k ≥ 0. Note that this can be obtained by a rescaling of the basis elements. Lastly, we also assume the existence of positive constants M 1 and M 2 such that |B i→i | ≤ M 1 |B i→i+1 | and n i ≤ M 2 n i+1 . These assumptions are relatively mild and hold in the practical situations described in Tab. 2 (cf §B.3 for details). Under these assumptions we achieve the following (tight) lower bound. Theorem 2. Let ĥi be a network with Θ parameters such that: ∀f i ∈ F i , ∃S i ∈ {0, 1} Θ such that max x∈R D i ×n i , ∥x∥≤1 ∥(S i ⊙ ĥi )(x) -f i (x)∥ ≤ ϵ . ( ) Then Θ is at least Ω n i n i+1 |B i→i+1 | log( 1 ϵ ) and ñi is at least Ω(n i log 1 ϵ ) in Theorem 1. Proof Idea. The full proof is provided in §B.3 and relies on a counting argument to compare the number of pruning masks and functions in F i within a distance of at least 2ϵ of each other. Thm. 2 dictates that if we wish to approximate a G-equivariant network target network to ϵ-tolerance by pruning an overparametrized arbitrary network, the latter must have at least Ω(n i n i+1 |B i→i+1 | log( 1 ϵ )) parameters. Applying the above result to our prescribed overparametrization scheme in Thm. 1 we find our proposed strategy is optimal with respect to ϵ and almost optimal with respect to |B i→i+1 |. We incur a small extra log factor whose origin is discussed in §B.3. In the equivariant setting, the result in Pensia et al. (2020) is far from optimal as their result gives guarantees on the pruning of dense nets with a similar width as the G-equivariant targets which incurs an increase by a factor DiDi+1 /|Bi→i+1| in the number of parameters. As a specific example, for overparametrized G-steerable networks (Tab. 2), we have DiDi+1 /|Bi→i+1| = d 2 |G|. On images of shape R 224×224×3 with G = C 8 , it corresponds to ≈ 4.10 5 fewer "effective" parameters than a dense network. Finally, we note that Thm. 2 makes no statement on which overparametrization strategy achieves such a lower bound. Remarkably, the pruning strategy prescribed by Thm. 1 recovers this optimal lower bound on ñi , meaning that, unsurprisingly, G-equivariant nets are the most suitable structure to prune.

4. SLT FOR SPECIFIC CHOICES OF G

In this section, we turn our focus to specific instantiations of our main theoretical results for different choices of groups. To apply Theorem 1, one simply needs to specify the group G, the group representation ρ(g), and finally the feature space F. For instance, we can immediately recover the results for dense networks (Pensia et al., 2020) by noticing G = {e} is the trivial group with a trivial action on R D (see the proof in §C). In Table 2 below we highlight different G-equivariant architectures through the framework provided in §3 before proving each setting in the remainder of the section.

4.1. A CASE STUDY WITH CNNS

As a warmup, let us consider the case of vanilla CNNs that possess translation symmetry. In this case, G = (Z 2 , +) the group of translations of the plane and D i = d 2 where d 2 is the size of a feature map G ρi Fi |Bi→i+1| ∨ |||Bi→i+1||| MLP {e} trivial R 1 CNN (Z 2 , +) fi(x -t) (R d 2 , ρi) d 2 E(2)-CNN (Z 2 , +)⋊O(2) ρreg(g)fi(g -1 (x -t)) (R d 2 ×|G| 2 , ρi) d 2 |G| 3 Permutation S(n) Xi σ(1) ,...,i σ(k) ,j (R n k i , ρi) b(ki + ki+1) ∨ (n k i + 1) Table 2: at layer i. Finally, ρ i acts on the feature space R d 2 by translating the coordinates of a point in the plane. The equivariant basis of f i in this setting (what we denoted B i→i+1 in the general case) are convolutions with kernels K f i ∈ R d 2 ×ni×ni+1 that are built using the canonical basis and n i and n i+1 are the input/output channels. We can apply Thm. 1 to achieve Cor. 4 (see §D for details) which recovers Burkholz (2022a, Thm. 3.1) and is a strict generalization of the result by da Cunha et al. (2022b).

4.2. SLT FOR E(2) STEERABLE NETS

The Euclidean group E(2) is the group of isometries of the plane Rfoot_1 and is defined as the semi-direct product between the translation and orthogonal groups of two dimensions (R 2 , +) ⋊ O(2) with elements (t, g) ∈ E(2) being shifts and planar rotations or flips. The most general method to build equivariant networks for E(2) is in the framework of steerable G-CNN's where filters are designed to be steerable with respect to the action of G (Cohen and Welling, 2017; Weiler et al., 2018) . Concretely, steerable feature fields associate a D-dimensional feature vector to each point in a base space f : R 2 → R D which transform according to their induced representation Ind (R 2 ⋊G) G ρ , f (x) → Ind (R 2 ⋊G) G ρ (tg) • f (x) := ρ(g) • f (g -1 (x -t)). Clearly, a RGB image-a scalar field-transforms according to the trivial representation ρ(g) = 1 , ∀g ∈ G, but intermediate layers may transform according to other representation types such as regular. As proven in Cohen et al. (2019) , any equivariant linear map between steerable feature spaces transforming under ρ i and ρ i+1 must be a group convolution with G-steerable kernels satisfying the following constraint: π i (gx) = ρ i+1 (g)π i (x)ρ i (g -1 ) ∀g ∈ G, x ∈ R 2 . An equivariant basis is then composed of convolutions with a basis of equivariant kernels that we compute next. One of the key ingredients needed to apply Theorem 1 is the availability of an equivariant basis with an identity element. One could in principle always take an existing equivariant basis, such as the one provided by Weiler and Cesa (2019) , and include an identity element by replacing the first basis element resulting in another equivariant basis with probability 1. In what follows, we show the generality of Theorem 1 by constructing a different equivariant basis from first principles via the canonical basis and then symmetrizing using the action of G ≤ O(2). As we show in our experiments, we can find winning tickets for both basis with negligible difference in performance. Classification of Equivariant Maps for E(2). We now seek to precisely characterize which kernels satisfy the equivariance constraint. Let R be the equivalence relation on R 2 , R := ∀(x, y) ∈ R 2 × R 2 , x ∼ y ⇐⇒ ∃g ∈ G such that y = g • x. (5) The equivalence class of x ∈ R 2 denoted O(x), is the orbit of x under the action of G on R 2 . Designate A R = R 2 /R ⊂ R 2 a set of representatives. Due to the equivariance constraint on the kernels π(•), once the value of π(x) is chosen, it automatically fixes π(g • x) for g • x ∈ O(x). Note that because |O(x)| = |G|, all possible initial matrices R |G|×|G| can be chosen at a point x ̸ = 0. 2 Remark. In practice, G-steerable equivariant networks do not operate on signals in R 2 but on a fixed size pixelized grid {1, 2, . . . , d} 2 denoted as [d] 2 ⊂ Z 2 . Henceforth, we consider all our target networks as well as the overparameterized G-steerable network to be defined on input signals sampled on [d] 2 and in appendix §E.3 we highlight two practical challenges that result from such a discretization, but crucially these do not disrupt our subsequent theory nor pruning techniques. Computing B. 3 To explicitly build a basis of the G-equivariant layers, it is illustrative to first consider the case for a single input-output pair of representations for a layer-i.e. n i = n i+1 = 1. x g ⋅ x g 2 ⋅ x g 3 ⋅ x g 4 ⋅ x g 5 ⋅ x g 6 ⋅ x g 7 ⋅ x 𝒜 ℛ κ 2,3 0 = ( 0 0 0 1 ⋮ 0 ⋮ ⋮ ( 0 ⋮ 0 0 ⋮ ⋮ 𝒪 x ρ i+1 (g 7 )κ 2,3 0 ρ i (g -7 ) K 2,3 G,x (y) = ∑ g∈G ρ i+1 (g)K 2,3 0,x (g -1 y)ρ i (g -1 ) Figure 2: Constructing K 2,3 G,x ∈ B x for C 8 . We must first construct a basis of the equivariant kernels in our domain. Let B = {B x , x ∈ A R } be a basis of equivariant kernels over the domain where each basis is a tensor of shape B x ⊂ R d×d×|G|×|G| . A single basis element b ∈ B x can be constructed by considering the canonical basisfoot_3 κ 0 ⊂ R |G|×|G| at each location x ∈ A R and evaluating it under the action of the group. One can freely choose both a starting point x ∈ A R , and an element of the canonical basis κ p,q 0 . Let K p,q 0,x ∈ R d×d×|G|×|G| , ∀(p, q) ∈ [|G|] × [|G|] be the tensor of κ p,q 0 stacked across the grid-i.e. it is 0 everywhere except at the index (x, p, q) where it is 1. Then to get the equivariant basis we symmetrize by acting on K p,q 0,x while enforcing the equivariance constraint. ∀y ∈ [d] 2 b(y) := K p,q G,x (y) = g∈G ρ i+1 (g)K p,q 0,x (g -1 y)ρ i (g -1 ). Repeating this procedure for all elements κ p,q 0 ∈ R |G|×|G| in the canonical basis completes the construction of our basis B x = {K p,q G,x , p, q ∈ [|G|]}. We finally obtain a basis of the equivariant kernel as B = x∈A R B x . A G-steerable expanded kernel K is then simply a linear combination of learned weights θ = [θ 1 , . . . , θ |B| ]-one for each basis element-K(x) = |B| k=1 θ k b k (x). In contrast, a standard convolution kernel has shape R d×d×ci×ci+1 which means that the equivalent input/output channels for G-steerable convolutions are c i = |G| × n i and c i+1 = |G| × n i+1 respectively. Fig. 2 illustrates the above process for a basis element for the C 8 group. Equipped with this basis, which has an identity element at the origin ( §E.2). We can now apply Thm. 1 to get: Corollary 1. Let h ∈ H be a random G-steerable CNN with regular representation of depth 2l, i.e., h(x) = K h 2l-1 * σ . . . σ(K h 0 * x) where K h 2i ∈ R d 2 ×|G| 2 ×ni×ñi , K h 2i+1 ∈ R d 2 ×|G| 2 ×ñi×ni+1 are equivariant kernels whose decomposition in B have coefficients drawn from U([-1, 1]). If ñi = C 3 n i log nini+1d 2 |G| 3 l min(ϵ,δ) , then with probability at least 1 -δ we have that for all f ∈ F (whose kernels K f i have parameters less than 1, and with |||f i ||| ≤ 1) there exists a collection of pruning masks S 2l-1 , . . . , S 0 such that, by defining Kh i the kernel associated with S i ⊙ W h i , max x∈R d 2 ×n 0 , ∥x∥≤1 ∥ Kh 2l-1 * σ . . . σ( Kh 0 * x) -f (x)∥ ≤ ϵ In Appendix §E, we compute max(|B i→i+1 |, |||B i→i+1 |||) that leads to the corollary above.

4.3. SLT FOR PERMUTATION EQUIVARIANT NETS

The symmetric group S n consists of all permutations that can be enacted on a set of cardinality n. The action of S n on a tensor X ∈ R n k ×m is defined by permuting all but last index: (g • X) i1,...,i k ,j = (X g -1 (i1),...,g -1 (i k ) , j), ∀g ∈ S n . Any general linear permutation equivariant map W i : R n k i → R n k i+1 , must satisfy the following fixed point equation: ki+ki+1) is the (k i + k i+1 ) Kroenecker power of a permutation matrix P (Maron et al., 2019) . General permutation equivariant networks are the concatenation of linear equivariant layers followed by pointwise non-linearities, which aligns with the setting needed to apply Theorem 1. P ⊗(ki+ki+1) Vec(W i ) = W i , where P ⊗( Classification of all Linear Permutation Equivariant Maps. In Maron et al. (2019) , the authors solve the above fixed point equation by first defining the equivalence relation Q on [n] ki+ki+1 as: Q := ∀a, b ∈ [n] ki+ki+1 , a ∼ b ⇔ (∀i, j ∈ [k i + k i+1 ], a i = a j ⇔ b i = b j ). (8) Now for all µ ∈ [n] ki+ki+1 /Q define the matrix B µ ∈ R n k i ×n k i+1 such that each entry B µ a,b = 1 (a,b)∈µfoot_4 . Then a basis for equivariant maps is B i→i+1 = {B µ , µ ∈ [n] ki+ki+1 /Q}. The cardinality of this basis |B i→i+1 | = b(k i + k i+1 ) is known as the (k i + k i+1 )-th Bell number and can be understood as the number of ways to partition [n] ki+ki+1 . When k i = k i+1 , the identity element is not in the basis, therefore we replace B (1,...,1) by a∈[n] k /Q B (a,a) = I, which is still a basis. We are now in a position to apply Theorem 1 to permutation equivariant networks. Corollary 2. Let h ∈ H be a random permutation equivariant network of depth 2l, i.e., h(x ) = W h 2l-1 σ . . . σ(W h 0 x) where W h 2i ∈ R n k i ×ni×n k i ×ñi , W h 2i+1 ∈ R n k i ×ñi×n k i+1 ×ni+1 are equivariant layers whose decomposition in B have coefficients drawn from U([-1, 1]). If ñi = C 2 n i log nini+1 max( b(ki+ki+1),n k i +1)l min(ϵ,δ) , then with probability at least 1 -δ we have that for all f ∈ F (with |||f i ||| ≤ 1 and parameters in the basis less than 1) there exists a collection of pruning masks on the decomposition in the equivariant basis of the layers S 2l-1 , . . . , S 0 s.t., max x∈R n k 0 ×n 0 , ∥x∥≤1 ∥(S 2l-1 ⊙ W h 2l-1 )σ . . . σ( S 0 ⊙ W h 0 (x)) -f (x)∥ ≤ ϵ (9) We discuss in Appendix §F.1 the computation of |||B i→i+1 |||, and provide the detailed proof. Message Passing GNNs. MPGNNs are networks that act on graphs with n-nodes by defining a feature vector for each node which is updated based on "messages" received from its neighbors which are then combined. Given a node v in a graph and its hidden representation x v i , the message passing update for a layer i is governed by the following equation: x v i = f up i (x v i-1 , u∈N (v) f agg i (x v i-1 , x u i-1 )). In its most general form the aggregation function f agg i and update function f up i are taken to be MLPs. In this case it is easy to see that Theorem 1 can be applied separately to both f agg i , f up i independently as MLPs are captured under G = {e}. Permutation in/equivariance is trivially maintained in the pruned network as the aggregate function operates on a local neighborhood of v and pruning does not impact this as pruning does not impose any ordering over the nodes or the adjacency matrix in the graph.

5. EXPERIMENTS

We substantiate our equivariant framework to finding winning SLTs by approximating target G-steerable networks, MPGNNs, and k-order GNNs on standard image classification, node and graph classification tasks respectively. For steerable networks we consider networks for G ∈ {C 4 , C 8 , D 4 } which are finite subgroups of O(2). To show the generality of our framework, we experiment with two different equivariant basis for E(2); the first one uses spherical harmonics and is taken from Weiler and Cesa (2019) (DEFAULT), while the second is the one we introduce in §4.2 (OURS). MPGNNs and k-order GNNs naturally operate on S n where permutation invariance is with respect to the node labels of a given graph. For E(2)-steerable, we experiment with Rotation and FlipRotation-MNIST datasets which contain data augmentations from G ≤ SO(2) and G ≤ O(2) respectively (Weiler and Cesa, 2019) . To evaluate MPGNNs and k-order GNNs we consider standard node classification benchmarks in citation networks in Cora and CiteSeer (Sen et al., 2008) and real-world graph classification datasets in Proteins and NCI1 (Yanardag and Vishwanathan, 2015) . We find equivariant strong lottery tickets by utilizing our overparametrization strategy described in §3 by solving SUBSET-SUM problems using Gurobi (Gurobi Optimization, 2018) . The definition of the SUBSET-SUM problems as mixed-integer optimization problems can be found in eq. 28 of §G. In Table 3 we report our main results for an overparametrization constant C = 5 (see Thm. 1) towards approximating a single target network using 5 random seeds to construct our overparametrized network. Specifically, we report the ratio of the number of parameters in the overparametrized and final pruned network divided by the original target network. We also report test accuracies for both, the maximum absolute weight error over all SUBSET-SUM problems, and the maximum relative output error between pruned and target networks. All model architectures and described in §G. For all equivariant architectures and datasets considered, we find that we are able to approximate the corresponding trained target networks sufficiently well. Specifically, we achieve sufficiently low maximum relative output error across test samples such that the test accuracy of the resulting pruned network matches essentially that of the target one for all random seeds of the pruning experiments. Finally, we conduct an ablation study on the effect of overparametrization constant factor C to the approximation accuracy with respect to the tolerance ϵ. We perform this study for the E(2) equivariant architectures for different subgroups. In Fig. 3 we plot this as a function of C ∈ {1, 2, 5, 10} for the groups C 4 , C 8 , D 4 using the basis construction from Weiler and Cesa (2019) . As observed, increasing our overparametrization factor leads, up to C = 5, to a lower maximum relative output error while the pruned accuracy marginally increases.  .4e -1 ± 0.9e -1 4.2e -2 ± 1.2e -2 Table 3 : Pruning random overparameterized G-equivariant networks to approximate G-equivariant targets. We report a) p /ptarget the parameter ratio of the number of parameters p of the overparametrized or the final pruned networks over ptarget, b) the test accuracy of the target and the pruned networks, c) the maximum absolute weight error over subset sum problems, d) and the relative output errors of the pruned network in contrast to the target over samples in the test set. † STDs are below 1e -4 . * Maximum time of MIP solver for SUBSET-SUM problems was thresholded to 600ms.

6. DISCUSSION

This paper introduces a unifying framework to prove the strong lottery ticket hypothesis for general equivariant networks. We prove the existence with high probability of winning tickets for randomly (logarithmically) overparameterized networks with double the depth. We also theoretically demonstrate such an overparametrization scheme is optimal as a function of the tolerance. While our presented theory is built using overparametrized networks of depth 2L it may be possible to extend Theorem 1 to the setting where overparamatrized networks have depth L + 1 as in Burkholz (2022b) by adapting the proof techniques. We leave this extension as future work. Our framework enjoys broad applicability to MLPs, CNNs, E(2)-steerable networks, general permutation equivariant networks, and MPGNNs all of which become insightful corollaries of our main theoretical result. One limitation of our developed theory is the assumption of using a point-wise ReLU as the non-linearity. As a result, a natural direction for future work is to consider extensions of the SUBSET-SUM problem beyond linear functions to more general non-linearities. In addition, our overparametrization strategy employed the "diamond shape" technique; however other schemes might also yield an optimal upper bound. Characterizing these schemes is an exciting direction for future work.

7. ETHICS STATEMENT

The main contributions of this work are primarily theoretical in nature as we seek to provide a general framework to study equivariant lottery tickets. Consequently, any potential societal impact would necessarily be speculative in nature and deeply tied to a particular application domain. For example, one could consider the environmental cost savings from creating an overparametrized G-equivariant network that does not need any GPU hours to train, but instead CPU resources to solve SUBSET-SUM problems. Beyond these goals any application of our theory to actual practice is likely to inherit the complex broader impacts native to the problem domain and we encourage practitioners to exercise due caution in their efforts.

8. REPRODUCIBILITY STATEMENT

We provide a complete proofs for all our theoretical results in the Appendix. In particular, proofs for Lemma 1 can be found in Appendix B.1 and Theorem 1 is a direct application of this result l-times and whose proof is located in B.2. The proof for Theorem 2 is located in Appendix B.3. Furthermore, instantiations of framework for E(2)-steerable CNNs, permuation equivariant networks, MLPs, and vanilla CNNs resulting in corollaries 1, 2, 3, and 4 respectively. The proofs for all the corollaries are located in Appendices C (MLP), D (CNN), E.4 (E(2)-CNN), and F.1 (permutation equivariant networks). We provide full details on our experimental setup, including hyperparamters choices, architectures, and the exact SUBSET-SUM problem being solved for pruning in Appendix G. Finally, code to reproduce our experimental results can be found in submission's supplementary material.

A ADDITIONAL MATERIAL ON THE SUBSET SUM PROBLEM

We recall here some results on subset sum originally from Lueker (1998) and modified by Pensia et al. (2020) to better fit the proof. Lemma 2 (SUBSET-SUM lemma). Let U ≃ U([0, 1]) (or U([-1, 0]) and V ≃ U([-1, 1]) be two independent random variables. Let P be the distribution of U V . Let δ 0 be the dirac-delta function. Define a distribution D = 1 2 δ 0 + 1 2 P . Let X 1 , . . . X n be i.i.d. from the distribution D where n ≥ C log( 2ϵ ) (for some universal constant C). Then, with probability at least 1 -ϵ, we have ∀z ∈ [-1, 1], ∃S ⊂ [n] such that |z - i∈S X i | ≤ ϵ (10) This Lemma, is in fact a consequence of the corollary 3.3 from Lueker (1998) which states that as soon as a distribution contains a uniform distribution, one can achieve any target with exponentially small precision by SUBSET-SUM. Extension to more general distributions.This allows us to extend the result Theorem 1 to a more general setting, where the distribution of the random coefficients is not U([-1, 1]) but contains a uniform distribution. Let's say that a distribution Z contains a uniform distribution U([a, b]) if there exist a distribution Z 1 and a constant ζ ∈ [0, 1[ such that: Z := ζZ 1 + (1 -ζ)U([a, b]) We want to extend the results of theorem 1 to distributions containing U([-a, a]) for some a > 0 We follow therefore the same path as in Pensia et al. (2020) to prove Lemma 2 but with more general distributions. Pensia et al. (2020) already made a remark for this next extension that we state and prove here. Lemma 3. Let a > 0. Let X and Y be two independent random variables such that X contains U([0, a]) (or U([-a, 0])) and Y contains U([-a, a]). Then the PDF of the random variable XY is such that: ∃A a > 0, f XY (z) ≥ A a log a 2 |z| for|z| < a 2 Proof. By the change of variable X a and Y a , one can apply lemma 4 from Pensia et al. (2020) to get that if X ∼ U([0, a]) (or U([-a, a])) and Ỹ ∼ U([-a, a]] the PDF of X Ỹ is: 1 2a 2 log a 2 |z| if |z| ≤ a 2 and 0 otherwise Now we know by hypothesis that ∃α X , α Y > 0 such that f X ≥ α X f U ([0,a]) and f Y ≥ α Y f U ([-a,a]) . Therefore, f XY ≥ α X α Y f X Ỹ and finally, f XY ≥ α X α Y 2a 2 log( a 2 |z| if |z| ≤ a 2 Lemma 4. Let X and Y be two independent random variables such that X contains U([0, a]) (or U([-a, 0])) and Y contains U([-a, a]). Let P be the distribution of XY . Then there exists a distribution Q and a scalar B a > 0 such that: P = B a U([- a 2 2 , a 2 2 ]) + (1 -B a )Q Proof. This is a direct consequence of the lower bound on the PDF of XY that was shown in the previous Lemma. Using Lemma 4 and Corollary 3.3 from Lueker (1998) leads immediately to the following result: Lemma 5. Let a > 0, X be a random variable containing U([0, a]) (or U([-a, 0])) and Y containing U([-a, a]). Let X 1 , . . . , X n be n iid random variables following the distribution 1 2 δ 0 + 1 2 P where P is the distribution of XY . Then, if n ≥ C a log 2 ϵ (for some constant depending on a), with probability at least 1 -ϵ, we have: ∀z ∈ [-1, 1], ∃S ⊂ {1, . . . , n} |z - i∈S X i | ≤ ϵ Proof. This follows immediately from Corollary 3.3 in Lueker (1998) (by applying Markov's inequality) and Lemma 4. Discussion. This allows us to generalize Theorem 1 to settings where the random overparametrized network has weights taken from a distribution which contains U([-a, a]). This includes almost all the usual settings, namely Gaussian, uniform, ... Indeed, the only thing to change is to no longer use Lemma 2 but Lemma 5 at the same place in the proof and by assuming that the distribution of the parameters of the overparametrized network contains U([-a, a]) for some a > 0.

B PROOF OF THE GENERAL SLT ON EQUIVARIANT NETWORKS USING POINTWISE RELU B.1 APPROXIMATION OF AN EQUIVARIANT TARGET LAYER

We now prove Lemma 1 that is used to approximate a single layer in a G-equivariant target model. Lemma 1. Let h i ∈ H i be a random overparametrized G-equivariant network as defined above, with coefficients λ ) where C 1 is a constant. Then, with probability 1 -δ, for every target G-equivariant layer f i ∈ F i , one can find two pruning masks S 2i , S 2i+1 on the coefficients λ (i) p→q,k and µ (i) p→q,k respectively such that: max x∈R D i ×n i , ∥x∥≤1 ∥(S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x) -f i (x)∥ ≤ ϵ . Proof. Let us first recall that the main hypothesis needed for this lemma is to have an identity element in the basis I ∈ B i→i . We note that this is a very mild assumption since the identity is trivially equivariant between F i and F i and one can always choose to incorporate it in the basis. Consequently, we will designate the first element in our equivariant basis to be the identity b i→i,1 = I. Remark.We choose C 1 = 3C to ensure that (C is the universal constant introduced in lemma 2): C 1 log n i n i+1 max (|B i→i+1 |, |||B i→i+1 |||) min (ϵ, δ) ≥ C log 4n i n i+1 max (|B i→i+1 |, |||B i→i+1 |||) min (ϵ, δ) , which is true for the entire domain of variables we are interested in (n i , n i+1 ≥ 1, δ, ϵ ≤ 1 2 and |B i→i+1 | ≥ 1). It is easy to see that 3 log(x) ≥ log(4x) on [2, +∞[ as x 3 ≥ 4x in this domain. To begin, we first introduce a function, χ, to identify blocks in our feature space F ni i . In particular, we leverage the "diamond shape" structure (see Fig 1 ) and define χ : [ñ i ] → [n i ], such that it divides the intermediate layer of our overparametrized approximation into groups of C 1 log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||) min(ϵ,δ) ) blocks which are linked with the same block in the first (i-th) layer. In other words, χ associates a block in F ñi in a surjective manner to a block in F ni . In a last piece of notation we will use x ω to mean the ω-th block of the feature space for x. For example, if x ∈ R ni×Di which is contained in the feature space F ni i of the i-th layer then ω ∈ [n i ] and x ω ∈ F i denotes the ω-th vector of dimension D i in x. Finally, because ω is a dummy index, quite often we will replace it with appropriate layer index-e.g. p, q, r. With this in hand we can write the function χ(q) as follows: χ(q) =     q -1 C 1 log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||) min(ϵ,δ) )     + 1 Before pruning, one has W h 2i = ni p=1 ñi q=1 |Bi→i| k=1 (κ p,q ni→ñi ⊗ λ (i) p→q,k b i→i,k ). We begin pruning by annihilating all first layer coefficients not associated with the identity basis element (k ̸ = 1) λ (i) p→q,k and for q / ∈ χ -1 (p). This yields the following decomposition post-pruning, W h 2i = ni p=1 q∈χ -1 (p) κ p,q ni→ñi ⊗ λ (i) p→q,1 I . Note that we can write p = χ(q) leading to the following: W h 2i x q = λ (i) χ(q)→q,1 x χ(q) . After the σ, -i.e. the pointwise-ReLU, one then has: σ(W h 2i x) q = σ(λ (i) χ(q)→q,1 x χ(q) ) = λ (i)+ χ(q)→q,1 x + χ(q) + λ (i)- χ(q)→q,1 x - χ(q) where we used the fact that the ReLU is pointwise and the identity on scalars σ(wx) = w + x + + w -x -. Expanding the second layer in its equivariant basis and using the above equation we get: W h 2i+1 σ(W h 2i x) r = ñi q=1   |Bi→i+1| k=1 µ (i) q→r,k b i→i+1,k   σ(W h 2i x) q (11) = ni p=1 q∈χ -1 (p)   |Bi→i+1| k=1 µ (i) q→r,k b i→i+1,k   (λ (i)+ p→q,1 x + p + λ (i)- p→q,1 x - p ) (12) = ni p=1 |Bi→i+1| k=1 q∈χ -1 (p) µ (i) q→r,k λ (i)+ p→q,1 b i→i+1,k x + p + (13) ni p=1 |Bi→i+1| k=1 q∈χ -1 (p) µ (i) q→r,k λ (i)- p→q,1 b i→i+1,k x - p . ( ) Our goal is to approximate the target model whose r-th block can be written as: f i (x) r = ni p=1 |Bi→i+1| k=1 α (i) p→r,k b i→i+1,k x + p term 1 + ni p=1 |Bi→i+1| k=1 α (i) p→r,k b i→i+1,k x - p term 2 . To do so, we only have approximate α (i) p→r,k in term 1, for all p, r, k, using a subset sum of q∈χ -1 (p) µ (i) q→r,k λ (i)+ p→q,1 and α (i) p→r,k in term 2, by a subset sum of q∈χ -1 (p) µ (i) q→r,k λ (i)- p→q,1 . This can be achieved by judiciously choosing pruning masks that selectively include µ (i) q→r,k which is a by-product of solving independent SUBSET-SUM problems. The key insight powering our analysis is to notice that the variables µ (i) p→r,k that appear in each approximation problems are different if (p, r, k) ̸ = (p ′ , r ′ , k ′ ). Moreover, the two different problems for fixed indices (p, r, k) can be seen using different variables since following whether it is positive or negative, λ (i) p→q,1 will necessarily be 0 in the first or the second term equation. Therefore, either in the first or the second equation, µ (i) q→r,k can be seen as being not a variable of the SUBSET-SUM problem. We are then at liberty to decide whether to prune the variable or not in the equation where it appears, because the pruning of the variable will not affect the result of the other SUBSET-SUM problem. Following this approach, we can then find a mask on the variables implied in subsequent problems, solve the problems independently and finally take the concatenation of all the masks in the second layer which will simultaneously solve all the problems. We now quantify this approach by showing that with high probability, the 2n i n i+1 |B i→i+1 | subset sum problems (with independent variables) written below can all be solved by applying a pruning mask on the second layer. The pruned mask applied on the second layer is denoted S 2i+1 q→r,k ∈ {0, 1} ñi×ni+1×|Bi→i+1| . The subset sum problems are written below: |err (i) p→r,k,+ | := q∈χ -1 (p) (S 2i+1 q→r,k • µ (i) q→r,k )λ (i)+ p→q,k -α (i) p→r,k ∀(p, r, k) ∈ [n i ] × [n i+1 ] × [|B i→i+1 |] ≤ ϵ 2n i n i+1 max(|B i→i+1 |, |||B i→i+1 |||) and |err (i) p→r,k,-| := q∈χ -1 (p) (S 2i+1 q→r,k • µ (i) q→r,k )λ (i)- p→q,k -α (i) p→r,k ∀(p, r, k) ∈ [n i ] × [n i+1 ] × [|B i→i+1 |] ≤ ϵ 2n i n i+1 max(|B i→i+1 |, |||B i→i+1 |||) We will now use the SUBSET-SUM Lemma 2 which explains the overparametrization that one needs to solve the SUBSET-SUM problems. Since µ (i) q→r,k and λ (i) p→q,k are i.i.d following U([-1, 1]), λ (i),+ p→q,1 follows 1 2 δ 0 + 1 2 U with the notations of lemma 2. We deduce that the µ (i) q→r,k λ (i),+ p→q,1 are i.i.d. following the distribution D = 1 2 δ 0 + 1 2 P . This is the same for µ (i) q→r,k λ (i),- p→q,1 which are i.i.d. following the distribution D = 1 2 δ 0 + 1 2 P . Here one should note that ∀p ∈ [n i ], |χ -1 (p)| = C 1 log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||) min(ϵ,δ) ). Therefore, by using 6 lemma 2 , ∀(p, r, k) ∈ [n i ] × [n i+1 ] × [|B i→i+1 |], the two subset sum problems can be achieved by pruning the coefficients µ (i) q→r,k with probability at least 1 -δ 2nini+1 max (|Bi→i+1|, |||Bi→i+1|||) . Call this the event E (i) p→r,k . By taking the intersection of the events, we get that E (i) = (p,r)∈[ni]×[ni+1],k∈[|Bi→i+1|] E (i) p→r,k holds with probability at least, p(E (i) ) = 1 -n i n i+1 |B i→i+1 | δ 2n i n i+1 max(|B i→i+1 |, |||B i→i+1 |||) ≥ 1 -δ. 6 At this point one may prove the same lemma but with more general distributions on the coefficients λ (i) p→q,k and µ (i) q→r,k by assuming that they only contain U([-a, a]) for some a > 0 and by using Lemma 5 instead of Lemma 2 other words, with probability at least 1 -δ, all the SUBSET-SUM problems are solved. Finally, it remains to check that the approximation holds with this pruning mask. Let Ω be defined as: Ω = max ∥x∥≤1 ∥(S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x) -f i (x)∥. By applying the masks we get: Ω = max r∈[ni+1] max ∥x∥≤1 (S 2i+1 ⊙ W h 2i+1 )σ((S h 2i ⊙ W h 2i )x) r -f i (x) r = max r∈[ni+1] max ∥x∥≤1 ∥ ni p=1 |Bi→i+1| k=1 (α (i) p→r,k + err (i) p→r,k,+ )b i→i+1,k (x + p ) + (α (i) p→r,k + err (i) p→r,k,-)b i→i+1,k (x - p ) - ni p=1 |Bi→i+1| k=1 α (i) p→r,k b i→i+1,k (x + p ) + α (i) p→r,k b i→i+1,k (x - p ) ∥ ≤ max r∈[ni+1] ni p=1 max ∥x∥≤1 |Bi→i+1| k=1 err (i) p→r,k,+ b i→i+1,k (x + p ) + |Bi→i+1| k=1 err (i) p→r,k,-b i→i+1,k (x - p ) ≤ max r∈[ni+1] ni p=1   max ∥x∥≤1 |Bi→i+1| k=1 err (i) p→r,k,+ b i→i+1,k (x + p ) + max ∥x∥≤1 |Bi→i+1| k=1 err (i) p→r,k,-b i→i+1,k (x - p )   ≤ max r∈[ni+1] ni p=1   |Bi→i+1| k=1 err (i) p→r,k,+ b i→i+1,k (x + p ) + |Bi→i+1| k=1 err (i) p→r,k,-b i→i+1,k (x - p )   ≤ ni p=1 2ϵ 2n i n i+1 max(|B i→i+1 |, |||B i→i+1 |||) × |||B i→i+1 ||| ≤ ϵ Note. In the statement of the Lemma we used a specific choice of norm (l p ) but our proof strategy will work with every norm as soon as σ, the ReLU non-linearity, is 1-Lipschitz (which may not be the case for some esoteric norms). As a result, there is no need to restrict oneself to the l p -norm, though for ease of exposition and not to confuse the reader we made this choice above. Moreover thanks to the flexibility of the SUBSET-SUM theorem, the proof can also be extended to a milder hypothesis which is on the distribution of coefficients. Specifically, it is sufficient to have that the distribution contains a uniform distribution centered at 0 (see Lemma 5). The immediate consequence of this is that it is possible to accommodate other weight initialization schemes that are commonly used in practice, but again for ease of readibility we chose to use Uniform distribution.

B.2 APPROXIMATION OF AN EQUIVARIANT TARGET NETWORK

We now prove in this appendix Theorem 1 which approximates a full target model. We first recall the two main assumptions (very mild) that are needed for the Theorem statement: • I ∈ B i→i • σ the pointwise ReLU is used as an equivariant nonlinearity and is 1-Lipschitz. Theorem 1. Let h ∈ H be a random overparametrized G-equivariant network with coefficients λ (i) p→q,k and µ (i) p→q,k , for i ∈ [l] and indices p, q, k as defined in Table 1 , all drawn from U([-1, 1]). Suppose that ñi = C 2 n i log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||)l min(ϵ,δ) ), where C 2 is a constant. Then with probability 1 -δ, for every f ∈ F, one can find a collection of pruning masks S 2l-1 , . . . S 0 on the coefficients λ (i) p→q,k and µ (i) p→q,k for every layer i ∈ [l] such that: max x∈R D 0 ×n 0 , ∥x∥≤1 ∥(S 2l-1 ⊙ W h 2l-1 )σ . . . σ((S 0 ⊙ W h 0 )x) -f (x)∥ ≤ ϵ . (2) Proof. We first note that we use a different constant C 2 = 2C 1 in the theorem as compared to lemma 1 which helps ensure that, C 2 n i log n i n i+1 max (|B i→i+1 |, |||B i→i+1 |||) l min (ϵ, δ) ≥ C 1 n i log 2n i n i+1 max (|B i→i+1 |, |||B i→i+1 |||) l min (ϵ, δ) , which is true in the domain of the following variables (n i , n i+1 , l ≥ 1, δ ≤, ϵ ≤ 1 2 , |B i→i+1 [≥ 1) as 2 log(x) ≥ log(2x) on [2, +∞]. We first apply lemma 1 l-times for each layer of the target network with ϵ becoming ϵ 2l and δ becoming δ l . With an overparametrization factor ñi ≥ C 1 n i log( 2nini+1 max(|Bi→i+1|,|||Bi→i+1|||)l min(ϵ,δ) ) we get that for each layer i, max ∥x∥≤1 ∥(S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x) -f i (x)∥ ≤ ϵ 2l holds with probability at least 1 -δ l . By taking a union bound, we get that this holds for every layer with probability at least 1 -δ. Now, let x ′ i be the input to the (2i)-th layer of the pruned overparametrized network h. Furthermore, let x i be the input to the i-th layer of the target network f . Then we have, • x ′ 0 = x 0 = x • x ′ i+1 = σ (S 2i+1 ⊙ W h 2i+1 )σ (S 2i ⊙ W h 2i )x ′ i for i ≤ l -2 • x ′ l = (S 2l-1 ⊙ W h 2l-1 )σ (S 2l-2 ⊙ W h 2l-2 )x ′ l-1 Equation 15 implies that, ∥(S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x ′ i ) -f i (x ′ i )∥ ≤ ∥x ′ i ∥ ϵ 2l Passing through the point-wise ReLU which is 1-Lipschitz for all the norms that we work with we get: ∥x ′ i+1 ∥ ≤ ∥x ′ i ∥ 1 + ϵ 2l By leveraging a recursive argument, and using the fact that ∥x ′ 0 ∥ = ∥x∥ ≤ 1 we then get that for all i ∈ {0, . . . , l -1}, ∥x ′ i ∥ ≤ (1 + ϵ 2l ) i . Then, forall i ≤ l -2: ∥x ′ i+1 -x i+1 ∥ = ∥σ (S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x ′ i ) -σ (f i (x i )) ∥ ≤ ∥σ (S 2i+1 ⊙ W h 2i+1 )σ((S 2i ⊙ W h 2i )x ′ i ) -σ (f i (x ′ i )) ∥ + ∥σ (f i (x ′ i )) -σ (f i (x i )) ∥ ≤ ∥x ′ i ∥ ϵ 2l + |||f i |||∥x ′ i -x i ∥ ≤ 1 + ϵ 2l i ϵ 2l + ∥x ′ i -x i ∥, where we used the fact that σ is one Lipschitz. We then get that, ∥x ′ l -x l = ∥(S 2l-1 ⊙ W h 2l-1 )σ (S 2l ⊙ W h 2l )x ′ l-1 ) -f l-1 (x l-1 )∥ ≤ ∥(S 2l-1 ⊙ W h 2l-1 )σ((S 2l-2 ⊙ W h 2l-2 )x ′ l-1 ) -f l-1 (x ′ l-1 )∥ + ∥f l-1 (x ′ l-1 ) -f l-1 (x l-1 )∥ ≤ ∥x ′ l-1 ∥ ϵ 2l + |||f l-1 |||∥x ′ l-1 -x l-1 ∥ ≤ (1 + ϵ 2l ) l-1 ϵ 2l + ∥x ′ l-1 -x l-1 ∥ ≤ l-1 i=0 1 + ϵ 2l i ϵ 2l ≤ (1 + ϵ 2l ) l -1 ≤ e ϵ 2 -1 ≤ ϵ because ϵ ≤ 1 2 B.3 LOWER BOUND We now prove Theorem 2. Theorem 2. Let ĥi be a network with Θ parameters such that: ∀f i ∈ F i , ∃S i ∈ {0, 1} Θ such that max x∈R D i ×n i , ∥x∥≤1 ∥(S i ⊙ ĥi )(x) -f i (x)∥ ≤ ϵ . (3) Then Θ is at least Ω n i n i+1 |B i→i+1 | log( 1 ϵ ) and ñi is at least Ω(n i log 1 ϵ ) in Theorem 1. Let us first recall the main assumptions of our setting: • For all f i ∈ Span(κ ni→ni+1 ⊗ B i→i+1 ), where f i = p∈[ni] q∈[ni+1] k α (i) p→q,k b i→i+1,k , we have: |||f i ||| ≤ 1 =⇒ ∥α (i) ∥ ∞ ≤ 1. This assumption is extremely mild as it can always trivially be satisfied by rescaling the basis elements. • ∃M 1 ∈ R + such that, uniformly over i, for every possible building block in our equivariant feature spaces F i and F i+1 we have, |B i→i | ≤ M 1 |B i→i+1 |. This assumption is used to mainly guard against a non-trivial scenario where the first layer would be able to carry "a lot of superfluous parameters". This assumption finds its solitary use in achieving the lower bound on the overparametrization factor in the theorem 1, i.e. to prove ñi ≥ Ω(n i log 1 ϵ ) and is not used for the lower bound on Θ. This assumption is very mild because in most of the usual cases (MLPs, CNNs, E(2)-steerable CNNs) the possible building blocks of each layer F i are finite (respectively R, R d 2 and R d 2 or R d 2 ×|G| ). Being finite automatically implies the existence of such a constant M 1 as we can simply take the maximum over the possible values of |Bi→i+1| |Bi→i| . • Finally we assume the existence of a constant M 2 ∈ R + such that n i ≤ M 2 n i+1 . This mild assumption-like the previous one-is used to ensure that ñi ≥ Ω(n i log 1 ϵ ). Our proof relies on a counting based argument that compares the number of pruning masks to the cardinal of a 2ϵ-separated net P in the set of target networks with respect to the operator norm. A similar argument was used by Pensia et al. (2020) in the context of dense nets. We recall here the definition of a 2ϵ-separated net: Definition B.1. Let F be a normed vector space. A 2ϵ-separated net P, in F is a subset P ⊂ F such that: ∀x 1 , x 2 ∈ P, x 1 ̸ = x 2 =⇒ ∥x 1 -x 2 ∥ ≥ 2ϵ In Lemma 1, we considered only a set of target network F i ⊂ Span(κ ni→ni+1 ⊗ B i→i+1 ) where each function has |||f i ||| ≤ 1, and ∥α (i) ∥ ∞ ≤ 1. Mixing this with the first assumption written above, it is therefore the set of maps f i such that ∥α (i) ∥ ∞ ≤ 1. Now consider the isomorphism I i : Span(κ ⊗ B i→i+1 ) → R nini+1|Bi→i+1| , which identify a function with its coefficients in the equivariant basis. I i := f i → α (i) This isomorphism shows that F i can be seen as the norm ball of R nini+1|Bi→i+1| with respect to the norm induced on R nini+1|Bi→i+1| by the isomorphism. Moreover P is a 2ϵ-separated net on F i if and only if its image is a 2ϵ-separated net on R nini+1|Bi→i+1| with respect to the induced norm. Lower bound on |P|. We just need to use Lemma 4.2.8 and extend Proposition 4.2.12 from Vershynin (2018) to non-Euclidean balls. The general idea is as follows: Denote by B(x, R) the ball centered at x of radius R in R nini+1|Bi→i+1 and by µ the Lebesgue measure. Let us construct a 2ϵ-separated net as follow: we take for first point the origin 0 of the vector space. At the step n, to construct the n + 1-th point of P we proceed as follow: if B(0, 1) ⊂ x∈P B(x, 2ϵ) we stop the processus and don't take any n + 1-th point. Else, we take a point in B(0, 1)⧹ x∈P B(x, 2ϵ). We know that this point is at a distance of at least 2ϵ of the other points of P. Moreover it is in the unit ball. At the end of the process (which must end since the unit ball is compact), we finally get that for the 2ϵ-separated net P, B(0, 1) ⊂ x∈P B(x, 2ϵ). Therefore, µ(B(0, 1)) ≤ µ( x∈P B(x, 2ϵ)) ≤ |P|µ(B(0, 2ϵ)). Finally, |P| ≥ µ(B(0,1)) µ(B(0,2ϵ)) = 1 2ϵ nini+1|Bi→i+1| . In the last step, we use the fact that the Lebesgue measure of a ball of radius R in a vector space of dimension n is R n V n where V n is the Lebesgue measure of the unit ball. This allows to choose |P| ≥ 1 2ϵ nini+1|Bi→i+1| . Lower bound induced on Θ. As the network that we seek to prune has Θ parameters, the number of binary pruning masks that can be constructed is 2 Θ . Moreover, due to the triangular inequality, each pruned network can approximate at most one element of P. Indeed, if f 1 i ̸ = f 2 i ∈ P are approximated with the same pruning mask, ∥f 2 i -f 1 i ∥ ≤ ∥f 2 i -(S ⊙ ĥi )∥ + ∥(S ⊙ ĥi ) -f 1 i ∥ ≤ 2ϵ, which contradicts the fact that P is a 2ϵ-separated net. This directly implies that the number of pruning masks must be bigger than the cardinal of P. 2 Θ ≥ 1 2ϵ nini+1|Bi→i+1| and by taking the log, Θ ≥ n i n i+1 |B i→i+1 | log(2) log 1 2ϵ which shows that Θ must be at least Ω(n i n i+1 |B i→i+1 | log 1 ϵ ) Lower bound on ñi . We now seek to provide a lower bound on ñi such that Theorem 1 holds. Since, our main claim requires that we approximate every target network with probability at least 1 -δ > 0, the set of parameters (drawn from any distribution) that can achieve this is non zero. What remains is to count the number of parameters contained within the overparametrized G-equivariant network in H i (see Lemma 1) as a function of the overparametrization factor ñi . This allows us to lower bound ñi via the lower bound on the number of parameters established above. Now any overparametrized G-equivariant network we construct has the following number of parameters: • Number of parameters of the first layer: n i ñi |B i→i | • Number of parameters of the second layer: ñi n i+1 |B i→i+1 | Therefore the overparametrized network h i has Θ = ñi (n i |B i→i | + n i+1 |B i→i+1 |) parameters. Using the second and third assumptions, we get that: Θ ≤ ñi (M 1 M 2 n i+1 |B i→i+1 | + n i+1 |B i→i+1 |) ≤ ñi (M 1 M 2 + 1)n i+1 |B i→i+1 |. Moreover by Eq. 21, we know that: Θ ≥ Ω n i n i+1 |B i→i+1 | log 1 ϵ . It therefore implies that: ñi ≥ Ω i log 1 ϵ (23) Discussion. Using the result in Theorem 2 we can now understand that Theorem 1 informs us that our proposed overparametrization strategy is optimal with respect to the tolerance ϵ and almost optimal with respect to n i n i+1 |B i→i+1 |. In Theorem 1 we observe an additional factor of log(n i n i+1 max(|B i→i+1 |, |||B i→i+1 )|||) which appears in ñi . We can reconcile this term which appears in the proof due to both our choice of with which metric do we want to approximate the target network and to the probabilistic setting of the SLTH. Indeed, first we note that we chose to approximate each target layer by ϵ with respect to the operator norm associated with the norms on the input and output space. But such a choice is arbitrary, and if we had chosen another metric, such as approximating each weight of the target network in the diamond shape structure by ϵ, then the term n i n i+1 |||B i→i+1 ||| might have been eliminated. The term n i n i+1 |B i→i+1 | arises from the fact that the parameters of the overparametrized network are drawn from a random process. Specifically, a bigger overparametrization is needed because of the scenario when not all the SUBSET-SUM problems have solutions, which has a probability of occurring that grows with n i n i+1 |B i→i+1 |-i.e. the complexity of the approximation. We could replace the probabilistic setting by instead taking an overparametrized network deterministically initilized by a smart initialization such that with probability one all possible subnetworks can be obtained by pruning the overparametrized one. In this case, the overparametrization on the width would no longer have the term n i n i+1 |B i→i+1 | in the log. Such an initialization can be taken for example by decomposing the overparametrized network in the different blocks of the diamond shape and taking the weights in each block to be ±1, ± 1 2 , ± 1 4 , ±( 1 2 ) log(ϵ) log(2) = 1 ϵ (the weights that are not part of a diamond shape can be initialized freely). Each weight of the target network can then be approximated by pruning the diamond shape with a mask which is the binary writing of the target weight. This is possible for every weight of the target network and for all target network at once with probability one (with a different mask for each target network). We note here the similarity of this construction with the one used in Sreenivasan et al. (2022) , albeit under a different setting than the one we considered here. In conclusion, we give some hints to annihilate the term n i n i+1 max(|B i→i+1 , |||B i→i+1 |||): first choosing another metric for approximating a layer and secondly going to a non-probabilistic setting where the overparametrized network is smartly initialized.

C PROOF OF STL ON MLP USING THEOREM 1

This corollary recovers the main result of Pensia et al. (2020) . In this case, G = {e} and the representation is trivial. The building block of a layer is F i = R and each layer is composed of a stack of n i , i.e. R ni = F ni i . The norm that we will use on R is of course the absolute value | • |. Therefore, as explained above, the norm that we consider on F ni i = R ni is ∥ • ∥ ∞ . The pointwise ReLU is trivially equivariant, since the G is trivial. It is moreover 1-Lipschitz. All maps are equivariant, since the group G is trivial. An equivariant basis of the maps F i → F i+1 and of the maps F i → F i is therefore a basis of the maps R → R. It is of dimension 1 and of course taken to be the identity. We therefore obtain that the identity is in B i→i . One has |B i→i | = 1 and |||B i→i+1 ||| = max |α|≤1 |||αI||| = 1. All the conditions are therefore validated and we are free to apply Theorem 1 in this setting, with max(|B i→i |, |||B i→i+1 |||) = 1 which leads to the following corollary: Corollary 3. Let h ∈ H be a random MLP of depth 2l, i.e., h(x) = W h 2l-1 σ . . . σ(W h 0 x) where W h 2i ∈ R ni×ñi , W h 2i+1 ∈ R ñi×ni+1 are dense linear maps with weights drawn from U([-1, 1]) If ñi = C 2 n i log nini+1l min(ϵ,δ) , then with probability at least 1 -δ we have that for all f ∈ F a target MLP with layers W f i ∈ [-1, 1] ni×ni+1 and |||f i ||| ≤ 1 there exists a collection of pruning masks S 2l-1 , . . . , S 0 such that, max x∈R n 0 , ∥x∥≤1 ∥(S 2l-1 ⊙ W h 2l-1 )σ . . . σ S 0 ⊙ W h 0 x -f (x)∥ ≤ ϵ D PROOF OF SLT ON CNN USING THEOREM 1 We now prove Theorem 1 application to the case regular translation equivariant CNNs. We highlight here that this is a strict generalization of the result obtained by da Cunha et al. (2022b) as we do not assume strictly positive inputs (recently extended in parallel in Burkholz (2022a) ). Corollary 4. Let h ∈ H be a random CNN of depth 2l, i.e., h(x ) = K h 2l-1 * σ . . . σ(K h 0 * x) where K h 2i ∈ R d 2 ×ni×ñi , K h 2i+1 ∈ R d 2 ×ñi×ni+1 are convolutional kernels with weights in U([-1, 1]) If ñi = C 2 n i log d 2 nini+1l min(ϵ,δ) , then with probability at least 1 -δ we have that for all f ∈ F a target CNN with kernels K f i ∈ [-1, 1] d 2 ×ni×ni+1 and |||f i ||| ≤ 1 there exists a collection of pruning masks S 2l-1 , . . . , S 0 such that, max x∈R d 2 ×n 0 , ∥x∥≤1 ∥(S 2l-1 ⊙ K h 2l-1 ) * σ . . . σ S 0 ⊙ K h 0 * x -f (x)∥ ≤ ϵ We now prove Corollary 4. In our case, the building blocks of every layer are F i = R d 2 where d 2 is the size of an image. Therefore, B i→i+1 is the basis of translation equivariant maps: R d 2 → R d 2 . When working with CNNs, the basis that is used in practice is the convolution with a kernel K p,q ∈ R d 2 where K p,q has only a 1 at the index (p, q) and is filled everywhere else with zeros on the grid d × d, where (p, q) ∈ [d] 2 . It is therefore easy to see that : |B i→i+1 | = d 2 . Let us choose ∥ • ∥ ∞ as a norm on R d 2 . Applying the proposition 1 from da Cunha et al. (2022b), we get: ∀K ∈ R d×d , ∀X ∈ R d×d , ∥K * X∥ ∞ ≤ ∥K∥ 1 ∥X∥ ∞ . By using this basis we then get that, |||B i→i+1 ||| = max K∈[-1,1] d 2 max X∈[-1,1] d 2 ∥K * X∥ ∞ ≤ max K∈[-1,1] d 2 max X∈[-1,1] d 2 ∥K∥ 1 ∥X∥ ∞ ≤ d 2 . We then get that: max(|B i→i+1 |, |||B i→i+1 |||) = d 2 . It is trivial to notice that the pointwise-ReLU used is equivariant and 1-Lipschitz. Moreover, the identity is clearly in B i→i by taking the kernel with only a 1 at the origin. Therefore all the conditions are met and we can apply theorem 1 which states that the overparametrization needed is: ñi = C 2 n i log n i n i+1 max (|B i→i+1 |, |||B i→i+1 |||) l min (ϵ, δ) = C 2 n i log d 2 n i n i+1 l min (ϵ, δ) . E ADDITIONAL MATERIAL ON E(2)-STEERABLE NETWORKS E.1 GENERAL EQUIVARIANT LAYERS IN THE CASE OF FEATURE FIELDS DEFINED ON R 2 In full generality, the theory of E(2)-steerable CNN has been developed in the setting of continuous and infinite steerable fields defined on R 2 . The input and output of a layer are then respectively functions in (R 2 → R cin ) and in (R 2 → R cout ). The reader will immediately note that it does not correspond to the practical case of E(2)-steerable CNN since these type of inputs are not infinite dimensional. The condition for a layer to be equivariant between these two feature fields is to be written as a continuous convolution with kernels satisfying the condition (called equivariant kernels): π(g • x) = ρ out (g)π(x)ρ -1 in (g), ∀g ∈ G, x ∈ X There are different methods to compute the possible kernels that satisfy this condition, that will lead to different basis. For example, in Weiler and Cesa (2019) , the authors use the polar coordinates to solve this condition. They have a free parameter which is the frequency and by varying this parameter they can compute a basis of the equivariant kernels. Our method to construct a basis of the equivariant kernels is different: we quotient the plane R 2 by the equivalence relation induced by the orbits under the group G. For each point in the continuous quotient space A R , we compute a basis of the equivariant kernels by putting an element of the canonical basis at this point and summing over the group G the action of an element of G on this element. More precisely, we impose having some matrix K p,q 0,x at the point x ̸ = 0 and to obtain the full equivariant kernel, we just apply the following formula: ∀y ∈ R 2 b(y) := K p,q G,x (y) = g∈G ρ i+1 (g)K p,q 0,x (g -1 y)ρ i (g -1 ). ( ) This formula is well defined because in the case of subgroups of O(2), ∀x, y ∈ R 2 ⧹{0}, the set {g ∈ G, g • x = y} is finite meaning that the above sum is finite for every y ∈ R 2 . It remains to check that the kernel K p,q G,x respects the above condition on equivariant kernels. Indeed, one has that: ∀y ∈ R 2 , ∀h ∈ G, K p,q G,x (h • y) = g∈G ρ i+1 (g)K p,q 0,x (g -1 • (h • y))ρ i (g -1 ) = g∈G ρ i+1 (h)ρ i+1 (h -1 g)K p,q 0,x ((h -1 g) -1 • y))ρ i (g -1 h)ρ i (h) -1 = ρ i+1 (h)   g∈G ρ i+1 (h -1 g)K p,q 0,x ((h -1 g) -1 • y))ρ i (g -1 h)   ρ i (h) -1 = ρ i+1 (h)K p,q G,x (y)ρ i (h) -1 where we used that g → h -1 g from G to G is a bijection. One should note that for some groups G and some x ̸ = 0 it may be possible that ∃g ∈ G, g • x = x. The set of all elements that keep the point unchanged is known as the stabilizer subgroup. For example, for G a dihedral group and a point x on the symmetry axis, one has that x remains untouched by the symmetry with respect to this axis. This is however not a problem, as the set of such g is finite, and therefore the above formula is still valid, even at the point x. One will note however that K p,q G,x (x) ̸ = K p,q 0,x (x). This means that we may lose the fact that the set of equivariant kernels {K p,q G,x , (p, q) ∈ [c in ] × [c out ]} is composed of independent vectors and therefore forms a basis. We will still have that it spans the space of equivariant kernels but not that it will form a basis. O(2) we consider, rotations of the base space are performed using bilinear interpolation. Finally, we downsample to the original size in order to obtain the discretized version of K p,q G,x .

E.4 PROOF OF SLT ON E(2)-STEERABLE CNNS USING THEOREM 1

We now prove Corollary 1. We work with trivial or regular representations of G ≤ O(2) on top of feature fields. It is straightforward that the pointwise ReLU is equivariant. Moreover, to easily compute |||B i→i+1 ||| we work with ∥ • ∥ ∞ which implies that the ReLU is then 1-Lipschitz. Finally, the identity can trivially be written as the convolution with an equivariant kernel having the identity at the origin (the identity is trivially a circulant matrix). Therefore we have that I ∈ B i→i If we use a trivial representation on top of the feature field at layer i, then the building block of this layer is F i = R d 2 . If we instead use a regular representation, then the building block of this layer is F i = R d 2 ×|G| . From the construction of the equivariant basis, we deduce that |B i→i+1 | ≤ d 2 |G| 2 for each layer. Indeed, we must first choose a pixel on the set of representatives A R ⊂ {-d 2 , -d-2 2 , ..., d-2 2 , d 2 } 2 ≃ [d] × [d] grid, and then choose a subset of the canonical basis at this point. But such canonical basis has |G| × |G| elements for regular to regular, 1 × |G| element for trivial to regular (or regular to trivial), and finally only 1 × 1 for trivial to trivial. This is even less than that at some points such as the origin because of the additional constraints. Finally, one has less that d 2 × |G| 2 choices in all cases which indicates that, |B i→i+1 | ≤ d 2 |G| 2 . In fact, since we can only choose x ∈ A R to obtain a set of independent elements, the true dependency will be |B i→i+1 | ≃ |A R | • |G| 2 ≃ d 2 |G| • |G| 2 = d 2 |G|. However because of the discretization procedure, it is easier to upper bound by d 2 |G| 2 since the cardinal of the discretized version of A R is not easily computable. Moreover, the reader will note that the cardinal of the basis has no real significance by itself because the basis was computed with an arbitrary discretization procedure, and therefore another procedure may have lead to another cardinal. Due to the artifacts during the discretization procedure the basis we construct B i→i+1 and only approximate a subset of all equivariant maps. We now compute |||B i→i+1 ||| when employing the ∥ • ∥ ∞ on each feature space. Applying the triangular inequality we get: |||B i→i+1 ||| ≤ |B i→i+1 | max b i→i+1,k ∈Bi→i+1 |||b i→i+1,k |||. It remains then to upper-bound |||b i→i+1,k ||| for every element in the basis. For all x ∈ A R and for all p, q ∈ [|G|] denote b i→i+1,p,q,x the convolution with the equivariant kernel K p,q G,x . We have using a result from da Cunha et al. (2022b) that |||b i→i+1,p,q,x ||| ≤ ∥K p,q G,x ∥ 1 . Then, in a non-discretized kernel setting, while noticing that the orbit of x has |G| elements, one has ∥K i,j G,x ∥ 1 ≤ |G|. Then, |||b i→i+1,p,q,x ||| ≤ |G|. For the identity this remains true as by using of circulant matrices it is trivial that ∥K i,j G,0 ∥ 1 = |G|.

F ADDITIONAL MATERIAL ON THE PERMUTATION EQUIVARIANT NETWORKS F.1 PROOF OF SLT ON PERMUTATION EQUIVARIANT NETWORKS

The aim of this appendix is to prove Corollary 2. The building blocks of the layers are here F i = R n k i . Taking direct sums of them we obtain F ni i = R n k i ×ni . Again as in appendix section E.4, the pointwise ReLU is equivariant and furthermore we facilitate the computation of |||B i→i+1 ||| by working with ∥ • ∥ ∞ , which implies that the ReLU is 1-Lipschitz. As explained above, the norm that we must consider on F ni i = R n k i ×ni to apply theorem 1 is the max of the norm across the blocks, i.e. still ∥ • ∥ ∞ on R n k i ×ni . First, observe that |||B i→i+1 ||| = n ki + 1. Proof. One can check that the worse case scenario happens when making b k ∈Bi→i+1 b k act on a tensor X ∈ R n k i full of 1. Denote by Y a for a ∈ [n] ki the tensor in R n k i such that it has a 1 at the index a and 0 everywhere else. The tensor full of 1 is therefore a∈  [n] k i Y a |||B i→i+1 ||| = max ∥α∥∞≤1 max ∥X∥∞≤1 k α k b k X ∞ =   b k ∈Bi→i+1 b k     a∈[n] k i Y a   ∞ ≤   I + µ∈[n] k i +k i+1 /Q B µ     a∈[n] k i Y a   ∞ ≤ max b∈[n] k i+1     I + µ∈[n] k i +k i+1 /Q B µ     a∈[n] k i

G EXPERIMENTAL DETAILS

The purpose of our experiments is to empirically validate the our theory found in the main text, and as a result show that by solving appropriate SUBSET-SUM problems one can prune an overparameterized random network to a target one. In this section of the appendix, we describe the network architectures we use for our experiments, the overparameterization scheme we select in order to be compatible with our claims, and the linear program we solve for each target weight in order to find the sparsification mask which leads to the approximation of the target network by the overparameterized one. For both MPGNN and E(2)-CNN experiments, we first train a single target network on the supervised tasks that we described in table 3 . The architecture we use for each of the target networks is described in the tables 4, 5, and 6 below. Notice that we do not utilize bias in the parameterized layers, as well as we do not make use of learnable element-wise affine transformations in the batch normalization layers. We train for 50 epochs using AdamW as the optimizer with learning rate 0.015 and default momentum parameters β = (0.9, 0.999) and a cosine scheduler. The weight decay coefficient is set to 5e-4. For the transductive learning tasks on Cora and CiteSeer with the MPGNN, we define an epoch as 10 parameter updates. For the image classification tasks on RotMNIST and FlipRotMNIST with the E(2)-CNN the batch size is set to 64. Finally, the target model is selected as the one which achieves the best validation accuracy throughout training. Afterwards, we define the overparameterized network. In particular, for each parameterized layer (linear or equivariant) of the target network we declare a module that we are going to approximate it with. The module consists of the composition of three layers; the first and the last being of the same type as the target layer, and the middle one is an element-wise ReLU activation function. We make source that the shapes of the input and output tensors match. We initialize these modules using iid drawn samples from U([-a, a]), where a is determined as twice the maximum absolute parameter of the target network. As we explain in the appendix section A, this is compatible with our theorem. For each parameter in the target network we solve two SUBSET-SUM problems, one to approximate the positive input tensors and one to approximate the negative input tensors. This distinction is needed if we want to use a ReLU in the overparameterized layers. The width is overparameterized by multiplying the input tensor size with a number that scales proportionally to a hyperparameter constant factor C, and logarithmically in the input and output size, the number of layers to be approximated, and in 1/ϵ, where ϵ is the desired network approximation error. For our experiments, we use ϵ = 1e-2. For further details, the reader is requested to examine the associated Python repository that we provide. Finally, we solve each defined SUBSET-SUM problem by treating it as a mixed-integer linear program, similar to Pensia et al. (2020) . Each one of the problems amounts to a different constraint optimization problem of the following form: In the optimization problem above, x is a vector resulting from the multiplication of the two weight matrices which participate in the diamond-shaped approximation scheme for each target weight y, as explained in 1. Optimization variables z and m amount to the absolute weight approximation error and part of the binary mask of the second layer in the overparameterized network, which is responsible for approximating the particular weight y. 



It will work with any distribution which contains a uniform distribution, e.g. Gaussian, see §A Care must be taken at the origin, since ∀g ∈ G, g • 0 = 0, and the set of permissible matrices depends on G as well as our choice of representations. We provide a thorough treatment of this case in §E.2. B is the basis of equivariant kernels. Bi→i+1 is obtained by taking the 2D convolution with these elements. Note that this canonical basis is of the same form (but different shape) as κn i →n i+1 used forF n i i → F n i+1 i+1 . 1 (a,b)∈µ = 1 if (a, b) ∈ µ and 0 otherwise, for a ∈ [n] k i and b ∈ [n] k i+1



For a basis B = {b 1 , . . . , b p }, we write |||B||| = max ∥α∥∞≤1 ||| p k=1 α k b k |||. σ(x) = x + is the pointwise ReLU. Finally, we take (ϵ, δ) ∈ [0, 1 2 ] 2 , and U([a, b]) is the uniform distribution on [a, b].

Figure 1: General Equivariant Pruning Method Our results and proof techniques build upon the line of work by Pensia et al. (2020), da Cunha et al.(2022b), and Burkholz (2022a). Specifically, we rely on the SUBSET-SUM algorithm(Lueker, 1998) to aid in approximating any given parameter of the target network. Departing from prior work, the main idea used in our technical analysis is to prune an overparametrized equivariant network in a way that preserves equivariance, as applying SUBSET-SUM using da Cunha et al. (2022b) construction may destroy the prunned network's equivariance.

Instantiations of Theorem 1 for different choices of G. MLP was proven in Pensia et al. (2020), CNN was proven in da Cunha et al. (2022b); Burkholz (2022a). We note a ∨ b := max(a, b).

drawn from U([-1, 1]). Further suppose that each ñi = C 1 n i log( nini+1 max(|Bi→i+1|,|||Bi→i+1|||)   min(ϵ,δ)

n ki .Moreover |B i→i+1 | = b(k i +k i+1 ) by definition of the Bell numbers. In fact the interested reader will check that one has |B i→i+1 | ≤ b(k i +k i+1 ) and that the equality happens as soon as n ≥ k i +k i+1 (for example with n = 1 one can not have an independent vector family of b(k i +k i+1 ) vectors in L(R, R) which is of dimension 1. The argument expressed inMaron et al. (2019) needs n ≥ k i +k i+1 to ensure that all the equivalence classes µ have at least one element. Finally, all the conditions to apply theorem 1 are true and one only need to replace max(|B |, |||B i→i+1 |||) by max( b(k i + k i+1 ), n ki + 1).

. y -m ⊤ x <= z m ⊤ x -y <= z

Figure 3: Ablation study of max. relative output error and pruned accuracy w.r.t. to C for C4, C8, D4.

Target network architecture for the MPGNN experiments.

Target network architecture for the E2CNN experiments.

9. ACKNOWLEDGEMENTS

The authors would like to thank Louis Pascal Xhonneux, Mandana Samiei, Mehrnaz Mofakhami, and Tara Akhound-Sadegh for insightful feedback on early drafts of this work. In addition, the authors thank Riashat Islam, Manuel Del Verme, Mandana Samiei, and Andjela Mladenovic for their generous sharing of computational resources. AJB is supported by the IVADO Ph.D. Fellowship.

annex

Published as a conference paper at ICLR 2023

E.2 CONSTRUCTION OF THE KERNEL AT THE ORIGIN

We would like to apply our basis construction formula to every point in the plane, including the origin but the problem is that at the origin: ∀g ∈ G, g • 0 = 0. Therefore the above sum is not well defined because it is infinite for infinite groups. We can only apply this formula in the case of G finite. The usual way to solve the problem at the origin if one deals with infinite groups is to solve all the linear problems π(g • x) = ρ out (g)π(x)ρ in (g) -1 . However in the setting of Corollary 3, we deal with finite subgroups of O(2). Therefore we can apply the above formula:G,0 (y) = g∈G ρ i+1 (g)K p,q 0,0 (g -1 y)ρ i (g -1 ).In our case, when dealing with the regular representation, if one takes G = C n the cyclic group of n rotations, one will check that the ρ i (g) are permutation matrices associated with the permutation of G : h → g • h. One can then check that summing over G leads to a circulant matrix.We have thus computed the set of equivariant kernels at the origin by using the above formula. One may have wanted to solve all the linear problems set by the equivariant constraints. Here they can be reformulated by the fact that the kernel at the origin must commute with all the matrices associated with the permutations of G : h → g • h. Solving this leads to the set of circulant matrices.

E.3 DISCRETIZATION OF R 2

We now highlight the practical challenges of building equivariant networks and their associated pruning when we discretize continuous signals on R 2 to a pixelized grid.The first problem we want to address is that we do not usually work on the plane R 2 but on spatially delimited images on [-d 2 , d 2 ] 2 . This is problematic since when G acts on a square images, it can become a non-square image after a rotation. For example, C 8 doe not always send2 (take the rotation by 45°for instance). In the same way restricting the equivariant kernel to a finite space [-d 2 , d 2 ] 2 as it is done in usual CNNs would lead to problems since for someWe overcome this issue by restricting the kernels to not being defined on R 2 but on a disk centered at the origin whose diameter equals the size of the image (see Figure 2 ). To implement this, we multiply with a mask which exponentially decays to zero for points with radius larger than the radius of the disk. This is permitted because the equivariant constraint set constrains the interior of the orbit, and it is trivial that for sub-groups of O(2), the disc is stable under the action of the group. Therefore, the kernels that we obtain are still equivariant because they can check the equivariant constraint.The second problem that we must address is the discretization process. Indeed, we do not work with continuous feature fields f :. This a problem, because the equivariant constraint Eq. 26 puts constraints between k(g • x) and k(x). Equation 26 cannot be used anymore because g • x is not always on the grid. For instance, if x = (1, 1) and g is the rotation by 45°, then gMoreover, note that it is not sufficient to discretize the equivariant kernels: one must choose only a finite subset of them. Indeed, the dimension of the equivariant map must be finite in the discretized setting as opposed to the continuous setting where it is infinite. In practice, the network is not exactly equivariant, but almost equivariant due to a discretization error. However, this is not an issue in the setting of Theorem 1. Indeed, once we have chosen a basis of the "almost-equivariant" kernels, we can prove the SLTH for the class of such networks, which is exactly the result that we want in practice. Weiler and Cesa (2019) choose a finite subset of the equivariant kernels, the authors upper-bound the frequency of the polar coordinate solution by an anti-aliasing condition. They then discretize the continuous kernels on the grid. For our basis construction, we choose a finite subset of the equivariant kernels by restricting A R to only A R {-d 2 , -d-2 2 , ..., d 2 } 2 . There are many different ways to discretize our kernels K p,q G,x defined on R 2 . One way would be to send g • x to the nearest pixel if it is not on the grid. In order to decrease the discretization error, we first upsample the grid by a factor 3 before we start applying actions of the group G to the base space. For the subgroups of 

