WHY ARE CONVOLUTIONAL NETS MORE SAMPLE-EFFICIENT THAN FULLY-CONNECTED NETS?

Abstract

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of "better inductive bias." However, this has not been made mathematically rigorous, and the hurdle is that the sufficiently wide fully-connected net can always simulate the convolutional net. Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on R d × {±1} on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires Ω(d 2 ) samples to generalize while O(1) samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an O(1) vs Ω(d 2 /ε) gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for 2 regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.

1. INTRODUCTION

Deep convolutional nets ("ConvNets") are at the center of the deep learning revolution (Krizhevsky et al., 2012; He et al., 2016; Huang et al., 2017) . For many tasks, especially in vision, convolutional architectures perform significantly better their fully-connected ("FC") counterparts, at least given the same amount of training data. Practitioners explain this phenomenon at an intuitive level by pointing out that convolutional architectures have better "inductive bias", which intuitively means the following: (i) ConvNet is a better match to the underlying structure of image data, and thus are able to achieve low training loss with far fewer parameters (ii) models with fewer total number of parameters generalize better. Surprisingly, the above intuition about the better inductive bias of ConvNets over FC nets has never been made mathematically rigorous. The natural way to make it rigorous would be to show explicit learning tasks that require far more training samples on FC nets than for ConvNets. (Here "task"means, as usual in learning theory, a distribution on data points, and binary labels for them generated given using a fixed labeling function.) Surprisingly, the standard repertoire of lower bound techniques in ML theory does not seem capable of demonstrating such a separation. The reason is that any ConvNet can be simulated by an FC net of sufficient width, since a training algorithm can just zero out unneeded connections and do weight sharing as needed. Thus the key issue is not an expressiveness per se, but the combination of architecture plus the training algorithm. But if the training algorithm must be accounted for, the usual hurdle arises that we lack good mathematical understanding of the dynamics of deep net training (whether FC or ConvNet) . How then can one establish the limitations of "FC nets + current training algorithms"? (Indeed, many lower bound techniques in PAC learning theory are information theoretic and ignore the training algorithm.) The current paper makes significant progress on the above problem by exhibiting simple tasks that require Ω(d 2 ) factor more training samples for FC nets than for ConvNets, where d is the data dimension. (In fact this is shown even for 1-dimensional ConvNets; the lowerbound easily extends to 2-D ConvNets.) The lower bound holds for FC nets trained with any of the popular algorithms Here the input data are 3 × 32 × 32 RGB images and the binary label indicates for each image whether the first channel has larger 2 norm than the second one. The input images are drawn from entry-wise independent Gaussian (left) and . In both cases, the 3-layer convolutional networks consist of two 3 × 3 convolutions with 10 hidden channels, and a 3 × 3 convolution with a single output channel followed by global average pooling. The 3-layer fully-connected networks consist of two fully-connected layers with 10000 hidden channels and another fully-connected layer with a single output. The 2-layer versions have one less intermediate layer and have only 3072 hidden channels for each layer. The hybrid networks consist of a single fully-connected layer with 3072 channels followed by two convolutional layers with 10 channels each. bn stands for batch-normalization Ioffe & Szegedy (2015) . listed in Table 1 . (The reader can concretely think of vanilla SGD with Gaussian initialization of network weights, though the proof allows use of momentum, 2 regularization, and various learning rate schedules.) Our proof relies on the fact that these popular algorithms lead to an orthogonalequivariance property on the trained FC nets, which says that at the end of training the FC net -no matter how deep or how wide -will make the same predictions even if we apply orthogonal transformation on all datapoints (i.e., both training and test). This notion is inspired by Ng ( 2004) (where it is named "orthogonal invariant"), which showed the power of logistic regression with 1 regularization versus other learners. For a variety of learners (including kernels and FC nets) that paper described explicit tasks where the learner has Ω(d) higher sample complexity than logistic regression with 1 regularization. The lower bound example and technique can also be extended to show a (weak) separation between FC nets and ConvNets. (See Section 4.2) Our separation is quantitatively stronger than the result one gets using Ng (2004) because the sample complexity gap is Ω(d 2 ) vs O(1), and not Ω(d) vs O(1). But in a more subtle way our result is conceptually far stronger: the technique of Ng ( 2004) seems incapable of exhibiting a sample gap of more than O(1) between Convnets and FC nets in our framework. The reason is that the technique of Ng ( 2004) can exhibit a hard task for FC nets only after fixing the training algorithm. But there are infinitely many training algorithms once we account for hyperparameters associated in various epochs with LR schedules, 2 regularizer and momentum. Thus Ng (2004)'s technique cannot exclude the possibility that the hard task for "FC net + Algorithm 1" is easy for "FC net + Algorithm 2". Note that we do not claim any issues with the results claimed in Ng (2004) ; merely that the technique cannot lead to a proper separation between ConvNets and FC nets, when the FC nets are allowed to be trained with any of the infinitely many training algorithms. (Section 4.2 spells out in more detail the technical difference between our technique and Ng's idea.) The reader may now be wondering what is the single task that is easy for ConvNets but hard for FC nets trained with any standard algorithm? A simple example is the following: data distribution in R d is standard Gaussian, and target labeling function is the sign of d/2 i=1 x 2 i - d i=d/2+1 x 2 i . Figure 1 shows that this task is indeed much more difficult for FC nets. Furthermore, the task is also hard in practice for data distributions other than Gaussian; the figure shows that a sizeable performance gap exists even on CIFAR images with such a target label. Extension to broader class of algorithms. The orthogonal-equivariance property holds for many types of practical training algorithms, but not all. Notable exceptions are adaptive gradient methods (e.g. Adam and AdaGrad), 1 regularizer, and initialization methods that are not spherically symmetric. To prove a lower bound against FC nets with these algorithms, we identify a property, permutationinvariance, which is satisfied by nets trained using such algorithms. We then demonstrate a single and natural task on R d × {±1} that resembles real-life image texture classification, on which we prove any permutation-invariant learning algorithm requires Ω(d) training examples to generalize, while Empirical Risk Minimization with O(1) examples can learn a convolutional net. Paper structure. In Section 2 we discuss about related works. In section 3, we define the notation and terminologies. In Section 4, we give two warmup examples and an overview for the proof technique for the main theorem. In Section 5, we present our main results on the lower bound of orthogonal and permutation equivariant algorithms. Du et al. (2018) attempted to investigate the reason why convolutional nets are more sample efficient. Specifically they prove O(1) samples suffice for learning a convolutional filter and also proved a Ω(d) min-max lower bound for learning the class of linear classifiers. Their lower bound is against learning a class of distributions, and their work fails to serve as a sample complexity separation, because their upper and lower bounds are proved on different classes of tasks. Arjevani & Shamir (2016) also considered the notion of distribution-specific hardness of learning neural nets. They focused on proving running time complexity lower bounds against so-called "orthogonally invariant" and "linearly invariant" algorithms. However, here we focus on sample complexity.

2. RELATED WORKS

Recently, there has been progress in showing lower bounds against learning with kernels. Wei et al. ( 2019) constructed a single task on which they proved a sample complexity separation between learning with neural networks vs. with neural tangent kernels. Notably the lower bound is specific to neural tangent kernels (Jacot et al., 2018) . Relatedly, Allen-Zhu & Li (2019) showed a sample complexity lower bound against all kernels for a family of tasks, i.e., learning k-XOR on the hypercube.

3. NOTATION AND PRELIMINARIES

We will use X = R d , Y = {-1, 1} to denote the domain of the data and label and H = {h | h : X → Y} to denote the hypothesis class. Formally, given a joint distribution P , the error of a hypothesis h ∈ H is defined as err P (h) := Px,y∼P [h(x) = y]. If h is a random hypothesis, we define err P (h) := Px,y∼P,h [h(x) = y] for convenience. A class of joint distributions supported on X × Y is referred as a problem, P. We use • 2 to denote the spectrum norm and • F to denote the Frobenius norm of a matrix. We use A ≤ B to denote that B -A is a semi-definite positive matrix. We also use O(d) and GL(d) to denote the d-dimensional orthogonal group and general linear group respectively. We use B d 2 p to denote the unit Schatten-p norm ball in R d×d . We use N (µ, Σ) to denote Gaussian distribution with mean µ and covariance Σ. For random variables X and Y , we denote X is equal to Y in distribution by X d = Y . In this work, we also always use P X to denote the distributions on X and P to denote the distributions supported jointly on X × Y. Given an input distribution P X and a hypothesis h, we define P X h as the joint distribution on X × Y, such that (P X h)(S) = P ({x|(x, h(x)) ∈ S}), ∀S ⊂ X × Y. In other words, to sample (X, Y ) ∼ P X h means to first sample X ∼ P X , and then set Y = h(X). For a family of input distributions P X and a hypothesis class H, we define P X H = {P X h | P X ∈ P X , h ∈ H}. In this work all joint distribution P can be written as P X h for some h, i.e. P Y|X is deterministic. For set S ⊂ X and 1-1 map g : X → X , we define g(S) = {g(x)|x ∈ S}. We use • to denote function composition. (f • g)(x) is defined as f (g(x)), and for function classes F, G, F • G = {f • g | f ∈ F, g ∈ G}. For any distribution P X supported on X , we define P X • g as the distribution such that (P X • g)(S) = P X (g(S)). In other words, if X ∼ P X ⇐⇒ g -1 (X) ∼ P X • g, because ∀S ⊆ X , P X∼P X g -1 (X) ∈ S = P X∼P X [X ∈ g(S)] = [P X • g](S). Algorithm 1 Iterative algorithm A Require: Initial parameter distribution P init supported in W = R m , total iterations T , training dataset {x i , y i } n i=1 , parametric model M : W → H, iterative update rule F (W, M, {x i , y i } n i=1 ) Ensure: Hypothesis h : X → Y. Sample W (0) ∼ P init . for t = 0 to T -1 do W (t+1) = F (W (t) , M, {x i , y i } n i=1 ). return h = sign M[W (T ) ] . For any joint distribution P of form P = P X h, we define P • g = (P X • g) (h • g). In other words, (X, Y ) ∼ P ⇐⇒ (g -1 (X), Y ) ∼ P • g. For any distribution class P and group G acting on X , we define P • G as {P • g | P ∈ P, g ∈ G}. Definition 3.1. A deterministic supervised Learning Algorithm A is a mapping from a sequence of training data, {(x i , y i )} n i=1 ∈ (X × Y) n , to a hypothesis A({(x i , y i )} n i=1 ) ∈ H ⊆ Y X . The algorithm A could also be randomized, in which case the output A({(x i , y i )} n i=1 ) is a distribution on hypotheses. Two randomized algorithms A and A are the same if for any input, their outputs have the same distribution in function space, which is denoted by A({x i , y i } n i=1 ) d = A ({x i , y i } n i=1 ). Definition 3.2 (Equivariant Algorithms). A learning algorithm is equivariant under group G X (or G X -equivariant) if and only if for any dataset {x i , y i } n i=1 ∈ (X × Y) n and ∀g ∈ G X , x ∈ X , A({g(x i ), y i } n i=1 ) • g = A({x i , y i } n i=1 ), or A({g(x i ), y i } n i=1 )(g(x)) = [A({x i , y i } n i=1 )](x). 1 Definition 3.3 (Sample Complexity). Given a problem P and a randomized learning algorithm A, δ, ε ∈ [0, 1], we define the (ε, δ)-sample complexity, denoted N (A, P, ε, δ), as the smallest number n ∈ N such that ∀P ∈ P, w.p. 1 -δ over the randomness of {x i , y i } n i=1 , err P (A({x i , y i } n i=1 )) ≤ ε. We also define the ε-expected sample complexity for a problem P, denoted N * (A, P, ε), as the smallest number n ∈ N such that ∀P ∈ P, E (xi,yi)∼P [err P (A({x i , y i } n i=1 ))] ≤ ε. By definition, we have N * (A, P, ε + δ) ≤ N (A, P, ε, δ) ≤ N * (A, P, εδ), ∀ε, δ ∈ [0, 1].

3.1. PARAMETRIC MODELS AND ITERATIVE ALGORITHMS

A parametric model M : W → H is a functional mapping from weight W to a hypothesis M(•) : X → Y. Given a specific parametric model M, a general iterative algorithm is defined as Algorithm 1. In this work, we will only use the two parametric models below, FC-NN and CNN. FC Nets: A L-layer Fully-connected Neural Network parameterized by its weights W = (W 1 , W 2 , . . . , W L ) is a function FC-NN[•] : R d → R, where W i ∈ R di-1×di , d 0 = d, and d L = 1: FC-NN[W](x) = W L σ(W L-1 • • • σ(W 2 σ(W 1 x))). Here, σ : R → R can be any function, and we abuse the notation such that σ is also defined for vector inputs, in the sense that [σ(x)] i = σ(x i ).

ConvNets (CNN):

In this paper we will only use two layer Convolutional Neural Networks with one channel. Suppose d = d r for some integer d , r, a 2-layer CNN parameterized by its weights W = (w, a, b) ∈ R k × R r × R is a function CNN[•] : R d → R: CNN[W](x) = r i=1 a r σ([w * x] d (i-1)+1:d i ) + b, where * : R k ×R d → R d is the convolution operator, defined as [w * x] i = k j=1 w j x [i-j-1 mod d]+1 , and σ : R d → R is the composition of pooling and element-wise non-linearity.

3.2. EQUIVARIANCE AND TRAINING ALGORITHMS

This section gives an informal sketch of why FC nets trained with standard algorithms have certain equivariance properties. The high level idea here is if update rule of the network, or more generally, the parametrized model, exhibits certain symmetry per step, i.e., property 2 in Theorem C.1, then by induction it will hold till the last iteration. Taking linear regression as an example, let x i ∈ R d , i ∈ [n] be the data and y ∈ R n be the labels, the GD update for L(w) = 1 2 n i=1 (x i w -y i ) 2 = 1 2 X w -y 2 would be w t+1 = F (w t , X, y) := w t -ηX(X w t -y). Now suppose there's another person trying to solve the same problem using GD with the same initial linear function, but he observes everything in a different basis, i.e., X = U X and w 0 = U w 0 , for some orthogonal matrix U . Not surprisingly, he would get the same solution for GD, just in a different basis. Mathematically, this is because w t = U w t =⇒ w t+1 = F (w t , U X, y) = U F (w t , X, y) = U w t+1 . In other words, he would make the same prediction for unseen data. Thus if the initial distribution of w 0 is the same under all basis (i.e., under rotations), e.g., gaussian N (0, I d ), then w 0 d = U w 0 =⇒ F t (w 0 , U X, y) = U F t ( w 0 , X, y), for any iteration t, which means GD for linear regression is orthogonal invariant. To show orthogonal equivariance for gradient descent on general deep FC nets, it suffices to apply the above argument on each neuron in the first layer of the FC nets. Equivariance for other training algorithms (see Table 1 ) can be derived in the exact same method. The rigorous statement and the proofs are deferred into Appendix C.

4.1. EXAMPLE 1: Ω(d) LOWER BOUND AGAINST ORTHOGONAL EQUIVARIANT METHODS

We start with a simple but insightful example to how equivariance alone could suffice for some non-trivial lower bounds. We consider a task on R d × {±1} which is a uniform distribution on the set {(e i y, y)|i ∈ {1, 2, . . . , d}, y = ±1}, denoted by P . Each sample from P is a one-hot vector in R d and the sign of the non-zero coordinate determines its label. Now imagine our goal is to learn this task using an algorithm A. After observing a training set of n labeled points S := {(x i , y i )} n i=1 , the algorithm is asked to make a prediction on an unseen test data x, i.e., A(S)(x). Here we are concerned with orthogonal equivariant algorithms --the prediction of the algorithm on the test point remains the same even if we rotate every x i and the test point x by any orthogonal matrix R, i.e., The main idea here is that, for a fixed training set S, the prediction A({(x i , y i )} n i=1 )(x) is determined solely by the inner products between x and x i 's due to orthogonal equivariance, i.e., there exists a random function f (which may depend on S) such thatfoot_0  A({(Rx i , y i )} n i=1 )(Rx) d = A({(x i , y i )} n i=1 )(x) A({(x i , y i )} n i=1 )(x) d = f (x x 1 , . . . , x x n ) But the input distribution for this task is supported on 1-hot vectors. Suppose n < d/2. Then at test time the probability is at least 1/2 that the new data point (x, y) ∼ P , is such that x has zero inner product with all n points seen in the training set S. This fact alone fixes the prediction of A to the value f (0, . . . , 0) whereas y is independently and randomly chosen to be ±1. We conclude that A outputs the wrong answer with probability at least 1/4.

4.2. EXAMPLE 2: Ω(d 2 ) LOWER BOUND IN THE WEAK SENSE

The warm up example illustrates the main insight of (Ng, 2004) , namely, that when an orthogonal equivariant algorithm is used to do learning on a certain task, it is actually being forced to simultaneously learn all orthogonal transformations of this task. Intuitively, this should make the learning much more sample-hungry compared to even Simple SGD on ConvNets, which is not orthogonal equivariant. Now we sketch why the obvious way to make this intuition precise using VC dimension (Theorem B.1) does not give a proper separation between ConvNets and FC nets, as mentioned in the Introduction. We first fix the ground truth labeling function h * = sign d i=1 x 2 i - 2d i=d+1 x 2 i . Algorithm A is orthogonal equivariant (Definition 3. 2) means that for any task P = P X h * , where P X is the input distribution and h * is the labeling function, A must have the same performance on P and its rotated version P • U = (P X • U ) (h * • U ) , where U can be any orthogonal matrix. Therefore if there's an orthogonal equivariant learning algorithm A that learns h * on all distributions, then A will also learn every the rotated copy of h * , h * • U , on every distribution P X , simply because A learns h * on distribution P X • U -1 . Thus A learns the class of labeling functions Formally, we have the following theorem, whose proof is deferred into Appendix D.2: Theorem 4.1 (All distributions, single hypothesis). Let P = {all distributions} {h * }. For any orthogonal equivariant algorithms A, N (A, P, ε, δ) = Ω((d 2 + ln 1 δ )/ε), while there's a 2-layer ConvNet architecture, such that h * • O(d) := {h(x) = h * (U (x)) | U ∈ O(d)} N (ERM CNN , P, ε, δ) = O 1 ε log 1 ε + log 1 δ . As noted in the introduction, this doesn't imply there is some task hard for every training algorithm for the FC net. The VC dimension based lower bound implies for each algorithm A the existence of a fixed distribution P X ∈ P and some orthogonal matrix U A such that the task (P X • U -1 A ) h * is hard for it. However, this does not preclude (P X • U -1 A ) h * being easy for some other algorithm A .

4.3. PROOF OVERVIEW FOR FIXED DISTRIBUTION LOWER BOUNDS

At first sight, the issue highlighted above (and in the Introduction) seems difficult to get around. One possible avenue is if the hard input distribution P X in the task were invariant under all orthogonal transformations, i.e., P X = P X • U for all orthogonal matrices U . Unfortunately, the distribution constructed in the proof of lower bound with VC dimension is inherently discrete and cannot be made invariant to orthogonal transformations. Our proof uses a fixed P X , the standard Gaussian distribution, which is indeed invariant under orthogonal transformations. The proof also uses the Benedek-Itai's lower bound, Theorem 4.2, and the main technical part of our proof is the lower bound for the the packing number D(H, ρ, ε) defined below (also see Equation ( 2)). For function class H, we use Π H (n) to denote the growth function of H, i.e. Π H (n) := sup x1,...,xn∈X |{(h(x 1 ), h(x 2 ), . . . , h(x n )) | h ∈ H}| . Denote the VC-Dimension of H by VCdim(H), by Sauer-Shelah Lemma, we know Π H (n) ≤ en VCdim(H) VCdim(H) for n ≥ VCdim(H). Let ρ be a metric on H, We define N (H, ρ, ε) as the ε-covering number of H w.r.t. ρ, and D(H, ρ, ε) as the ε-packing number of H w.r.t. ρ. For distribution P X , we use ρ X (h, h ) := PX∼P X [h(X) = h (X)] to denote the discrepancy between hypothesis h and h w.r.t. P X . Theorem 4.2. [Benedek-Itai's lower bound] For any algorithm A that (ε, δ)-learns H with n i.i.d. samples from a fixed distribution P X , it must hold for every Benedek & Itai (1991) . Later Long (1995) improved this bound for the regime n ≥ VCdim(H) using Sauer-Shelah lemma, i.e., Π H (n) ≥ (1 -δ)D(H, ρ X , 2ε) (1) Since Π H (n) ≤ 2 n , we have N (A, P X H, ε, δ) ≥ log 2 D(H, ρ X , 2ε) + log 2 (1 -δ), which is the original bound from N (A, P X , ε, δ) ≥ VCdim(H) e ((1 -δ)D(H, ρ X , 2ε)) 1 VCdim(H) . (2) Intuition behind Benedek-Itai's lower bound. We first fix the data distribution as P X . Suppose the 2ε-packing is labeled as {h 1 , . . . , h D(H,ρ X ,2ε) } and ground truth is chosen from this 2ε-packing, (ε, δ)-learns the hypothesis H means the algorithm is able to recover the index of the ground truth w.p. 1 -δ. Thus one can think this learning process as a noisy channel which delivers log 2 D(H, ρ X , 2ε) bits of information. Since the data distribution is fixed, unlabeled data is independent of the ground truth, and the only information source is the labels. With some information-theoretic inequalities, we can show the number of labels, or samples (i.e., bits of information) N (A, P X H, ε, δ) ≥ log 2 D(H, ρ X , 2ε)+log 2 (1-δ). A more closer look yields Equation ( 2), because when VCdim(H) < ∞, then only log 2 Π H (n) instead of n bits information can be delivered.

5. LOWER BOUNDS

Below we first present a reduction from a special subclass of PAC learning to equivariant learning (Theorem 5.1), based on which we prove our main separation results, Theorem 4.1, 5.2, 5.3 and 5.4. Theorem 5.1. If P X is a set of data distributions that is invariant under group G X , i.e., P X •G X = P X , then the following inequality holds. (Furthermore it becomes an equality when G X is a compact group.) inf A∈A G X N * (A, P X H, ε) ≥ inf A∈A N * (A, P X (H • G X ), ε) Remark 5.1. The sample complexity in standard PAC learning is usually defined again hypothesis class H only, i.e., P X is the set of all the possible input distributions. In that case, P X is always invariant under group G X , and thus Theorem 5.1 says that G X -equivariant learning against hypothesis class H is as hard as learning against hypothesis H • G X without equivariance constraint.

5.1. Ω(d 2 ) LOWER BOUND FOR ORTHOGONAL EQUIVARIANCE WITH A FIXED DISTRIBUTION

In this subsection we show Ω(d 2 ) vs O(1) separation on a single task in our main theorem (Theorem 5.2). With the same proof technique, we further show we can get correct dependency on ε for the lower bound, i.e., Ω( d 2 ε ), by considering a slightly larger function class, which can be learnt by ConvNets with O(d) samples. We also generalize this Ω(d 2 ) vs O(d) separation to the case of 2 regression with a different proof technique. Theorem 5.2. There's a single task, P X h * , where h * = sign d i=1 x 2 i - 2d i=d+1 x 2 i and P X = N (0, I 2d ) and a constant ε 0 > 0, independent of d, such that for any orthogonal equivariant algorithm A, we have N * (A, P X h * , ε 0 ) = Ω(d 2 ), while there's a 2-layer ConvNet, such that N (ERM CNN , P X h * , ε, δ) = O 1 ε log 1 ε + log 1 δ . Moreover, ERM CNN could be realized by gradient descent (on the second layer only). (5) By Benedek&Itai's lower bound, (Benedek & Itai, 1991) (Equation ( 1)), we know N (A, P, ε 0 , δ) ≥ log 2 ((1 -δ)D(H, ρ X , 2ε 0 )) . By Lemma D.4, there's some constant C, such that D(H, ρ X , ε) ≥ ( C ε ) d(d-1) , ∀ε > 0. Published as a conference paper at ICLR 2021 The high-level idea for Lemma D.4 is to first show that ρ X (h U , h V ) ≥ Ω( U -V F √ d ), and then we show the packing number of orthogonal matrices in a small neighborhood of I d w.r.t. • F √ d is roughly the same as that in the tangent space of orthogonal manifold at I d , i.e., the set of skew matrices, which is of dimension d(d-1) 2 and has packing number ( C ε ) d(d-1) 2 . The advantage of working in the tangent space is that we can apply the standard volume argument. Setting δ = 1 2 , we have N * (A, P, ε 0 ) ≥ N (A, P, 1 2 , 2ε 0 ) ≥ d(d-1) 2 log 2 C 4ε0 -1 = Ω(d 2 ). Indeed, we can improve the above lower bound by applying Equation (2), and get N (A, P, ε, 1 2 ) ≥ d 2 e 1 2 1 d 2 C ε 1 2 -1 2d = Ω(d 2 ε -1 2 + 1 2d ). Note that the dependency in ε in Equation ( 7) is ε -1 2 + 1 2d is not optimal, as opposed to ε -1 in upper bounds and other lower bounds. A possible reason for this might be that Theorem 4.2 (Long's improved version) is still not tight and it might require a tighter probabilistic upper bound for the growth number Π H (n), at least taking P X into consideration, as opposed to the current upper bound using VC dimension only. We left it as an open problem to show a single task P with Ω( d 2 ε ) sample complexity for all orthogonal equivariant algorithms. However, if the hypothesis is of VC dimension O(d), using a similar idea, we can prove a Ω(d 2 /ε) sample complexity lower bound for equivariant algorithms, and O(d) upper bounds for ConvNets. Theorem 5.3 (Single distribution, multiple functions). There is a problem with single input distribu- tion, P = {P X } H = {N (0, I d )} {sign d i=1 α i x 2 i | α i ∈ R} , such that for any orthogonal equivariant algorithms A and ε > 0, N * (A, P, ε) = Ω(d 2 /ε), while there's a 2-layer ConvNets architecture, such that N (ERM CNN , P, ε, δ) = O( d log 1 ε +log 1 δ ε ). Interestingly, we can show an analog of Theorem 5.3 for 2 regression, i.e., the algorithm not only observes the signs but also the values of labels y i . Here we define the 2 loss of function h : R d → R as P (h) = E (x,y)∼P (h(x) -y) 2 and the sample complexity N * (A, P, ε) for 2 loss similarly as the smallest number n ∈ N such that ∀P ∈ P, E (xi,yi)∼P [ P (A({x i , y i } n i=1 ))] ≤ ε E (x,y)∼P y 2 . The last term E (x,y)∼P y 2 is added for normalization to avoid the scaling issue and thus any ε > 1 could be achieved trivially by predicting 0 for all data. Theorem 5.4 (Single distribution, multiple functions, 2 regression). There is a problem with single input distribution, P = {P X } H = {N (0, I d )} { d i=1 α i x 2 i | α i ∈ R} , such that for any orthogonal equivariant algorithms A and ε > 0, N * (A, P, ε) ≥ d(d+3)

2

(1 -ε) -1, while there's a 2-layer ConvNet architecture, such that N * (ERM CNN , P, ε) ≤ d for any ε > 0.

5.2. Ω(d) LOWER BOUND FOR PERMUTATION EQUIVARIANCE

In this subsection we will present Ω(d) lower bound for permutation equivariance via a different proof technique -direct coupling. The high-level idea of direct coupling is to show with constant probability over (X n , x), we can find a g ∈ G X , such that g(X n ) = X n , but x and g(x) has different labels, in which case no equivariant algorithm could make the correct prediction. Theorem 5.5. Let t i = e i + e i+1 and s i = e i + e i+2 3 and P be the uniform distribution on {(s i , 1)} n i=1 ∪ {(t i , -1)} n i=1 , which is the classification problem for local textures in a 1-dimensional image with d pixels. Then for any permutation equivariant algorithm A, N (A, P, 1 8 , 1 8 ) ≥ N * (A, P, 1 4 ) ≥ d 10 . Meanwhile, N (ERM CN N , P, 0, δ) ≤ log 2 1 δ + 2, where ERM CN N stands for ERM CN N for function class of 2-layer ConvNets. Remark 5.2. The task could be understood as detecting if there are two consecutive white pixels in the black background. For proof simplicity, we take texture of length 2 as an illustrative example. It is straightforward to extend the same proof to more sophisticated local pattern detection problem of any constant length and to 2-dimensional images.

6. CONCLUSION

We rigorously justify the common intuition that ConvNets can have better inductive bias than FC nets, by constructing a single natural distribution on which any FC net requires Ω(d 2 ) samples to generalize if trained with most gradient-based methods starting with gaussian initialization. On the same task, O(1) samples suffice for convolutional architectures. We further extend our results to permutation equivariant algorithms, including adaptive training algorithms like Adam and AdaGrad, 1 regularization, etc. 

A SOME BASIC INEQUALITIES

Lemma A.1. ∀x ∈ [-1, 1], arccos x √ 1 -x ≥ √ 2. Proof. Let x = cos(t), t ∈ [-π, π], we have arccos(x) √ 1 -x = t 1 -cos(t) = t √ 2 sin(t/2) ≥ √ 2. Lemma A.2. ∃C > 0, ∀d ∈ N + , M ∈ R d×d , C M F / √ d ≤ E x∼S d-1 [ M x 2 ] ≤ M F / √ d. Proof of Lemma A.2. Upper Bound: By Cauchy-Schwarz inequality, we have E x∼S d-1 [ M x 2 ] ≤ E x∼S d-1 M x 2 2 = tr M E x∼S d-1 [xx ] M = tr[M M ] d = M F √ d . Lower Bound: Let M = U ΣV be the singular value decomposition of M , where U, V are orthogonal matrices and Σ is diagonal. Since M F = Σ F , and E x∼S d-1 [ M x 2 ] = E x∼S d-1 [ Σx 2 ], w.l.o.g., we only need to prove the lower bound for all diagonal matrices. By Proposition 2.5.1 in (Talagrand, 2014), there's some constant C, such that C Σ F = C d i=1 σ 2 i ≤ E x∼N (0,I d ) d i=1 x 2 i σ 2 i = E x∼N (0,I d ) [ M x ] 2 . By Cauchy-Schwarz Inequality, we have E x∼N (0,I d ) [ x 2 ] ≤ E x∼N (0,I d ) x 2 2 = √ d. Therefore, we have C Σ F ≤ E x∼N (0,I d ) [ M x ] 2 = E x∼S d-1 [ M x ] 2 E x∼N (0,I d ) [ x 2 ] ≤ E x∼S d-1 [ M x ] 2 √ d, which completes the proof. Lemma A.1. For any z > 0, we have  Pr x∼N (0,σ) (|x| ≤ z) ≤ 2 √ π z σ Proof. Pr x∼N (0,σ) (|x| ≤ z) = z -z 1 √ 2π σ exp - x 2 2σ 2 dx ≤ 2 π z σ B UPPERAND } n i=1 )(x i ) = y i , ∀i ∈ [n] , then for any distribution P X and 0 < ε, δ < 1, we have N (A, P X H, ε, δ) = O( VCdim(H) ln 1 ε + ln 1 δ ε ). Meanwhile, there's a distribution P X supported on any subsets {x 0 , . . . , x d-1 } which can be shattered by H, such that for any 0 < ε, δ < 1 and any algorithm A, it holds N (A, P X H, ε, δ) = Ω( VCdim(H) + ln 1 δ ε ).

C EQUIVARIANCE IN ALGORITHMS

In this section, we give sufficient conditions for an iterative algorithm to be equivariant (as defined in Algorithm 1). Theorem C.1. Suppose G X is a group acting on X = R d , the iterative algorithm A is G X -equivariant (as defined in Algorithm 1) if the following conditions are met: (proof in appendix) 1. There's a group G W acting on W and a group isomorphism τ : G X → G W , such that M[τ (g)(W)](g(x)) = M[W](x), ∀x ∈ X , W ∈ W, g ∈ G. (One can think g as the rotation U applied on data x in linear regression and τ (U ) as the rotation U applied on w.) 2. Update rule F is invariant under any joint group action (g, τ (g)), ∀g ∈ G. In other words, [τ (g)](F (W, M, {x i , y i } n i=1 )) = F ([τ (g)](W), M, {g(x i ), y i } n i=1 ).

3.. The initialization

P init is invariant under group G W , i.e. ∀g ∈ G W , P init = P init • g -1 . Here we want to address that the three conditions in Theorem C.1 are natural and almost necessary. Condition 1 is the minimal expressiveness requirement for model M to allow equivariance. Condition 3 is required for equivariance at initialization. Condition 2 is necessary for induction. Proof of Theorem C.1. ∀g ∈ G X , we sample W (0) ∼ P init , and W (0) = τ (g)(W (0) ). By property (3), W (0) d = W (0) ∼ P init . Let W (t+1) = F W (t) , M, {x i , y i } n i=1 and W (t+1) = F W (t) , M, {g(x i ), y i } n i=1 for 0 ≤ t ≤ T -1, we can show W (t) = τ (g)W (t) ) by induction using property (2). By definition of Algorithm 1, we have A {x i , y i } n i=1 d = M[W (T ) ], and M[ W (T ) ] • g d = A({g(x i ), y i } n i=1 ) • g. By property (1), we have M[ W (T ) ](g(x)) = M[τ (g)(W (T ) ](g(x)) = M[W (T ) ](x). There- fore, A({x i , y i } n i=1 ) d = M[W (T ) ] = M[ W (T ) ] • g d = A({g(x i ), y i } n i=1 ) • g, meaning A is G X -equivariant. Remark C.1. Theorem C.1 can be extended to the stochastic case and the adaptive case which allows the algorithm to use information of the whole trajectory, i.e., the update rule could be generalized as W (t+1) = F t ({W (s) } t s=1 , M, {x i , y i } n i=1 ), as long as (the distribution of) each F t is invariant under joint transformations. Below are two example applications of Theorem C.1. Other results in Table 1 could be achieved in the same way. For classification tasks, optimization algorithms often work with a differentiable surrogate loss : R → R instead the 0-1 loss, such that (yh(x)) ≥ 1 [yh(x) ≤ 0], and the total loss for hypothesis h and training, L(M(W); {x i , y i } n i=1 ) is defined as n i=1 (y i [M(W)](x i )). It's also denoted by L(W) when there's no confusion. Definition C.1 (Gradient Descent for FC nets). We call Algorithm 1 Gradient Descent if M = FC-NN and F = GD L , where GD L (W) = W -η∇L(W) is called the one-step Gradient Descent update and η > 0 is the learning rate. Sample W (0) ∼ P init . for t = 0 to T -1 do ) ] . W (t+1) = W (t) -η n i=1 ∇ (FC-NN(W (t) )(x i ), y i ) return h = sign FC-NN[W (T Corollary C.2. Fully-connected networks trained with (stochastic) gradient descent from i.i.d. Gaussian initialization is equivariant under the orthogonal group. Proof of Corollary C.2. We will verify the three conditions required in Theorem C.1 one by one. The only place we use the FC structure is for the first condition. Lemma C.3. There's a subgroup G W of O(m), and a group isomorphism τ : A notable property of Gradient Descent is that it is invariant under orthogonal re-parametrization. Formally, given loss function L : R m → R and parameters W ∈ R m , an orthogonal re-parametrization of the problem is to replace (L, W ) by (L • O -1 , OW ), where O ∈ R m×m is an orthogonal matrix. G X = O(d) → G W , such that FC-NN[τ (R)(W)] • R = FC-NN[W], ∀W ∈ W, R ∈ G X . Lemma C.4 (Gradient Descent is invariant under orthogonal re-parametization). For any L, W and orthogonal matrix O ∈ R m×m , we have OGD L (W ) = GD L•O -1 (OW ). Proof of Lemma C.4. By definition, it suffices to show that for each i ∈ [n], and every W and W = OW, O∇ W (FC-NN(W)(x i ), y i ) = ∇ W (FC-NN(O -1 W )(x i ), y i ), which holds by chain rule. For any R ∈ O(d), and set O = τ (R) by Lemma C.3, [L • O -1 ](W) = n i=1 (y i FC-NN[O -1 (W)](x i )) = n i=1 (y i FC-NN[W](Rx i )). The second condition in Theorem C.1 is satisfied by plugging above equality into Lemma C.4. The third condition is also satisfied since the initialization distribution is i.i.d. Gaussian, which is known to be orthogonal invariant. In fact, from the proof, it suffices to have the initialization of the first layer invariant under G X . Corollary C.5. FC nets trained with newton's method from zero initialization for the first layer and any initialization for the rest parameters is GL(d)-equivariant, or equivariant under the group of invertible linear transformations. Here, Netwon's method means to use NT(W) = W -η(∇ 2 L(W)) -1 ∇L(W) as the update rule and we assume ∇ 2 L(W) is invertible. Proof is deferred into Appendix, . Proof of Corollary C.5. The proof is almost the same as that of Corollary C.2, except the following modifications. Condition 1: If we replace the O(d), O(m) by GL(d), GL(m) in the statement and proof Lemma C.3, the lemma still holds. Condition 2:By chain rule, one can verify the update rule Newton's method is invariant under invertible linear re-parametization, i.e. OGD L (W ) = NT L•O -1 (OW ), for all invertible matrix O. Condition 3: Since the first layer is initialized to be 0, it is invariant under any linear transformation. Remark C.2. The above results can be easily extended to the case of momentum and L p regularization. For momentum, we only need to ensure that the following update rule, W (t+1) = GDM(W (t) , W (t-1) , M, {x i , y i } n i=1 ) = (1 + γ)W (t) -γW (t-1) -η∇L(W (t) ), also satisfies the property in Lemma C.4. For L p regularization, because W p is independent of {x i , y i } n i=1 , we only need to ensure W p = τ (R)(W) p , ∀R ∈ G X , which is easy to check when G X only contains permutation or sign-flip.

C.1 EXAMPLES OF EQUIVARIANCE FOR NON-ITERATIVE ALGORITHMS

To demonstrate the wide application of our lower bounds, we give two more examples of algorithmic equivariance where the algorithm is not iterative. The proofs are folklore. Definition C.2. Given a positive semi-definite kernel K, the Kernel Regression algorithm REG K is defined as: REG K ({x i , y i } n i=1 )(x) := 1 K(x, X N ) • K(X N , X N ) † y ≥ 0 where K(X N , X N ) ∈ R n×n , [K(X N , X N )] i,j = K(x i , x j ), y = [y 1 , y 2 , . . . , y N ] and K(x, X N ) = [K(x, x 1 ), . . . , K(x, x N )]. Kernel Regression: If kernel K is G X -equivariant, i.e., ∀g ∈ G X , x, y ∈ X , K(g(x), g(y)) = K(x, y), then algorithm REG K is G X -equivariant. ERM: If F = F • G X , and argmin h∈F n i=1 1 [h(x i ) = y i ] is unique, then ERM F is G X - equivariant. D OMITTED PROOFS D.1 PROOFS OF SAMPLE COMPLEXITY REDUCTION FOR GENERAL EQUIVARIANCE Given G X -equivariant algorithm A, by definition, N * (A, P, ε) = N * (A, P • g -1 , ε), ∀g ∈ G X . Consequently, we have N * (A, P, ε) = N * (A, P • G X , ε). (12) Lemma D.1. Let A be the set of all algorithms and A G X be the set of all G X -equivariant algorithms, the following inequality holds. The equality is attained when G X is a compact group. inf A∈A G X N * (A, P, ε) ≥ inf A∈A N * (A, P • G X , ε) Proof of Lemma D.1. Take infimum over A G X over the both side of Equation 12, and note that A G X ⊂ A, Inequality 13 is immediate. Suppose the group G X is compact and let µ be the Haar measure on it, i.e. ∀S ⊂ G X , g ∈ G X , µ(S) = µ(g•S). We claim for each algorithm A, the sample complexity of the following equivariant algorithm A is no higher than that of A on P G X : A ({x i , y i } n i=1 ) = A({g(x i ), y i } n i=1 ) • g, where g ∼ µ. By the definition of Haar measure, A is G X -equivariant. Moreover, for any fixed n ≥ 0, we have inf P ∈P E (xi,yi)∼P [err P (A ({x i , y i } n i=1 ))] = inf P ∈P E g∼µ E (xi,yi)∼P •g -1 [err P (A({x i , y i } n i=1 ))] ≥ inf P ∈P inf g∈G X E (xi,yi)∼P •g -1 [err P (A({x i , y i } n i=1 ))] = inf P ∈P•G X E (xi,yi)∼P [err P (A({x i , y i } n i=1 ))] , which implies inf A∈A G X N * (A, P, ε) ≤ inf A∈A N * (A, P • G X , ε). Proof of Theorem 5.1. Simply note that (P X H)•G X = ∪ g∈G X (P X •g) (H•g -1 ) = ∪ g∈G X P X (H • g -1 ) = P X (H • G X ), the theorem is immediate from Lemma D.1. D.2 PROOF OF THEOREM 4.1 Lemma D.2. Define h U = sign x 1:d U x d+1:2d , ∀U ∈ R d×d , we have H = {h U | U ∈ O(d)} ⊆ sign d i=1 x 2 i - 2d i=d+1 x 2 i • O(2d). Proof. Note that 0 U U 0 = I d 0 0 U • 0 I d I d 0 • I d 0 0 U , 0 I d I d 0 = √ 2 2 I d - √ 2 2 I d √ 2 2 I d √ 2 2 I d • I d 0 0 -I d • √ 2 2 I d √ 2 2 I d - √ 2 2 I d √ 2 2 I d , thus for any U ∈ O(d), ∀x ∈ R 2d , h U (x) = sign x 1:d U x d+1:2d = sign x 0 U U 0 x =sign g U (x) I d 0 0 -I d g U (x) ∈ h * • O(2d), where g U (x) = I d 0 0 U • √ 2 2 I d - √ 2 2 I d √ 2 2 I d √ 2 2 I d • x is an orthogonal transformation on R 2d . Lemma D.3. Define h U = sign x 1:d U x d+1:2d , ∀U ∈ R d×d , and H = {h U | U ∈ O(d)}, we have VCdim(H) ≥ d(d -1) 2 . Proof. Now we claim H shatters {e i + e d+j } 1≤i<j≤d , i.e. O(d) can shatter {e i e j } 1≤i<j≤d , or for any sign pattern {σ ij } 1≤i<j≤d , there exists U ∈ O(d), such that sign U, e i e j = σ ij , which implies VCdim(H) ≥ d(d-1) 2 . Let so(d) = {M | M = -M , M ∈ R d×d }, we know exp(u) = I d + u + u 2 2 + • • • ∈ SO(d), ∀u ∈ so(d). Thus for any sign pattern {σ ij } 1≤i<j≤d , let u = 1≤i<j≤d σ ij (e i e j -e j e i ) and λ → 0 + , sign exp(λu), e i e j = sign 0 + λσ ij + O(λ 2 ) = sign [σ ij + O(λ)] = σ ij . Theorem 4.1 (All distributions, single hypothesis). Let P = {all distributions} {h * }. For any orthogonal equivariant algorithms A, N (A, P, ε, δ) = Ω((d 2 + ln 1 δ )/ε), while there's a 2-layer ConvNet architecture, such that N (ERM CNN , P, ε, δ ) = O 1 ε log 1 ε + log 1 δ . Proof of Theorem 4.1. Lower bound: Suppose d = 2d for some integer d , we construct P = P X H, where P X is the set of all possible distributions on X = R 3k , and H = {sign d i=1 x 2 i - 2d i=d +1 x 2 i }. By Lemma D.2, H = {sign x 1:d U x d+1:2d | U ∈ O(d )} ⊆ H • O(d). By Theorem 5.1, we have inf A∈A G X N * (A, P X H, ε) ≥ inf A∈A N * (A, P X (H • G X ), ε) ≥ inf A∈A N * (A, P X H , ε) (15) By the lower bound in Theorem B.1, we have inf A∈A N * (A, P X H , ε) ≥ VCdim(H )+ln 1 δ ε . By Lemma D.3 VCdim(H ) ≥ d (d -1) 2 = Ω(d 2 ). Upper Bound: Take CNN as defined in Section 3. 1 with d = 2d , r = 2, k = 1, σ : R d → R, σ(x) = d i=1 x 2 i (square activation + average pooling), we have F CNN = sign 2 i=1 a i d j=1 x 2 (i-1)d +j w 2 1 + b |a 1 , a 2 , w 1 , b ∈ R . Note that min h∈F CNN err P (h) = 0, ∀P ∈ P, and the VC dimension of F is 3, by Theorem B.1, we have ∀P ∈ P, w.p. 1 -δ, err P (ERM F CNN ({x i , y i } n i=1 )) ≤ ε, if n = Ω 1 ε log 1 ε + log 1 δ ) . Convergence guarantee for Gradient Descent: We initialize all the parameters by i.i.d. standard gaussian and train the second layer by gradient descent only, i.e. set the LR of w 1 as 0. (Note training the second layer only is still a orthogonal-equivariant algorithm for FC nets, thus it's a valid separation.) For any convex non-increasing surrogate loss of 0-1 loss l satisfying l(0) ≥ 1, lim x→∞ l(x) = 0 e.g. logistic loss, we define the loss of the weight W as (x k,i is the kth coordinate of x i ) L(W) = n i=1 l(F CNN [W](x i )y i ) = n i=1 l     2 k=1 a i d j=1 x 2 (k-1)d +j,i w 2 1 + b   y i   , which is convex in a i and b. Note w 1 = 0 with probability 1, which means the data are separable even with fixed first layer, i.e. min a,b L(W) = L(W) | a=a * ,b=0 = 0, where a * is the ground truth. Thus with sufficiently small step size, GD converges to 0 loss solution. By the definition of surrogate loss, L(W) < 1 implies for x i , l(x i y i ) < 1 and thus the training error is 0. Proof of Lemma D.4. The key idea here is to first lower bound ρ X (U, V ) by U -V F / √ d and apply volume argument in the tangent space of I d in O(d). We have ρ(h U , h V ) = P x∼N (0,I 2d ) [h U (x) = h V (x)] = P x∼N (0,I 2d ) x 1:d U x d+1:2d x 1:d V x d+1:2d < 0 = 1 π E x 1:d ∼N (0,I d ) arccos x 1:d U V x 1:d x 1:d 2 ≥ 1 π E x 1:d ∼N (0,I d ) 2 -2 x 1:d U V x 1:d x 1:d 2 (by Lemma A.1) = 1 π E x∼S d-1 2 -2x U V x = 1 π E x∼S d-1 (U -V )x F ≥C 1 U -V F / √ d (by Lemma A.2) ( ) Below we show it suffices to pack in the 0.4 ∞ neighborhood of I d . Let so(d) be the Lie algebra of SO(d), i.e., {M ∈ R d×d | M = -M }. We also define the matrix exponential mapping exp : R d×d → R d×d , where exp(A) = A + A 2 2! + A 3 3! + • • • . It holds that exp(so(d)) = SO(d) ⊆ O(d). The benefit of covering in such neighborhood is that it allows us to translate the problem into the tangent space of I d by the following lemma. Lemma D.5 (Implication of Lemma 4 in (Szarek, 1997) ). For any matrix A, B ∈ so(d), satisfying that A ∞ ≤ π 4 , B ∞ ≤ π 4 , we have 0.4 A -B F ≤ exp(A) -exp(B) F ≤ A -B F . Therefore, we have D(H, ρ X , ε) ≥ D(O(d), C 1 • F / √ d, ε) ≥ D(so(d) ∩ π 4 B d 2 ∞ , C 1 • F / √ d, 2.5ε). Note that so(d) is a d(d-1)

2

-dimensional subspace of R d 2 , by Inverse Santalo's inequality (Lemma 3, (Ma & Wu, 2015 )), we have vol(so(d) ∩ B d 2 ∞ ) vol(so(d) ∩ B d 2 2 ) 2 d(d-1) ≥ C 2 dim(so(d)) E G∼N (0,I d 2 ) Π so(d) (G) ∞ . where vol(•) is the d(d-1) 2 volume defined in the space of so(d) and Π so(d) (G) = G-G 2 is the projection operator onto the subspace so(d). We further have E G∼N (0,I d 2 ) Π so(d) (G) ∞ = E G∼N (0,I d 2 ) G -G 2 ∞ ≤ E G∼N (0,I d 2 ) [ G ∞ ] ≤ C 3 √ d, where the last inequality is by Theorem 4. 4.5, Vershynin (2018) . Finally, we have D(so(d) ∩ π 4 B d 2 ∞ , C 1 • F / √ d, 2.5ε) =D(so(d) ∩ B d 2 ∞ , • F , 10 √ dε C 1 π ) ≥ vol(so(d) ∩ B d 2 ∞ ) vol(so(d) ∩ B d 2 2 ) × C 1 π 10 √ dε d(d-1) 2 ≥   C 1 C 2 π d(d-1) 2 10dε   d(d-1) 2 := C ε d(d-1) 2 D.4 PROOF OF THEOREM 5.3 Theorem 5.3 (Single distribution, multiple functions). There is a problem with single input distribution , P = {P X } H = {N (0, I d )} {sign d i=1 α i x 2 i | α i ∈ R} , such that for any orthogonal equivariant algorithms A and ε > 0, N * (A, P, ε) = Ω(d 2 /ε), while there's a 2-layer ConvNets architecture, such that N (ERM CNN , P, ε, δ) = O( d log 1 ε +log 1 δ ε ). Proof of Theorem 5. Upper bound:Take CNN as defined in Section 3.1 with d = d , r = 1, k = 1, σ : R → R, σ(x) = x 2 (square activation + no pooling), we have F CNN = sign d i=1 a i w 2 1 x 2 i + b |a i , w 1 , b ∈ R = sign d i=1 a i x 2 i + b |a i , b ∈ R . Note that min  (ERM F CNN ({x i , y i } n i=1 )) ≤ ε, if n = Ω 1 ε d log 1 ε + log 1 δ ) . Convergence guarantee for Gradient Descent: We initialize all the parameters by i.i.d. standard gaussian and train the second layer by gradient descent only, i.e. set the LR of w 1 as 0. (Note training the second layer only is still a orthogonal-equivariant algorithm for FC nets, thus it's a valid separation.) For any convex non-increasing surrogate loss of 0-1 loss l satisfying l(0) ≥ 1, lim x→∞ l(x) = 0 e.g. logistic loss, we define the loss of the weight W as (x k,i is the kth coordinate of x i ) L(W) = n i=1 l(F CNN [W](x i )y i ) = n i=1 l ( d k=1 w 2 1 a i x 2 k,i + b)y i , which is convex in a i and b. Note w 1 = 0 with probability 1, which means the data are separable even with fixed first layer, i.e. min a,b L(W) = L(W) | a=a * ,b=0 = 0, where a * is the ground truth. Thus with sufficiently small step size, GD converges to 0 loss solution. By the definition of surrogate loss, L(W) < 1 implies for x i , l(x i y i ) < 1 and thus the training error is 0. D.5 PROOF OF LEMMA D.6 Lemma D.6. For A ∈ R d×d , we define M A ∈ R 2d×2d as M A = A 0 0 I d , and h A : R 4d → {-1, 1} as h A (x) = sign x 1:2d M A x 2d+1:4d . Then for H = {h A | ∀A ∈ R d×d } ⊆ {sign x Ax]|∀A ∈ R 4d×4d }, satisfies that it holds that for any d, algorithm A and ε > 0, N * (A, {N (0, I 4d )} H, ε) = Ω( d 2 ε ). Proof of Lemma D.6. Below we will prove a Ω( 1 ε d 2 ) lower bound for packing number, i.e. D(H, ρ X , 2ε 0 ) = D(R d×d , ρ, 2ε 0 ), where ρ(U, V ) = ρ X (h U , h V ). Then we can apply Long's improved version Equation (2) of Benedek-Itai's lower bound and get a Ω(d 2 /ε) sample complexity lower bound. The reason that we can get the correct rate of ε is that the VCdim(H) is exactly equal to the exponent of the packing number. (cf. the proof of Theorem 5.2) Similar to the proof of Theorem 5.2, the key idea here is to first lower bound ρ(U, V ) by U -V F / √ d and apply volume argument. Recall for A ∈ R d×d , we define M A ∈ R 2d×2d as M A = A 0 0 I d , and h A : R 4d → {-1, 1} as h A (x) = sign x 1:2d M A x 2d+1:4d . Then for H = {h A | ∀A ∈ R d×d } . Below we will see it suffices to lower bound the packing number of a subset of R d×d , i.e. I d + 0.1B d 2 ∞ , where B d 2 ∞ is the unit spectral norm ball. Clearly ∀x, x 2 = 1, ∀U ∈ I d + 0.1B d 2 ∞ , 0.9 ≤ U x 2 ≤ 1.1. Thus ∀U, V ∈ I d + 0.1B d 2 ∞ we have, ρ X (h U , h V ) = P x∼N (0,I 4d ) [h U (x) = h V (x)] = P x∼N (0,I 4d ) x 1:2d M U x 2d+1:4d x 1:2d M V x 2d+1:4d < 0 = 1 π E x 1:2d ∼N (0,I 2d ) arccos x 1:2d M U M V x 1:2d M U x 1:2d 2 M V x 1:2d 2 ≥ 1 π E x 1:2d ∼N (0,I 2d ) 2 -2 x 1:2d M U M V x 1:2d M U x 1:2d 2 M V x 1:2d 2 (by Lemma A.1) ≥ √ 2 1.1π E x 1:2d ∼N (0,I 2d ) M U x 1:2d 2 M V x 1:2d 2 -x 1:2d M U M V x 1:2d = 1 1.1π E x 1:2d ∼N (0,I 2d ) (M U -M V )x 1:2d 2 2 -M U x 1:2d 2 -M V x 1:2d 2 2 ≥ 1 1.1π ( E x 1:2d ∼N (0,I 2d ) (M U -M V )x 1:2d 2 - E x 1:2d ∼N (0,I 2d ) M U x 1:2d 2 -M V x 1:2d 2 ) ≥ C 0 1.1π E x 1:2d ∼N (0,I 2d ) (M U -M V )x 1:2d 2 (by Lemma D.7) ≥C 1 M U -M V F / √ d (by Lemma A.2) =C 1 U -V F / √ d It remains to lower bound the packing number. We have M(0.1B d 2 ∞ , C 1 • F / √ d, ε) ≥ vol(B d 2 ∞ ) vol(B d 2 2 ) × 0.1C 1 √ dε d 2 ≥ C ε d 2 , ( ) for some constant C. The proof is completed by plugging the above bound and VCdim(H) = d 2 into Equation (2). Lemma D.7. Suppose x, x ∼ N (0, I d ), then ∀R, S ∈ R d×d , we have E x [ (R -S)x 2 ] -E x,y Rx 2 2 + y 2 2 - Sx 2 2 + y 2 2 ≥ C 0 E x [ (R -S)x 2 ] , for some constants C 0 independent of R, S and d.   i,j,i j (2). Lower bound for expected loss. x i x j x i x j M ij M i j   = i =j (M 2 ij + M ij M ji + M ii M jj ) E x∼N (0,1) x 2 2 + i M 2 ii E x∼N (0,1) x 4 = i =j (M 2 ij + M ij M ji + M ii M jj ) + 3 i M 2 ii = M + M 2 2 F + (tr[M ]) 2 The infimum of the test loss over all possible algorithms A is 



this can be made formal using the fact that Gram matrix determine a set of vectors up to an orthogonal transformation. For vector x ∈ R d , we define xi = x (i-1) mod d+1 . For vector x ∈ R d , we define xi = x (i-1) mod d+1 .



Figure 1: Comparison of generalization performance of convolutional versus fully-connected models trained by SGD. The grey dotted lines indicate separation, and we can see convolutional networks consistently outperform fully-connected networks.Here the input data are 3 × 32 × 32 RGB images and the binary label indicates for each image whether the first channel has larger 2 norm than the second one. The input images are drawn from entry-wise independent Gaussian (left) and CIFAR-10 (right). In both cases, the 3-layer convolutional networks consist of two 3 × 3 convolutions with 10 hidden channels, and a 3 × 3 convolution with a single output channel followed by global average pooling. The 3-layer fully-connected networks consist of two fully-connected layers with 10000 hidden channels and another fully-connected layer with a single output. The 2-layer versions have one less intermediate layer and have only 3072 hidden channels for each layer. The hybrid networks consist of a single fully-connected layer with 3072 channels followed by two convolutional layers with 10 channels each. bn stands for batch-normalizationIoffe & Szegedy (2015).

Now we show this algorithm fails to generalize on task P , if it observes only d/2 training examples.

on all distributions. (See formal statement in Theorem 5.1) By the standard lower bounds with VC dimension (See Theorem B.1), it takes at least Ω( VCdim(H•O(d)) ε ) samples for A to guarantee 1 -ε accuracy. Thus it suffices to show the VC dimension VCdim(H • O(d)) = Ω(d 2 ), towards a Ω(d 2 ) sample complexity lower bound. (Ng (2004) picks a linear thresholding function as h * , and thus VCdim(h * • O(d)) is only O(d).)

Proof of Theorem 5.2. Upper bound: implied by upper bound in Theorem 4.1. Lower bound: Note that the P X = N (0, I 2d ) is invariant under O(2d), by Theorem 5.1, it suffices to show that there's a constant ε 0 > 0 (independent of d), for any algorithm A, it takes Ω(d 2 ) samples to learn the augmented function class h * • O(2d) w.r.t. P X = N (0, I 2d ). Define h U = sign x 1:d U x d+1:2d , ∀U ∈ R d×d , and by Lemma D.2, we have H = {h U | U ∈ O(d)} ⊆ h * • O(2d). Thus it suffices to a Ω(d 2 ) sample complexity lower bound for the sub function class H, i.e., N * (A, N (0, I 2d ) {sign x 1:d U x d+1:2d }, ε 0 ) = Ω(d 2 ).

Gradient Descent for FC-NN (FC networks) Require: Initial parameter distribution P init , total iterations T , training dataset {x i , y i } n i=1 , loss function Ensure: Hypothesis h : X → Y.

Proof of Lemma C.3. By definition, FC-NN[W](x) could be written FC-NN[W 2:L ](σ(W 1 x)), which implies FC-NN[W](x) = FC-NN[W 1 R -1 , W 2:L ](Rx), ∀R ∈ O(d), and thus we can pick τ (R) = O ∈ O(m), where O(W) = [W 1 R -1 , W 2:L ], and G W = τ (O(d)).

PROOFS OF LEMMAS FOR THEOREM 5.2 Lemma D.4. Define h U = sign x 1:d U x d+1:2d , H = {h U | U ∈ O(d)}, and ρ(U, V ) := ρ X (h U , h V ) = Px∼N(0,I 2d ) [h U (x) = h V (x)].There exists a constant C, such that the packing number D(H, ρ X , ε) = D(O(d), ρ, ε) ≥ C ε d(d-1) 2

3. Lower bound: Note P = {N (0, I d )} H, where H = {sign d i=1 α i x 2 i | α i ∈ R}. Since N (0, I d ) is invariant under all orthogonal transformations, by Theorem 5.1,inf equivariant A N * (A, N (0, I d ) • H, ε 0 ) = inf A N * (A, N (0, I d ) (H • O(d)), ε 0 ). Furthermore, it can be show that H • O(d) = {sign i,j β ij x i x j | β ij ∈ R},the sign functions of all quadratics in R d . Thus it suffices to show learning quadratic functions on Gaussian distribution needs Ω(d 2 /ε) samples for any algorithm (see Lemma D.6, where we assume the dimension d can be divided by 4).

h∈F CNN err P (h) = 0, ∀P ∈ P, and the VC dimension of F is d + 1, by Theorem B.1, we have ∀P ∈ P, w.p. 1 -δ, err P

Let F (x, d) be the cdf of chi-square distribution, i.e.F (x, d) = Px x 2 2 ≤ x . Let z = x d , we have F (zd, d) ≤ (ze 1-z ) d/2 ≤ (ze 1-z ) 1/2 . Thus Py y 2 2 ≤ d/2 < 1, which implies for any x 2 S)x 2 1 x 2 ≤ 10 √ d ≥α 1 α 2 E x [ (R -S)x 2 ] ,for some constant α 2 > 0. Here we use the other side of the tail bound of cdf of chi-square, i.e. for z > 1, 1 -F (zd, d) < (ze 1-z ) d/2 < (ze 1-z ) 1/2 . D.6 PROOFS OF THEOREM 5.4 Lemma D.8. Let M ∈ R d×d , we have E x∼N (0,I d )

Single distribution, multiple functions, 2 regression). There is a problem with single input distribution,P = {P X } H = {N (0, I d )} { d i=1 α i x 2 i | α i ∈ R}, such that for any orthogonal equivariant algorithms A and ε > 0, N * (A, P, ε) ≥ d(d+3)2(1 -ε) -1, while there's a 2-layer ConvNet architecture, such that N * (ERM CNN , P, ε) ≤ d for any ε > 0.Proof of Theorem 5.4. Lower bound: Similar to the proof of Theorem 5.3, it suffices to for any algorithm A, N * (A,H •O(d), ε) ≥ d(d+3) 2 (1-ε)-1. Note that H •O(d) = { i,j β ij x i x j | β ij ∈R} is the set of all quadratic functions. For convenience we denote h M (x) = x M x, ∀M ∈ R d×d . Now we claim quadratic functions such that any learning algorithm A taking at most n samples must suffer d(d+1)2-n loss if the ground truth quadratic function is sampled from i.i.d. gaussian. Moreover, the loss is at most d(d+3) 2 for the trivial algorithm always predicting 0. In other words, if the expected relative error ε ≤ have the expected sample complexity N * (A, P, ε) ≥ n. That is N * (A, P, ε) ≥d(d+3)

yi)∼P X h M [ P (A({x i , y i } {x i , h M (x i )} n i=1 )](x) -h M (x)) 2 ≥ E xi,x∼P X M ∼N (0,I d 2 ) Var x,xi,M [h M (x) | {x i , h M (x i )} n i=1 , x] = E xi,x∼P X M ∼N (0,I d 2 ) Var M [h M (x) | {h M (x i )} n i=1 ] ,where the inequality is achieved when[A({x i , y i } n i=1 )](x) = E M [h M (x) | {x i , y i } n i=1 ].Thus it suffices to lower bound VarM [h M (x) | {h M (x i )} n i=1 ],for fixed {x i } n i=1 and x. For convenience we define S d = {A ∈ R d×d | A = A } be the linear space of all d × d symmetric matrices, where the inner product A, B := tr[A B] and Π n : R d×d → R d×d as the projection operator for the orthogonal complement of the n-dimensional space spanned by x i x i in S d . By definition, we can expandxx = n i=1 α i x i x i + Π n (xx ).Thus even conditioned on {xi , y i } n i=1 and x, h M (x) = tr[xx ] = n i=1 α i tr[x i x i M ] + tr[Π n (xx )M ],still follows a gaussian distribution, N (0, Π n (xx ) 2 F ). Note we can always find symmetric matrices E i with E i F = 1 and tr[E i E j ] = 0 such that Π n (A) = k i=1 E i tr[E i A],where the rank of Π n , is at leastd(d+1)   Thus the infimum of the expected test loss is inf A ∼N (0,I d 2 ) E (xi,yi)∼P X h M [ P (A({x i , y i } n i=1 ))]

Examples of gradient-based equivariant training algorithms for FC networks. The initialization requirement is only for the first layer of the network.

The separation becomes Ω(d) vs O(1) in this case. Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

LOWER BOUND FOR SAMPLE COMPLEXITY WITH VC DIMENSION Theorem B.1. [Blumer et al. (1989)] If learning algorithm A is consistent and ranged in H, i.e. A({x i , y i }

annex

Upper bound: We use the same CNN construction as in the proof of Theorem 5.3, i.e., the function Proof of Theorem 5.5. Lower Bound: We further define permutation g i as gGiven X n , y n , we define B := {d(x, x k ) ≥ 3, ∀k ∈ [n]} and we have PThus for any permutation equivariant algorithm A, N * (A, {P }, 1 4 ) ≥ d 10 . Upper Bound: Take CNN as defined in Section 3.)+b, thus the probability of ERM F CNN not achieving 0 error is at most the probability that all data in the training dataset are t i or s i : (note the training error of ERM F CNN is 0)Convergence guarantee for Gradient Descent: We initialize all the parameters by i.i.d. standard gaussian and train the second layer by gradient descent only, i.e. set the LR of w 1 , w 2 as 0. (Note training the second layer only is still a permutation-equivariant algorithm for FC nets, thus it's a valid separation.)For any convex non-increasing surrogate loss of 0-1 loss l satisfying l(0) ≥ 1, lim x→∞ l(x) = 0 e.g. logistic loss, we define the loss of the weight W asNote w 1 w 2 = 0 with probability 1, which means the data are separable even with fixed first layer, i.e. inf a1,b L(W) = 0. Further note L(W) is convex in a 1 and b, which implies with sufficiently small step size, GD converges to 0 loss solution. By the definition of surrogate loss, L(W) < 1 implies for x i , l(x i y i ) < 1 and thus the training error is 0.

