A LAW OF ROBUSTNESS FOR TWO-LAYERS NEURAL NETWORKS

Abstract

We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than n/k where n is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a O(1)-Lipschitz network, while mere data fitting of d-dimensional data requires only one neuron per d datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the highdimensional regime n ≈ d (which we also refer to as the undercomplete case, since only k ≤ d is relevant here). Finally we prove the conjecture for polynomial activation functions of degree p when n ≈ d p . We complement these findings with experimental evidence supporting the conjecture.

1. INTRODUCTION

We study two-layers neural networks with inputs in R d , k neurons, and Lipschitz non-linearity ψ : R → R. These are functions of the form: x → k =1 a ψ(w • x + b ) , with a , b ∈ R and w ∈ R d for any ∈ [k]. We denote by F k (ψ) the set of functions of the form (1). When k is large enough and ψ is non-polynomial, this set of functions can be used to fit any given data set (Cybenko, 1989; Leshno et al., 1993) . That is, given a data set (x i , y i ) i∈[n] ∈ (R d × R) n , one can find f ∈ F k (ψ) such that f (x i ) = y i , ∀i ∈ [n] . In a variety of scenarios one is furthermore interested in fitting the data smoothly. For example, in machine learning, the data fitting model f is used to make predictions at unseen points x ∈ {x 1 , . . . , x n }. It is reasonable to ask for these predictions to be stable, that is a small perturbation of x should result in a small perturbation of f (x). A natural question is: how "costly" is this stability restriction compared to mere data fitting? In practice it seems much harder to find robust models for large scale problems, as first evidenced in the seminal paper (Szegedy et al., 2013) . In theory the "cost" of finding robust models has been investigated from a computational complexity perspective in (Bubeck et al., 2019) , from a statistical perspective in (Schmidt et al., 2018) , and more generally from a model complexity perspective in (Degwekar et al., 2019; Raghunathan et al., 2019; Allen-Zhu and Li, 2020) . We propose here a different angle of study within the broad model complexity perspective: does a model have to be larger for it to be robust? Empirical evidence (e.g., (Goodfellow et al., 2015; Madry et al., 2018) ) suggests that bigger models (also known as "overparametrization") do indeed help for robustness. Our main contribution is a conjecture (Conjecture 1 and Conjecture 2) on the precise tradeoffs between size of the model (i.e., the number of neurons k) and robustness (i.e., the Lipschitz constant of the data fitting model f ∈ F k (ψ)) for generic data sets. We say that a data set (x i , y i ) i∈[n] is generic if it is i.i.d. with x i uniform (or approximately so, see below) on the sphere S d-1 = {x ∈ R d : x = 1} and y i uniform on {-1, +1}. We give the precise conjecture in Section 2. We prove several weaker versions of Conjecture 1 and Conjecture 2 respectively in Section 4 and Section 3. We also give empirical evidence for the conjecture in Section 5. A corollary of our conjecture. A key fact about generic data, established in Baum (1988) ; Yun et al. (2019) ; Bubeck et al. (2020) , is that one can memorize arbitrary labels with k ≈ n/d, that is merely one neuron per d datapoints. Our conjecture implies that for such optimal-size neural networks it is impossible to be robust, in the sense that the Lipschitz constant must be of order √ d. The conjecture also states that to be robust (i.e. attain Lipschitz constant O(1)) one must necessarily have k ≈ n, that is roughly each datapoint must have its own neuron. Therefore, we obtain a trade off between size and robustness, namely to make the network robust it needs to be d times larger than for mere data fitting. We illustrate these two cases in Figure 1 . We train a neural network to fit generic data, and plot the maximum gradient over several randomly drawn points (a proxy for the Lipschitz constant) for various values of √ d, when either k = n (blue dots) or k = 10n d (red dots). As predicted, for the large neural network (k = n) the Lipschitz constant remains roughly constant, while for the optimally-sized one (k = 10n d ) the Lipschitz constant increases roughly linearly in √ d. Notation. For Ω ⊂ R d we define Lip Ω (f ) = sup x =x ∈Ω |f (x)-f (x )| x-x (if Ω = R d we omit the subscript and write Lip(f )), where • denotes the Euclidean norm. For matrices we use • op , • op, * , • F and •, • for respectively the operator norm, the nuclear norm (sum of singular values), the Frobenius norm, and the Frobenius inner product. We also use these notations for tensors of higher order, see Appendix A for more details on tensors. We denote c > 0 and C > 0 for universal numerical constants, respectively small enough and large enough, whose values can change in different occurences. Similarly, by c p > 0 and C p > 0 we denote constants depending only on the parameter p. We also write ReLU(t) = max(t, 0) for the rectified linear unit. Generic data. We give some flexibility in our definition of "generic data" in order to focus on the essence of the problem, rather than technical details. Namely, in addition to the spherical model mentioned above, where x i is i.i.d. uniform on the sphere S d-1 = {x ∈ R d : x = 1}, we also consider the very closely related model where x i is i.i.d. from a centered Gaussian with covariance 1 d I d (in particular E[ x i 2 ] = 1, and in fact x i is tightly concentrated around 1). In both cases we consider y i to be i.i.d. random signs. We say that a property holds with high probability for generic data, if it holds with high probability either for the spherical model or for the Gaussian model.

2. A CONJECTURED LAW OF ROBUSTNESS

Our main contribution is the following conjecture, which asserts that, on generic data sets, increasing the size of a network is necessary to obtain robustness: Conjecture 1 For generic data sets, with high probabilityfoot_0 , any f ∈ F k (ψ) fitting the datafoot_1 (i.e., satisfying (2)) must also satisfy: Lip S d-1 (f ) ≥ c n k . Note that for generic data, with high probability (for n = poly(d)), there exists a smooth interpolation. Namely there exists g : R d → R with g(x i ) = y i , ∀i ∈ [n] and Lip(g) = O(1). This follows easily from the fact that with high probability (for large d) one has x i -x j ≥ 1, ∀i = j. Conjecture 1 puts restrictions on how smoothly one can interpolate data with small neural networks. A striking consequence of the conjecture is that for a two-layers neural network f ∈ F k (ψ) to be as robust as this function g (i.e., Lip(f ) = O(1)) and fit the data, one must have k = Ω(n), i.e., roughly one neuron per data point. On the other hand with that many neurons it is quite trivial to smoothly interpolate the data, as we explain in Section 3.3. Thus the conjecture makes a strong statement that essentially the trivial smooth interpolation is the best thing one can do. In addition to making the prediction that one neuron per datapoint is necessary for optimal smoothness, the conjecture also gives a precise prediction on the possible tradeoff between size of the network and its robustness. We also conjecture that this whole range of tradeoffs is actually achievable: Conjecture 2 Let n, d, k be such that C • n d ≤ k ≤ C • n and n ≤ d C where C is an arbitrarily large constant in the latter occurence. There exists ψ such that, for generic data sets, with high probability, there exists f ∈ F k (ψ) fitting the data (i.e., satisfying (2)) and such that Lip S d-1 (f ) ≤ C n k . The condition k ≤ C • n in Conjecture 2 is necessary, for any interpolation of the data must have Lipschitz constant at least a constant. The other condition on k, namely k ≥ C • n d , is also necessary, for that many neurons is needed to merely guarantee the existence of a data-fitting neural network with k neurons (see Baum (1988); Yun et al. (2019) ; Bubeck et al. (2020) ). Finally the condition n ≤ d C is merely used to avoid explicitly stating a logarithmic term in our conjecture (indeed, equivalently one can replace this condition by adding a multiplicative polylogarithmic term in d in the claimed inequality). Our results around Conjecture 2 (Section 3). We prove Conjecture 2 for both the optimal smoothness regime (which is quite straightforward, see Section 3.3) and for the optimal size regime (here more work is needed, and we use a certain tensor-based construction, see Section 3.4). In the latter case we only prove approximate data fitting (mostly to simplify the proofs), and more importantly we need to assume that n is of order d p for some even integer p. It would be interesting to generalize the proof to any n. While the conjecture remains open between these two extreme regimes, we do give a construction in Section 3.3 which has the correct qualitative behavior (namely increasing k improves the Lipschitz constant), albeit the scaling we obtain is n/k instead of n/k, see Theorem 1. Our results around Conjecture 1 (Section 4). We prove a weaker version of Conjecture 1 where the Lipschitz constant on the sphere is replaced by a proxy involving the spectral norm of the weight matrix, see Theorem 3. We also prove the conjecture in the optimal size regime, specifically when n = d p for an integer p and one uses a polynomial activation function of degree p, see Theorem 6. For p = 1 (i.e., n ≈ d) we in fact prove the conjecture for abritrary non-linearities, see Theorem 4. Further open problems. Our proposed law of robustness is a first mathematical formalization of the broader phenomenon that "overparametrization in neural networks is necessary for robustness". Ideally one would like a much more refined understanding of the phenomenon than the one given in Conjecture 1. For example, one could imagine that in greater generality, the law would read Lip Ω (f ) ≥ F (k, (x i , y i ) i∈[n] , Ω). That is, we would like to understand how the achievable level of smoothness depends on the particular data set at hand, but also on the set where we expect to be making predictions. Another direction to generalize the law would be to extend it to multilayers neural networks. In particular one could imagine the most general law would replace the parameter k (number of neurons) by the type of architecture being used and in turn predict the best architecture for a given data set and prediction set. Finally note that our proposed law apply to all neural networks, but it would also be interesting to understand how the law interacts with algorithmic considerations (for example in Section 5 we use Adam Kingma and Ba (2015) to find a set of weights that qualitatively match Conjecture 2).

3. SMOOTH INTERPOLATION

We start with a warm-up in Section 3.1 where we discuss the simplest case of interpolation with a linear model (k = 1, n ≤ d) and in Section 3.2 for the optimal smoothness regime (k = n). We generalize the construction of Section 3.2 in Section 3.3 to obtain the whole range of tradeoffs between k and Lip(f ), albeit with a suboptimal scaling, see Theorem 1. We also generalize the linear model calculations of Section 3.1 in Section 3.4 to obtain the optimal size regime for larger values of n via a certain tensor construction.

3.1. THE SIMPLEST CASE

: OPTIMAL SIZE REGIME WHEN n ≤ c • d Let us consider k = 1, n ≤ c • d and ψ(t) = t. Thus we are trying to find w ∈ R d such that w • x i = y i for all i ∈ [n], or in other words Xw = Y with X the n × d matrix whose i th row is x i , and Y = (y 1 , . . . , y n ). The smoothest solution to this system (i.e., the one minimizing w ) is w = X (XX ) -1 Y , Note that Lip(x → w • x) = w = √ w w = Y (XX ) -1 Y . Using [Theorem 5.58, Vershynin (2012) ] one has with probability at least 1 -exp(C -cd) (and using that n ≤ c • d) that XX 1 2 I n , and thus Lip(x → w • x) ≤ √ 2 • Y = √ 2n. This concludes the proof sketch of Conjecture 2 for the simplest case k = 1 and n ≤ d.

3.2. ANOTHER SIMPLE CASE: OPTIMAL SMOOTHNESS REGIME

Next we consider the optimal smoothness regime in Conjecture 2, namely k = n. First note that, for generic data and n = poly(d), with high probability the caps C i := x ∈ S d-1 : x i • x ≥ 0.9 are disjoint sets and moreover they each contain a single data point (namely x i ). With a single ReLU unit it is then easy to make a smooth function (10-Lipschitz) which is 0 outside of C i and equal to +1 at x i (in other words the neuron activates for a single data point), namely x → 10•ReLU (x i • x -0.9). Thus one can fit the entire data set with the following ReLU network which is 10-Lipschitz on the sphere: f (x) = n i=1 10y i • ReLU (x i • x -0.9) . This concludes the proof of Conjecture 2 for the optimal smoothness regime k = n.

3.3. INTERMEDIATE REGIMES VIA RELU NETWORKS

We now combine the two constructions above (the linear model of Section 3.1 and the "isolation" strategy of Section 3.2) to give a construction that can trade off size for robustness (albeit not optimally according to Conjecture 2), see Appendix C for the proof. Theorem 1 Let n, d, k be such that C • n log(n) d ≤ k ≤ C • n. For generic data sets, with probability at least 1 -1/n C , there exists f ∈ F k (ReLU) fitting the data (i.e., satisfying (2)) and such that Lip S d-1 (f ) ≤ C • n log(d) k .

3.4. OPTIMAL SIZE NETWORKS VIA TENSOR INTERPOLATION

In this section we essentially prove Conjecture 2 in the optimal size regime (namely k • d ≈ n), with three caveats: 1. We allow a slack of a log n factor by considering k Baum (1988) ; Bubeck et al. (2020) . • d = Cn log(n) instead of the optimal k • d = Cn as in 2. We only prove approximate fit rather than exact fit. It is likely that with more work one can use the core of our argument to obtain exact fit. For that reason we did not make any attempt to optimize the dependency on ε in Theorem 2. For instance one could probably obtain log(1/ε) rather than 1/poly(ε) dependency by using an iterative scheme that fits the residuals, as in (Bresler and Nagaraj, 2020; Bubeck et al., 2020) . 3. We assume that n is of order d p for some even integer p. While it might be that one can apply the same proof for odd integers, the whole construction crucially relies on p being an even integer as we essentially do a linear regression over the feature embedding x → x ⊗p . A possible approach to extend the proof to other values of n would be use the scheme of Section 3.3 with the linear regression there replaced by the tensor regression used below. Theorem 2 Fix ε > 0, p an even integer, and let ψ(t) = t p . Let n, d, k be such that n log(n) = ε 2 • d p and k = C p • d p-1 . Then for generic data, with probability at least 1 -1/n C , there exists f ∈ F k (ψ) such that |f (x i ) -y i | ≤ C p • ε , ∀i ∈ [n] , and Lip S d-1 (f ) ≤ C p n k . Proof. We propose to approximately fit with the following neural network: f (x) = n i=1 y i (x i • x) p . Naively one might think that this neural network requires n neurons. However, it turns out that one can always decompose a symmetric tensor of order p into k = 2 p d p-1 rank-1 symmetric tensors of order p, so that in fact f ∈ F k (ψ). For p = 2 this simply follows from eigendecomposition and for general p we give a simple proof in [Appendix A, Lemma 2]. One also has by applying [Appendix B, Lemma 4] with τ = C p log(n) and doing an union bound, that with probability at least 1 -1/n C , for any j ∈ [n], n i=1,i =j y i (x i • x j ) p ≤ C p n log(n) d p ≤ C p ε . In particular this proves (3). Thus it only remains to estimate the Lipschitz constant, which by [Appendix A, Lemma 1] is reduced to estimating the operator norm of the tensor n i=1 y i x ⊗p i . We do so in [Appendix B, Lemma 5].

4. PROVABLE WEAKER VERSIONS OF CONJECTURE 1

Conjecture 1 can be made weaker along several directions. For example the quantity of interest Lip S d-1 (f ) can be replaced by various upper bound proxies for the Lipschitz constant. A mild weakening would be to replace it by the Lipschitz constant on the whole space (we shall in fact only consider this notion here). A much more severe weakening is to replace it by a quantity that depends on the spectral norm of the weight matrix (essentially ignoring the pattern of activation functions). For the latter proxy we actually give a complete proof, see Theorem 3, which in particular formally proves that "overparametrization is a law of robustness for generic data sets". Other interesting directions to weaken the conjecture include specializing it to common activation functions, or simply having a smaller lower bound on the Lipschitz constant. In Section 4.2 we prove the conjecture when n is replaced by d in the lower bound. We say that this inequality is in the "very high-dimensional case", in the sense that it matches the conjecture for n ≈ d (alternatively we also refer to it as the "undercomplete case", in the sense that only k ≤ d is relevant in this very high-dimensional scenario). In the moderately high-dimensional case (n d) the proof strategy we propose in Section 4.2 cannot work. In Section 4.3 we give another argument for the latter case, specifically in the optimal size regime (i.e., k • d ≈ n) and for a power activation function, see Theorem 5. We generalize this to polynomial activation functions in Section D.1. In the specific case of a quadratic activation function we also show a lower bound that applies for any k and which is in fact larger than the one given in Conjecture 1, see Theorem 7 in Section D.2.

4.1. SPECTRAL NORM PROXY FOR THE LIPSCHITZ CONSTANT

We can rewrite (1) as f (x) = a ψ(W x + b) , where a = (a 1 , . . . , a k ) ∈ R k , b = (b 1 , . . . , b k ) ∈ R k , W ∈ R k×d is the matrix whose th row is w , and ψ is extended from R → R to R k → R k by applying it coordinate-wise. We prove here the following: Theorem 3 Assume that ψ is L-Lipschitz. For f ∈ F k (ψ) one has Lip(f ) ≤ L • a • W op . For a generic data set, if f (x i ) = y i , ∀i ∈ [n] and f has no bias terms (i.e., b = 0 in (4)), then with positive probability one has: L • a • W op ≥ n k . Note that we prove the inequality (6) only with positive probability (i.e., there exists a data set where the inequality is true), but in fact it is easy to derive the statement with high probability using classical concentration inequalities. Proof. Since ψ : R → R is L-Lipschitz, we have: f (x)-f (x ) ≤ a • ψ(W x+b)-ψ(W x +b) ≤ L• a • W x-W x ≤ L• a • W op • x-x , which directly proves (5). Next, following the proof of [Proposition 1, Bubeck et al. (2020) ] one obtains that for a generic data set, with positive probability, one has (without bias terms): k =1 |a | • w ≥ √ n L . It only remains to observe that: √ n L ≤ k =1 |a | • w ≤ k =1 |a | 2 • k =1 w 2 = a • W F ≤ √ k • a • W op , which concludes the proof of (6).

4.2. UNDERCOMPLETE CASE

Next we prove the conjecture in the high dimensional case n ≈ d. More precisely we replace n by d in the conjectured lower bound. Importantly note that the resulting lower bound then becomes non-trivial only in the regime k ≤ d (the "undercomplete case"). We consider in fact a slightly more general scenario than interpolation with a neural network, namely we simply assume that one interpolates the data with a function f (x) = g(P x) where P is a linear projection on a k-dimensional subspace (this clearly generalizes f ∈ F k (ψ), in fact it even allows for the non-linearity ψ to depend on the datafoot_2 , or to have a different non-linearity for each neuron). Theorem 4 Let n ≥ d. Let f : R d → R be a function such that f (x i ) = y i , ∀i ∈ [n] and moreover f (x) = g(P x) for some differentiable function g : R k → R and matrix P ∈ R k×d . Then, for generic data, with probability at least 1 -exp(C -cd) one must have Lip(f ) ≥ c d k . Proof. Let us modify g so that P is simply an orthogonal projection operator (i.e., P P = I k ). Let us also assume for sake of notational simplicity that we have a balanced data set of size 2n, that is with: y 1 , . . . y n = +1 and y n+1 , . . . , y 2n = -1. Let us denote x i = x i -x n+i for i ∈ [n]. The sequence x i is i.i.d. and satisfies E[x i x i ] = 2 d I d . Now observe that on the segment [x i , x n+i ] (whose length is less than 2), the function f changes value from +1 to -1, and thus there exists z i ∈ [x i , x n+i ] such that: 1 ≤ |∇f (z i ) • (x i -x n+i )| = |∇f (z i ) • x i | . Moreover one has (using that ∇f (x) = P ∇g(x), and thus ∇g(x ) = P ∇f (x) ≤ Lip(f )) |∇f (z i ) • x i | = |∇g(P z i ) • (P x i )| ≤ Lip(f ) • P x i . Combining the two above displays one has: n Lip(f ) ≤ n i=1 P x i ≤ n n i=1 P x i 2 = n n i=1 x i P P x i = n n i=1 x i x i , P P HS . Using [Theorem 5.39, Vershynin (2012) ] (specifically (5.23)) we know that with probability at least 1 -exp(C -cd) we have 

4.3. POWER ACTIVATION

We prove here the conjecture for the power activation function ψ(t) = t p with p an integer and with no bias terms (we deal with general polynomials, including with bias, in Appendix D). Without bias such a network can be written as: f (x) = k =1 a (w • x) p = T, x ⊗p , ( ) where T = k =1 a w ⊗p . As we already saw in the proof of Theorem 2 (see specifically [Appendix A, Lemma 2]), without loss of generality we have k ≤ C p d p-1 . We now prove that tensor networks of the form (7) cannot obtain a Lipschitz constantfoot_3 better than n/d p-1 , in accordance with Conjecture 1 for full rank tensors (where k ≈ d p-1 ).  T op ≥ c p n d p-1 . Proof. Denoting Ω = n i=1 y i x ⊗p i , we have (using y 2 i = 1 for the first equality and [Appendix A, Lemma 3] for the last inequality): n = T, Ω ≤ Ω op • T op, * ≤ d p-1 • Ω op • T op . Thus we obtain  T op ≥ n d p-1 • Ω op

5. EXPERIMENTS

We consider a generic dataset from the Gaussian model (i.e., x 1 , . . . , x n i.i.d. from N (0, 1 d I d ) and labels y 1 , . . . , y n i.i.d from the uniform distribution over {-1, 1} and independent of x 1 , . . . , x n ). For various values of (n, d, k) we train two-layers neural networks with k ReLU units and batch normalization (see Ioffe and Szegedy (2015) ) between the linear layer and ReLU layer, using the Adam optimizer (Kingma and Ba, 2015) on the least squares loss. We keep the values of (n, k, d) where the network successfully memorizes the random labels (possibly after a rounding to {-1, +1}, and such that prior to rounding the least squares loss is at most some small value ε to be specified later). Given a triple (n, d, k), suppose the output of the trained network is f n,d,k : R d → R. We then generate z 1 , . . . , z T (where T = 1000) i.i.d from the distribution N (0, 1 d I d ), independently of everything else and define the "maximum random gradient" to be max i∈[T ] ∇f n,k,d (z i ) (it is our proxy for the true Lipschitz constant sup z∈S d-1 ∇f n,d,k (z) ). Our experimental results are as follows: Experiment 1: We ran experiments with n between 100 and 2000, d between ∼ 50 and ∼ n, and k between ∼ 10 and ∼ n (we also choose ε = 0.02 for the thresholding). In Figure 2 we give a scatter plot of n k , max i∈[T ] ∇f n,k,d (z i ) , and as predicted we see a linear trend, thus providing empirical evidence for Conjecture 1. Experiment 2: In this experiment, we investigate the two extreme cases k ∼ n and k ∼ n/d. We fix n = 10 4 and sweep the value of d between 10 to 5000 (we also choose ε = 0.1 for the thresholding). In the first case, we let k = n and in the second case we let k = 10n/d. In Figure 3 we plot √ d versus the maximum random gradient (as defined above) for both cases. We observe a linear dependence between the maximum gradient value and √ d when we have k = 10n/d, and roughly a constant maximum gradient value when k = n, thus providing again evidence for Conjecture 1 Bruce Arie Reznick. Sum of even powers of real linear forms, volume 463. American Mathematical Soc., 1992. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander 

A RESULTS ON TENSORS

A tensor of order p is an array T = (T i1,...,ip ) i1,...,ip∈ [d] . The Frobenius inner product for tensors is defined by: T, S = d i1,...,ip=1 T i1,...,ip S i1,...,ip , with the corresponding norm • F . A tensor is said to be of rank 1 if it can be written as: T = u 1 ⊗ . . . ⊗ u p , for some u 1 , . . . , u p ∈ R d . The operator norm • op is defined by: T op = sup S rank 1, S F ≤1 T, S . For symmetric tensors (i.e., such that the entries of the array are invariant under permutation of the p indices), Banach's Theorem (see e.g., [(2.32) , Nemirovski (2004) ]) states that in fact one has T op = sup x∈S d-1 T, x ⊗p . ( ) We refer to Friedland and Lim (2018) for more details and background on tensors. We now list a couple of useful results, with short proofs. Lemma 1 For a tensor T of order p, one has Lip S d-1 (x → T, x ⊗p ) ≤ p • T op . Proof. One has for any x, y ∈ S d-1 , T, x ⊗p -T, y ⊗p ≤ p q=1 T, x ⊗p-q+1 ⊗ y ⊗q-1 -T, x ⊗p-q ⊗ y ⊗q ≤ p • x -y • sup x 1 ,...,x p ∈S d-1 T, ⊗ p q=1 x q = p • x -y • T op . Lemma 2 For any tensor T of order p, there exists w 1 , . . . , w 2 p d p-1 ∈ R d and ξ 1 , . . . , ξ 2 p d p-1 ∈ {-1, +1} such that for all x ∈ R d , T, x ⊗p = 2 p d p-1 =1 ξ • (w • x) p . Results like Lemma 2 go back at least to Reznick (1992) . In fact much more precise results on minimal decomposition in rank-1 tensors are known thanks to the work of Alexander and Hirschowitz (1995) . We refer to (Comon et al., 2008) for more discussion on this topic. Proof. First note that trivially T can be written as: T = d i1,...,ip-1=1 e i1 ⊗ . . . ⊗ e ip-1 ⊗ T [i 1 , . . . , i p-1 , 1 : d] . Thus one only needs to prove that a function of the form x → p q=1 (w q • x) can be written as the sum of 2 p functions of the form (w • x) p . To do so note that, with ε q i.i.d. random signs, E p q=1 ε q • p q=1 ε q w q • x p = E   p q=1 ε q • p q1,...,qp=1 p r=1 ε qr w qr • x   = p! p q=1 (w q • x) . Lemma 3 For any tensor T of order p one has: T op, * ≤ d p-1 • T op . The above result and its proof are directly taken from Li et al. (2018) . We only repeat the argument here for sake of completeness. Proof. Note that the decomposition ( 10) is orthogonal, and thus for any tensor S of order p one has: T, S ≤ d p-1 • d i1,...,ip-1=1 e i1 ⊗ . . . ⊗ e ip-1 ⊗ T [i 1 , . . . , i p-1 , 1 : d], S 2 ≤ d p-1 • S 2 op • d i1,...,ip-1=1 T [i 1 , . . . , i p-1 , 1 : d] 2 = d p-1 2 • S op • T F . Thus one has T op, * ≤ d p-1 2 • T F . By duality one also has T op ≥ d -p-1 2 • T F , which concludes the proof.

B RESULTS ON RANDOM TENSORS

Lemma 4 For any fixed x ∈ S d-1 and generic data, with probability at least 1 -C exp(-c p τ ) one has: n i=1 y i (x i • x) p ≤ C p nτ d p . Proof. Using [Theorem 1, Paouris et al. (2017) ] one has, for any fixed x ∈ S d-1 and τ ≤ n, P d p/2 n i=1 |x i • x| p -nσ p > C p √ nτ ≤ C exp(-c p τ ) , where σ p denotes the p th moment of the standard Gaussian. Let us denote n + = |{i ∈ [n] : y i = +1} and T + = i:yi=+1 x ⊗p i , and similarly for n -, T -. Now with probability 1 -C exp(-cτ ) (with respect to the randomness of the y i s) we have |n + -n -| ≤ √ nτ . Thus combining the two above displays we obtain with probability at least 1 -C exp(-c p τ ), d p/2 i:yi=+1 |x i • x| p - i:yi=-1 |x i • x| p ≤ C p √ nτ + σ p |n + -n -| ≤ C p √ nτ , Lemma 5 For generic data, with probability at least 1 -C exp(-c p d) one has: n i=1 y i x ⊗p i op ≤ C p n d p-1 . Proof. Let N be an 1 2p -net of S d-1 (in particular |N | ≤ C d p ) . By an union bound and Lemma 4 one has: P ∃x ∈ N ε : n i=1 y i |x i • x| p > C p n d p-1 ≤ C exp(-c p d) , Let T = n i=1 y i x ⊗p i . Note that T is symmetric, and thus thanks to (9) and Lemma 1, one has: T op ≤ max x∈N T, x ⊗p + 1 2 T op , and in particular T op ≤ 2 max x∈N T, x ⊗p , which together with (11) concludes the proof.

C PROOF OF THEOREM 1

Let m = n k (by assumption m ≤ c • d log(n) ) and assume it is an integer. Let us choose m points with the same label, say it is the points x 1 , . . . , x m with label +1. As in Section 3.1 let w ∈ R d be the minimal norm vector that satisfy w • x i = 1, and thus as we proved there with probability at least 1 -exp(C -cd) one has w ≤ √ 2m. Crucially for the end of the proof, also note that the distribution of w is rotationally invariant. Next observe that with probability at least 1 -1/n C (with respect to the sampling of x m+1 , . . . , x n ) one has max i∈{m+1,...,n} |w • x i | ≤ C • w log(n) d ≤ 1 2 . In particular the cap C := {x ∈ S d-1 : w • x ≥ 1 2 } contains x 1 , . . . , x m but does not contain any x i , i > m. Thus the neuron x → 2 • ReLU w • x - 1 2 , computes the value 1 at points x 1 , . . . , x m and the value 0 at the rest of the training set. One can now repeat this process, and build the neurons w 1 , . . . , w k (all with norm ≤ √ 2m), so that (with well-chosen signs ξ ∈ {-1, 1}) the data is perfectly fitted by the function: f (x) = k =1 2 • ξ • ReLU w • x - 1 2 . It only remains to estimate the Lipschitz constant. Note that if a point x ∈ S d-1 activates a certain subset A ⊂ {1, . . . , k} of the neurons, then the gradient at this point is ∈A w with w = 2ξ w . Using that the w are rotationally invariant, one also has with probability at least 1 -Cn exp(-cd) In particular we conclude that with a = Cm log(d) the probability that a fixed point on the sphere activates more than a neuron is exponentially small in d log(d) (recall that m log(k) ≤ cd by assumption). Thus we can conclude via an union bound on an ε-net that the same holds for the entire sphere simultaneously. This concludes the proof.

D FURTHER RESULTS AROUND CONJECTURE 1 D.1 POLYNOMIAL ACTIVATION

We now observe that one can generalize Theorem 5 to handle biases (the parameters b l in 1), and in fact even general polynomial activation function. Indeed, observe that any polynomial of w, x + b must also be a polynomial in w, x , albeit with different coefficients. Theorem 6 Let ψ(t) = p q=0 α q t q and assume that we have f ∈ F k (ψ) such that f (x i ) = y i , ∀i ∈ [n]. Then, for generic data, with probability at least 1 -C exp(-c p d) one must have Lip {x: x ≤1} (f ) ≥ c p n d p-1 . Proof. Note that for f ∈ F k (ψ) there exists tensors T 0 , . . . , T p , such that T q is a tensor of order q, and f can be written as: T q , Ω q , and thus there exists q ∈ {1, . . . , p} such that T q , Ω q ≥ c p n (we ignore the term q = 0 by considering the largest balanced subset of the data, i.e. we assume n i=1 y i = 0). Now one can repeat the proof of Theorem 5 to obtain that with probability at least 1 -C exp(-c p d), one has T q op ≥ c p n d p-1 . It only remains to observe that the Lipschitz constant of f on the unit ball is lower bounded by T q op . As we mentioned in Section 4.3, without loss of generality we can assume T q is symmetric, and thus by (9) there exists x ∈ S d-1 such that T q op = T q , x ⊗q . Now consider the univariate polynomial P (t) = f (tx). By Markov brothers' inequality one has max t∈[-1,1] P (t) ≥ |P (q) (0)| = q! • | T q , x ⊗q | = q! • T q op , thus concluding the proof. f (x) =

D.2 QUADRATIC ACTIVATION

In Section 4.3 we obtained a lower bound for tensor networks that match Conjecture 1 only when the rank of the corresponding tensor is maximal. Here we show that for quadratic networks (i.e., p = 2) we can match Conjecture 1, and in fact even obtain a better bound, for any rank k: Theorem 7 Assume that we have a matrix T ∈ R d×d with rank k such that: T, x ⊗2 i = y i , ∀i ∈ [n] . Then, for generic data, with probability at least 1 -C exp(-cd), one must have T op ≥ c √ nd k (≥ c n/k) .



We do not quantify the "with high probability" in our conjecture. We believe the conjecture to be true except for an event of exponentially small probability with respect to the sampling of a generic data set, but even proving that the statement is true with strictly positive probability would be extremely interesting. We expect the same lower bound to hold even if one only asks f to approximately fit the data. In fact our provable variants of Conjecture 1 are based proofs that are robust to only assuming an approximately fitting f . It would be interesting to study whether allowing data-dependent non-linearities could affect the conclusion of our conjectures. Such study would need to crucially rely on having only one hidden layer, as it is known from the Kolmogorov-Arnold theorem that with two hidden layers and data-dependent non-linearities one can obtain perfect approximation properties with k ≤ d (albeit the non-linearities are non-smooth). Note that without loss of generality one can assume T to be symmetric, since we only consider how it acts on x ⊗p . For symmetric tensors one has that the Lipschitz constant on the unit ball is lower bounded by the operator norm of T thanks to(9)



Figure 1: See Section 5 for the details of this experiment.

i=1 x i x i op ≤ C n d (here we use n ≥ d too). Moreover we have P P op, * = Tr(P P ) = Tr(P P ) = k. Thus we have n i=1 x i x i , P P HS ≤ C n•k d so that with the above display one obtains n Lip(f ) ≤ n Ck d , which concludes the proof.

Figure 2: Scatter plot of maximum random gradient with respect to n k with 906 data points (Experiment 1)

, and it only remains to apply [Appendix B, Lemma 5] which states that with probability at least 1 -C exp(-c p d) one has Ω op ≤ C p n d p-1 .

that ∈A w 2 ≤ C • |A| • m for all A ⊂ {1, . . . , k}. Thus it only remains to control how large A can be. We show below that |A| ≤ Cm log(d) with probability at least 1 -C exp(-cd log(d)) which will conclude the proof. If x activates neuron then w • x ≥ 1 2 ≥ w 4 √ m . Now note that for any fixed x ∈ S d-1 and fixed A ⊂ [k], P ∀ ∈ A, w • x ≥ w 4 √ m ≤ C exp -c|A| d m , so that P ∃A ⊂ [k] : |A| = a and ∀ ∈ A, w • x ≥ w 4 √ m ≤ exp Ca log(k) -ca d m .

Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, 2018. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2013. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Practice, pages 210-268. Cambridge University Prteess, 2012. Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, pages 15532-15543, 2019.

annex

Proof. The proof is exactly the same as for Theorem 5, except that in (8), instead of using Lemma 3 we use the fact that for a matrix T of rank k one has:

