A LAW OF ROBUSTNESS FOR TWO-LAYERS NEURAL NETWORKS

Abstract

We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than n/k where n is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a O(1)-Lipschitz network, while mere data fitting of d-dimensional data requires only one neuron per d datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the highdimensional regime n ≈ d (which we also refer to as the undercomplete case, since only k ≤ d is relevant here). Finally we prove the conjecture for polynomial activation functions of degree p when n ≈ d p . We complement these findings with experimental evidence supporting the conjecture.

1. INTRODUCTION

We study two-layers neural networks with inputs in R d , k neurons, and Lipschitz non-linearity ψ : R → R. These are functions of the form: x → k =1 a ψ(w • x + b ) , with a , b ∈ R and w ∈ R d for any ∈ [k]. We denote by F k (ψ) the set of functions of the form (1). When k is large enough and ψ is non-polynomial, this set of functions can be used to fit any given data set (Cybenko, 1989; Leshno et al., 1993) . That is, given a data set (x i , y i ) i∈[n] ∈ (R d × R) n , one can find f ∈ F k (ψ) such that f (x i ) = y i , ∀i ∈ [n] . In a variety of scenarios one is furthermore interested in fitting the data smoothly. For example, in machine learning, the data fitting model f is used to make predictions at unseen points x ∈ {x 1 , . . . , x n }. It is reasonable to ask for these predictions to be stable, that is a small perturbation of x should result in a small perturbation of f (x). A natural question is: how "costly" is this stability restriction compared to mere data fitting? In practice it seems much harder to find robust models for large scale problems, as first evidenced in the seminal paper (Szegedy et al., 2013) . In theory the "cost" of finding robust models has been investigated from a computational complexity perspective in (Bubeck et al., 2019) , from a statistical perspective in (Schmidt et al., 2018) , and more generally from a model complexity perspective in (Degwekar et al., 2019; Raghunathan et al., 2019; Allen-Zhu and Li, 2020) . We propose here a different angle of study within the broad model complexity perspective: does a model have to be larger for it to be robust? Empirical evidence (e.g., (Goodfellow et al., 2015; Madry et al., 2018) ) suggests that bigger models (also known as "overparametrization") do indeed help for robustness. Our main contribution is a conjecture (Conjecture 1 and Conjecture 2) on the precise tradeoffs between size of the model (i.e., the number of neurons k) and robustness (i.e., the Lipschitz constant of the data fitting model f ∈ F k (ψ)) for generic data sets. We say that a data set (x i , y i ) i∈[n] is generic if it is i.i.d . with x i uniform (or approximately so, see below) on the sphere S d-1 = {x ∈ R d : x = 1} and y i uniform on {-1, +1}. We give the precise conjecture in Section 2. We prove several weaker versions of Conjecture 1 and Conjecture 2 respectively in Section 4 and Section 3. We also give empirical evidence for the conjecture in Section 5. A corollary of our conjecture. A key fact about generic data, established in Baum ( 1988 2020), is that one can memorize arbitrary labels with k ≈ n/d, that is merely one neuron per d datapoints. Our conjecture implies that for such optimal-size neural networks it is impossible to be robust, in the sense that the Lipschitz constant must be of order √ d. The conjecture also states that to be robust (i.e. attain Lipschitz constant O(1)) one must necessarily have k ≈ n, that is roughly each datapoint must have its own neuron. Therefore, we obtain a trade off between size and robustness, namely to make the network robust it needs to be d times larger than for mere data fitting. We illustrate these two cases in Figure 1 . We train a neural network to fit generic data, and plot the maximum gradient over several randomly drawn points (a proxy for the Lipschitz constant) for various values of √ d, when either k = n (blue dots) or k = 10n d (red dots). As predicted, for the large neural network (k = n) the Lipschitz constant remains roughly constant, while for the optimally-sized one (k = 10n d ) the Lipschitz constant increases roughly linearly in √ d. Notation. For Ω ⊂ R d we define Lip Ω (f ) = sup x =x ∈Ω |f (x)-f (x )| x-x (if Ω = R d we omit the subscript and write Lip(f )), where • denotes the Euclidean norm. For matrices we use • op , • op, * , • F and •, • for respectively the operator norm, the nuclear norm (sum of singular values), the Frobenius norm, and the Frobenius inner product. We also use these notations for tensors of higher order, see Appendix A for more details on tensors. We denote c > 0 and C > 0 for universal numerical constants, respectively small enough and large enough, whose values can change in different occurences. Similarly, by c p > 0 and C p > 0 we denote constants depending only on the parameter p. We also write ReLU(t) = max(t, 0) for the rectified linear unit. Generic data. We give some flexibility in our definition of "generic data" in order to focus on the essence of the problem, rather than technical details. Namely, in addition to the spherical model mentioned above, where x i is i.i.d. uniform on the sphere S d-1 = {x ∈ R d : x = 1}, we also consider the very closely related model where x i is i.i.d. from a centered Gaussian with covariance 1 d I d (in particular E[ x i 2 ] = 1, and in fact x i is tightly concentrated around 1). In both cases we consider y i to be i.i.d. random signs. We say that a property holds with high probability for generic data, if it holds with high probability either for the spherical model or for the Gaussian model.



Figure 1: See Section 5 for the details of this experiment.

); Yun et al. (2019); Bubeck et al. (

