A LAW OF ROBUSTNESS FOR TWO-LAYERS NEURAL NETWORKS

Abstract

We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than n/k where n is the number of datapoints. In particular, this conjecture implies that overparametrization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a O(1)-Lipschitz network, while mere data fitting of d-dimensional data requires only one neuron per d datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the highdimensional regime n ≈ d (which we also refer to as the undercomplete case, since only k ≤ d is relevant here). Finally we prove the conjecture for polynomial activation functions of degree p when n ≈ d p . We complement these findings with experimental evidence supporting the conjecture.

1. INTRODUCTION

We study two-layers neural networks with inputs in R d , k neurons, and Lipschitz non-linearity ψ : R → R. These are functions of the form: x → k =1 a ψ(w • x + b ) , with a , b ∈ R and w ∈ R d for any ∈ [k]. We denote by F k (ψ) the set of functions of the form (1). When k is large enough and ψ is non-polynomial, this set of functions can be used to fit any given data set (Cybenko, 1989; Leshno et al., 1993) . That is, given a data set (x i , y i ) i∈[n] ∈ (R d × R) n , one can find f ∈ F k (ψ) such that f (x i ) = y i , ∀i ∈ [n] . In a variety of scenarios one is furthermore interested in fitting the data smoothly. For example, in machine learning, the data fitting model f is used to make predictions at unseen points x ∈ {x 1 , . . . , x n }. It is reasonable to ask for these predictions to be stable, that is a small perturbation of x should result in a small perturbation of f (x). A natural question is: how "costly" is this stability restriction compared to mere data fitting? In practice it seems much harder to find robust models for large scale problems, as first evidenced in the seminal paper (Szegedy et al., 2013) . In theory the "cost" of finding robust models has been investigated from a computational complexity perspective in (Bubeck et al., 2019) , from a statistical perspective in (Schmidt et al., 2018) , and more generally from a model complexity perspective in (Degwekar et al., 2019; Raghunathan et al., 2019; Allen-Zhu and Li, 2020) . We propose here a different angle of study within the broad model complexity perspective: does a model have to be larger for it to be robust? Empirical evidence (e.g., (Goodfellow et al., 2015; Madry et al., 2018) ) suggests that bigger models (also known as "overparametrization") do indeed help for robustness. Our main contribution is a conjecture (Conjecture 1 and Conjecture 2) on the precise tradeoffs between size of the model (i.e., the number of neurons k) and robustness (i.e., the Lipschitz constant 1

