ON THE (NON-)ROBUSTNESS OF TWO-LAYER NEURAL NETWORKS IN DIFFERENT LEARNING REGIMES

Abstract

Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermediate scenarios, where initialization still plays a role due to "lazy" training. We consider overparameterized networks in high dimensions with quadratic targets and infinite samples. Our analysis allows us to identify new tradeoffs between approximation (as measured via test error) and robustness, whereby robustness can only get worse when test error improves, and vice versa. We also show how linearized lazy training regimes can worsen robustness, due to improperly scaled random initialization. Our theoretical results are illustrated with numerical experiments.

1. INTRODUCTION

Deep neural networks have enjoyed tremendous practical success in many applications involving highdimensional data, such as images. Yet, such models are highly sensitive to small perturbations known as adversarial examples (Szegedy et al., 2013) , which are often imperceptible by humans. While various strategies such as adversarial training (Madry et al., 2018) can mitigate this vulnerability empirically, the situation remains highly problematic for many safety-critical applications like autonomous vehicles and health, and motivates a better theoretical understanding of what mechanisms may be causing this. Various factors are known to contribute to adversarial examples. In linear models, features that are only weakly correlated with the label, possibly in a spurious manner, may improve prediction accuracy but induce large sensitivity to adversarial perturbations (Tsipras et al., 2019; Sanyal et al., 2021) . On the other hand, common neural networks may exhibit high sensitivity to adversarial perturbations at random initialization (Simon-Gabriel et al., 2019; Daniely & Shacham, 2020; Bubeck et al., 2021) . While such settings already capture interesting phenomena behind adversarial examples, they are restricted to either trained linear models, or nonlinear networks at initialization. Trained, nonlinear networks may thus involve multiple sources of vulnerability arising from initialization, training algorithms, as well as the data distribution. Capturing the interaction between these different components is of crucial importance for a more complete understanding of adversarial robustness. In this paper, we study the interplay between these different factors by analyzing approximation (i.e how well the model fits the data) and robustness properties (i.e the sensitivity of the model's outputs w.r.t perturbations in test data) of two-layer neural networks in different learning regimes. We consider two-layer finite-width networks in high dimensions with infinite training data, in asymptotic regimes inspired by Ghorbani et al. (2019) . This allows us to focus on the effects inherent to the data distribution and the inductive bias of architecture (choice of activation function, number of hidden neurons per input dimension, etc.) and training algorithms, while side-stepping issues due to finite samples. Following Ghorbani et al. (2019) , we focus on nonlinear regression settings with structured quadratic target functions, and consider commonly studied training regimes for two-layer networks, namely (i) neural networks with quadratic activations trained with stochastic gradient descent on the population risk which finds the global optimum; (ii) random features (RF, Rahimi & Recht, 2008) , (iii) neural tangent (NT, Jacot et al., 2018) , as well as (iv) "lazy" training (Chizat et al., 2019) regimes for basic RF and NT, where we consider a first-order Taylor expansion of the 1

