ON THE (NON-)ROBUSTNESS OF TWO-LAYER NEURAL NETWORKS IN DIFFERENT LEARNING REGIMES

Abstract

Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermediate scenarios, where initialization still plays a role due to "lazy" training. We consider overparameterized networks in high dimensions with quadratic targets and infinite samples. Our analysis allows us to identify new tradeoffs between approximation (as measured via test error) and robustness, whereby robustness can only get worse when test error improves, and vice versa. We also show how linearized lazy training regimes can worsen robustness, due to improperly scaled random initialization. Our theoretical results are illustrated with numerical experiments.

1. INTRODUCTION

Deep neural networks have enjoyed tremendous practical success in many applications involving highdimensional data, such as images. Yet, such models are highly sensitive to small perturbations known as adversarial examples (Szegedy et al., 2013) , which are often imperceptible by humans. While various strategies such as adversarial training (Madry et al., 2018) can mitigate this vulnerability empirically, the situation remains highly problematic for many safety-critical applications like autonomous vehicles and health, and motivates a better theoretical understanding of what mechanisms may be causing this. Various factors are known to contribute to adversarial examples. In linear models, features that are only weakly correlated with the label, possibly in a spurious manner, may improve prediction accuracy but induce large sensitivity to adversarial perturbations (Tsipras et al., 2019; Sanyal et al., 2021) . On the other hand, common neural networks may exhibit high sensitivity to adversarial perturbations at random initialization (Simon-Gabriel et al., 2019; Daniely & Shacham, 2020; Bubeck et al., 2021) . While such settings already capture interesting phenomena behind adversarial examples, they are restricted to either trained linear models, or nonlinear networks at initialization. Trained, nonlinear networks may thus involve multiple sources of vulnerability arising from initialization, training algorithms, as well as the data distribution. Capturing the interaction between these different components is of crucial importance for a more complete understanding of adversarial robustness. In this paper, we study the interplay between these different factors by analyzing approximation (i.e how well the model fits the data) and robustness properties (i.e the sensitivity of the model's outputs w.r.t perturbations in test data) of two-layer neural networks in different learning regimes. We consider two-layer finite-width networks in high dimensions with infinite training data, in asymptotic regimes inspired by Ghorbani et al. (2019) . This allows us to focus on the effects inherent to the data distribution and the inductive bias of architecture (choice of activation function, number of hidden neurons per input dimension, etc.) and training algorithms, while side-stepping issues due to finite samples. Following Ghorbani et al. (2019) , we focus on nonlinear regression settings with structured quadratic target functions, and consider commonly studied training regimes for two-layer networks, namely (i) neural networks with quadratic activations trained with stochastic gradient descent on the population risk which finds the global optimum; (ii) random features (RF, Rahimi & Recht, 2008) , (iii) neural tangent (NT, Jacot et al., 2018) , as well as (iv) "lazy" training (Chizat et al., 2019) regimes for basic RF and NT, where we consider a first-order Taylor expansion of the network around initialization, including the initialization term itself (in contrast to the standard RF and NT regimes which ignore the offset due to initialization). Note that, though the theoretical setting is inspired by Ghorbani et al. (2019) , our work differs from theirs in its focus and scope. Indeed, we are concerned with robustness and its interplay with approximation, in different learning regimes, while they are only concerned with approximation. We also note that the lazy/linearized regimes we study as part of this work were not considered by Ghorbani et al. (2019) , and help us highlight the impact of initialization on robustness. Note that unlike the other regimes, SGD exhibits a kind of feature learning, whereby the first layer weights are learning specific directions by approximating the matrix B. In particular, this involves non-trivial feature selection via non-linear learning, while the other regimes (RF and NT) are linear estimators on top of non-linear but fixed features. Main contributions. Our work establishes theoretical results which uncover novel tradeoffs between approximation (as measured via test error) and robustness that are inherent to all the regimes considered. These tradeoffs appear to be due to misalignment between the target function and the input distribution (weight distribution) for random features (Section 4), or to the inductive bias of fully-trained networks (Section 3 and Appendix E). We also show that improperly scaled random initialization can further degrade robustness in lazy/linearized models (Section 5), since the resulting models might inherit the nonrobustness inherent to random initialization. This raises the question of how small should the initialization be in order to enhance the robustness of the trained model. Our theoretical results are empirically verified with extensive numerical experiments on simulated data. The setting of our work is regression in a student-teacher setup where the student is a two-layer feedforward neural network and the teacher is a quadratic form. We assume access to infinite training data. Thus, the only complexity parameters are the coefficient matrix of the teacher model, the input dimension d and the width of the neural network m, assumed to both "large" but proportional to one another. Refer to Section 2 for details. In Appendix I, we also show similar but weaker trade-offs for arbitrary student and teacher models. The infinite-sample setting allows us to focus on the effects inherent to the data distribution and the inductive bias of architecture (choice of activation function) and different learning regimes, while side-stepping issues due to finite samples and label noise. Also note that in this infinite-data setting, label noise provably has no influence on the learned model, in all the learning regimes considered. The observation that there is a tradeoff between robustness and approximation, even in this infinite-sample setting, is one of the surprising findings of our work. This complements related works such as (Bubeck et al., 2020b; Bubeck & Sellke, 2021) , which show that finite training samples with label noise is a possible source of nonrobustness in neural networks. Related works. Various works have theoretically studied adversarial examples and robustness in supervised learning, and the relationship to ordinary test error / accuracy. Tsipras et al. (2019) considers a specific data distribution where good test error implies poor robustness. Shafahi et al. (2018) ; Mahloujifar et al. (2018) ; Gilmer et al. (2018) ; Dohmatob (2019) show that for high-dimensional data distributions which have concentration property (e.g., multivariate Gaussians, distributions satisfying log-Sobolev inequalities, etc.), an imperfect classifier will admit adversarial examples. On the other hand, Yang et al. (2020) observed empirically that natural images are well-separated, and so locally-lipschitz classifies shouldn't suffer any kind of test error vs robustness tradeoff. However, gradient-descent is not likely to find such models. Our work studies regression problems with quadratic targets, and shows that there are indeed tradeoffs between test error and robustness which are controlled by the learning algorithm / regime and model. Schmidt et al. (2018) ; Khim & Loh (2018); Yin et al. (2019) ; Bhattacharjee et al. (2021) ; Min et al. (2021b; a) study the sample complexity of robust learning. In contrasts, our work focuses on the case of infinite data, so that the only complexity parameters are the input dimension d and the network width m. Gao et al. (2019) ; Bubeck et al. (2020b) ; Bubeck & Sellke (2021) show that over-parameterization may be necessary for robust interpolation in the presence of noise. In contrast, our paper considers a structured problem with noiseless signal and infinite training data, where the network width m and the input dimension d tend to infinity proportionately. In this under-complete asymptotic setting, our results show a systematic and precise tradeoff between approximation (test error) and robustness in different learning regimes. Thus, our work nuances the picture presented by previous works by exhibiting a nontrivial interplay between robustness and test error, which persists even in the case of infinite training data where the resulting model isn't affected by label noise. Dohmatob (2021) ; Hassani & Javanmard (2022) study the tradeoffs between interpolation, predictive performance (test error), and robustness for finite-width over-parameterized networks in kernel regimes with noisy linear target functions. In contrast, we consider structured quadratic target functions and compare different learning settings, including SGD optimization in a non-kernel regime, as well as lazy/linearized models. Robustness has also been studied from a causality perspective. For example, Rothenhausler et al. (2021) studies tradeoffs between test error and robustness in linear regression under distributional shifts on the marginal distribution of the covariates. We provide a more detailed discussion of the related work in Appendix A. Remark 1.1. Note that the term "over-parametrization" is not used in our paper in the same sense as in Bubeck et al. (2020b) ; Bubeck & Sellke (2021) ; Hassani & Javanmard (2022) . In those works, the setup is finite samples (n < ∞), and over-parametrization means that m is substantially larger than n/d, where d is the input-dimension, and m is the network with (i.e number of neurons in the hidden layer). In our work, we focus on the infinite-sample case (n = ∞), and over-parametrization means m/d is large. Finally, the findings of Bubeck et al. (2020b) ; Bubeck & Sellke (2021) -namely, that over-parametrization is beneficial for robustness, and of Hassani & Javanmard (2022) -namely, that over-parametrization is detrimental to robustness, are nuanced by Zhu et al. (2022) which shows that as the width m of a neural network is increased, there transition from over-parametrization being detrimental, to being benificial for robustness. More precisely, they derived upper-bounds for robustness error which show that there critical value m 0 such that the robustness error is an increasing function of width in the interval [1, m 0 ] (over-parametrization hurts) and a decreasing function of width in the interval [m 0 , ∞) (i.e over-parametrization becomes beneficial). Overall, the exact role of over-parametrization in robustness remains partly unclear, even though progress is being made on the subject.

2. PRELIMINARIES

Notations. We use standard notations in our manuscript. A cheat sheet is provided in supplemental / appendix.

2.1. THE TEACHER MODEL: A QUADRATIC FORM

We consider the following regression setup proposed by Ghorbani et al. (2019) . Let B be a fixed d × d psd matrix and let b 0 ∈ R be a fixed unknown scalar. Consider the quadratic teacher model f ⋆ (x) := x ⊤ Bx + b 0 , for any x ∈ R d . (1) We assume the input data is distributed according to N (0, I d ), the standard Gaussian distribution in d dimensions. Thus, the structure of the problem of learning the teacher model f ⋆ in (1) is completely determined by the unknown d × d matrix B. We assume an idealized scenario where the learner has access to an infinite number of iid samples of the form (x, f ⋆ (x)) width x ∼ P x := N (0, I d ). For simplicity of analysis, we will further assume as in (Ghorbani et al., 2019 ) that the teacher model f ⋆ defined in (1) is centered, i.e E x∼N (0,I d ) [f ⋆ (x)] = 0. This forces the offset b 0 = -tr(B).

2.2. THE STUDENT MODEL: A TWO-LAYER NEURAL NETWORK

Consider a two-layer student neural network f W,z,s (x) := m j=1 z j σ(x ⊤ w j ) + s, ( ) where m is the network width, i.e., the number of hidden neurons each with parameter vector w j ∈ R d , output weights z = (z 1 , . . . , z m ) ∈ R m , and activation function σ : R → R. We define W as the m × d matrix with jth row w j . The scalar s is an offset which we will sometimes set to 0, in which case we will simply write f W,z,s := f W,z,0 . Note that the teacher model f ⋆ is itself a two-layer neural network with output weights fixed to 1, quadratic activation function, and m = d hidden neurons with parameters W ⋆ := B 1/2 , where B 1/2 is the unique psd matrix such that (B 1/2 ) 2 = B. The aim of learning is to approximate the teacher model f ⋆ as closely as possible with the student model f W,z,s . We will consider the following high-dimensional setup: -Infinite training data, wherein the sample size n is equal to ∞, i.e., the learner has access to the entire data distribution, allowing us to step-aside issues linked with finite samples. -Proportionate scaling of input dimension d and student network width m, wherein d and m are finite, and large of the same order, i.e., m, d → ∞, m/d → ρ ∈ (0, ∞). The parameterization rate ρ (which corresponds to the number of hidden neurons per input dimension), will play a crucial role in our analysis. The case ρ < 1 corresponds to under-parametrization, while ρ > 1 corresponds to over-parametrization. Occasionally, we will also consider the extreme over-parametrization regime corresponding to ρ ≫ 1, or more precisely, the limiting case ρ → ∞.

2.3. METRICS FOR TEST ERROR AND ROBUSTNESS

Test error. The test / approximation error of a student model f : R d → R is defined by ε test (f ) := ∥f -f ⋆ ∥ 2 L 2 (Px) = E Px (f (x) -f ⋆ (x)) 2 , ( ) where P x is the distribution of the features. ( 4) measures how well the student f approximates the teacher model f ⋆ . Except otherwise explicitly stated, in this article we will always consider the isotropic case where the distribution of the features is P x = N (0, I d ), as in Ghorbani et al. (2019) . It will be instructive to compare the test error of f to that of the null predictor which outputs 0 on every input, namely ∥f ⋆ ∥ 2 L 2 (Px) . Thus, consider the normalized test error, ε test (f ) := ε test (f )/∥f ⋆ ∥ 2 L 2 (Px) . This quantity was studied in Ghorbani et al. (2019) where explicit analytic formulae were obtained for two-layer networks in various regimes of interest: networks fully trained by stochastic gradient-descent (SGD) on the population risk, random features (RF), and neural tangent (NT). We shall consider these same regimes and establish tradeoffs between test error and robustness of the corresponding models. This will paint a picture complementary to Ghorbani et al. (2019) . Measure of robustness / sensitivity. We will measure the robustness of a smooth student model f : R d → R (e.g., the two-layer neural net (2)) by what we call its robustness error, defined as the square-root of its Dirichlet energy ε rob (f ) w.r.t. to a random test point x ∼ P x , that is ε rob (f ) := ∥∇ x f ∥ 2 L 2 (Px) = E Px ∥∇ x f (x)∥ 2 . ( ) Smoothness here is in the very general sense of Gigli & Ledoux (2013, Section 4 .1) with euclidean structure. The smaller the value of ε rob (f ), the more robust / less sensitive f is to changes in a test data point, on average. We justify the choice of this quantity as a measure of robustness in Appendix D, where we will link it to more classical notions of robustness error (Madry et al., 2018) . In particular, the teacher model has ε rob (f ⋆ ) = 4∥B∥ 2 F . Finally, note that, measures of robustness based on notions of sensitivity have been considered in other works like Bubeck et al. (2020b) ; Bubeck & Sellke (2021) for regression, and Wu et al. (2021) for classification settings. It will be convenient to compare the robustness of a student model f to that of the baseline quadratic teacher model f ⋆ defined in (1). To this end, consider the normalized robustness error f defined by ε rob (f ) := ε rob (f )/ε rob (f ⋆ ), which measures the relative robustness error of the student. The objective of our paper is to study the quantity ε rob (f ) for neural networks (2) in various regimes in the limit (3), and put to light interesting phenomena. In particular, we will establish tradeoffs between test error and robustness error, in the form of a nontrivial relationship ε test (f ) + ε rob (f ) = 1, for different learning regimes. This paints a picture complementary to Ghorbani et al. (2019) . Remark 2.1. Note that it might be tempting to thing that ∥f -f ⋆ ∥ L 2 (Px) ≈ 0 =⇒ ∥∇f ∥ L 2 (Px) ≈ ∥∇f ⋆ ∥ L 2 (Px) , Such an implication would automatically lead to a (at least) heuristic explanation of our tradeoffs (8). However, (9) is false in general. Indeed, a smooth function (small ∥∇f ∥ L 2 (Px) ) can be approximated very well in L 2 by functions by very rough functions (large ∥∇f ∥ L 2 (Px) . This point is elaborated in Appendix K. To establish our tradeoffs, we exploit the fine structure of two-layer neural networks in the different learning regimes considered. 3 RESULTS FOR TWO-LAYER NEURAL NETWORKS TRAINED VIA SGD Consider a student neural network model with quadratic activation f SGD : R d → R, i.e f SGD (x) := m j=1 (x ⊤ w j ) 2 + s. Here, W = (w 1 , . . . , w m ) ∈ R m×d is a matrix of learnable parameters (one per hidden neuron), and s ∈ R is a learnable offset. The output weights vector is fixed to z = 1 m := (1, . . . , 1), while W and s are optimized via SGD (hence the subscript), where each update is on a single new sample point. It is shown in Theorem 3 of Ghorbani et al. (2019) that if W t ∈ R m×d is the matrix of hidden parameters after t steps of SGD, then in the limit (3), the matrix W t W ⊤ t ∈ R m×m converges a.s to the best rank-m approximation of B. Thus, by continuity of matrix norms, we deduce that ∥W t W ⊤ t ∥ 2 F converges a.s to ∥B∥ 2 F,m , in the infinite data limit t → ∞. Combining with Lemma G.1 establishes the following asymptotic formula for the (normalized) robustness of the resulting model f SGD , in the high-dimensional limit (3). Theorem 3.1. In the limit (3), it holds that ε rob (f SGD ) a.s -→ ∥B∥ 2 F,m ∥B∥ 2 F = m∧d k=1 λ k (B) 2 d j=1 λ k (B) 2 ≤ 1, with equality iff rank(B) ≤ m. In particular, if ρ ≥ 1, then in the limit (3), it holds that ε rob (f SGD ) a.s -→ 1. Tradeoff approximation and robustness. We see from the above theorem that if m ≥ rank(B), then the robustness error for the learned student converges to that of the true model if m ≥ d, namely ε rob (f SGD ) p → d j=1 λ j (B) 2 = ε rob (f ⋆ ) . This is the case if m ≥ d, for example. Otherwise (i.e if m < rank(B)), then the limiting value of ε rob (f SGD ) can be arbitrarily less than ε rob (f ⋆ ), i.e., the learned student will be much more robust (i.e., stable) than the ground truth model. Comparing with Theorem 3 and Proposition 1 of Ghorbani et al. (2019) , we can see that any decrease in robustness error of the learned student (compared to the teacher model) is at the expense of increased test error, and vise versa. Indeed, it was shown in that paper that the normalized test error ε test (f SGD ) verifies ε test (f SGD ) p -→ 1 -lim ∥B∥ 2 F,m /∥B∥ 2 F , in the limit (3) Combining with our Theorem 3.1 above, we deduce that ε test (f SGD ) + ε rob (f SGD ) p -→ 1, in the limit (3). ( ) The above formula highlights a tradeoff between test error and robustness error. Thus, we have identified a novel tradeoff between approximation and robustness for the neural network model (2) trained via SGD. In the sequel, we shall establish such tradeoffs for other learning regimes.

4. RESULTS FOR THE RANDOM FEATURES MODEL

Consider the two-layer model (2) with hidden neuron parameters w 1 , . . . , w m sampled iid from a d-dimensional multivariate Gaussian distribution N (0, Γ) with covariance matrix Γ. We denote this so-called random features (RF) student model f RF , defined by f RF (x) = f W,zRF (x) = z ⊤ RF σ(W x), where z RF ∈ R m solves the following linear regression problem arg min z∈R m E x∼N (0,I d ) [(z ⊤ σ(W x) -f ⋆ (x)) 2 ]. It is easily seen that for x ∼ N (0, I d ), one has z RF = U -1 v, with U jk := E x [σ(x ⊤ w j )σ(x ⊤ w k )] ∀j, k ∈ [m], v j := E x [f ⋆ (x)σ(x ⊤ w j )] ∀j ∈ [m]. The covariance matrix Γ encompasses the inductive bias of the neurons at initialization to different directions in feature space. Define the alignment α = α(B, Γ) of the hidden neurons to the task at hand, namely learning the teacher model f ⋆ , as follows α := tr(BΓ) ∥B∥ F ∥Γ∥ F ≤ 1. ( ) As we shall see, the task-alignment α plays a crucial role in the dynamics of prediction performance (test error) and robustness f RF .

4.1. ASSUMPTIONS AND KEY QUANTITIES

As in Ghorbani et al. (2019) , we will need the following mild technical conditions on Γ. This condition is quite reasonable, and moreover, it allows us to leverage standard tools from random matrix theory (RMT) in our analysis. We will also need the following technical condition on the activation function σ. Condition 4.2. σ is weakly continuously-differentiable and satisfies the growth condition σ(t) 2 ≤ c 0 e c1t 2 for some c 0 > 0 and c 1 < 1, and for all t ∈ R. Moreover, σ is not a purely affine function. The above growth condition is a classical condition usually imposed for the theoretical analysis of neural networks (see, e.g., Ghorbani et al., 2019; Mei & Montanari, 2019; Montanari & Zhong, 2020) , and is satisfied by all popular activation functions used in practice. One of its main purposes is to ensure that all the Hermite coefficients (λ k ) k∈N of the activation function exist. Refer to Section G.2 for precise definition of Hermite coefficients. We will also need the following condition. Condition 4.3. (A) λ 0 := λ 0 (σ) = 0. (B) λ 2 := λ 2 (σ) ̸ = 0. Part (A) of this condition was introduced by Ghorbani et al. (2019) to simplify the analysis of the test error of the random features model f RF . Part (B) ensures that the random features model f RF does not degenerate to the null predictor. Definition 4.1. With z ∼ N (0, 1), define the following scalars λ := E[σ(z) 2 ] -λ 2 1 , κ := λ 2 2 ∥Γ∥ 2 F d/2, τ := λ 2 tr(BΓ) √ d, λ ′ := E[σ ′ (z) 2 ] -λ 2 1 , κ ′ := λ 2 3 ∥Γ∥ 2 F d/2. ( ) These coefficients will turn out to be "sufficient statistics" which will completely capture the influence of activation function σ on the robustness of the random features model f RF . Note that by construction, λ, κ, λ ′ , and κ ′ are nonnegative. Now, consider the random psd matrices A 0 and D 0 defined by A 0 := λI m + λ 2 1 Θ, D 0 := λ ′ I m + (κ ′ /d + λ 2 1 )Θ, with Θ := W W ⊤ ∈ R m×m . These matrices appear upon linearizing the expressions for the test error and the robustness error of the RF model, using RMT techniques from El Karoui (2010) . By employing the so-called Silverstein fixed-point equation (Silverstein & Choi, 1995; Ledoit & Péché, 2011; Dobriban & Wager, 2018) , one can show that there exist positive constants ψ 1 and ψ 2 such that tr(A -1 0 )/d a.s → ψ 1 , tr(A -2 0 D 0 )/d a.s → ψ 2 in the limit (3). (20) Moreover, the ψ k 's only depend on (i) the parametrization rate ρ, and (ii) the limiting eigenvalue distribution D of the rescaled covariance matrix d • Γ of the hidden neurons at initialization. Also, since λ and λ ′ are strictly positive (thanks to Condition 4.2), so are the ψ k 's by definition of D 0 and A 0 in (19). These scalars together with those defined in (18) will play a crucial role in our analysis.

4.2. TEST ERROR / PREDICTION PERFORMANCE IN RANDOM FEATURES REGIME

We recall that the (normalized) test error ε test (f RF ) of the random features model f RF was completely analyzed in Ghorbani et al. (2019) . Indeed, the following was established that ε test (f RF ) = 1 - ψ 1 τ 2 ∥B∥ 2 F (2κψ 1 + 2) + o d,P (1), in the limit (3). ( 21) See Theorem 1 of the said paper. Thus, the normalized test error only depends on the aspect ratio ρ, the limiting spectral distribution D of d • Γ, and the scale parameters (λ, κ, τ ) defined in (18). It was further that, if the task-alignment α of the hidden neurons defined in (17), admits a limit α ∞ when d → ∞, then w.p 1 lim ρ→∞ lim m,d→∞ m/d→ρ ε test (f RF ) = 1 -α 2 ∞ . Thus, ( 21) predicts that the normalized test error ε test (f RF ) vanishes if (i) Γ ∝ B and (ii) the number of neurons per input dimension m/d diverges, corresponding to extreme over-parametrization.

4.3. ANALYSIS OF ROBUSTNESS IN RANDOM FEATURES REGIME

The following result establishes an analytic formula for the robustness in the RF regime. (A) In the limit (3), we have the following approximation ε rob (f RF ) = τ 2 (2κψ 2 1 + ψ 2 ) ∥B∥ 2 F (2κψ 1 + 2) 2 + o d,P ( ) (B) Moreover, if lim d→∞ α = α ∞ , then lim ρ→∞ lim m,d→∞ m/d→ρ ε rob (f RF ) = α 2 ∞ w.p 1. In particular, for the optimal choice of Γ in terms of test error, namely Γ ∝ B, one has lim ρ→∞ lim m,d→∞ m/d→ρ ε rob (f RF ) = 1 w.p 1. Thus, the robustness only depends on the aspect ratio ρ, the limiting spectral distribution D of d • Γ (via ψ 1 and ψ 2 ), and the scale parameters defined in (18). The theorem is proved in the Appendix G.4. Tradeoff between approximation and robustness. We deduce from the above theorem that in the limit (3), the random features model f RF is more robust (i.e., less sensitive to perturbations) than the teacher model f ⋆ . Interestingly, we see that this gap in robustness between the two models closes with increasing alignment α between the covariance matrix of the random features Γ and the coefficient matrix B. Comparing with ( 22), we obtain the following relationship (provided ∥Γ∥ 2 F ≫ 1/m), ε test (f RF ) + ε rob (f RF ) p → 1 -α 2 ∞ + α 2 ∞ = 1, in the limit (3), ( ) which trades-off between the normalized test error ε test (f RF ) (defined in ( 21)) and the normalized robustness ε rob (f RF ) of f RF . Thus, we have identified another novel tradeoff between the test error and the robustness in random features models. 

5. NEURAL TANGENT REGIME

Consider a two-layer network width output weights z 0 j fixed to 1, and hidden weight w j ∈ R d drawn from N (0, Γ). For the quadratic activation σ(t) := t 2 , the neural tangent (NT) approximation (Jacot et al., 2018; Chizat et al., 2019) w.r.t. the first layer parameters is given by where f init is the function computed by the neural network at initialization (see Appendix E for details), and A = (a 1 , . . . , a m ) = (∆w 1 , . . . , ∆w m ) ∈ R m×d is the change in W . We will see that the initialization term f init might have drastic influence on the robustness of the resulting model. f W +A (x) ≈ f init (x) + tr(A∇ W f W (x)) = f init (x) + 2 m j=1 (x ⊤ (z 0 j a j ))(x ⊤ w j ).

5.1. NEURAL TANGENT APPROXIMATION WITHOUT INITIALIZATION TERM

We temporarily discard the initialization term f init (x) from the RHS of ( 25), and consider the simplified approximation f NT (x; A; c) := 2 m j=1 (x ⊤ a j )(x ⊤ w j ) -c, where, WLOG, we absorb the output weights z j in the parameters a j in the first-order term. In (26), A ∈ R m×d and c ∈ R are model parameters that are optimized. In terms of test error, let A NT and c NT be optimal in f NT (•; A, c), and let f NT = f NT (•; A NT , c NT ) for short. In Thm. 2 of Ghorbani et al. (2019) , it is shown that the (normalized) test error of the linearized model f NT is given by E W [ ε test (f NT )] = (1 -ρ) 2 + (1 -β) + (1 -ρ) + β + o d (1). ( ) where β = β(B) := tr(B) 2 /(d∥B∥ 2 F ) ∈ (0, 1]. We now establish an analytic formula for the robustness error of f NT . Theorem 5.1. Consider the neural tangent model f NT in (26). In the limit (3) it holds that, E W [ ε rob (f NT )] = (ρ + ρ 2 )/2 + (ρ -ρ 2 )β/2 + o d (1) , where ρ := min(ρ, 1). ( ) Further observe that because 0 ≤ β ≤ 1, the RHS of ( 28) is further upper-bounded by ρ ≤ 1 with equality when β = 1 (e.g., for B ∝ I d ). We deduce that in the NT regime, the student neural network is at least as robust as the teacher model f ⋆ . Comparing with (27), we obtain the following tradeoff between test error and robustness, stated only for β = 1 for simplicity of presentation. Corollary 5.1. If β = 1 (i.e., if B ∝ I d ), then in the limit (3), it holds that E W [ ε test (f NT ) + ε rob (f NT )] = 1 + o d (1).

5.2. NEURAL TANGENT APPROXIMATION WITH INITIALIZATION TERM

We now consider the neural tangent approximation (25) without discarding the initialization term f init from the RHS of (25). Also, let z 0 ∈ R m be the output weights, drawn iid from N (0, 1/m) and frozen, and let Q be the m × m diagonal matrix with z 0 as its diagonal. This corresponds to what  Let f NTL (x; A, c) be RHS of (25), f NTL (x; A, c) := f W,z,c (x) + 2 m j=1 (x ⊤ w j )(x ⊤ a j ) = f init (x) + f NT (x; A, c), where f init (x) := m j=1 z 0 j (x ⊤ w j ) 2 = x ⊤ W ⊤ QW x defines the neural network at initialization. Theorem 5.2. Suppose the output weights z 0 at initialization are iid from N (0, (1/m)I m ). Then, in the limit (3), the following identities hold E {W,z 0 } [ ε test (f NTL )] = E W [ ε test (f NT )] + o d (1), E {W,z 0 } [ ε rob (f NTL )] = E W [ ε rob (f NT )] + E {W,z 0 } [ ε rob (f init )] + o d (1). Thus, on average (over initialization): (i) f NTL and f NT have the same test error, i.e., the initialization term f init in f NTL does not affect its test error. (ii) On the other hand, f NTL is less robust than f NT ; the deficit in robustness, namely the term (1 + ∥Γ∥ 2 F )/∥B∥ 2 F , corresponds exactly to the contribution of the initialization. The situation is empirically illustrated in Fig. 2 . Notice the perfect match between our theoretical results and experiments.

6. CONCLUDING REMARKS

In this paper, we have studied the adversarial robustness of two-layer neural networks in different high-dimensional learning regimes, and established a number of new tradeoffs between prediction performance and robustness, in the form (8). Our analysis also shows that random initialization can further degrade the robustness in lazy training regimes: for "large" random initialization, the trained neural network inherits additional vulnerability already present at initialization. Our work can be seen as a first step towards a rigorous theoretical understanding of the robustness of trained neural networks, an important subject which is still understudied. 

A FURTHER RELATED WORK

Various works have theoretically studied adversarial examples and robustness in supervised learning, and the relationship to ordinary predictive performance / test error. We present a detailed list here. Tsipras et al. (2019) considers a specific data distribution where good accuracy implies poor robustness. Shafahi et al. (2018) ; Mahloujifar et al. (2018) ; Gilmer et al. (2018) ; Dohmatob (2019) show that for high-dimensional data distributions which have concentration property (e.g., multivariate Gaussians, distributions satisfying log-Sobolev inequalities, etc.), an imperfect classifier will admit adversarial examples. Dobriban et al. (2020) studies tradeoffs in Gaussian mixture classification problems, highlighting the impact of class imbalance. On the other hand, Yang et al. ( 2020) observed empirically that natural images are well-separated, and so locally-lipschitz classifies shouldn't suffer any kind of test error vs robustness tradeoff. However, gradient-descent is not likely to find such models. Our work studies regression problems with quadratic targets, and shows that there are indeed tradeoffs between test error and robustness which are controlled by the learning algorithm / regime and model. 2021) studies robustness vs accuracy for data distributions which are well-separated (e.g., say the two classes are supported on disjoint balls). The main finding in that paper is that (i) the robustness vs accuracy tradeoff doesn't exist for well-separated datasets. The work also posits that (ii) real-world datasets are well-separated. We think (i) is only an artifact of the well-separatedness assumption (an assumption which fails for Gaussians (as noted in the paper), say, due to infinite support). Also, (ii) is likely due to the fact that most real datasets are limited in sample size, and so, deceptively appear to be well-separated. Indeed, in the real world, there are cats which look like dogs (e.g, Siamese cats), even though such data might be under-represented in ML datasets. Gao et al. (2019) ; Bubeck et al. (2020b) ; Bubeck & Sellke (2021) show that over-parameterization may be necessary for robust interpolation in the presence of noise. In contrast, our paper considers a structured problem with noiseless signal and infinite-data n = ∞, where the network width m and the input dimension d tend to infinity proportionately. In this under-complete asymptotic setting, our results show a precise picture of the tradeoffs between approximation (test error) and robustness in different learning regimes. Our work nuances this picture by exhibiting a nontrivial interplay between robustness and test error which persists even in the case of infinite samples, where the model isn't affected by label noise.

B NOTATIONS

Linear algebra. The set of integers from 1 through d will be denoted [d] . We will denote the identity matrix of size d by I d . The euclidean norm of a vector x ∈ R d will be denoted ∥x∥. The kth largest singular-value of a matrix A will be denoted s k (A), and equals the positive square-root of the kth largest eigenvalue of the positive-semidefinite (psd) matrix AA ⊤ . In particular, ∥A∥ op := s 1 (A) is the spectral norm of A. If A is itself psd, then its singular-values coincide with its eigenvalues. The Frobenius norm of A is denoted ∥A∥ F and defined by ∥A∥ F := d k=1 s k (A) 2 . More generally, we define ∥A∥ F,m := m∧d k=1 s k (A) 2 , so that ∥A∥ F,d = ∥A∥ F in particular. Note that m → ∥A∥ F,m is a nondecreasing function which is upper-bounded by ∥A∥ F . The Hadamard / element-wise product of two matrices A 1 and A 2 of the same shape will be denoted A 1 • A 2 . The squared L 2 -norm of a function f : R d → R w.r.t a distribution P on R d will be denoted ∥f ∥ 2 P , and defined by ∥f ∥ L 2 (P ) := E P [f (x) 2 ], whenever this integral exists. Given a psd matrix of size d, we denote by N (0, Σ) the d-dimensional multivariate gaussian distribution with covariance matrix Σ. quantity which goes to zero (resp. which goes to zero in probability) in the limit d → ∞. As usual, the acronym "a.s" stands for almost-surely, while "w.p p" stands for with probability at least p.

C WARM-UP: AN INSIGHT FROM LINEAR REGRESSION

Before providing complete proofs in the sequel, in this section we will use linear regression to develop an intuitive understanding for the tradeoffs established in the paper. Consider n sample points (x 1 , y 1 ), . . . , (x n , y n ) in R d × R. Now, let X = (x 1 , . . . , x n ) ∈ R n×d be the design matrix and y := (y 1 , . . . , y n ) ∈ R n be the response vector. For any nonempty subset S of [n], let X S ∈ R |S|×d (resp. y S ∈ R |S| ) be the version of X (resp. y) with all rows (resp. columns) not in S removed. For a (possibly random) sequence of nonempty subsets (S t ) ∞ t=1 of [n], and a sequence of sufficiently small stepsizes (α t ) ∞ t=1 , the following discrete dynamics w t = w t-1 -α t X ⊤ St (X St w t-1 -y St )/|S t | represents GD or SGD initialized at w 0 ∈ R d . In particular, GD corresponds to taking S t = [n] for all t ≥ 1. By construction, it is clear from (33) that at any iteration t, w t -w 0 ∈ span(X) := {X ⊤ z | z ∈ R n }. ( ) On the other hand, suppose w ⋆ ∈ R d is an interpolant, i.e Xw ⋆ = y. It is well-known that w t converges to a point w ∞ which is the orthogonal projection of w 0 onto the affine space of interpolants I := {w ∈ R d | Xw = y} = w ⋆ + kern(X) , where kern(X) ⊆ R d is the kernel of X. Thanks to (34) and the closedness of the subspace span(X), it is clear that w ∞ -w 0 ∈ span(X). We conclude that w ∞ -w ⋆ is orthogonal to w ∞ -w 0 , and so ∥w ⋆ - w 0 ∥ 2 = ∥w ⋆ -w ∞ ∥ 2 + ∥w ∞ -w 0 ∥ 2 ,

by

Pythagoras' Theorem. In particular, taking w 0 = 0, the previous identity becomes ∥w ∞ -w ⋆ ∥ 2 test error of w∞ + ∥w ∞ ∥ 2 rob. error of w∞ = ∥w ⋆ ∥ 2 constant , where the test error is w.r.t to test data generated according to the linear model x ∼ P x := N (0, I d ), y = f w⋆ (x), with f w (x) := x ⊤ w for all w, x ∈ R d . Dividing through by ∥w ⋆ ∥ 2 = ∥f w⋆ ∥ 2 L 2 (Px) , we deduce the following result which can be seen as the inductive bias of GD and SGD on linear regression. Proposition C.1. For GD or SGD started at w 0 = 0, it holds that ε test (f w∞ ) + ε rob (f w∞ ) = 1. This is a tradeoff between the test error and the robustness error for GD and SGD! It is valid for all sample sizes n ≥ 1. In contrast, the tradeoffs established in the nonlinear settings of the previous sections persist hold for infinite samples where n = ∞. Perhaps, sufficiently large but finite n is sufficient, but this investigation is left for future work.

D JUSTIFICATION OF OUR PROPOSED MEASURE OF ROBUSTNESS

Let us begin by explaining why our proposed measure of robustness based on Dirichlet energy ( 6) is actually a measure of robustness. Unless otherwise stated, in this section the feature distribution will be any distribution P x on R d . Given smoothfoot_0 f : R d → R, consider the d × d psd matrix J(f ) and scalar S(f ) ≥ 0 defined by J(f ) := E Px [∇f (x)∇f (x) ⊤ ], S(f ) := tr(J(f )) 1/2 . ( ) Note that ε rob (f ) = S(f ) 2 . The following lemma shows that S(f ) measures the sensitivity of f to random local fluctuations in test data, on average. Lemma D.1 (Measure of local sensitivity). We have lim δ→0 + 1 δ E Px [∆ f (x; δ)] = S(f ), where ∆ f (x; δ) := sup ∥v∥2≤δ |f (x + v) -f (x)|. This lemma is a direct corollary to Lemma D.3 proved later below. The next lemma shows that ∥J(f )∥ op measures the (non)robustness of f to universal adversarial perturbations, in the sense of Moosavi-Dezfooli et al. (2017) . Lemma D.2 (Measure of robustness to universal perturbations). We have the identity lim δ→0 + 1 δ ∆ f (δ) = ∥J(f )∥ 1/2 op , where ∆ f (δ) 2 := sup ∥v∥≤δ E Px (f (x + v) -f (x)) 2 . In particular, the leading eigenvector of J(f ) corresponds to (first-order) universal adversarial perturbations of f , in the sense of Moosavi-Dezfooli et al. ( 2017), which can be efficiently computed using the Power Method, for example. A rough sketch of the proof of the above lemma is as follows. To first-order, we have f (x+v)-f (x) ≈ v ⊤ ∇f (x). Thus, ∆ f (δ) 2 := sup ∥v∥≤δ E Px [(f (x + v) -f (x)) 2 ] ≈ sup ∥v∥≤δ E Px [(v ⊤ ∇f (x)) 2 ] = sup ∥v∥≤δ v ⊤ J(f )v = δ 2 ∥J(f )∥ 2 op . The first lemma is proved via a similar argument.

D.1 WHY NOT USE LIPSCHITZ CONSTANTS TO MEASURE ROBUSTNESS ?

Note for any that smooth function, S(f ) is always a lower-bound for the Lipschitz constant ∥f ∥ Lip of f . Recall that ∥f ∥ Lip is defined by ∥f ∥ Lip := sup x̸ =x ′ |f (x) -f (x ′ )| ∥x -x ′ ∥ . ( ) One special case where there is equality S(f ) = ∥f ∥ Lip is when f is a linear function. However, this is far from true in general: ∥f ∥ Lip is a worst-case measure, while S(f ) is an average-case measure for each q. If ∥f ∥ Lip is small (i.e., of order O(1)), then a small perturbation (i.e., of size O(1)) can only result in mild change in the output of f (i.e., of order O(1)). However, a large value of ∥f ∥ Lip is uninformative regarding adversarial examples (for example, one can think of a function which is smooth everywhere except on a set of measure zero). In contrast, a large value for S(f ) indicates that, on average, it is possible for an adversarial to drastically change the output of f via a small modification of its input. An illustrative example. Consider a quadratic function f (x) := (1/2)x ⊤ Bx + c with isotropic feature distribution P x = N (0, I d ). Note that the teacher model f ⋆ defined in (1) is of this form. A direct computation reveals that ∇f (x) = Bx and so S(f ) 2 := E Px ∥∇f (x)∥ 2 = E Px ∥Bx∥ 2 = ∥B∥ 2 F . However, the Lipschitz constant of f restricted to the ball of radius √ d is 2 , ∥f ∥ Lip = sup ∥x∥≤ √ d ∥∇f (x)∥ = sup ∥x∥≤ √ d ∥Bx∥ = √ d∥B∥ op , which can be up to √ d times larger than S(f ) = ∥B∥ F . For example, take B to be an ill-conditioned, e.g., rank-1, matrix.

D.2 PROOFS FOR DIRICHLET ENERGY AS A MEASURE OF ADVERSARIAL VULNERABILITY

Let ∥ • ∥ be any norm on R d with dual norm ∥ • ∥ ⋆ . Given a function f : R d → R, a tolerance parameter δ ≥ 0 (the attack budget), and a scalar q ≥ 1, define R δ (f ) by R q,δ (f, g) := E Px [∆ f (x; δ) q ] , where ∆ f (x; δ) := sup ∥x ′ -x∥≤δ |f (x ′ ) -f (x)| is the maximal variation of f in a neighborhood of size δ around x. For q = 2, we simply write R δ (f, g) for R δ,2 (f, g). In particular, G δ (f ) := E[R δ (f, f ⋆ )] is adversarial test error and G 0 (f ) := E[R 0 (f, f ⋆ )] is the ordinary test error of f , where the expectations are w.r.t all sources of randomness in f and f ⋆ . Of course G δ (f ) is an increasing function of δ. Define R q,δ (f ) := R q,δ (f, f ) and R δ (f ) := R 2,δ (f, f ), which measure the deviation of the outputs of f w.r.t to the outputs of f , under adversarial attack. Note R q (f ) ≡ 0. Also note that in the case where ∥ • ∥ is the euclidean L 2 -norm: if f is a near perfect model (in the classical sense), meaning that its ordinary test error G 0 (f ) is small, then R δ (f ) is a good approximation for G δ (f ). Finally, (at least for small values of α), we can further approximate R α (f ) (and therefore G δ (f ), for near perfect f ) by δ 2 times the Dirichlet energy S(f ) 2 . Indeed, ) , and f : R d → R be a smooth function. Define S q (f ) by Lemma D.3. Let q ∈ [1, ∞ S q (f ) := (E Px [∥∇f (x)∥ q ⋆ ]) 1/q . ( ) Note that, in particular, if ∥ • ∥ is the euclidean L 2 -norm and q = 2, then S q (f ) 2 is the Dirichlet energy defined in (6) as our measure of robustness. We have the following (A) General case. S q (f ) is the right derivative of the mapping δ → R δ (f ) 1/q at δ = 0. More precisely, we have the following lim δ→0 + R q,δ (f ) 1/q δ = S q (f ), or equivalently, R q,δ (f ) = δ q • S(f ) q + Higher order terms in δ q . (B) Case of Dirichlet energy In particular, if ∥ • ∥ is the euclidean L 2 -norm, and we take q = 2, R δ (f ) = δ 2 • S q (f ) 2 + Higher order terms in δ 2 . ( ) Remark D.1. A heuristic argument was used in Simon-Gabriel et al. (2019) to justify the use of average (dual-)norm of gradient (i.e the average local Lipschitz constant) E Px [∥∇f (x)∥ ⋆ ] (corresponding to q = 1 in the above) as a proxy for the adversarial generalization. The proof of Lemma D.3 follows directly Fubini's Theorem and the following lemma. Lemma D.4. If f is differentiable at x, then the function δ → ∆ f (x; δ) := sup ∥x ′ -x∥≤δ |f (x ′ ) -f (x)| is right-differentiable at 0 with derivative given by ∆ ′ f (x; 0) = ∥∇f (x)∥ ⋆ . Proof. As f is differentiable, f (x ′ ) = f (x) + ∇f (x) ⊤ (x ′ -x) + o(∥x ′ -x∥) around x. Therefore for sufficiently small δ, if B(x; δ) is ball of radius δ around x, then ∆ f (x; δ) = sup x ′ ∈B(x;δ) | ∇f (x) ⊤ (x ′ -x) + o(∥x ′ -x∥) | ≤ sup x ′ ∈B(x;δ) |∇f (x) ⊤ (x ′ -x)| + sup y∈B(x;δ) o(∥x ′ -x∥) = ∥∇f (x)∥ ⋆ δ + sup y∈B(x;δ) o(∥x ′ -x∥) ∆ f (x; δ) δ ≤ ∥∇f (x)∥ ⋆ + sup x ′ ∈B(x;δ) o(∥x ′ -x∥) δ (44) Note that sup x ′ ∈B(x;δ) o(∥x ′ -x∥) δ → 0. This proves lim sup δ→0 + (1/δ)∆ f (x; δ) ≤ ∥∇f (x)∥ ⋆ . Similarly, one computes ∆ f (x; δ) = sup | ∇f (x) ⊤ (x ′ -x) + o(∥x ′ -x∥) | ≥ sup |∇f (x) ⊤ (x ′ -x)| -sup o(∥x ′ -x∥) = ∥∇f (x)∥δ -sup o(∥x ′ -x∥) (45) Hence lim inf δ→0 + (1/δ)∆ f (x; δ) ≥ ∥∇f (x)∥ ⋆ , and we conclude that δ → ∆ f (x; δ) is differen- tiable at δ = 0, with derivative ∆ ′ f (x; 0) = ∥∇f (x)∥ ⋆ as claimed. Proof of Lemma D.3. By basic properties of limits, one has lim δ→0 + R q,δ (f ) 1/q δ q q = lim δ→0 + R q,δ (f ) δ = lim δ→0 + E Px [|∆ f (x; δ)| q ] δ = E Px lim δ→0 + |∆ f (x; δ)| q δ = E Px lim δ→0 + |∆ f (x; δ)| δ q = E Px [∥∇f (x)∥ q ⋆ ] := S q (f ) q , ( ) where the 3rd line is thanks to Fubini's Theorem, and the 5th line is thanks to lemma D.4 (and the fact that ∆ f (x; 0) ≡ 0). Noting that R q,0 (f ) ≡ 0 then concludes the proof.

E NEURAL NETWORKS AT (RANDOM) INITIALIZATION

We now consider networks at initialization, wherein the hidden weights matrix W = (w 1 , . . . , w m ) is a random m × d matrix with iid rows from N (0, Γ) as in the random features regime ( 13), but we freeze the output weight vector z = z 0 ∈ R m at random initialization, with random iid entries from N (0, 1/m), following standard initialization procedures. Let f init denote this random network, i.e., f init (x) := (z 0 ) ⊤ σ(W x) = m j=1 z 0 j σ(x ⊤ w j ). ( ) Theorem E.1. Under the Conditions 4.1 and 4.2, we have the identity in the limit (3), ε rob (f init ) = ∥σ ′ ∥ 2 L 2 (N (0,1)) + λ 2 3 ∥Γ∥ 2 F /2 + λ 2 2 ∥Γ∥ 2 F 4∥B∥ 2 F + o d,P (1) , where λ k is the kth Hermite coefficient of the activation function σ. In particular, for the quadratic activation function σ(t) = t 2 -1, we have ε rob (f init ) = 1 + ∥Γ∥ 2 F ∥B∥ 2 F + o d,P (1). Analogously, the test error for the NN at initialization is given by the following result. Theorem E.2. Under the Conditions 4.1 and 4.2, we have the following identity in the limit (3), ε test (f init ) = 1 + ∥σ∥ 2 L 2 (N (0,1)) + λ 2 2 ∥Γ∥ 2 F /2 2∥B∥ 2 F + o d,P 1). In particular, for the quadratic activation σ(t) := t 2 -1, we have the following identity ε test (f init ) = 1 + 1 + ∥Γ∥ 2 F ∥B∥ 2 F + o d,P . Combining Thm. E.2 with formula (11), we deduce that training a randomly initialized neural network always improves its test error, as one would expect. On the other hand, combining Thm. 3.1 and Thm. E.1, we deduce that fully training the networks (10) via SGD: (1) Degrades robustness if ∥B∥ 2 F ≳ ∥Γ∥ 2 F + 1. This is because in this case, the parameters of the model align to the signal matrix B, which has much larger energy than the parameters at initialization. Indeed, SGD tends to move the covariance structure of the hidden neurons from Γ to B. (2) Improves robustness if ∥B∥ 2 F ≲ ∥Γ∥ 2 F + 1.

F MISCELLANEOUS F.1 LAZY TRAINING OF OUTPUT LAYER IN RF REGIME

We now study the influence of the initialization on the random features regime. Let W = (w 1 , . . . , w m ) ∈ R m×d with random rows drawn iid from N (0, Γ) as in the RF model ( 13), and let the output layer be initialized at z = z 0 ∼ N (0, (1/m)I d ) and updated via single-pass gradient-flow on the entire data distribution (infinite data). In this so-called random features lazy (RFL) regime, we posit the following approximation neural network (2) f RFL (x) := z ⊤ RFL,λ σ(W x) = f init (x) + δ ⊤ λ σ(W x), where z RFL,λ := z 0 + δ λ and δ λ ∈ R m solves the following ridge-regression problem arg min δ∈R m E x∼N (0,I d ) [(δ ⊤ σ(W x) + f init (x) -f ⋆ (x)) 2 ] + λ∥δ∥ 2 . ( ) The use of the ridge parameter here can be thought of as a proxy for early-stopping at iteration t ∝ 1/λ Ali et al. ( 2020); λ = 0 corresponds to training the output layer to optimality. Theorem F.1. We have the following identities E z 0 [ ε test (f RFL,λ )] = ε test (f RF ) + tr(P 2 λ U )/m 2∥B∥ 2 F + o d,P (1) E z 0 [ ε rob (f RFL,λ )] = ε rob (f RF ) + tr(P 2 λ C)/m 4∥B∥ 2 F + o d,P (1), where U = U (W ) and C = C(W ) are the random matrices defined in (15) and (54) respectively. Because P 2 λ , U , and C are psd matrices, the residual terms tr(P 2 λ U )/m and tr(P 2 λ C)/m in the above formulae are nonnegative. We deduce that random initialization of the output weights hurts both test error and robustness, as long as the RFL regime is valid. Infinitely regularized case λ → ∞. Note that P λ converges in spectral norm a.s to the identity matrix I m in the limit λ → ∞. Thus, in this limit, z RF,λ converges almost-surely to the all-zero m-dimensional vector and so, thanks to (91), the output weights z RFL,λ of f RFL,λ converge to the value at initialization z 0 . Therefore, f RFL,λ and all its derivatives converge a.s point-wise its state f init at initialization (47). We deduce that in the λ → ∞ limit, the neural network in the lazy regime is equivalent to an untrained model f init , in terms of test error and robustness. This does not come as much of a surprise, since λ → ∞, corresponds to early-stopping at t = 0, i.e., no optimization. Unregularized case λ → 0 + . By an analogous argument as above, P λ converges a.s. to the all-zero m × m matrix in the limit λ → 0 + , and so thanks to (91), we have the almost-sure convergence ∥z RFL,λ -z RF,λ ∥ → 0. We deduce that in this limit, the unregularized lazy training regime is exactly equivalent to the unregularized vanilla RF regime. Thus, the random features lazy (RFL) regime corresponding to the approximation f RFL is an interpolation between the random features regime (corresponding to f RF ) and the untrained regime (corresponding to f init ). Although this is not useful in our infinite data regime, we remark that a non-zero amount of regularization is often crucial for good statistical performance with finite samples. In this, case, P λ is non-zero, and we expect both the test error and robustness to become worse in this lazy RF approximation, compared to vanilla RF.

F.2 EFFECT OF REGULARIZATION IN RF REGIME

Suppose the estimation of the output weights of the RF model is regularized, i.e., for a fixed λ ≥ 0, consider instead the model f RF,λ (x) := z ⊤ RF,λ σ(W x), where z RF,λ is chosen to solve the following ridge-regularized problem min z∈R m ∥f W,z -f ⋆ ∥ 2 L 2 (N (0,I d )) + λ∥z∥ 2 . A simple computation gives the explicit form z RF,λ = U -1 λ v, where U λ := U + λI m , U = U (W ) is the random matrix defined in (15), and v ∈ R m is random vector defined in ( 16). An inspection of the proof of Theorem 4.1 (see Appendix G.4) reveals that the situation in the presence of ridge regularization is equivalent to the unregularized case in which we replace λ by λ + λ in the definition of the matrix A 0 which appears in (20). This has the effect of decreasing ψ 1 and ψ 2 , and thanks to (23), decreasing the robustness of the random features model. That is, ε rob (f RF,λ ) is a decreasing function of the amount of regularization of λ, and in fact, lim λ→∞ ε rob (f RF,λ ) = 0.

G TECHNICAL PROOFS

Before proving the main results of the manuscript, we first state and prove some auxiliary results which will be instrumental.

G.1 A USEFUL LEMMA

Recall the definitions of the approximation error and robustness metrics from Section 2.3. The following lemma was used to express the measure of (non)robustness ε rob (f W,z,s ) 2 of a two-layer neural network f W,z,s as a quadratic form in the output weights, with coefficient matrix which depends on the distribution of the hidden weights. Let us start by deriving an analytic formula for the robustness measure for the neural network general model ( 2). This result will be exploited in the sequel in the analysis of the different learning regimes we will consider. Lemma G.1. For the neural net f W,z,s defined in (2), we have the analytic formula ε rob (f W,z,s ) = z ⊤ Cz, where C = C(W ) is the m × m psd matrix with entries given by (with x ∼ N (0, I d )) c j,k := (w ⊤ j w k )E x [σ ′ (x ⊤ w j )σ ′ (x ⊤ w k )] ∀j, k ∈ [m]. In particular, for a quadratic activation σ(t) ≡ t 2 + s, we have c j,k = 4(w ⊤ j w k ) 2 ∀j, k ∈ [m]. Proof. One directly computes ∇ x f W,z,s (x) = m j=1 z j σ ′ (x ⊤ w j )w j , and so the Laplacian of f W,z,s at x is given by ∥∇ x f W,z,s (x)∥ 2 = m j,k=1 z j z k (w ⊤ j w k )σ ′ (x ⊤ w j )σ ′ (x ⊤ w k ). Thus, S(f W,z,s ) 2 evaluates to ε rob (f W,z,s ) := E x∼N (0,I d ) ∥∇ x f W,z,s (x)∥ 2 = m j,k=1 z j z k (w ⊤ j w k )E x [σ ′ (x ⊤ w j )σ ′ (x ⊤ w k )] = z ⊤ C(W )z, where the m × m psd matrix C(W ) is as defined in Lemma G.1. In particular, for the activation function σ(t) := t 2 + s, one computes c j,k := (w ⊤ j w k )E x∼N (0,I d ) [σ ′ (x ⊤ w j )σ ′ (x ⊤ w k )] = 4(w ⊤ j w k )E x∼N (0,I d ) [(x ⊤ w j )(x ⊤ w k )] = 4(w ⊤ j w k ) 2 , where the last step is due to the fact that E x∼N (0,I d ) [(x ⊤ w j )(x ⊤ w k )] = E x [x ⊤ w j w ⊤ k x] = tr(Cov(x)w j w ⊤ k ) = w ⊤ j w k , by a standard result on the mean of a quadratic form.

Corollary G.1 (Robustness error of teacher model). It holds that ε

rob (f ⋆ ) = 4∥B∥ 2 F . Proof. For the first part follows directly from Lemma G.1 with activation function σ(t) := t 2 + b 0 /d and fixed output weight vector z = 1 m := (1, . . . , 1).

G.2 HERMITE COEFFICIENTS

For any nonnegative integer k, let He k : R → R be the (probabilist's) kth Hermite polynomial. For example, note that He 0 (t) := 0, He 1 (t) := t, He 2 (t) ≡ t 2 -1, He 3 (t) := t 3 -3t, etc. The sequence (He k ) k forms an orthonormal basis for the Hilbert space L 2 = L 2 (N (0, 1)) for functions R → R which are square-integrable w.r.t the standard normal distribution N (0, 1). Under suitable integrability conditions (refer to Section 4.1), the coefficients of the activation function σ in this basis are called its Hermite coefficients, denoted λ k , and are given by λ k = λ k (σ) := E G∼N (0,1) [σ(G)He k (G)]. Finally, ∥σ∥ 2 L 2 (N (0,1)) = E G∼N (0,1) [σ(G) 2 ] defines the squared L 2 -norm of σ w.r.t the standard Gaussian distribution N (0, 1). Note that by construction, one has ∥σ∥ 2 L 2 (N (0,1)) = ∞ k=0 λ 2 k (σ).

G.3 APPROXIMATION OF RANDOM MATRICES

This section establishes some technical results for "linearizing" a number of complicated random matrices which occur in our analysis. We will make heavy use of random matrix theory (RMT) techniques developed in Silverstein & Choi (1995) ; El Karoui (2010); Ledoit & Péché (2011) ; Dobriban & Wager (2018) We begin by recalling the following definition for future reference. Definition 4.1. With z ∼ N (0, 1), define the following scalars λ := E[σ(z) 2 ] -λ 2 1 , κ := λ 2 2 ∥Γ∥ 2 F d/2, τ := λ 2 tr(BΓ) √ d, λ ′ := E[σ ′ (z) 2 ] -λ 2 1 , κ ′ := λ 2 3 ∥Γ∥ 2 F d/2. Let U be the random m × m psd matrix defined in (15) and let v ∈ R m be the random vector defined in ( 16). Recall that λ k = λ k (σ) is the k Hermite coefficient of the activation function σ. Also recall the definition of the scalars λ, κ, τ , λ ′ , and κ ′ from (18). The following result was established in Ghorbani et al. (2019) . Proposition G.1 (Lemma 2 of Ghorbani et al. (2019) ). If λ 0 = 0 and Conditions 4.1, 4.2 are in place, then in the limit (3), it holds that ∥U - U 0 ∥ op = o d,P (1), -(τ / √ d)1 m ∥ = o d,P (1), where the random m × m psd matrix U 0 is defined by U 0 := λI m + λ 2 1 W W ⊤ + (κ/d)1 m 1 ⊤ m + µµ ⊤ , and µ = (µ 1 , . . . , µ m ) ∈ R m with µ i := λ 2 • (∥w i ∥ 2 -1)/2. A careful inspection of the proof of the estimate (57) reveals that we can remove the condition λ 0 = 0, at the expense of incurring rank-1 perturbations in the matrix U 0 . Indeed, let us rewrite σ = σ + λ 0 , and with λ 0 (G) = E G [σ(G)] = 0 with G ∼ N (0, 1) independent of the w i 's. Let T 0 be the m × m matrix with entries (T 0 ) ij := λ 0 (σ i )λ 0 (σ j ) , where σ i is the function defined by σ i (z) := σ(∥w i ∥z) = σ i (z) + λ 0 , with σ(∥w i ∥z) := σ(∥w i ∥z). Thus, we have the decomposition T 0 = T 0 + λ 0 (u1 ⊤ m + 1 m u ⊤ ) + λ 2 0 1 m 1 ⊤ m , where u = (λ 0 (σ i )) i∈ [m] . Let T 0 be the m × m psd matrix with entries (T 0 ) ij := λ 0 (σ i )λ 0 (σ j ). Using the arguments from Ghorbani et al. (2019) (since λ 0 (σ) = 0), one has ∥T 0 -µµ ⊤ ∥ op = o d,P (1). Furthermore, observe that one can write u1 ⊤ m = Rµ1 ⊤ m , where R is the m × m diagonal matrix with R ii := λ 0 (σ i )/µ i . Now, for large d and any i ∈ [m], one computes R ii = E G σ(∥w i ∥G) -σ(G) λ 2 • (∥w i ∥ 2 -1)/2 = E G σ(∥w i ∥G) -σ(G) ∥w i ∥ -1 • 1 λ 2 • (∥w i ∥ + 1)/2 → E G [Gσ ′ (G)] λ 2 • 2/2 = λ 2 λ 2 = 1. We deduce that ∥R -I m ∥ op = o d,P (1), and so ∥u1 ⊤ m -µ1 ⊤ m ∥ op = o d,P (1). This proves the following extension of the above lemma which will be crucial in the sequel. Lemma G.2 (Linearization of U without the Condition λ 0 (σ) = 0). Suppose Conditions 4.1 and 4.2 are in place. In the limit (3), it holds that ∥U -U 0 ∥ op = o d,P (1), where U 0 is the m × m random psd matrix given by U 0 := λI m + λ 2 1 W W ⊤ + (κ/d)1 m 1 ⊤ m + µ µ ⊤ , ( ) with µ := µ + λ 0 1 m and λ := λ -λ 2 0 = E G∼N (0,1) [σ(G) 2 ] -λ 2 0 -λ 2 1 . Let C = C(W ) be the random m × m psd matrix with entries given by c ij := (w ⊤ i w j )E x∼N (0,I d ) [σ ′ (x ⊤ w i )σ ′ (x ⊤ w j )]. Thanks to Lemma G.1, we know that ε rob (f RF ) = z ⊤ RF Cz RF = v ⊤ U -1 CU -1 v, a random quadratic form in v. We start by linearizing the nonlinear random coefficient matrix C. Lemma G.3 (Linearization of C). Suppose Conditions 4.1 and 4.2 are in place. Then, in the limit (3), we have the following approximation ∥C -C 0 ∥ op = o d,P (1), ( ) where C 0 is the m × m random psd matrix given by C 0 := λ ′ I m + (κ ′ /d + λ 2 1 )W W ⊤ + (2κ/d)1 m 1 ⊤ m , with κ ′ := d • λ 2 3 ∥Γ∥ 2 F /2 ≥ 0, and λ ′ := ∥σ ′ ∥ 2 L 2 (N (0,1)) -λ 2 1 . Proof. Note that C = (W W ⊤ ) ⊙ U ′ , where U ′ is the m × m random psd matrix with entries given by U ′ ij := E x∼N (0,Im) [σ ′ (x ⊤ w i )σ ′ (x ⊤ w j )]. -Step 1: Linearization. Invoking the previous lemma with σ ′ in place of σ, we know that ∥U ′ -U ′ 0 ∥ op = o d,P (1), where U ′ 0 is the m × m random matrix given by U ′ 0 := λ ′ I m + λ 1 (σ ′ ) 2 W W ⊤ + (κ(σ ′ )/d)1 m 1 ⊤ m + (µ + λ 0 (σ ′ )1 m )(µ + λ 0 (σ ′ )1 m ) ⊤ = λ ′ I m + λ 2 (σ) 2 W W ⊤ + (κ ′ /d)1 m 1 ⊤ m + (µ + λ 1 (σ)1 m )(µ + λ 1 (σ)1 m ) ⊤ , and we have used the fact that λ 0 ((σ ′ ) 2 ) -λ 0 (σ ′ ) 2 -λ 1 (σ ′ ) 2 = λ 0 ((σ ′ ) 2 ) -λ 1 (σ) 2 -λ 2 (σ) 2 = λ ′ -λ 2 (σ) 2 =: λ ′ . Now, since ∥W W ⊤ ∥ op = O d,P 1) by standard RMT, we deduce that from (68) that,  ∥C -(W W ⊤ ) ⊙ U ′ ∥ op = ∥(W W ⊤ ) ⊙ (U ′ -U ′ 0 )∥ op ≤ ∥W W ⊤ ∥ op • ∥U ′ -U ′ 0 ∥ op = o d,P (1). (70) -Step 2: Simplification. Let E := diag((∥w i ∥ 2 ) i∈[m] ) and F := (W W ⊤ ) ⊙ (W W ⊤ ). Then (W W ⊤ ) ⊙ U ′ 0 = λ ′ E + λ 2 (σ) 2 F + (κ ′ /d)W W ⊤ + 2λ 1 (σ)diag(µ)W W ⊤ + λ 1 (σ) 2 W W ⊤ = λ ′ E + λ 1 (σ) 2 F + (κ ′ /d + λ 1 (σ) 2 )W W ⊤ + 2λ 1 (σ)diag(µ)W W ⊤ . ( Proof. By Sherman-Morrison formula, we have A -1 1 = A -1 0 -κ A -1 0 1 m 1 ⊤ m A -1 0 /d (1 + κ1 ⊤ m A -1 0 1 m /d) , and so 1 ⊤ m A -1 1 D 0 A -1 1 1 m d = a -2ab -ab 2 = a(1 -b) 2 = ac 2 , where a := 1 ⊤ m A -1 0 D 0 A -1 0 1 m /d, b := κ1 ⊤ m A -1 0 1 m /d 1 + κ1 ⊤ m A -1 0 1 m /d , c := 1 -b = 1 1 + κ1 ⊤ m A -1 0 1 m /d . Now, one has 1 ⊤ m A -1 0 1 m /d = tr(A -1 0 )/d + o d,P , thanks to Lemmas 5 and 6 of Ghorbani et al. (2019) . By an analogous argument, one can show that 1 ⊤ m A -1 0 D 0 A -1 0 1 m /d = tr(A -2 0 D 0 )/d + o d,P 1). Finally, the fact that tr(A 0 ) -1 /d and tr(A -2 0 D 0 )/d converge to deterministic values ψ 1 and ψ 2 respectively, can be established via standard RMT arguments Silverstein & Choi (1995) ; Ledoit & Péché (2011) . Proof. From Lemmas G.1 and G.3, we know that ε rob (f RF ) = τ 2 (2κψ 2 1 + ψ 2 ) ∥B∥ 2 F (2κψ 1 + 2) 2 + o d,P ε rob (f RF ) = z ⊤ RF Cz RF = u ⊤ C 0 u + o d,P (1) = τ 2 1 ⊤ m U -1 0 C 0 U -1 0 1 m d + o d,P (1), where u := U -1 0 h, with h := (τ / √ d)1 m = λ 2 • tr(BΓ)1 m and U 0 defined as in Lemma G.2 and C, C 0 are as defined in Lemma G.3. Let A 1 , A 0 , and D 0 be the random matrices defined in (75). Since, C 0 = D 0 + (2κ/d)1 m 1 ⊤ m , one computes 1 ⊤ m U -1 0 C 0 U -1 0 1 m d = 1 ⊤ m U -1 0 D 0 U -1 0 1 m d + 2κ • 1 ⊤ m U -1 0 1 m 1 ⊤ m U -1 0 1 m d 2 = 1 ⊤ m U -1 0 D 0 U -1 0 1 m d + 2κ • 1 ⊤ m U -1 0 1 m d 2 = 1 ⊤ m U -1 0 D 0 U -1 0 1 m d + 2κψ 2 1 (1 + κψ 1 ) 2 + o d,P (1), where the last step is thanks to Lemma G.5. It remains to estimate the first term in the above display. Using the Sherman-Morrison formula, we have U -1 0 = A -1 1 - A -1 1 µµ ⊤ A -1 1 1 + µ ⊤ A -1 1 µ . ( ) We deduce that 1 ⊤ m U -1 0 D 0 U -1 0 1 m d = a 11 -a 12 -a 21 + a 22 + o d,P (1), where a 11 , a 12 , a 21 , and a 22 are defined by a 11 := 1 ⊤ m A -1 1 D 0 A -1 1 1 m d , a 12 := 1 ⊤ m A -1 1 D 0 A -1 1 µµ ⊤ A -1 1 1 m (1 + µ ⊤ A -1 1 µ)d , a 21 := 1 ⊤ m A -1 1 D 0 A -1 1 µµ ⊤ A -1 1 1 m (1 + µ ⊤ A -1 1 µ)d , a 22 := 1 ⊤ m A -1 1 µµ ⊤ A -1 1 D 0 A -1 1 µµ ⊤ A -1 1 1 m (1 + µ ⊤ A -1 1 µ) 2 d . Now, one easily computes max(|a 12 |, |a 21 |) ≤ ∥D 0 ∥ op ∥A -1 1 ∥ op • 1 ⊤ m A -1 1 µµ ⊤ A -1 1 1 m (1 + µ ⊤ A -1 1 µ)d ≲ (1 ⊤ m A -1 1 µ/ √ d) 2 (1 + µ ⊤ A -1 1 µ) = o d,P , where we have used Lemma G.5 in the last two steps. Similarly, we have, |a 22 | ≤ ∥D 0 ∥ op ∥A -1 1 ∥ op O d,P • 1 ⊤ m A -1 1 µ/ √ d o d,P • µ ⊤ A -1 1 µ (1 + µ ⊤ A -1 1 µ) 2 O d,P • µ ⊤ A -1 1 1 m / √ d o d,P = o d,P (1), again thanks to Lemma G.5. We conclude from ( 85) that 1 ⊤ m U -1 0 D 0 U -1 0 1 m d = a 11 + o d,P (1). Finally, we know from Lemma G.6 that a 11 := 1 ⊤ m A -1 1 D 0 A -1 1 1 m d = ψ 2 (1 + κψ 1 ) 2 + o d,P (1). part (A) of the theorem them follows upon dividing (85) by ε rob (f ⋆ ) = 4∥B∥ 2 F . For part (B), one notes that ψ 1 > 0 and so τ 2 (2κψ 2 1 + ψ 2 ) ∥B∥ 2 F (2κψ 1 + 2) 2 = tr(BΓ) 2 d(∥Γ∥ 2 F dψ 1 + ψ 2 ) (∥Γ∥ 2 F dψ 1 + 2) 2 ∥B∥ 2 F = tr(BΓ) 2 d 2 ∥Γ∥ 2 F dψ 1 (∥Γ∥ 2 F dψ 1 + 2) 2 ∥B∥ 2 F + o d (1) = tr(BΓ) 2 ∥Γ∥ 2 F ∥B∥ 2 F + o d (1) → α 2 ∞ , which completes the proof.

H PROOFS OF MAIN RESULTS

H.1 PROOF OF THEOREM E.1: ROBUSTNESS ERROR OF NEURAL NETWORKS AT

INITIALIZATION

We restate the result here for convenience. Let f init be the function computed by the neural network at initialization, as defined in (47). Theorem E.1. Under the Conditions 4.1 and 4.2, we have the identity in the limit (3), ε rob (f init ) = ∥σ ′ ∥ 2 L 2 (N (0,1)) + λ 2 3 ∥Γ∥ 2 F /2 + λ 2 2 ∥Γ∥ 2 F 4∥B∥ 2 F + o d,P (1) , where λ k is the kth Hermite coefficient of the activation function σ. In particular, for the quadratic activation function σ(t) = t 2 -1, we have ε rob (f init ) = 1 + ∥Γ∥ 2 F ∥B∥ 2 F + o d,P . Proof. Thanks to Lemma G.1, we know that ε rob (f init ) = z ⊤ Cz, where C is the random m × m psd matrix defined in (65). By standard RMT, z ⊤ Cz = tr(C)/m + o d,P (1). Now, let C 0 be the random matrix introduced in Lemma G.3. Since ∥C -C 0 ∥ op = o d,P (1) (thanks to the aforementioned lemma), one has tr(C)/m = tr(C 0 )/m + o d,P (1). Let D 0 := λ ′ I m + (κ ′ /d + λ 2 1 )W W ⊤ be the matrix defined in (75) so that C 0 = D 0 + (2κ/d)1 m 1 ⊤ m . We deduce that in the limit (3), ε rob (f init ) = tr(D 0 )/m + 2κ/d + o d,P (1) = (κ ′ /d + λ 2 1 )tr(W W ⊤ )/m + λ ′ + 2κ/d + o d,P (1) = k ′ /d + λ 2 1 + λ ′ + 2κ/d + o d,P (1) = ∥σ ′ ∥ 2 L 2 (N (0,1)) + κ ′ /d + 2κ/d + o d,P (1) = ∥σ ′ ∥ 2 L 2 (N (0,1)) + λ 2 3 ∥Γ∥ 2 F /2 + λ 2 2 ∥Γ∥ 2 F + o d,P (1) where the third line is because tr(W W ⊤ )/m = (1/m) m j=1 ∥w j ∥ 2 which converges in probability to tr(Γ) = 1, by the weak law of large numbers. Dividing by both sides of the above display by ε rob (f ⋆ ) = 4∥B∥ 2 F then gives the result. In particular, in the case of quadratic activation σ(t) := t 2 -1, we have λ 2 = 2, ∥σ ′ ∥ 2 L 2 (N (0,1)) = 4, λ 3 = 0, and so we deduce that ε rob (f init ) = (4 + 4∥Γ∥ 2 F )/(4∥B∥ 2 F ) = (1 + ∥Γ∥ 2 F )/∥B∥ 2 F . H.2 PROOF OF THEOREM E.2: TEST ERROR OF NEURAL NETWORK AT INITIALIZATION Theorem E.2. Under the Conditions 4.1 and 4.2, we have the following identity in the limit (3), ε test (f init ) = 1 + ∥σ∥ 2 L 2 (N (0,1)) + λ 2 2 ∥Γ∥ 2 F /2 2∥B∥ 2 F + o d,P . In particular, for the quadratic activation σ(t) := t 2 -1, we have the following identity ε test (f init ) = 1 + 1 + ∥Γ∥ 2 F ∥B∥ 2 F + o d,P Proof. For random initial output weights z 0 ∼ N (0, (1/m)1 m ) independent of the (random) hidden weights matrix W , one computes E z [ε test (f init )] := E z E x∼N (0,I d ) [(f init (x) -f ⋆ (x)) 2 ] = E z E x [f init (x) 2 ] + E x [f ⋆ (x) 2 ], where we have used the fact that Ez = 0. The second term in the rightmost expression equals ∥f ⋆ ∥ 2 L 2 (N (0,I d )) = 2∥B∥ 2 F . Let Q be the m × m diagonal matrix with the output weights z on the diagonal, and let U be the m × m matrix with entries U ij := E x [σ(x ⊤ w j )σ(x ⊤ w j )] introduced in (15), and let U  0 := λI m + λ 2 1 W W ⊤ + (κ/d)1 m 1 ⊤ m + µµ ⊤ with µ := (λ 2 (∥w j ∥ 2 -1)) j∈[m] ∈ R m , be its approximation given in Proposition G.1. Then E x [f init (x) 2 ] = E x [σ(W x) ⊤ Qσ(W x)] = z ⊤ E x [σ(W x)σ(W x) ⊤ ]z = z ⊤ U z = tr(U )/m + o d,P = λ + λ 2 1 tr(W W ⊤ )/m 1+o d,P (1) +k/d + λ 2 m i=1 (∥w i ∥ 2 -1) 2 /m o d,P (1) +o d,P (1) = λ + λ 2 1 + κ/d + o d,P (1) = ∥σ∥ 2 L 2 (N (0,1)) + λ 2 2 ∥Γ∥ 2 F /2 + o d,P (90) The first part of the result then follows upon dividing through by ∥f ⋆ ∥ 2 L 2 (N (0,I d )) = 2∥B∥ 2 F . In particular, if σ is the quadratic activation, then ∥σ∥ 2 L 2 (N (0,I d )) = λ 2 = 2, and the second part of the result follows.

H.3 THE SPECIAL CASE OF QUADRATIC ACTIVATIONS

We now specialize Theorem 4.1 to the case of the quadratic activation function and obtain more transparent formulae. Corollary H.1. Consider the random features model f RF with covariance matrix Γ satisfying Condition 4.1 and quadratic activation function σ(t) := t 2 -1. Then, in the limit (3), it holds that ε rob (f RF ) = tr(BΓ) 2 ∥Γ∥ 2 F (1/m + ∥Γ∥ 2 F ) 2 ∥B∥ 2 F + o d,P . Furthermore, part (B) of Theorem 4.1 holds. Proof. For quadratic activation, one easily computes λ 1 = λ 0 = 0, λ 2 = 2, λ = 2, λ ′ = 4, κ = λ 2 2 ∥Γ∥ 2 F d/2 = 2∥Γ∥ 2 F d, τ := 2tr(BΓ)/ √ d, κ ′ = 0, and so A 0 = 2I m and D 0 = 4I m . In this case, one deduces ψ 1 := lim m,d→∞ d/m→ρ tr(A -1 0 )/d = ρ/2 and ψ 2 := lim m,d→∞ d/m→γ tr(A -2 0 D 0 )/d = ρ. Plugging these into formula (23) of Theorem 4.1 yields Theorem F.1. We have the following identities ε rob (f RF ) = 4tr(BΓ) 2 d • 2 • 2∥Γ∥ 2 F d • (ρ/2) 2 (2 + 2 • 2∥Γ∥ 2 F d • ρ/2) 2 ∥B∥ 2 F + o d,P (1) = 4tr(BΓ) 2 ∥Γ∥ 2 F (ρd) 2 (2 + 2∥Γ∥ 2 F ρd) 2 ∥B∥ 2 F + o d,P (1) = tr(BΓ) 2 ∥Γ∥ 2 F (1/m + ∥Γ∥ 2 F ) 2 ∥B∥ 2 F + o d,P E z 0 [ ε test (f RFL,λ )] = ε test (f RF ) + tr(P 2 λ U )/m 2∥B∥ 2 F + o d,P (1) E z 0 [ ε rob (f RFL,λ )] = ε rob (f RF ) + tr(P 2 λ C)/m 4∥B∥ 2 F + o d,P (1), where U = U (W ) and C = C(W ) are the random matrices defined in (15) and (54) respectively. Proof. By construction, note that the vector δ λ is equivalent to the output weights of a RF approximation with true labels f ⋆ (x) := f ⋆ (x) -f init (x). If U and v are as defined in ( 15) and ( 16) respectively, then we have the closed-form solution (with U λ := U + λI m ) δ λ = U -1 λ (E x [(f ⋆ (x) -f z 0 (x))σ(W x)]) = U -1 λ (v -E x [(z 0 ) ⊤ σ(W x)σ(W x) ⊤ ]) = U -1 λ (v -U z 0 ) = z RF,λ -U -1 λ U z 0 . Thus, for a fixed regularization parameter λ > 0, the output weights vector in this lazy training regime is given by z RFL,λ = δ λ + z 0 = z RF,λ + P λ z 0 , where P λ := I m -U -1 λ U . We deduce that in the presence of any amount of ridge regularization, the lazy random features (RFL) regime is equivalent to the vanilla random features (RF) regime, with an additive bias of P λ z 0 ∈ R m on the fitted output weights vector. In particular, note that if λ = 0, then z RFL,0 = z RF,0 , that is in the absence of regularization, the RFL and RF correspond to the same regime (i.e., the initialization has no impact on the final model). Under review as a conference paper at ICLR 2023 -test error. From formula (91), and noting that z 0 is independent of W , one computes the test error of f lazy,λ averaged over the initial output weights vector z 0 as E z 0 [ε test (f lazy,λ )] := E z 0 [∥f lazy,λ -f ⋆ ∥ 2 L 2 (N (0,I d )) ] = ∥f RF -f ⋆ ∥ 2 L 2 (N (0,I d )) + E z 0 [∥f W,P λ z 0 ∥ 2 L 2 (N (0,I d )) ] = ε test (f RF,λ ) + E a0 [(z 0 ) ⊤ P λ U P λ z 0 ] = ε test (f RF,λ ) + tr(P 2 λ U )/m, where U = U (W ) is the matrix defined in (15). -(Non)robustness. From formula (91), one computes S(f RFL , λ) 2 = z ⊤ RFL,λ Cz RFL,λ = z ⊤ RF,λ Cz RF,λ + 2z RF,λ CP λ z 0 + (z 0 ) ⊤ P λ CP λ z 0 = S(f RF,λ ) 2 + 2z RF,λ CP λ z 0 + (z 0 ) ⊤ P λ CP λ z 0 , where C = C(W ) is the matrix defined in (65). Taking expectations w.r.t z 0 , and noting that z 0 is independent of P λ and C only depend on W and are therefore independent of z 0 , we have E z 0 [S(f lazy,λ ) 2 ] = S(f RF,λ ) 2 + tr(P 2 λ C)/m. H.5 PROOF OF THEOREM 5.1: NEURAL TANGENT (NT) REGIME Theorem 5.1. Consider the neural tangent model f NT in (26). In the limit (3) it holds that, E W [ ε rob (f NT )] = (ρ + ρ 2 )/2 + (ρ -ρ 2 )β/2 + o d (1) , where ρ := min(ρ, 1). Let r ≤ min(m, d) be the rank of W . It is clear that r = min(m, d) w.p 1. Let W ⊤ = P 1 SV ⊤ (93) be the singular-value decomposition of W ⊤ , where P 1 ∈ R d×r (resp. V ∈ R m×r ) is the columnorthogonal matrix of singular-vectors of W ⊤ (resp. W ), and S ∈ R r×r is the diagonal matrix of nonzero singular-values. For any A ∈ R m×d , set G(A) := SV ⊤ A ∈ R r×d . In their proof of (27), Ghorbani et al. (2019) showed that it is optimal (in terms of test error) to chose A NT such that G(A NT ) = P ⊤ 1 B/2. Multiplying through by the orthogonal projection matrix P 1 gives P 1 P ⊤ 1 B/2 = P 1 G(A NT ) = P 1 SV ⊤ A NT = W ⊤ A NT . For the proof of Theorem 5.1, we will need the following lemma. Lemma H.1. ε rob (f NT ) = 4∥W ⊤ A + A ⊤ W ∥ 2 F . Proof. Note that we can rewrite f NT (x) = 2tr((W ⊤ A)xx ⊤ ) -c, which is linear in xx ⊤ ∈ R d×d . One then readily computes ∇f NT (x) = 2(W ⊤ A + A ⊤ W )x, from which we deduce that ∥∇f NT (x)∥ 2 = 4x ⊤ (W ⊤ A + A ⊤ W ) 2 x. Averaging over x ∼ N (0, I d ) then gives ε rob (f NT ) 4 := E x ∥∇f NT (x)∥ 2 = E x [x ⊤ (W ⊤ A + A ⊤ W ) 2 x] = tr((W ⊤ A + A ⊤ W ) 2 ) = ∥W ⊤ A + A ⊤ W ∥ 2 F , which completes the proof. We will also need the following auxiliary lemma. -Robustness error. Proceeding in the same way as in the paragraph leading to (94), one has ε rob (f NTL ) = 2∥P 1 P ⊤ 1 B∥ 2 F + 2∥P ⊤ 1 BP 1 ∥ 2 F , where P 1 ∈ R d×r is the column-orthogonal matrix in (93) and r := min(m, d) is the rank of W (w.p 1). Now, by definition of B, one has B 2 = (B -W ⊤ QW )(B -W ⊤ QW ), and so P 1 P ⊤ 1 B 2 = P 1 P ⊤ 1 B 2 -P 1 P ⊤ 1 BW ⊤ QW -P 1 P ⊤ 1 W ⊤ QW B + P 1 P ⊤ 1 W ⊤ QW W ⊤ QW. (103) We now take the expectation w.r.t (W, z 0 ), of each term on the RHS. Thanks to Lemma H.2, we recognize the expectation w.r.t W of the trace of the first term in ( 103) as E W [tr(P 1 P ⊤ 1 B 2 )] = E W [∥P 1 P ⊤ 1 B∥ 2 F ] = ∥B∥ 2 F (ρ + o d (1)), Now, since W and z 0 are independent and z 0 has zero mean, the second and third terms in (102) have zero expectation w.r.t (W, z 0 ) because they are linear in Q = diag(z 0 ). Finally, one notes that P 1 P ⊤ 1 W ⊤ QW W ⊤ QW = P 1 SV ⊤ DW W ⊤ QV SP ⊤ 1 = W ⊤ QW W ⊤ QW, and so taking expectation w.r.t W and D (i.e z 0 ) yields E {W,z 0 } [tr(P 1 P ⊤ 1 W ⊤ QW W ⊤ QW )] = E {W,z 0 } [tr(W W ⊤ QW W ⊤ Q)] = E {W,z 0 } [z ⊤ ((W W ⊤ ) ⊙ (W W ⊤ ))z] = 1 4 E {W,z 0 } [ε rob (f init )], where the last step is thanks to the second part of Lemma G.1. Putting things together, we have at this point established that E {W,z 0 } [∥P 1 P ⊤ 1 B∥ 2 F ] = ∥B∥ 2 F (ρ + o d (1)) + 1 4 E {W,z 0 } [ε rob (f init )]. Similarly, noting that P 1 P ⊤ 1 W ⊤ = W ⊤ by definition of P 1 , one has ∥P 1 BP ⊤ 1 ∥ 2 F = tr(P 1 P ⊤ 1 BP 1 P ⊤ 1 B) = tr((P 1 P ⊤ 1 B -W ⊤ QW )(P 1 P ⊤ 1 B -W ⊤ QW )) = tr(P 1 P ⊤ 1 BP 1 P ⊤ 1 ) -tr(P 1 P ⊤ 1 BW ⊤ QW ) -tr(P 1 P ⊤ 1 W ⊤ W QW B) + tr(W ⊤ QW W ⊤ QW ). (108) Taking expectation w.r.t W and z 0 then gives E {W,z 0 } ∥P ⊤ 1 BP 1 ∥ 2 F = E W [∥P ⊤ 1 BP 1 ∥ 2 F ] + E {W,z 0 } [tr(W W ⊤ QW W ⊤ Q)] = ∥B∥ 2 F (ρ 2 (1 -β) + ρβ + o d (1)) + 1 4 E {W,z 0 } [ε rob (f init )]. Combining ( 102), ( 107), (109), and (28) then completes the proof of (32). -test error. The proof of formula (31) build on the proof of Theorem 2 in Ghorbani et al. (2019) . Let P 2 be a d × (d -min(m, d)) matrix such that the combined columns of P 1 and P 2 form an orthonormal basis for R d . Then, one computes ε test (f NTL ) := ∥f NTL -f ⋆ ∥ L 2 (N (0,I d )) = E x [|f NTL (x) -f ⋆ (x)| 2 ] (a) = min A∈R m×d 2∥ B -W ⊤ A -A ⊤ W ∥ 2 F (b) = 2∥P ⊤ 2 BP 2 ∥ 2 F = 2∥P ⊤ 2 (B -W ⊤ QW )P 2 ∥ 2 F (c) = 2∥P ⊤ 2 BP 2 ∥ 2 F = ε test (f NT ) . where (a) and (b) are due to arguments analogous to arguments made in the beginning of proof of Theorem 2 in Ghorbani et al. (2019) (except that our B plays the role of B in Ghorbani et al. (2019) ) and (c) is because P ⊤ 2 P 1 = 0 ∈ R (d-min(m,d))×d by construction of P 2 . Dividing through the above display by ε rob (f ⋆ ) = 4∥B∥ 2 F then gives (31). function is fixed to quadratic), and so the residue function h has no specific structure in general. Mindful of the previous remark, to disprove (1), it is sufficient to construct a subspace H of the weighted Sobolev space W 1,2 (N (0, I d )) (consisting of functions g : R d → R which are squareintegrable w.r.t N (0, I d ) with weak-derivatives which are square-integrable w.r.t N (0, I d )), such that: for every ϵ, C > 0, there exists h 0 ∈ H with ∥h 0 ∥ L 2 (N (0,I d )) ≤ ϵ and ∥∇h 0 ∥ L 2 (N (0,I d )) > C.( 111) Indeed, take H = W 1,2 (N (0, I d )), and for fixed α ∈ (0, 1), consider the sequence of residue functions (h n ) n in H given by h n (x) := (1/n α ) sin(nx 1 ), for every positive integer n, and x = (x 1 , . . . , x d ) ∈ R d . Note that a constructive way for realizing such residue functions is by taking B = 0 in the teacher model f ⋆ , and activation function σ(t) ≡ sin(t) in the student model f . Now, a simple computation gives ∇h n (x) = (1/n β ) cos(nx 1 )e 1 , where β := 1 -α ∈ (0, 1) and e j is jth standard basis vector in R d . Furthermore,  ∥h



Here, derivatives are allowed to be defined only almost-everywhere, as in neural networks with ReLU activation function. This notion of smoothness is completely subsumed by the more general notion presented in Section 4.1 ofGigli & Ledoux (2013). For fair comparison with our measure of robustness, we restrict the computation of Lipschitz constant to this ball since √ d is the length of a typical random vector from N (0, I d ).



Condition 4.1. The covariance matrix Γ satisfies: (A) tr(Γ) = 1 and d • ∥Γ∥ op = O(1). (B) The empirical eigenvalue distribution of d • Γ converges weakly to a probability distribution D on R + .

Theorem 4.1. Consider the random features model f RF (13), with covariance matrix Γ satisfying Condition 4.1 and activation function σ satisfying Conditions 4.2 and 4.3.

Theorem 4.1 and Corollary H.1 are empirically verified in Fig. 1. Results for the ReLU activation function are also shown. Notice the perfect match between our theoretical results and experiments.

Figure 1: Empirical validation of Theorem 4.1 and Corollary H.1. Showing (normalized) test / test error ε test and robustness ε rob of random features model (13) as a function of the number of the network width, for different choices of the covariance matrix Γ of the random weights of the hidden neurons: the optimal choice Γ ⋆ ∝ B and the naive choice I d . Here, the input-dimension is d = 450 and the regularization λ is zero. Horizontal broken lines correspond to asymptotes at α 2 ∞ at 1 -α 2 ∞ , where α ∞ := lim d→∞ tr(BΓ)/(∥B∥ F ∥Γ∥ F ) is the level of task-alignment of the covariance matrix Γ of hidden neurons, w.r.t learning the teacher model f ⋆ defined in (1). Broken curves are theoretical predictions, while solid curves correspond to actual experiments.

Figure2: Showing curves of (normalized) test error ε test and robustness ε rob for a two-layer neural network in different learning regimes of the hidden weights. Here, the input dimension is d = 450 and the width m sweeps a range of values from 10 to 1500. Dashed curves correspond to theoretical predictions, while solid curves correspond to actual values observed in the experiment (5 runs). We use n = 10 6 training samples as a proxy for infinite data. The covariance matrix of the hidden neurons is fixed at Γ = (1/d)I d . For simplicity of this experiment, we also take the coefficient matrix B of the teacher model to be proportional to I d . (c) "Small random init" means B = (1/ √ d)I d , so that B is much larger than Γ. (d) "Large random init" means that B = (1/d)I d , and thus is of the same order as Γ (in Frobenius norm). In this case, the initialization degrades the robustness, as predicted by Thm. 5.2. Note that, as predicted by Thm. 5.2, random initialization has no impact on the test error of the NT approximation. Results for (a) the neural network at initialization (Thm. E.1 and E.2) and (b) In the NT regime (Thm. 5.1) are also depicted for reference.

up: An insight from linear regression D Justification of our proposed measure of robustness D.1 Why not use Lipschitz constants to measure robustness ? . . . . . . . . . . . . . . D.2 Proofs for Dirichlet energy as a measure of adversarial vulnerability . . . . . . . . E Neural networks at (random) initialization F Miscellaneous F.1 Lazy training of output layer in RF regime . . . . . . . . . . . . . . . . . . . . . . F.2 Effect of regularization in RF regime . . . . . . . . . . . . . . . . . . . . . . . . . G Technical proofs G.1 A useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 Hermite coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.3 Approximation of random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . G.4 Proof of Theorem 4.1: Analytic formula for robustness of RF model . . . . . . . . H Proofs of main results H.1 Proof of Theorem E.1: Robustness error of neural networks at initialization . . . . H.2 Proof of Theorem E.2: test error of neural network at initialization . . . . . . . . . H.3 The special case of quadratic activations . . . . . . . . . . . . . . . . . . . . . . . H.4 Proof of Theorem F.1: Random features lazy (RFL) regime . . . . . . . . . . . . . H.5 Proof of Theorem 5.1: Neural tangent (NT) regime . . . . . . . . . . . . . . . . . H.6 Proof of Theorem 5.2: Neural tangent lazy (NTL) regime . . . . . . . . . . . . . . I General nonlinear teacher and student models J Proof of Theorem I.1 K Approximating function values doesn't amount to approximating grandients K.1 Disproving (9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K.2 Some exceptional cases where (1) holds . . . . . . . . . . . . . . . . . . . . . . .

-Gabriel et al. (2019); Daniely & Shacham (2020); Bubeck et al. (2021); Bartlett et al. (2021) study adversarial vulnerability of neural networks at initialization, but do not consider the effects of training the model, in contrast to our work. Schmidt et al. (2018); Khim & Loh (2018); Yin et al. (2019); Bhattacharjee et al. (2021); Min et al. (2021b;a) study the sample complexity of robust learning. In contrasts, our work focuses on the case of infinite data, so that the only complexity parameters are the input dimension d and the network width m. Bhattacharjee et al. (

) Further, because max i∈[n] |∥w i ∥ 2 -1| = o d,P (1) by basic concentration, we have ∥E -I m ∥ op , ∥diag(µ)∥ op = o d,P (1).

G.4 PROOF OF THEOREM 4.1: ANALYTIC FORMULA FOR ROBUSTNESS OF RF MODEL We are now ready to prove Theorem 4.1, restated here for convenience. Theorem 4.1. Consider the random features model f RF (13), with covariance matrix Γ satisfying Condition 4.1 and activation function σ satisfying Conditions 4.2 and 4.3. (A) In the limit (3), we have the following approximation

ε rob (f RF ) = α 2 ∞ w.p 1. In particular, for the optimal choice of Γ in terms of test error, namely Γ ∝ B, one has lim ρ→∞ lim m,d→∞ m/d→ρ ε rob (f RF ) = 1 w.p 1.

1), by concentration of random quadratic forms = tr(U 0 )/m + o d,P (1), thanks to Proposition G.1

which proves the first part of the corollary. The second part follows directly from the second part of Theorem 4.1.H.4 PROOF OF THEOREM F.1: RANDOM FEATURES LAZY (RFL) REGIME

n ∥ 2 L 2 (N (0,I d )) = n -2α E x1∼N (0,1) [sin 2 (nx 1 )] = n -2α e -n 2 sinh(n 2 ) = n -2α (1 -e -2n 2 )/2 n→∞ -→ 0, ∥∇h n ∥ 2 L 2 (N (0,I d )) = E x1∼N (0,1) [n 2β cos 2 (nx 1 )] = n 2β e -n 2 cosh(n 2 ) = n 2β (1 + e -2n 2 )/2 n→∞ -→ ∞.Thus, (2) holds and we conclude that the implication (1) claimed by the reviewer fails in general. □K.2 SOME EXCEPTIONAL CASES WHERE (1) HOLDSLet us roundup by noting that (1) can be true in very specific circumstances. For example, if the activation function σ of the student is quadratic, just like the teacher model f ⋆ . Indeed, in this case the set of all residue functions f -f ⋆ is contained in the set of polynomials of degree at most 2 in d real variables; this is a finite-dimensional subspaceH = P 2 (R d ) of L 2 (N (0, I d )).In fact, it is easy to show via a simple counting argument that dim(H) = dim(P 2 (R d )) = d+2 d = (d+2)(d+1)/2 ≲ d 2 . Thus, the gradient operator ∇ is a finite-rank, and therefore compact operator on H. It follows that the H-restricted operator norm ∥∇| H ∥ op defined by∥∇| H ∥ op := sup h∈H\{0} ∥∇h∥ L 2 (N (0,I d )) ∥h∥ L 2 (N (0,I d )) . (112)is finite and thus the implication (1) holds in this case. This argument is valid whenever the linear span H of the residue functions f -f ⋆ is a finite-dimensional subspace of L 2 (N (0, I d )), for example linear models, or more generally, polynomial models of degree ≤ D (corresponding to the case where the activation function σ of the student is a polynomial of degree ≤ D), for some fixed integer D ≥ 1; indeed H = P D (R d ) in this case, and has dimension d+D d = (d + D)(d + D -1) . . . (d + 1)/D! ≲ d D .

annex

Dong Yin, Ramchandran Kannan, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, 2019.Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, and Volkan Cevher. Robustness in deep learning:The good (width), the bad (depth), and the ugly (initialization), 2022.Also, thanks to (El Karoui, 2010, Theorem 2. 3), we may linearize F like soCombining with (71) gives (recalling that κ :where ∥∆∥ op = o d,P (1).

Let us rewrite U

We will need the following lemmas. Lemma G.4. We have the following approximationwhereProof. Thanks to Proposition G.1, the fitted output weights vector z RF ∈ R m concentrates around u := U -1 0 h. On the other hand, we know from Lemma G.1 that ε rob (f RF ) = z ⊤ RF Cz RF . The result then follows from Lemma G.3. Lemma G.5. Under Condition 4.3, the following holds in the limit (3)where ψ 1 > 0 is as defined in (20).Proof. Formula (77) was established in the proof of (Ghorbani et al., 2019 , Theorem 1), whilst (78) was established in the proof of Lemma 5 of the same paper.As for (79), we note that1), by standard RMT arguments Vershynin (2012) .We will need one final lemma. Lemma G.6. Let A 1 , A 0 , and D 0 be the random matrices defined in (75). Then, it holds thatwhere ψ 1 and ψ 2 as defined in (20).Lemma H.2. Let P 1 be as in (93) and let β := tr(B) 2 /(d∥B∥ 2 F ) as usual. In the limit (3), we have the identitieswhere ρ := min(ρ, 1).Proof. WLOG, let B be a diagonal matrix, so that B 2 = j λ 2 j e j e ⊤ j , where e j is the jth standard unit-vector in R d . Then, with P = P 1 P ⊤ 1 , we haveTherefore,), where we have used the fact that E W P jj = (1/d)E W tr(P ) for all j, due to rotation-invariance. This proves (95).The proof of ( 96) is completely analogous to the proof of formula (69) in Ghorbani et al. (2019) , with ρ therein replaced with 1 -ρ, and is thus omitted.Proof of Theorem 5.1. From Lemma H.1 and formula ( 94)), we know thatThe result then follows upon taking expectations w.r.t the hidden weights matrix W and applying Lemma H.2.H.6 PROOF OF THEOREM 5.2: NEURAL TANGENT LAZY (NTL) REGIME Theorem 5.2. Suppose the output weights z 0 at initialization are iid from N (0, (1/m)I m ). Then, in the limit (3), the following identities holdand the d × d matrix B is defined byThus, fitting the model f NTL (•; A, c) to the teacher model f ⋆ with coefficient matrix B is equivalent to fitting f NT (•; A, c) to the modified teacher model f ⋆ with coefficient matrix B.In terms of test error (4), let A NTL , c NTL be optimal in f NT (•; A, c), and for simplicity of notation defineWe split the proof into two parts. In the first part, we establish (32). The second part handles (31).

I GENERAL NONLINEAR TEACHER AND STUDENT MODELS

The results we established so far are for student-teacher models which are two-layer neural networks in certain learning regimes with Gaussian data. In this section, we consider much more general scenarios, and show lower-bounds that display similar tradeoffs between test error and robustness.Suppose the distribution P x of the features is any distribution on R d which satisfies a Poincaré inequality with constant c 2 > 0. This means that for any smooth function f :is the variance of f and f := E Px f ∈ R is its mean w.r.t P x . For example, N (0, Σ) verifies a Poincaré inequality with c 2 = ∥Σ∥ op . Consider a teacher model f ⋆ which is now any function in L 2 (P x ) with mean f ⋆ = 0. Theorem I.1. For every smooth student model f : R d → R (neural network or not!), it holds thatThe nature of Theorem I.1 is a tradeoff since it directly implies that the test error cannot be decreased without increasing the robustness error. In the particular case of isotropic features where P x = N (0, I d ) as considered in the preceding sections, a Poincaré inequality with constant c 2 = 1 is satisfied, and we deduce from the above theorem that, for any smooth student model f , one hasOf course, apart from being only one-sided, the inequality ( 110) is weaker than the tradeoffs established in the preceding sections, due to the square-roots in the former. However, (110) holds without any real restriction on the teacher model f ⋆ , student f model, or learning algorithm / regime; it is solely a consequence of the high-dimensional geometry of the distribution of the features, manifested via the Poincaré inequality. In contrast, the tradeoffs established in the preceding sections where for student-teacher models which where two-layer neural networks in various learning regimes.J PROOF OF THEOREM I.1Theorem I.1. For every smooth student model f : R d → R (neural network or not!), it holds thatProof. WLOG, assume ε test (f ) ≤ ∥f ⋆ ∥ L 2 (Px) , since the claimed lower-bound trivially holds otherwise. By the Poincaré inequality, we have, by the triangle inequality.In particular, we havewhere the last line follows from ∥f ⋆ ∥ 2 L 2 (Px) = Var Px (f ⋆ ) ≤ ∥f ⋆ -f ∥ 2 L 2 (Px) . The result then follows from a simple rearrangement of the terms in the above display.

K.1 DISPROVING (9)

Henceforth, for a student model f , consider the residue function h := f -f ⋆ . Thus, ∇f -∇f ⋆ = ∇h. In the student-teacher setup considered in our work, the student f is in general mis-specified w.r.t to the teacher f ⋆ (for example, because the students activation is arbitrary while the teacher's activation

