OPTIMAL RATES FOR AVERAGED STOCHASTIC GRA-DIENT DESCENT UNDER NEURAL TANGENT KERNEL REGIME

Abstract

We analyze the convergence of the averaged stochastic gradient descent for overparameterized two-layer neural networks for regression problems. It was recently found that a neural tangent kernel (NTK) plays an important role in showing the global convergence of gradient-based methods under the NTK regime, where the learning dynamics for overparameterized neural networks can be almost characterized by that for the associated reproducing kernel Hilbert space (RKHS). However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show that the averaged stochastic gradient descent can achieve the minimax optimal convergence rate, with the global convergence guarantee, by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of a ReLU network under certain conditions. However, the eigenvalues of the NTK converge to zero as the number of examples increases, as shown in Su & Yang (2019) (also see Figure 1 ), resulting in the degeneration of the NTK. This phenomenon indicates that the convergence rates in previous studies in terms of generalization are generally slower than O(T -1/2 ) owing to the dependence on the minimum eigenvalue. Moreover, Bietti & Mairal (2019); Ronen et al. (2019); Cao et al. (2019) also supported this observation by providing a precise

1. INTRODUCTION

Recent studies have revealed why a stochastic gradient descent for neural networks converges to a global minimum and why it generalizes well under the overparameterized setting in which the number of parameters is larger than the number of given training examples. One prominent approach is to map the learning dynamics for neural networks into function spaces and exploit the convexity of the loss functions with respect to the function. The neural tangent kernel (NTK) (Jacot et al., 2018) has provided such a connection between the learning process of a neural network and a kernel method in a reproducing kernel Hilbert space (RKHS) associated with an NTK. The global convergence of the gradient descent was demonstrated in Du et al. (2019b) ; Allen-Zhu et al. (2019a) ; Du et al. (2019a) ; Allen-Zhu et al. (2019b) through the development of a theory of NTK with the overparameterization. In these theories, the positivity of the NTK on the given training examples plays a crucial role in exploiting the property of the NTK. Specifically, the positivity of the Gram-matrix of the NTK leads to a rapid decay of the training loss, and thus the learning dynamics can be localized around the initial point of a neural network with the overparameterization, resulting in the equivalence between two learning dynamics for neural networks and kernel methods with the NTK through a linear approximation of neural networks. Moreover, Arora et al. (2019a) provided a generalization bound of O(T -1/2 ), where T is the number of training examples, on a gradient descent under the positivity assumption of the NTK. These studies provided the first steps in understanding the role of the NTK. estimation of the decay of the eigenvalues, and Ronen et al. (2019) ; Cao et al. (2019) proved the spectral bias (Rahaman et al., 2019) for a neural network, where lower frequencies are learned first using a gradient descent. By contrast, several studies showed faster convergence rates of the (averaged) stochastic gradient descent in the RKHS in terms of the generalization (Cesa-Bianchi et al., 2004; Smale & Yao, 2006; Ying & Zhou, 2006; Neu & Rosasco, 2018; Lin et al., 2020) . In particular, by extending the results in a finite-dimensional case (Bach & Moulines, 2013) , Dieuleveut & Bach (2016) ; Dieuleveut et al. (2017) showed convergence rates of O(T -2rβ 2rβ+1 ) depending on the complexity r ∈ [1/2, 1] of the target functions and the decay rate β > 1 of the eigenvalues of the kernel (a.k.a. the complexity of the hypothesis space). In addition, extensions to the random feature settings (Rahimi & Recht, 2007; Rudi & Rosasco, 2017; Carratino et al., 2018) , to the multi-pass variant (Pillaud-Vivien et al., 2018b) , and to the tail-averaging and mini-batching variant (Mücke et al., 2019) have been developed.

Motivation. The convergence rate of O(T -2rβ

2rβ+1 ) is always faster than O(T -1/2 ) and is known as the minimax optimal rate (Caponnetto & De Vito, 2007; Blanchard & Mücke, 2018) . Hence, a gap exists between the theories regarding NTK and kernel methods. In other words, there is still room for an investigation into a stochastic gradient descent due to a lack of specification of the complexities of the target function and the hypothesis space. That is, to obtain faster convergence rates, we should specify the eigenspaces of the NTK that mainly contain the target function (i.e., the complexity of the target function), and specify the decay rates of the eigenvalues of the NTK (i.e., the complexity of the hypothesis space), as studied in kernel methods (Caponnetto & De Vito, 2007; Steinwart et al., 2009; Dieuleveut & Bach, 2016) . In summary, the fundamental question in this study is Can stochastic gradient descent for overparameterized neural networks achieve the optimal rate in terms of the generalization by exploiting the complexities of the target function and hypothesis space? In this study, we answer this question in the affirmative, thereby bridging the gap between the theories of overparameterized neural networks and kernel methods. Figure 1: An estimation of the eigenvalues of Σ ∞ using two-layer ReLU networks with a width of M = 2 × 10 4 . The number of uniformly randomly generated samples on the unit sphere is n = 10 4 and the dimensionality of the input space is d ∈ {5, 10, 100}.

1.1. CONTRIBUTIONS

The connection between neural networks and kernel methods is being understood via the NTK, but it is still unknown whether the optimal convergence rate faster than O(T -1/2 ) is achievable by a certain algorithm for neural networks. This is the first paper to overcome technical challenges of achieving the optimal convergence rate under the NTK regime. We obtain the minimax optimal convergence rates (Corollary 1), inherited from the learning dynamics in an RKHS, for an averaged stochastic gradient descent for neural networks. That is, we show that smooth target functions efficiently specified by the NTK are learned rapidly at faster convergence rates than O(1/ √ T ). Moreover, we obtain an explicit optimal convergence rate of O T -2rd 2rd+d-1 for a smooth approximation of the ReLU network (Corollary 2), where d is the dimensionality of the data space and r is the complexity of the target function specified by the NTK of the ReLU network.

1.2. TECHNICAL CHALLENGE

The key to showing a global convergence (Theorem 1) is making the connection between kernel methods and neural networks in some sense. Although this sort of analysis has been developed in several studies (Du et al., 2019b; Arora et al., 2019a; Weinan et al., 2019; Arora et al., 2019b; Lee et al., 2019; 2020) , we would like to emphasize that our results cannot be obtained by direct application of their results. A naive idea is to simply combine their results with the convergence analysis of the stochastic gradient descent for kernel methods, but it does not work. The main reason is that we need the L 2 -bound weighted by a true data distribution on the gap between dynamics of stochastic gradient descent for neural networks and kernel methods if we try to derive a convergence rate of population risks for neural networks from that for kernel methods. However, such a bound is not provided in related studies. Indeed, to the best of our knowledge, all related studies make this kind of connection regarding the gap on training dataset or sample-wise high probability bound (Lee et al., 2019; Arora et al., 2019b) . That is, a statement "for every input data x with high probability |g (t) nn (x) -g (t) ntk (x)| < " cannot yield a desired statement " g (t) nn -g (t) ntk L2(ρ X ) < " where g (t) nn and g (t) ntk are t-th iterate of gradient descent for a neural network and corresponding iterate described by NTK, and • L2(ρ X ) is the L 2 -norm weighted by a marginal data distribution ρ X over the input space. Moreover, we note that we cannot utilize the positivity of the Gram-matrix of NTK which plays a crucial role in related studies because we consider the population risk with respect to • L2(ρ X ) rather than the empirical risk. To overcome these difficulties we develop a different strategy of the proof. First, we make a bound on the gap between two dynamics of the averaged stochastic gradient descent for a two-layer neural network and its NTK with width M (Proposition A), and obtain a generalization bound for this intermediate NTK (Theorem A in Appendix). Second, we remove the dependence on the width of M from the intermediate bound. These steps are not obvious because we need a detailed investigation to handle the misspecification of the target function by an intermediate NTK. Based on detailed analyses, we obtain a faster and precise bound than those in previous results (Arora et al., 2019a) . The following is an informal version of Proposition A providing a new connection between a two-layer neural networks and corresponding NTK with width M . Proposition 1 (Informal). Under appropriate conditions we simultaneously run averaged stochastic gradient descent for a neural network with width of M and for its NTK. Assume they share the same hyper-parameters and examples to compute stochastic gradients. Then, for arbitrary number of iterations T ∈ Z + and > 0, there exists M ∈ Z + depending only on T and such that ∀t ≤ T , g (t) nn -g (t) ntk L∞(ρ X ) ≤ , where g (t) nn and g (t) ntk are iterates obtained by averaged stochastic gradient descent. This proposition is the key because it connects two learning dynamics for a neural network and its NTK through overparameterization without the positivity of the NTK. Instead of the positivity, this proposition says that overparameterization increases the time stayed in the NTK regime where the learning dynamics for neural networks can be characterized by the NTK. As a result, the averaged stochastic gradient descent for the overparameterized two-layer neural networks can fully inherit preferable properties from learning dynamics in the NTK as long as the network width is sufficiently large. See Appendix A for detail.

1.3. ADDITIONAL RELATED WORK

Besides the abovementioned studies, there are several works (Chizat & Bach, 2018b; Wu et al., 2019; Zou & Gu, 2019) that have shown the global convergence of (stochastic) gradient descent for overparameterized neural networks essentially relying on the positivity condition of NTK. Moreover, faster convergence rates of the second-order methods such as the natural gradient descent and Gauss-Newton method have been demonstrated (Zhang et al., 2019; Cai et al., 2019) in the similar setting, and the further improvement of Gauss-Newton method with respect to the cost per iteration has been conducted in Brand et al. (2020) . There have been several attempts to improve the overparameterization size in the NTK theory. For the regression problem, Song & Yang (2019) has succeeded in reducing the network width required in Du et al. (2019b) by utilizing matrix Chernoff bound. For the classification problem, the positivity condition can be relaxed to a separability condition using another reference model (Cao & Gu, 2019a; b; Nitanda et al., 2019; Ji & Telgarsky, 2019) , resulting in mild overparameterization and generalization bounds of O(T -1/2 ) or O(T -1/4 ) on classification errors. For an averaged stochastic gradient descent on classification problems in RKHSs, linear convergence rates of the expected classification errors have been demonstrated in Pillaud-Vivien et al. (2018a) ; Nitanda & Suzuki (2019) . Although our study focuses on regression problems, we describe how to combine their results with our theory in the Appendix. The mean field regime (Nitanda & Suzuki, 2017; Mei et al., 2018; Chizat & Bach, 2018a ) that is a different limit of neural networks from the NTK is also important for the global convergence analysis of the gradient descent. In the mean field regime, the learning dynamics follows the Wasserstein gradient flow which enables us to establish convergence analysis in the probability space. Moreover, several studies (Allen-Zhu & Li, 2019; Bai & Lee, 2019; Ghorbani et al., 2019; Allen-Zhu & Li, 2020; Li et al., 2020; Suzuki, 2020) attempt to show the superiority of neural networks over kernel methods including the NTK. Although it is also very important to study the conditions beyond the NTK regime, they do not affect our contribution and vice versa. Indeed, which method is better depends on the assumption on the target function and data distribution, so it is important to investigate the optimal convergence rate and optimal method in each regime. As shown in our study, the averaged stochastic gradient descent for learning neural network achieves the optimal convergence rate if the target function is included in RKHS associated with the NTK with the small norm. It means there are no methods that outperform the averaged stochastic gradient descent under this setting.

2. PRELIMINARY

Let X ⊂ R d and Y ⊂ R be the measurable feature and label spaces, respectively. We denote by ρ a data distribution on X × Y, by ρ X the marginal distribution on X, and by ρ(•|X) the conditional distribution on Y , where (X, Y ) ∼ ρ. Let (z, y) (z ∈ R, y ∈ Y) be the squared loss function 1 2 (z -y) 2 , and let g : X → R be a hypothesis. The expected risk function is defined as follows: L(g) def = E (X,Y )∼ρ [ (g(X), Y )]. The Bayes rule g ρ : X → R is a global minimizer of L over all measurable functions. For the least squares regression, the Bayes rule is known to be g ρ (X) = E Y [Y |X] and the excess risk of a hypothesis g (which is the difference between the expected risk of g and the expected risk of the Bayes rule g ρ ) is expressed as a squared L 2 (ρ X )-distance between g and g ρ (for details, see Cucker & Smale (2002) ) up to a constant: L(g) -L(g ρ ) = 1 2 g -g ρ 2 L2(ρ X ) , where • L2(ρ X ) is L 2 -norm weighted by ρ X defined as g L2(ρ X ) def = g 2 (X)dρ X (X) 1/2 (g ∈ L 2 (ρ X )). Hence, the goal of the regression problem is to approximate g ρ in terms of the L 2 (ρ X )-distance in a given hypothesis class. Two-layer neural networks. The hypothesis class considered in this study is the set of two-layer neural networks, which is formalized as follows. Let M ∈ Z + be the network width (number of hidden nodes). Let a = (a 1 , . . . , a M ) ∈ R M (a r ∈ R) be the parameters of the output layer, B = (b 1 , . . . , b M ) ∈ R d×M (b r ∈ R d ) be the parameters of the input layer, and c = (c 1 , . . . , c M ) ∈ R M (c r ∈ R) be the bias parameters. We denote by Θ the collection of all parameters (a, B, c), and consider two-layer neural networks: g Θ (x) = 1 √ M M r=1 a r σ(b r x + γc r ), where σ : R → R is an activation function and γ > 0 is a scale of the bias terms. Symmetric initialization. We adopt symmetric initialization for the parameters Θ. Let a (0) = (a (0) 1 , . . . , a (0) M ) , B (0) = (b (0) 1 , . . . , b M ), and c (0) = (c (0) 1 , . . . , c M ) denote the initial values for a, B, and c, respectively. Assume that the number of hidden units M ∈ Z + is even. The parameters for the output layer are initialized as a are independently drawn from the distribution µ 0 . The bias parameters are initialized as c (0) r = 0 for r ∈ {1, . . . , M }. The aim of the symmetric initialization is to make an initial function g Θ (0) = 0, where Θ (0) = (a (0) , B (0) , c (0) ). This is just for theoretical simplicity. Indeed, we can relax the symmetric initialization by considering an additional error stemming from the nonzero initialization in the function space. Regularized expected risk minimization. Instead of minimizing the expected risk (1) itself, we consider the minimization problem of the regularized expected risk around the initial values: min Θ L(g Θ ) + λ 2 a -a (0) 2 2 + B -B (0) 2 F + c -c (0) 2 2 . ( ) where the last term is the L 2 -regularization at an initial point with a regularization parameter λ > 0. This regularization forces iterations obtained by optimization algorithms to stay close to the initial value, which enables us to utilize the better convergence property of regularized kernel methods. Averaged stochastic gradient descent. Stochastic gradient descent is the most popular method for solving large-scale machine learning problems, and its averaged variant is also frequently used to stabilize and accelerate the convergence. In this study, we analyze the generalization ability of an averaged stochastic gradient descent. The update rule is presented in Algorithm 1. Let Θ (t) = (a (t) , B (t) , c (t) ) denote the collection of t-th iterates of parameters a ∈ R M , B ∈ R d×M , and c ∈ R M . At t-th iterate, stochastic gradient descent using a learning rate η t for the problem (3) with respect to a, B, c is performed on lines 4-6 for a randomly sampled example (x t , y t ) ∼ ρ. These updates can be rewritten in an element-wise fashion as follows. For r ∈ {1, . . . , M }, a (t+1) r -a (0) r = (1 -η t λ)(a (t) r -a (0) r ) -η t M -1/2 (g Θ (t) (x t ) -y t )σ(b (t) r x t + γc (t) r ), b (t+1) r -b (0) r = (1 -η t λ)(b (t) r -b (0) r ) -η t M -1/2 (g Θ (t) (x t ) -y t )a (t) r σ (b (t) r x t + γc (t) r )x t , c (t+1) r -c (0) r = (1 -η t λ)(c (t) r -c (0) r ) -η t M -1/2 (g Θ (t) (x t ) -y t )a (t) r γσ (b (t) r x t + γc (t) r ), where a (t) = (a (t) 1 , . . . , a (t) M ) , B (t) = (b (t) 1 , . . . , b M ), and c (t) = (c (t) 1 , . . . , c M ) . Finally, a weighted average using weights α t of the history of parameters is computed on line 9. In our theory, we consider the constant learning rate η t = η and uniform averaging α t = 1/(T + 1). Algorithm 1 Averaged Stochastic Gradient Descent 1: Input: number of iterations T , regularization parameter λ, learning rates (η t ) T -1 t=0 , averaging weights (α t ) T t=0 , initial values Θ (0) = (a (0) , B (0) , c (0) ) 2: for t = 0 to T -1 do 3: Randomly draw a sample (x t , y t ) ∼ ρ 4: a (t+1) ← a (t) -η t ∂ a (g Θ (t) (x t ), y t ) -η t λ(a (t) -a (0) ) 5: B (t+1) ← B (t) -η t ∂ B (g Θ (t) (x t ), y t ) -η t λ(B (t) -B (0) ) 6: c (t+1) ← c (t) -η t ∂ c (g Θ (t) (x t ), y t ) -η t λ(c (t) -c (0) ) 7: Θ (t+1) ← (a (t+1) , B (t+1) , c (t+1) ) 8: end for 9: Θ (T ) = ( T t=0 α t a (t) , T t=0 α t B (t) , T t=0 α t c (t) ) 10: Return g Θ (T ) Integral and Covariance Operators. The integral and covariance operators associated with the kernels, which are the limit of the Gram-matrix as the number of examples goes to infinity, play a crucial role in determining the learning speed. For a given Hilbert space H, we denote by ⊗ H the tensor product on H, that is, ∀(f, g) ∈ H 2 , f ⊗ H g defines a linear operator; h ∈ H → (f ⊗ H g)h = f, h H g ∈ H. Note that f ⊗ H g naturally induces a bilinear function: (h, h ) ∈ H × H → (f ⊗ H g)h, h H = f, h H g, h H . When H is a reproducing kernel Hilbert space (RKHS) associated with a bounded kernel k : X × X → R, the covariance operator Σ : H → H is defined as follows: Set K X def = k(X, •) and Σ = E X∼ρ X [K X ⊗ H K X ]. Note that the covariance operator is a restriction of the integral operator on L 2 (ρ X ): f ∈ L 2 (ρ X ) -→ Σf = X f (X)K X dρ X ∈ L 2 (ρ X ). We use the same symbol as above for convenience with a slight abuse of notation. Because Σ is a compact self-adjoint operator on L 2 (ρ X ), Σ has the following eigendecomposition: Σf = ∞ i=1 λ i f, φ i L2(ρ X ) φ i for f ∈ L 2 (ρ X ), where {(λ i , φ i )} ∞ i=1 is a pair of eigenvalues and orthogonal eigenfunctions in L 2 (ρ X ). For s ∈ R, the power Σ s is defined as Σ s f = ∞ i=1 λ s i f, φ i L2(ρ X ) φ i .

3. MAIN RESULTS: MINIMAX OPTIMAL CONVERGENCE RATES

In this section, we present the main results regarding the fast convergence rates of the averaged stochastic gradient descent under a certain condition on the NTK and target function g ρ . Neural tangent kernel. The NTK is a recently developed kernel function and has been shown to be extremely useful in demonstrating the global convergence of the gradient descent method for neural networks (cf., Jacot et al. ( 2018 2019a)). The NTK in our setting is defined as follows: ∀x, ∀x ∈ X , k ∞ (x, x ) def = E b (0) [σ(b (0) x)σ(b (0) x )] + R 2 (x x + γ 2 )E b (0) [σ (b (0) x)σ (b (0) x )], where the expectation is taken with respect to b (0) ∼ µ 0 . The NTK is the key to the global convergence of a neural network because it makes a connection between the (averaged) stochastic gradient descent for a neural network and the RKHS associated with k ∞ (see Proposition A). Although this type of connection has been shown in previous studies (Arora et al., 2019b; Weinan et al., 2019; Lee et al., 2019; 2020) , note that their results are inapplicable to our theory because we consider the population risk. Indeed, our study is the first to establish this connection for an (averaged) stochastic gradient descent in terms of the uniform distance on the support of the data distribution, enabling us to obtain faster convergence rates. We note that an NTK k ∞ is the sum of two NTKs, that is, the first and second terms in (4) are NTKs for the output and input layers with bias, respectively.

3.1. GLOBAL CONVERGENCE ANALYSIS

Let H ∞ be an RKHS associated with NTK k ∞ , and let Σ ∞ be the corresponding integral operator. Let {λ i } ∞ i=1 denote the eigenvalues of Σ ∞ sorted in decreasing order: λ 1 ≥ λ 2 ≥ • • • . Assumption 1. (A1) There exists C > 0 such that σ ∞ ≤ C, σ ∞ ≤ 2, and |σ(u)| ≤ 1 + |u| for ∀u ∈ R. (A2) supp(ρ X ) ⊂ {x ∈ R d | x 2 ≤ 1}, Y ⊂ [-1, 1], R = 1, and γ ∈ [0, 1]. (A3) There exists r ∈ [1/2, 1] such that g ρ ∈ Σ r ∞ (L 2 (ρ X )), i.e., Σ -r ∞ g ρ L2(ρ X ) < ∞. (A4) There exists β > 1 such that λ i = Θ(i -β ). Remark. • (A1): Typical smooth activation functions, such as sigmoid and tanh functions, and smooth approximations of the ReLU, such as swish (Ramachandran et al., 2017) , which performs as well as or even better than the ReLU, satisfy Assumption (A1). This condition is used to relate the two learning dynamics between neural networks and kernel methods (see Proposition A). • (A2): The boundedness (A2) of the feature space and label are often assumed for stochastic optimization and least squares regression for theoretical guarantees (see Steinwart et al. (2009) ). Note that these constants in (A2) can be relaxed to arbitrary constants. • (A3): Assumption (A3) measures the complexity of g ρ because Σ ∞ can be considered as a smoothing operator using a kernel k ∞ . A larger r indicates a faster decay of the coefficients of expansion of g ρ based on the eigenfunctions of Σ ∞ and smoothens g ρ . In addition, Σ r ∞ (L 2 (ρ X )) shrinks with respect to r and Σ 1/2 ∞ (L 2 (ρ X )) = H ∞ , resulting in g ρ ∈ H ∞ . This condition is used to control the bias of the estimators through L 2 -regularization. The notation Σ -r ∞ g ρ represents any function G ∈ L 2 (ρ X ) such that g ρ = Σ r ∞ G. • (A4): Assumption (A4) controls the complexity of the hypothesis class H ∞ . A larger β indicates a faster decay of the eigenvalues and makes H ∞ smaller. This assumption is essentially needed to bound the variance of the estimators efficiently and derive a fast convergence rate. Theorem 1 and Corollary 1, 2 hold even though the condition in (A4) is relaxed to λ i = O(i -β ) and the lower bound λ i = Ω(i -β ) is necessary only for making obtained rates minimax optimal. Under these assumptions, we derive the convergence rate of the averaged stochastic gradient descent for an overparameterized two-layer neural network, the proof is provided in the Appendix. Theorem 1. Suppose Assumptions (A1)-(A3) hold. Run Algorithm 1 with a constant learning rate η satisfying 4(6 + λ)η ≤ 1. Then, for any > 0, Σ ∞ op ≥ λ > 0, δ ∈ (0, 1), and T ∈ Z + , there exists M 0 ∈ Z + such that for any M ≥ M 0 , the following holds with high probability at least 1 -δ over the random choice of features Θ (0) : E g Θ (T ) -g ρ 2 L2(ρ X ) ≤ + α λ 2r Σ -r ∞ g ρ 2 L2(ρ X ) + 1 T + 1 g ρ 2 H∞ + 1 λη 2 (T + 1) 2 g ρ 2 H∞ + α T + 1 1 + g ρ 2 L2(ρ X ) + Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ ∞ (Σ ∞ + λI) -1 , where α > 0 is a universal constant and g Θ (T ) is an iterate obtained through Algorithm 1. Remark. The first term and second term λ 2r Σ -r ∞ g ρ 2 L2(ρ X ) are the approximation error and bias, which can be chosen to be arbitrarily small. The first term comes from the approximation of the NTK using finite-sized neural networks, and the second term comes from the L 2 -regularization, which coincides with a bias term in the theory of least squares regression (Caponnetto & De Vito, 2007) . The third and fourth terms come from the convergence of the averaged semi-stochastic gradient descent (which is considered in the proof) in terms of the optimization. The appearance of an inverse dependence on λ in the fourth term is common because a smaller λ indicates a weaker strong convexity, which slows down the convergence speed of the optimization methods (Rakhlin et al., 2012) . The term Tr Σ ∞ (Σ ∞ + λI) -1 is the variance from the stochastic approximation of the gradient, and it is referred to as the degree of freedom or the effective dimension, which is known to be unavoidable in kernel regression problems (Caponnetto & De Vito, 2007; Dieuleveut & Bach, 2016; Rudi & Rosasco, 2017) . Global convergence in NTK regime. This theorem shows the global convergence to the Bayes rule g ρ , which is a minimizer over all measurable maps because the approximation term can be arbitrarily small by taking a sufficiently large network width M . The required value of M has an exponential dependence on T ; note, however, that reducing M is not the main focus of the present study. The key technique is to relate two learning dynamics for two-layer neural networks and kernel methods in an RKHS approximating H ∞ up to a small error. Unlike existing studies (Du et al., 2019b; Arora et al., 2019a; b; Weinan et al., 2019; Lee et al., 2019; 2020) showing such connections, we establish this connection in term of the L ∞ (ρ X )-norm, which is more useful in a generalization analysis. Moreover, existing studies essentially rely on the strict positivity of the Gram-matrix to localize all iterates around an initial value, which can slow down the convergence rate in terms of the generalization because the convergence of the eigenvalues of the NTK to zero affects the Rademacher complexity. By contrast, our theory succeeds in demonstrating the global convergence in the NTK regime without the positivity of the NTK.

3.2. OPTIMAL CONVERGENCE RATE

We derive the fast convergence rate from Theorem 1 by utilizing Assumption (A4), which defines the complexity of the NTK. The regularization parameter λ mainly controls the trade-off within the generalization bound, that is, a smaller value decreases the bias term but increases the variance term including the degree of freedom. The degree of freedom Tr Σ ∞ (Σ ∞ + λI) -1 can be specified by imposing Assumption (A4) because it determines the decay rate of the eigenvalues of Σ ∞ . As a result, this trade-off between bias and variance depending on the choice of λ becomes clear, and we can determine the optimal value. Concretely, by setting λ = T -β/(2rβ+1) , the sum of the bias and variance terms is minimized, and these terms become asymptotically equivalent. Corollary 1. Suppose Assumptions (A1)-(A4) hold. Run Algorithm 1 with the constant learning rate η = O(1) satisfying 4(6 + λ)η ≤ 1 and λ = T -β/(2rβ+1) . Then, for any > 0, δ ∈ (0, 1) and T ∈ Z + satisfying Σ ∞ op ≥ λ, there exists M 0 ∈ Z + such that for any M ≥ M 0 , the following holds with high probability at least 1 -δ over the random choice of random features Θ (0) : E g Θ (T ) -g ρ 2 L2(ρ X ) ≤ + αT -2rβ 2rβ+1 1 + Σ -r ∞ g ρ 2 L2(ρ X ) , where α > 0 is a universal constant and g Θ (T ) is an iterate obtained by Algorithm 1. The resulting convergence rate is O(T - 2rβ+1 ) with respect to T by considering a sufficiently large network width of M such that the error stemming from the approximation of NTK can be ignored. Because T corresponds to the number of examples used to learn a predictor g Θ (T ) , this convergence rate is simply the generalization error bound for the averaged stochastic gradient descent. In general, this rate is always faster than T -1/2 and is known to be the minimax optimal rate of estimation (Caponnetto & De Vito, 2007; Blanchard & Mücke, 2018) in H ∞ in the following sense. Let P(β, r) be a data distribution class satisfying Assumptions (A2)-(A4). Then, lim τ →0 lim inf T →∞ inf h (T ) sup ρ P h (T ) -g ρ 2 L2(ρ X ) > τ T -2rβ 2rβ+1 = 1, where ρ is taken in P(β, r) and h (T ) is taken over all mappings (x t , y t ) T -1 t=0 → h (T ) ∈ H ∞ .

3.3. EXPLICIT OPTIMAL CONVERGENCE RATE FOR SMOOTH APPROXIMATION OF RELU

For smooth activation functions that sufficiently approximate the ReLU, an optimal explicit convergence rate can be derived under the setting in which the target function is specified by NTK with the ReLU, and the data are distributed uniformly on a sphere. We denote the ReLU activation by σ(u) = max{0, u} and a smooth approximation of ReLU by σ (s) , which converges to ReLU, as s → ∞ in the following sense. We make alternative assumptions to (A1), (A2), and (A3): Assumption 2. (A1') σ (s) satisfies (A1). σ (s) and σ (s) converge pointwise almost surely to σ and σ as s → ∞. (A1') and (A2') are special cases of (A1) and (A2). There are several activation functions that satisfy this condition, including swish (Ramachandran et al., 2017 ): (A2') ρ X is a uniform distribution on {x ∈ R d | x 2 = 1}. Y ⊂ [-1, 1], R = 1, and γ ∈ (0, 1]. ( σ (s) (u) = u 1+exp(-su) . Under these conditions, we can estimate the decay rate of the eigenvalues for the ReLU as β = 1 + 1 d-1 , yielding the explicit optimal convergence rate by adapting the proof of Theorem 1 to the current setting. Note that Algorithm 1 is run for a neural network with a smooth approximation σ (s) of the ReLU. Corollary 2. Suppose Assumptions (A1'), (A2'), and (A3') hold. Run Algorithm 1 with the constant learning rate η = O(1) satisfying 4(6 + λ)η ≤ 1, and λ = T -d/(2rd+d-1) . Given any > 0, δ ∈ (0, 1) and T ∈ Z + satisfying Σ ∞ op ≥ 2λ, let s be an arbitrary and sufficiently large positive value. Then, there exists M 0 ∈ Z + such that for any M ≥ M 0 , the following holds with high probability at least 1 -δ over the random choice of random features Θ (0) : E g Θ (T ) -g ρ 2 L2(ρ X ) ≤ + αT -2rd 2rd+d-1 1 + Σ -r ∞ g ρ 2 L2(ρ X ) , where α > 0 is a universal constant and g Θ (T ) is an iterate obtained by Algorithm 1.

4. EXPERIMENTS

We verify the importance of the specification of target functions by showing the misspecification significantly slows down the convergence speed. To evaluate the misspecification, we consider single-layer learning as well as the two-layer learning, and we see the advantage of two-layer learning. Here, note that, with evident modification of the proofs, the counterparts of Corollaries 1 and 2 for learning a single layer also hold by replacing Σ ∞ with the covariance operator Σ a,∞ (Σ b,∞ ) associated with k a,∞ (k b,∞ ), where k a,∞ (x, x ) = E b (0) [σ(b (0) x)σ(b (0) x )], k b,∞ (x, x ) = R 2 (x x + γ 2 )E b (0) [σ (b (0) x)σ (b (0) x )], which are components of k ∞ = k a,∞ +k b,∞ corresponding to the output and input layers, respectively. Then, from Corollaries 1 and 2, a Bayes rule g ρ is learned efficiently by optimizing the layer which has a small norm Σ -r g ρ L2(ρ X ) for Σ ∈ {Σ a,∞ , Σ b,∞ , Σ ∞ }.  -r g ρ L2(ρ X ) for Σ ∈ {Σ a,∞ , Σ b,∞ , Σ ∞ }. Bayes rules g ρ are averages of eigenfunctions of Σ a,∞ (left), Σ b,∞ (middle), and Σ ∞ (right) corresponding to the 10-largest eigenvalues excluding the first and second, with the setting: R = 1/(20 √ 2), γ = 10 √ 2, and ρ X is the uniform distribution on the unit sphere in R 2 . To estimate eigenvalues and eigenfunctions, we draw 10 4 -samples from ρ X and M = 2 × 10 4 -hidden nodes of a two-layer ReLU. Empirical observations. We observe g ρ has the smallest norm with respect to the integral operator which specifies g ρ and has a comparably small norm with respect to Σ ∞ even for the cases where g ρ is specified by Σ a,∞ or Σ b,∞ . This observation suggests the efficiency of learning a corresponding layer to g ρ and learning both layers, and it is empirically verified. We run Algorithm 1 10-times with respect to output (blue), input (purple), and both layers (orange) of two-layer swish networks with s = 10. Figure 2 (Bottom) depicts the average and standard deviation of test errors. From the figure, we see that learning a corresponding layer to g ρ and both layers exhibit faster convergence, and that misspecification significantly slows down the convergence speed in all cases.

5. CONCLUSION

We analyzed the convergence of the averaged stochastic gradient descent for overparameterized twolayer neural networks for a regression problem. Through the development of a new proof strategy that does not rely on the positivity of the NTK, we proved that the global convergence (Theorem 1) relies only on the overparameterization. Moreover, we demonstrated the minimax optimal convergence rates (Corollary 1) in terms of the generalization error depending on the complexities of the target function and the hypothesis class and showed the explicit optimal rate for the smooth approximation of the ReLU. We consider a random feature approximation of NTK k ∞ : for an initial value B (0) = (b (0) r ) M r=1 , ∀x, ∀x ∈ X , k M (x, x ) def = 1 M M r=1 σ(b (0) r x)σ(b (0) r x ) + (x x + γ 2 ) M M r=1 σ (b (0) r x)σ (b (0) r x ), We can confirm that k M is an approximation of NTK, that is, k M converges to k ∞ uniformly over supp(ρ X ) × supp(ρ X ) almost surely by the uniform law of large numbers. We denote by (H M , , H M ) an RKHS associated with k M . By the assumptions, we see k M (x, x ) ≤ 12 for ∀(x, x ) ∈ supp(ρ X ) × supp(ρ X ). We introduce averaged stochastic gradient descent in H M (see Algorithm 2) as a reference for Algorithm 1. The notation G (t) represents a stochastic gradient at the t-th iterate: G (t) def = ∂ z (g (t) (x t ), y t )k M (x t , •). Algorithm 2 Reference ASGD in H M 1: Input: number of iterations T , regularization parameter λ, learning rates (η t ) T -1 0=1 , averaging weights (α t ) T t=0 , 2: g (0) ← 0 3: for t = 0 to T -1 do 4: Randomly draw a sample (x t , y t ) ∼ ρ 5: g (t+1) ← (1 -η t λ)g (t) -η t G (t) 6: end for 7: Return g (T ) = T t=0 α t g (t) The following proposition shows the equivalence between the averaged stochastic gradient descent for two-layer neural networks and that in H M up to a small constant depending on M . Proposition A. Suppose Assumptions (A1) and (A2) hold. Run Algorithms 1 and 2 with the constant learning rate η t = η satisfying ηλ < 1 and η ≤ 1. Moreover, assume that they share the same hyper-parameter settings and the same examples (x t , y t ) T -1 t=0 to compute stochastic gradient. Then, for arbitrary T ∈ Z + and > 0, there exists M ∈ Z + depending only on T and such that ∀t ≤ T , g Θ (t) -g (t) L∞(ρ X ) ≤ , where g Θ (t) and g (t) are iterates obtained by Algorithm 1 and 2, respectively. Remark. Note that this proposition holds for non-averaged SGD too because it is a special case of averaged SGD by setting only one α t to 1. Key idea. This proposition is the key because it connects two learning dynamics for neural networks and RKHS H M by utilizing overparameterization without the positivity of NTK unlike existing studies (Weinan et al., 2019; Arora et al., 2019b ) that provide such a connection for continuous gradient flow with the positive NTK. Instead of the positivity of NTK, Proposition A says that overparameterization increases the time stayed in the NTK regime where the learning dynamics for neural networks can be characterized by the NTK. As a result, because M is free from the other hyper-parameters, the averaged stochastic gradient descent for the overparameterized two-layer neural networks can fully inherit preferable properties from learning dynamics in H M with an appropriate choice of learning rates and regularization parameters as long as the network width is sufficiently large depending only on the number of iterations and the required accuracy.

A. 2 CONVERGENCE RATE OF THE REFERENCE ASGD

We give the convergence analysis of Algorithm 2 in H M , which will be a part of a bound in Theorem 1. Proofs essentially rely on several techniques developed in serial studies (Bach & Moulines, 2013; Dieuleveut & Bach, 2016; Dieuleveut et al., 2017; Pillaud-Vivien et al., 2018a; Rudi & Rosasco, 2017; Carratino et al., 2018) with several adaptations to our settings. Let M ∈ Z + ∪ {∞} be a positive number or ∞. We set K M,X def = k M (X, •) and denote by Σ M the covariance operator defined by k M : Σ M def = E X∼ρ X [K M,X ⊗ H M K M,X ]. We denote by g M,λ the minimizer of the regularized risk over H M : g M,λ def = arg min g∈H M L(g) + λ 2 g 2 H M . We remark that Σ M : L 2 (ρ X ) → H M is isometric (Cucker & Smale, 2002) , that is, ∀(f, g) ∈ L 2 (ρ X ) × L 2 (ρ X ), Σ 1/2 M f, Σ 1/2 M g H M = f, g L2(ρ X ) , and we use this fact frequently. It is known that g M,λ is represented as follows (Caponnetto & De Vito, 2007) : g M,λ = (Σ M + λI) -1 E (X,Y ) [Y K M,X ] = (Σ M + λI) -1 Σ M g ρ . The following theorem provides a convergence rate of Algorithm 2 to the minimizer g M,λ . Theorem A. Suppose Assumptions (A1), (A2) and (A3) hold. Run Algorithm 2 with the constant learning rate η t = η satisfying 4(6 + λ)η ≤ 1. Then, for ∀λ > 0 and ∀δ ∈ (0, 1) there exists M 0 > 0 such that for ∀M ≥ M 0 the following holds with high probability at least 1 -δ: E g (T ) -g M,λ 2 L2(ρ X ) ≤ 4 η 2 (T + 1) 2 (Σ M + λI) -1 g M,λ 2 L2(ρ X ) + 2 • 24 2 T + 1 (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) + 8 T + 1 1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ M (Σ M + λI) -1 , where g (T ) is an iterate obtained by Algorithm 2. Remark. The first and second terms stem from the optimization speed of a semi-stochastic part of averaged stochastic gradient descent. The first term has a better dependency on T , but it has a worse dependency on λ than the second one. This kind of deterioration due to the weak strong convexity is common in first-order optimization methods. However, as confirmed later, these two terms are dominated by the variance term corresponding to the third term by setting hyper-parameters appropriately. To make the bound in Theorem A free from the size of M , we introduce the following proposition. Proposition B. Suppose g ρ ∈ H ∞ holds. Under Assumption (A1) and (A2), for any δ ∈ (0, 1), there exists M 0 ∈ Z + such that for any M ≥ M 0 , the following holds with high probability at least 1 -δ: (Σ M + λI) -1 g M,λ 2 L2(ρ X ) ≤ 2λ -1 g ρ 2 H∞ , (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) ≤ 2 g ρ 2 H∞ , and if λ ≤ Σ ∞ op , then Tr Σ M (Σ M + λI) -1 ≤ 3Tr Σ ∞ (Σ ∞ + λI) -1 . Remark. The last inequality on the degree of freedom was shown in Rudi & Rosasco (2017) . To show the convergence to g ρ , we utilize the following decomposition: 1 3 g (T ) -g ρ 2 L2(ρ X ) ≤ g (T ) -g M,λ L2(ρ X ) + g M,λ -g ∞,λ L2(ρ X ) + g ∞,λ -g ρ 2 L2(ρ X ) , where g ∞,λ def = arg min g∈H∞ {L(g) + λ 2 g 2 H∞ }. The first term is the optimization speed evaluated in Theorem A, and the second and third terms are approximation errors from a random feature approximation of NTK and imposing L 2 -regularization, respectively. These approximation terms can be evaluated by the following existing results. The next proposition is a simplified version of Lemma 8 in Carratino et al. (2018) Proposition C (Carratino et al. (2018) ). Under Assumption (A1), (A2), and (A3), for any , λ > 0 and δ ∈ (0, 1], there exists M 0 ∈ Z + depending on , λ, δ such that for any M ≥ M 0 , the following holds with high probability at least 1 -δ: (Caponnetto & De Vito (2007) ). Under Assumption (A3), it follows that g M,λ -g ∞,λ 2 L2(ρ X ) ≤ . Proposition D g ∞,λ -g ρ 2 L2(ρ X ) ≤ λ 2r Σ -r ∞ g ρ 2 L2(ρ X ) . By combining Theorem A, Proposition B, C, and D with the decomposition (8), we can establish the convergence rate of reference ASGD to reach g ρ , which is simply the generalization error bound. Theorem B. Assume the same conditions as in Theorem A. Then, for ∀ > 0, Σ ∞ op ≥ ∀λ > 0, and ∀δ ∈ (0, 1), there exists M 0 ∈ Z + such that for ∀M ≥ M 0 , the following holds with high probability at least 1 -δ over the random choice of random features Θ (0) : E g (T ) -g ρ 2 L2(ρ X ) ≤ + 3λ 2r Σ -r ∞ g ρ 2 L2(ρ X ) + 24 T + 1 288 + 1 λη 2 (T + 1) g ρ 2 H∞ + 24 T + 1 1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ ∞ (Σ ∞ + λI) -1 , where g (T ) is an iterate obtained by Algorithm 2.

A. 3 CONVERGENCE RATES OF ASGD FOR NEURAL NETWORKS

As explained earlier, the generalization bound for the reference ASGD is inherited by that for twolayer neural networks through Proposition A with the following decomposition: for an iterate Θ T obtained by Algorithm 1, g Θ (T ) -g ρ L2(ρ X ) ≤ g Θ (T ) -g (T ) L2(ρ X ) + g (T ) -g ρ L2(ρ X ) . That is, these two terms are bounded by Proposition A and Theorem B under Assumption (A1)-(A3), resulting in Theorem 1, which exhibits comparable generalization error to Theorem B as long as the network width M is sufficiently large. Theorem 1 immediately leads to the fast convergence rate in Corollary 1 by setting η t = η = O(1) satisfying 4(6 + λ)η ≤ 1 and λ = T -β/(2rβ+1) with the bounds on g ρ 2 H∞ , g ρ 2 L2(ρ X ) , and the degree of freedom. Because β in Assumption (A4) controls the complexity of the hypothesis space H ∞ , it derives a bound on the degree of freedom, as shown in Caponnetto & De Vito (2007) : Tr Σ ∞ (Σ ∞ + λI) -1 = O(λ -1/β ). In addition, the boundedness of Σ ∞ op ≤ O(1) gives g ρ H∞ = Σ r-1 2 ∞ Σ -r ∞ g ρ L2(ρ X ) ≤ O Σ -r ∞ g ρ L2(ρ X ) , g ρ L2(ρ X ) ≤ Σ r ∞ Σ -r ∞ g ρ L2(ρ X ) ≤ O Σ -r ∞ g ρ L2(ρ X ) . This finishes the proof of Corollary 1.

B PROOF OF PROPOSITION A

We first show the Proposition A that says the equivalence between averaged stochastic gradient descent for two-layer neural networks and that in an RKHS associated with k M .

Proof. Proof of Proposition A

Bound the growth of g Θ (t) L∞(ρ X ) . We first show that there exist increasing functions d(t) and M (t) depending only on t uniformly over the choice of the history of examples (x t , y t ) ∞ t=1 used in Algorithms such that g Θ (s) L∞(ρ X ) ≤ d(t) for ∀s ≤ t when M ≥ M (t). We show this statement by the induction. Without loss of generality, we assume that there is no bias term, b (0) r 2 = 1, and supp(ρ X ) ⊂ {x ∈ R d+1 | x 2 ≤ 2} by setting x ← (x, γ) (where γ ∈ (0, 1)). Hence, we consider the update only for parameters a and B. The above statement clearly holds for t = 0. Thus, we assume it holds for t. We recall the specific update rules of the stochastic gradient descent: a (t+1) r -a (0) r = (1 -ηλ)(a (t) r -a (0) r ) - η √ M (g Θ (t) (x t ) -y t )σ(b (t) r x t ), (t+1) r -b (0) r = (1 -ηλ)(b (t) r -b (0) r ) - η √ M (g Θ (t) (x t ) -y t )a (t) r σ (b (t) r x t )x t . Here, let us consider ∀M ≥ M (t). Set d M b (t) = max s≤t,1≤r≤M b (s) r 2 . Then, by expanding equation ( 9), we get |a (t+1) r -a (0) r | ≤ |a (t) r -a (0) r | + η √ M (d(t) + 1)(1 + 2d M b (t)) ≤ η √ M t s=0 (d(s) + 1)(1 + 2d M b (t)) ≤ η(t + 1) √ M (d(t) + 1)(1 + 2d M b (t)), where we used σ(u) ≤ 1 + u and |y t | ≤ 1. As for the term |a (s) r | (s ≤ t + 1), from the similar augment for s and the monotonicity, we have for s ≤ t + 1, |a (s) r | ≤ 1 + |a (s) r -a (0) r | ≤ 1 + η(t + 1) √ M (d(t) + 1)(1 + 2d M b (t)) ≤ 1 + η(t + 1)(d(t) + 1)(1 + 2d M b (t)). Published as a conference paper at ICLR 2021 We next give a bound on b (t+1) r -b (0) r 2 . By expanding equation ( 10), we get b (t+1) r -b (0) r 2 ≤ b (t) r -b (0) r 2 + 4η √ M |a (t) r |(d(t) + 1) ≤ 4η √ M t s=0 |a (s) r |(d(s) + 1) ≤ 4η(t + 1) √ M (d(t) + 1) 1 + η(t + 1) √ M (d(t) + 1)(1 + 2d M b (t)) , where we used σ ∞ ≤ 2 and x t 2 ≤ 2. Here, we evaluate d M b (t). From the similar augment for s ≤ t, the monotonicity, and b (s) r 2 ≤ 1 + b (s) r -b (0) r 2 , we get d M b (t) ≤ 1 + 4η(t + 1) √ M (d(t) + 1) 1 + η(t + 1) √ M (d(t) + 1)(1 + 2d M b (t)) . Let M (t + 1) be a positive integer depending on t and d(t) such that t+1 √ M (d(t) + 1) ≤ 1 4 . Let us reconsider ∀M ≥ M (t + 1). Then, since η ≤ 1, we have d M b (t) ≤ 5 2 , |a (s) r | ≤ 5 2 (∀s ≤ t + 1). From the derivation of ( 11) and ( 12) and since η ≤ 1, we have for 0 ≤ ∀s ≤ t + 1, |a (s) r -a (0) r | ≤ d 1 (t + 1) √ M , b (s) r -b (0) r 2 ≤ d 2 (t + 1) √ M , where d 1 (t + 1) and d 2 (t + 1) are set to d 1 (t + 1) def = 6(t + 1)(d(t) + 1), d 2 (t + 1) def = 10(t + 1)(d(t) + 1). We next bound |g Θ (t+1) (x)| for x ∈ ∀supp(ρ X ) as follows. Since g Θ (0) ≡ 0, |g Θ (t+1) (x)| = |g Θ (t+1) (x) -g Θ (0) (x)| ≤ 1 √ M M r=1 (a (t+1) r -a (0) r )σ(b (0) r x) + a (t+1) r (σ(b (t+1) r x)) -σ(b (0) r x) ≤ 1 √ M M r=1 2 a (t+1) r -a (0) r + 4 a (t+1) r b (t+1) r -b (0) r ≤ 2d 1 (t + 1) + 10d 2 (t + 1). In summary, by setting M (t + 1) = max{M (t), 16(t + 1) 2 (d(t) + 1) 2 } and d(t + 1) = 2d 1 (t + 1) + 10d 2 (t + 1), we get g Θ (t+1) L∞(ρ X ) ≤ d(t + 1) when M ≥ M (t + 1). We note that from the above construction, d(t), d 1 (t), d 2 (t) depend only on t and inequalities ( 13) are always hold for ∀t ∈ Z + when M ≥ M (t + 1). Linear approximation of the model. For a given T ∈ Z + , we consider ∀M ≥ M (T ) and define the neighborhood of Θ (0) = (a (0) r , b (0) r ) M r=1 : B T (Θ (0) ) def = (a r , b r ) M r=1 ∈ (R × R d+1 ) M | |a r | ≤ 5 2 , |a r -a (0) r | ≤ d 1 (T ) √ M , b r -b (0) r 2 ≤ d 2 (T ) √ M . From Taylor's formula |σ(b r x) -σ(b (0) r x) -σ (b (0) r x)(b r -b (0) r ) x| ≤ 2 σ ∞ b r -b (0) r 2 2 and the smoothness of σ, we get for Θ ∈ B T (Θ (0) ) and x ∈ supp(ρ X ), a r σ(b r x) -(a (0) r σ(b (0) r x) + (a r -a (0) r )σ(b (0) r x) + a (0) r σ (b (0) r x)(b r -b (0) r ) x) ≤ 4|a r -a (0) r | b r -b (0) r 2 + 2C|a r | b r -b (0) r 2 2 ≤ 2d 1 (T )d 2 (T ) M + 5Cd 2 2 (T ) M . ( ) We here define a linear model: h Θ (x) def = 1 √ M M r=1 (a r -a (0) r )σ(b (0) r x) + a (0) r σ (b (0) r x)(b r -b (0) r ) x . By taking the sum of ( 14) over r ∈ {1, . . . , M } and by g Θ (0) ≡ 0, |g Θ (x) -h Θ (x)| ≤ 1 √ M M r=1 2d 1 (T )d 2 (T ) M + 5Cd 2 2 (T ) M ≤ 1 √ M 2d 1 (T )d 2 (T ) + 5Cd 2 2 (T ) . We denote d 3 (T ) ) ). Thus, we get for ∀t ∈ {1, . . . , T }, def = 2d 1 (T )d 2 (T ) + 5Cd 2 2 (T ). Since iterates (Θ (t) ) T t=0 obtained by Algorithm 1 are contained in B T (Θ (0) ), weighted averages (Θ (t) ) T t=0 are also contained in B T (Θ ( |g Θ (t) (x) -h Θ (t) (x)| ≤ d 3 (T ) √ M , g Θ (t) (x) -h Θ (t) (x) ≤ d 3 (T ) √ M . ( ) Recursion of h Θ (t) using the random feature approximation of NTK. We here derive a recursion of h Θ (t) using k M . From the updates ( 9) and ( 10), we have h Θ (t+1) (x) = 1 √ M M r=1 (a (t+1) r -a (0) r )σ(b (0) r x) + a (0) r σ (b (0) r x)(b (t+1) r -b (0) r ) x = (1 -ηλ)h Θ (t) (x) - η M M r=1 (g Θ (t) (x t ) -y t )σ(b (t) r x t )σ(b (0) r x) - η M M r=1 a (0) r σ (b (0) r x)(g Θ (t) (x t ) -y t )a (t) r σ (b (t) r x t )x t x. ( ) Note that for t ∈ {0, . . . , T }, |(g Θ (t) (x t ) -y t )σ(b (t) r x t ) -(h Θ (t) (x t ) -y t )σ(b (0) r x t )| ≤ |(g Θ (t) (x t ) -h Θ (t) (x t ))σ(b (0) r x t )| + |(g Θ (t) (x t ) -y t )(σ(b (t) r x t ) -σ(b (0) r x t ))| ≤ 2d 3 (T ) √ M + 4(d(T ) + 1) b (t) r -b (0) r 2 ≤ 2 √ M (d 3 (T ) + 2(d(T ) + 1)d 2 (T )) , |(g Θ (t) (x t ) -y t )a (t) r σ (b (t) r x t ) -(h Θ (t) (x t ) -y t )a (0) r σ (b (0) r x t )| ≤ |(g Θ (t) (x t ) -h Θ (t) (x t ))a (0) r σ (b (0) r x t )| + |(g Θ (t) (x t ) -y t )(a (t) r σ (b (t) r x t ) -a (0) r σ (b (0) r x t ))| ≤ 2d 3 (T ) √ M + (d(T ) + 1)|a (t) r σ (b (t) r x t ) -a (0) r σ (b (0) r x t ))| ≤ 2d 3 (T ) √ M + (d(T ) + 1) |a (0) r (σ (b (t) r x t ) -σ (b (0) r x t ))| + |(a (t) r -a (0) r )σ (b (t) r x t )| ≤ 2d 3 (T ) √ M + 2(d(T ) + 1) C b (t) r -b (0) r 2 + |a (t) r -a (0) r | ≤ 2d 3 (T ) √ M + 2(d(T ) + 1) √ M (d 1 (T ) + Cd 2 (T )) = 2 √ M (d 3 (T ) + (d(T ) + 1)(d 1 (T ) + Cd 2 (T ))) . Plugging these two inequalities into (16), we have ∀t ∈ {1, . . . , T -1}, h Θ (t+1) (x) ≤ (1 -ηλ)h Θ (t) (x) - η M M r=1 (h Θ (t) (x t ) -y t )σ(b (0) r x t )σ(b (0) r x) - η M M r=1 σ (b (0) r x)(h Θ (t) (x t ) -y t )σ (b (0) r x t )x t x + 2η √ M (2d 3 (T ) + (d(T ) + 1) (d 1 (T ) + (C + 2)d 2 (T ))) = (1 -ηλ)h Θ (t) (x) -η(h Θ (t) (x t ) -y t ) 1 M M r=1 σ(b (0) r x t )σ(b (0) r x) + σ (b (0) r x)σ (b (0) r x t )x t x + 2η √ M (2d 3 (T ) + (d(T ) + 1) (d 1 (T ) + (C + 2)d 2 (T ))) = (1 -ηλ)h Θ (t) (x) -η(h Θ (t) (x t ) -y t )k M (x, x t ) + η √ M d 4 (T ) , where d 4 (T ) = 2d 3 (T ) + (d(T ) + 1) (d 1 (T ) + (C + 2)d 2 (T )). Clearly, the inverse inequality also holds: h Θ (t+1) (x) ≥ (1 -ηλ)h Θ (t) (x) -η(h Θ (t) (x t ) -y t )k M (x, x t ) - η √ M d 4 (T ). Thus, we get |h Θ (t+1) (x) -(1 -ηλ)h Θ (t) (x) + η(h Θ (t) (x t ) -y t )k M (x, x t )| ≤ η √ M d 4 (T ). Equivalence between Algorithm 1 and 2. We provide a bound between recursions of Algorithm 2 and (17). Noting that h Θ (0) ≡ g (0) ≡ 0, we have for ∀t ∈ {0, . . . , T -1}, |h Θ (t+1) (x) -g (t+1) (x)| ≤ (1 -ηλ)|h Θ (t) (x) -g (t) (x)| + η|h Θ (t) (x t ) -g (t) (x t )|k M (x, x t ) + η √ M d 4 (T ). Noting k M L∞(ρ X ) ≤ 12 and taking a supremum over x, x t ∈ supp(ρ X ) in both sides, we have h Θ (t+1) -g (t+1) L∞(ρ X ) ≤ (1 -ηλ) h Θ (t) -g (t) L∞(ρ X ) + η h Θ (t) -g (t) L∞(ρ X ) k M L∞(ρ X ) + η √ M d 4 (T ) ≤ (1 -ηλ + 12η) h Θ (t) -g (t) L∞(ρ X ) + η √ M d 4 (T ) ≤ t s=0 (1 + 12η) t-s η √ M d 4 (T ) ≤ T √ M (1 + 12η) T d 4 (T ). Since h Θ is a linear model, we have h Θ (T ) = T t=0 α t h Θ (t) and h Θ (T ) -g (T ) L∞(ρ X ) ≤ T t=0 α t h Θ (t) -g (t) L∞(ρ X ) ≤ T √ M (1 + 12η) T d 4 (T ). Combining this inequality with (15), we finally have g Θ (T ) -g (T ) L∞(ρ X ) ≤ g Θ (T ) -h Θ (T ) L∞(ρ X ) + h Θ (T ) -g (T ) L∞(ρ X ) ≤ 1 √ M (d 3 (T ) + 13 T T d 4 (T )). Because (d 3 (T ) + 13 T T d 4 (T )) depends only on T and C from the construction, g Θ (T )g (T ) L∞(ρ X ) → 0 as M → ∞. This finishes the proof of Proposition A.

C PROOF OF THEOREM A

In this section, we give the proof of the convergence theory for the reference ASGD (Algorithm 2). We introduce an auxiliary result for proving Theorem A. Lemma A. Suppose Assumption (A1), (A2), and (A3) hold. Set ξ def = Y K M,X -(K M,X ⊗ H M K M,X + λI)g M,λ . Then, for ∀λ > 0 and ∀δ ∈ (0, 1) there exists M 0 > 0 such that for ∀M ≥ M 0 the following holds with high probability at least 1 -δ: E (X,Y )∼ρ [ξ ⊗ H M ξ] 2(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )Σ M . Proof. Since ξ = (Y -g M,λ (X))K M,X -λg M,λ , we get E[ξ ⊗ H M ξ] = E[(Y -g M,λ (X)) 2 K M,X ⊗ H M K M,X ] -λE[(Y -g M,λ (X))K M,X ] ⊗ H M g M,λ -λg M,λ ⊗ H M E[(Y -g M,λ (X))K M,X ] + λ 2 g M,λ ⊗ H M g M,λ . We evaluate an expectation in the second and third terms in the right hand side of the above equation as follows: E[(Y -g M,λ (X))K M,X ] = E[Y K M,X -(K M,X ⊗ H M K M,X )g M,λ ] = E[Y K M,X ] -Σ M g M,λ = E[Y K M,X ] -Σ M (Σ M + λI) -1 E[Y K M,X ] = E[Y K M,X ] -(Σ M + λI -λI)(Σ M + λI) -1 E[Y K M,X ] = λ(Σ M + λI) -1 E[Y K M,X ] = λg M,λ . Hence, we get E[ξ ⊗ H M ξ] E[(Y -g M,λ (X)) 2 K M,X ⊗ H M K M,X ]. For h ∈ H M , E[(Y -g M,λ (X)) 2 K M,X ⊗ H M K M,X ]h, h H M = E[(Y -g M,λ (X)) 2 (K M,X ⊗ H M K M,X )h, h H M ] ≤ Y -g M,λ (X) 2 L∞(ρ X ) E[ (K M,X ⊗ H M K M,X )h, h H M ] ≤ 2(1 + g M,λ 2 L∞(ρ X ) )E[ (K M,X ⊗ H M K M,X )h, h H M ], where we used Assumption (A2) for the last inequality. Finally, we provide an upper-bound on g M,λ L∞(ρ X ) . Since S -1 -T -1 = -S -1 (S -T )T -1 for arbitrary operators S and T , we get (Σ ∞ + λI) -1 -(Σ M + λI) -1 g ρ L2(ρ X ) = (Σ ∞ + λI) -1 (Σ ∞ -Σ M )(Σ M + λI) -1 g ρ L2(ρ X ) = (Σ ∞ + λI) -1 op Σ ∞ -Σ M op (Σ M + λI) -1 op g ρ L2(ρ X ) ≤ 1 λ 2 Σ ∞ -Σ M op g ρ L2(ρ X ) . We denote F ∞ = (Σ ∞ + λI) -1 g ρ and F M = (Σ M + λI) -1 g ρ . Noting g ∞,λ = Σ ∞ F ∞ and g M,λ = Σ M F M , we get for ∀x ∈ supp(ρ X ), |g ∞,λ (x) -g M,λ (x)| = |Σ ∞ F ∞ (x) -Σ M F M (x)| = X K ∞,x (X)F ∞ (X)dρ X - X K M,x (X)F M (X)dρ X = X (K ∞,x -K M,x )(X)F ∞ (X)dρ X - X K M,x (X)(F M (X) -F ∞ (X))dρ X ≤ K ∞,x -K M,x L2(ρ X ) F ∞ L2(ρ X ) + K M,x L2(ρ X ) F M -F ∞ L2(ρ X ) ≤ 1 λ k ∞ -k M L∞(ρ X ) 2 g ρ L2(ρ X ) + 12 λ 2 Σ ∞ -Σ M op g ρ L2(ρ X ) , where we used k M (x, x ) ≤ 12 for ∀(x, x ) ∈ supp(ρ X ) × supp(ρ X ) and inequality (19). Moreover, we get |g ∞,λ (x)| = | g ∞,λ , K x H∞ | ≤ K x H∞ g ∞,λ H∞ ≤ 2 √ 3 Σ ∞ (Σ ∞ + λI) -1 g ρ H∞ ≤ 2 √ 3 Σ 1+r ∞ (Σ ∞ + λI) -1 Σ -r ∞ g ρ H∞ ≤ 2 √ 3 Σ 1 2 +r ∞ (Σ ∞ + λI) -1 Σ -r ∞ g ρ L2(ρ X ) ≤ 2 √ 3 Σ 1 2 +r ∞ (Σ ∞ + λI) -1 op Σ -r ∞ g ρ L2(ρ X ) ≤ 2 √ 3 Σ -r ∞ g ρ L2(ρ X ) . where we used Assumption (A3) and the isometric map Σ 1/2 ∞ : L 2 (ρ X ) → H ∞ . Hence, we get g M,λ L∞(ρ X ) ≤ 1 λ k ∞ -k M L∞(ρ X ) 2 + 12 λ 2 Σ ∞ -Σ M op g ρ L2(ρ X ) + 2 √ 3 Σ -r ∞ g ρ L2(ρ X ) . By the uniform law of large numbers (Theorem 3.1 in Mohri et al. (2012) ) and the Bernstein's inequality (Proposition 3 in Rudi & Rosasco (2017)) to random operators, k ∞ -k M L∞(ρ X ×ρ X ) and Σ ∞ -Σ M op converge to zero as M → ∞ in probability. That is, for given λ > 0 and δ ∈ (0, 1), there exists M 0 such that for any M ≥ M 0 the following holds with high probability at least 1 -δ: g M,λ L∞(ρ X ) ≤ 1 2 g ρ L2(ρ X ) + 2 √ 3 Σ -r ∞ g ρ L2(ρ X ) . Combining with (18), we get E[(Y -g M,λ (X)) 2 K M,X ⊗ H M K M,X ]h, h H M ≤ 2(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )E[ (K M,X ⊗ H M K M,X )h, h H M ]. Proof of Theorem A. Since, stochastic gradient in H M is described as G (t) = ∂ z (g (t) (x t ), y t )k M (x t , •) = K M,xt , g (t) H M -y t K M,xt , the update rule of Algorithm 2 is g (t+1) = (1 -ηλ)g (t) -η K M,xt , g (t) H M -y t K M,xt = (I -ηK M,xt ⊗ H M K M,xt -ηλI) g (t) + ηy t K M,xt . Hence, we get g (t+1) -g M,λ = (I -ηK M,xt ⊗ H M K M,xt -ηλI) =αt (g (t) -g M,λ ) =At -η(K M,xt ⊗ H M K M,xt + λI)g M,λ + ηy t K M,xt =βt . This leads to the following stochastic recursion: t ∈ {0, . . . , T -1}, A t+1 = α t A t + β t = t s=0 α s A 0 + t s=0 t l=s+1 α s β s . By taking the average, we get A T = 1 T + 1 T t=0 A t = 1 T + 1 T t=0 t s=0 α s A 0 Bias term + 1 T + 1 T t=0 t s=0 t l=s+1 α s β s Noise term . Thus, the average A T is composed of bias and noise terms. We next bound these two terms, separately. Bound the bias term. Note that the bias term exactly corresponds to the recursion (21) with β t = 0. Hence, we consider the case of β t = 0 and consider the following stochastic recursion in H M : A 0 = -g M,λ , A t+1 = (I -ηH t -ηλI)A t , where we define H t = K M,xt ⊗ H M K M,xt . In addition, we consider the deterministic recursion of this recursion: A 0 = A 0 , A t+1 = (I -ηΣ M -ηλI)A t . We set A T = 1 T + 1 T t=0 A t , A T = 1 T + 1 T t=0 A t . Then, the bias term we want to evaluate is decomposed as follows: by Minkowski's inequality, E[ A T 2 L2(ρ X ) ] 1 2 ≤ A T L2(ρ X ) + E[ A T -A T 2 L2(ρ X ) ] 1 2 . ( ) We here bound the first term in the right hand side of (23). Note that from 4(6 + λ)η ≤ 1 and k M L∞(ρ X ) 2 ≤ 12, we seefoot_0 η(Σ M +λI) η(12+λ)I ≺ 1 2 I. Since A t = (I -ηΣ M -ηλI) t g M,λ , its average is A T = 1 T + 1 T t=0 A t = 1 η(T + 1) (Σ M + λI) -1 (I -(I -ηΣ M -ηλ) T +1 )g M,λ . Therefore, A T L2(ρ X ) = 1 η(T + 1) (Σ M + λI) -1 (I -(I -ηΣ M -ηλ) T +1 )g M,λ L2(ρ X ) ≤ 1 η(T + 1) (Σ M + λI) -1 g M,λ L2(ρ X ) . We bound the second term in ( 23), which measures the gap between A T and A T . To do so, we consider the following recursion: A t+1 -A t+1 = A t -A t -η(H t + λI)(A t -A t ) + η(Σ M -H t )A t . Hence, we have A t+1 -A t+1 2 H M = A t -A t 2 H M -η A t -A t , (H t + λI)(A t -A t ) -(Σ M -H t )A t H M -η (H t + λI)(A t -A t ) -(Σ M -H t )A t , A t -A t H M + η 2 (H t + λI)(A t -A t ) -(Σ M -H t )A t 2 H M . Let (F t ) T -1 t=0 be a filtration. We take a conditional expectation given F t : E[ A t+1 -A t+1 2 H M | F t ] ≤ A t -A t 2 H M -2η (Σ M + λI)(A t -A t ), A t -A t H M + 2η 2 E[ (H t + λI)(A t -A t ) 2 H M | F t ] (25) + 2η 2 E[ (Σ M -H t )A t 2 H M | F t ], where we used g + h 2 H M ≤ 2( g 2 H M + h 2 H M ). For g ∈ H M , we have E[(K M,xt ⊗ H M K M,xt ) 2 ]g, g H M = E (K M,xt ⊗ H M K M,xt ) 2 g, g H M = E K M,xt , g H M (K M,xt ⊗ H M K M,xt )K M,xt , g H M = E K M,xt , g H M k M (x t , x t )K M,xt , g H M = E K M,xt , g H M k M (x t , x t ) ≤ 12E K M,xt , g 2 H M = 12 E [K M,xt ⊗ H M K M,xt ] g, g H M . ( ) where we used k M (x t , x t ) ≤ 12 which is confirmed from the definition of k M and Assumption (A2). This means that E[H 2 t ] = E[(K M,xt ⊗ H M K M,xt ) 2 ] 12Σ M on H M × H M . Hence, we get a bound on (25) as follows: E[ (H t + λI)(A t -A t ) 2 H M | F t ] = E[ (H t + λI) 2 (A t -A t ), A t -A t H M | F t ] = λ 2 I + 2λΣ M + E[(K M,xt ⊗ H M K M,xt ) 2 ] (A t -A t ), A t -A t H M ≤ λ 2 I + 2(6 + λ)Σ M (A t -A t ), A t -A t H M . Next, we bound a term ( 26): E[ (Σ M -H t )A t 2 H M | F t ] = E[ (Σ M -H t ) 2 A t , A t H M | F t ] = E[ (Σ 2 M -Σ M H t -H t Σ M + H 2 t )A t , A t H M | F t ] = (E[H 2 t ] -Σ 2 M )A t , A t H M ≤ E[H 2 t ]A t , A t H M ≤ 12 Σ M A t , A t H M . Combining these inequalities, we get E[ A t+1 -A t+1 2 H M | F t ] ≤ A t -A t 2 H M -2η (Σ M + λI)(A t -A t ), A t -A t H M + 2η 2 λ 2 I + 2(6 + λ)Σ M (A t -A t ), A t -A t H M + 24η 2 Σ M A t , A t H M = (1 -2λη + 2λ 2 η 2 ) A t -A t 2 H M -2η Σ M (A t -A t ), A t -A t H M + 4η 2 (6 + λ) Σ M (A t -A t ), A t -A t H M + 24η 2 Σ M A t , A t H M ≤ A t -A t 2 H M -η Σ M (A t -A t ), A t -A t H M + 24η 2 Σ M A t , A t H M , where for the last inequality we used 4η(6 + λ) ≤ 1. By taking the expectation and the average of (28) over t ∈ {0, . . . , T -1}, we get 1 T + 1 T t=0 Σ M (A t -A t ), A t -A t H M ≤ 24η T + 1 T t=0 Σ M A t , A t H M . Since Σ 1/2 M : L 2 (ρ X ) → H M is isometric, we see Σ 1/2 M (A t -A t ) H M = A t -A t L2(ρ X ) . Thus, the second term in (23) can be bounded as follows: E[ A T -A T 2 L2(ρ X ) ] = E[ Σ 1/2 M (A T -A T ) 2 H M ] ≤ 1 T + 1 T t=0 E[ Σ 1/2 M (A t -A t ) 2 H M ] ≤ 24η T + 1 T t=0 Σ 1/2 M A t 2 H M = 24η T + 1 T t=0 Σ 1/2 M (I -ηΣ M -ηλI) t g M,λ 2 H M = 24η T + 1 T t=0 (I -ηΣ M -ηλI) 2t Σ 1/2 M g M,λ , Σ 1/2 M g M,λ H M ≤ 24 T + 1 (Σ M + λI) -1 Σ 1/2 M g M,λ , Σ 1/2 M g M,λ H M = 24 T + 1 (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) where we used the convexity for the first inequality and we used the following inequality for the last inequality: since k M L∞(ρ X ) 2 ≤ 12 and η(Σ M + λI) η(12 + λ)I 1 2 I, T t=0 (I -ηΣ M -ηλI) 2t 1 η (Σ M + λI) -1 . By plugging ( 24) and ( 29) into (23), we get the bound on the bias term: E[ A T 2 L2(ρ X ) ] ≤ 2 A T 2 L2(ρ X ) + 2E[ A T -A T 2 L2(ρ X ) ] ≤ 2 η 2 (T + 1) 2 (Σ M + λI) -1 g M,λ 2 L2(ρ X ) + 24 2 T + 1 (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) . ( ) Bound the noise term. Note that the noise term in ( 22) exactly corresponds to the recursion (21) with A 0 = 0. Hence, it is enough to consider the case of A 0 = 0 to evaluate the noise term. In this case, the average A T can be rewritten as follows: A T = 1 T + 1 T t=0 t s=0 t l=s+1 α l β s = η T + 1 T s=0 T t=s t l=s+1 α l β s η =Zs . We here evaluate the noise term. We set z s = (x s , y s ). Note that since E zs [β s ] = 0, we have for s < s , E (zs,...,z T ) Z s , Z s L2(ρ X ) = X E (zs,...,z T ) T t=s t l=s+1 α l β s η T t=s t l=s +1 α l β s η dρ X = X E zs [β s ]E (zs+1,...,z T ) β s T t=s t l=s+1 α l η T t=s t l=s +1 α l η dρ X = 0. Therefore, we have E[ A T 2 L2(ρ X ) ] = η 2 (T + 1) 2 E   T s=0 Z s 2 L2(ρ X )   = η 2 (T + 1) 2 E   T s,s =0 Z s , Z s L2(ρ X )   = η 2 (T + 1) 2 T s=0 E Z s , Z s L2(ρ X ) = η 2 (T + 1) 2 T s=0 E Σ 1/2 M Z s 2 H M . Here, we apply Lemma 21 in Pillaud-Vivien et al. (2018a) with A = Σ M , H = Σ M + λI, C = 2(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )Σ M . One of required conditions in this lemma is verified by Lemma A. We verify the other required condition described below: E (K M,X ⊗ H K M,X + λI)CH -1 (K M,X ⊗ H K M,X + λI) 1 η C. Indeed, we have E (K M,X ⊗ H K M,X + λI)CH -1 (K M,X ⊗ H K M,X + λI) = E K M,X ⊗ H K M,X CH -1 K M,X ⊗ H K M,X + 2λΣ M CH -1 + λ 2 CH -1 E K M,X ⊗ H K M,X CH -1 K M,X ⊗ H K M,X + 6λ(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )Σ M , where we used Σ M (Σ M + λI) -1 I and λ(Σ M + λI) -1 I. Moreover, we see E K M,X ⊗ H K M,X CH -1 K M,X ⊗ H K M,X = 2(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )E K M,X ⊗ H K M,X Σ M (Σ M + λI) -1 K M,X ⊗ H K M,X 2(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )E (K M,X ⊗ H K M,X ) 2 24(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )Σ M , where we used (27) for the last inequality. Hence, we get E (K M,X ⊗ H K M,X + λI)CH -1 (K M,X ⊗ H K M,X + λI) (24 + 6λ)(1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) )Σ M . Since, 4η(6 + λ) ≤ 1, the condition (32) is verified. We apply Lemma 21 in Pillaud-Vivien et al. (2018a) to (31), yielding the following inequality: E[ A T 2 L2(ρ X ) ] ≤ 4 T + 1 1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ 2 M (Σ M + λI) -2 ≤ 4 T + 1 1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ M (Σ M + λI) -1 . (33) Convergence rate in terms of the optimization. Finally, by combining ( 30) and ( 33) with ( 22), we get the convergence rate of averaged stochastic gradient descent to g M,λ : t) . This finishes the proof. E g (T ) -g M,λ 2 L2(ρ X ) ≤ 4 η 2 (T + 1) 2 (Σ M + λI) -1 g M,λ 2 L2(ρ X ) + 2 • 24 2 T + 1 (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) + 8 T + 1 1 + g ρ 2 L2(ρ X ) + 24 Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ M (Σ M + λI) -1 , where g (T ) def = 1 T +1 T t=0 g

D PROOF OF PROPOSITION B

We provide Proposition B which provides the bound on Theorem A. Proof of Proposition B. From the Bernstein's inequality (Proposition 3 in Rudi & Rosasco (2017) ) to random operators, the covariance operator Σ M converges to Σ ∞ as M → ∞ in probability. Especially, there exits M 0 ∈ Z + such that for any M ≥ M 0 , it follows that with high probability at least 1 -δ, Σ ∞ -Σ M 1 2 (Σ ∞ + λI) in L 2 (ρ X ). Thus, for ∀f ∈ L 2 (ρ X ), we see (Σ ∞ + λI) -1/2 (Σ ∞ -Σ M )(Σ ∞ + λI) -1/2 f, f L2(ρ X ) = (Σ ∞ -Σ M )(Σ ∞ + λI) -1/2 f, (Σ ∞ + λI) -1/2 f L2(ρ X ) ≤ 1 2 (Σ ∞ + λI)(Σ ∞ + λI) -1/2 f, (Σ ∞ + λI) -1/2 f L2(ρ X ) = 1 2 f 2 L2(ρ X ) . Hence, we have (Σ ∞ + λI) -1/2 (Σ ∞ -Σ M )(Σ ∞ + λI) -1/2 1 2 I. Following the argument in Bach (2017b), we have for ∀f ∈ L 2 (ρ X ), (Σ M + λI) -1 f, f L2(ρ X ) = (Σ ∞ + λI + Σ M -∞ ) -1 f, f L2(ρ X ) = I + (Σ ∞ + λI) -1/2 (Σ M -Σ ∞ )(Σ ∞ + λI) -1/2 -1 (Σ ∞ + λI) -1/2 f, (Σ ∞ + λI) -1/2 f L2(ρ X ) = 2 (Σ ∞ + λI) -1/2 f, (Σ ∞ + λI) -1/2 f L2(ρ X ) = 2 (Σ ∞ + λI) -1 f, f L2(ρ X ) . Thus, we confirm that with high probability at least 1 -δ, (Σ M + λI) -1 2(Σ ∞ + λI) -1 Utilizing this inequality, we show the first and second inequalities in Proposition B as follows. It is sufficient to prove the second inequality because of (Σ M + λI) -1 g M,λ 2 L2(ρ X ) ≤ 1 λ (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) Noting that g ρ ∈ H ∞ and g M,λ = (Σ M + λI) -1 Σ M g ρ , we get (Σ M + λI) -1/2 g M,λ 2 L2(ρ X ) = (Σ M + λI) -1/2 (Σ M + λI) -1 Σ M g ρ 2 L2(ρ X ) ≤ (Σ M + λI) -1/2 g ρ 2 L2(ρ X ) ≤ 2 (Σ ∞ + λI) -1/2 g ρ 2 L2(ρ X ) ≤ 2 Σ -1/2 ∞ g ρ 2 L2(ρ X ) = 2 g ρ 2 H∞ . The third inequality on the degree of freedom is a result obtained by Rudi & Rosasco (2017) .

E EIGENVALUE ANALYSIS OF NEURAL TANGENT KERNEL E. 1 REVIEW OF SPHERICAL HARMONICS

We briefly review the spherical harmonics which is useful in analyzing the eigenvalues of dot-product kernels. For references, see Atkinson & Han (2012) ; Bach (2017a); Bietti & Mairal (2019) ; Cao et al. (2019) . Here, we denote by τ d-1 is the uniform distribution on the sphere S d-1 ⊂ R d . The surface area of S d-1 is ω d-1 = 2π d/2 Γ(d/2) where Γ is the Gamma function. In L 2 (τ d-1 ), there is an orthonomal basis consisting of a constant 1 and the spherical harmonics Y kj (x), k ∈ Z ≥1 , j = 1, . . . , N (d, k) , where N (d, k) = 2k+d-2 k k + d -3 d -2 . That is, Y ki , Y sj L2(τ d-1 ) = δ ks δ ij and Y ki , 1 L2(τ d-1 ) = 0. The spherical harmonics Y kj are homogeneous functions of degree k, and clearly Y kj have the same parity as k. Legendre polynomial P k (t) of degree k and dimension d (a.k.a. Gegenbauer polynomial) is defined as (Rodrigues' formula): P k (t) = (-1/2) k Γ( d-1 2 ) Γ k + d-1 2 (1 -t 2 ) (3-d)/2 d dt k (1 -t 2 ) k+(d-3)/2 . Legendre polynomials have the same parity as k. This polynomial is very useful in describing several formulas regarding the spherical harmonics. Addition formula. We have the following addition formula: N (d,k) j=1 Y kj (x)Y kj (y) = N (d, k)P k (x y), ∀x, ∀y ∈ S d-1 . Hence, we see that P k (x •) is spherical harmonics of degree k. Using the addition formula and the orthogonality of spherical harmonics, we have S d-1 P j (Z x)P k (Z y)dτ d-1 (Z) = δ jk N (d, k) P k (x y). Combining the following equation: for x = te d + √ 1 -t 2 x , (x ∈ S d-1 , x ∈ S d-2 , t ∈ [-1, 1]), ω d-1 ω d-2 dτ d-1 (x) = (1 -t 2 ) (d-3)/2 dtdτ d-2 (x ), we see the orthogonality of Legendre polynomials in L 2 ([-1, 1], (1 -t 2 ) (d-3)/2 dt) and since P k (1) = 1 we see 1 -1 P 2 k (t)(1 -t 2 ) (d-3)/2 dt = ω d-1 ω d-2 1 N (d, k) . Recurrence relation. We have the following relation: tP k (t) = k 2k + d -2 P k-1 (t) + k + d -2 2k + d -2 P k+1 (t), for k ≥ 1, and for k = 0 we have tP 0 (t) = P 1 (t). Funk-Hecke formula. The following formula is useful in computing Fourier coefficients with respect to spherical harmonics via Legendre polynomials. For any linear combination Y k of Y kj , (j ∈ {1, . . . , N (d, k)}) and any f ∈ L 2 ([-1, 1], (1 -t 2 ) (d-3)/2 dt), we have for ∀x, S d-1 f (x y)Y k (y)dτ d-1 (y) = ω d-2 ω d-1 Y k (x) 1 -1 f (t)P k (t)(1 -t 2 ) (d-3)/2 dt. This formula say that spherical harmonics are eigenfunctions of the integral operator defined by f (x y) and each eigen-space is spanned by spherical harmonics of the same degree. Moreover, it also provides a way of computing corresponding eigenvalues. where we used equation (36). By the rotationally invariance, we can show S d-1 k(x Z)dτ d-1 (Z) = S d-1 σ(Z x)dτ d-1 (Z) S d-1 σ(Z x )dτ d-1 (Z). Thus, comparing ( 40) with (41), we get λk = μ2 k .

E. 3 EIGENVALUES OF NEURAL TANGENT KERNELS

Utilizing a relation λk = μ2 k , we derive a way of computing eigenvalues of the integral operator defined by the integral operators Σ ∞ associated with the activation σ. Recall the definition of the neural tangent kernel: k ∞ (x, x ) def = E b (0) ∼µ0 [σ(b (0) x)σ(b (0) x )] + (x x + γ 2 )E b (0) ∼µ0 [σ (b (0) x)σ (b (0) x )]. A neural tangent kernel consists of three kernels: h 1 (x, x ) = E b (0) ∼µ0 σ(b (0) x)σ(b (0) x ) , h 2 (x, x ) = E b (0) ∼µ0 σ (b (0) x)σ (b (0) x ) , h 3 (x, x ) = x x E b (0) ∼µ0 σ (b (0) x)σ (b (0) x ) . By the argument in the previous subsection, h 1 and h 2 are dot-product kernel, that is, there exist ĥ1 and ĥ2 such that h 1 (x, x ) = ĥ1 (x x ) and h 2 (x, x ) = ĥ2 (x x ). Moreover, h 3 is a dot-product kernel as well because we get h 3 (x, x ) = ĥ3 (x x ) by setting ĥ3 (t) = t ĥ2 (t). Hence, theory explained earlier is applicable to these kernels. Eigenvalues μk for kernels h 1 and h 2 are described as follows: μ(1) k = ω d-2 ω d-1 1 -1 σ(t)P k (t)(1 -t 2 ) (d-3)/2 dt, μ(2) k = ω d-2 ω d-1 1 -1 σ (t)P k (t)(1 -t 2 ) (d-3)/2 dt, yielding eigenvalues λ(1) k = (μ (1) k ) 2 and λ(2) k = (μ k ) 2 for h 1 and h 2 , respectively. As for eigenvalues λ(3) k for h 3 , we have λ(3) k = ω d-2 ω d-1 1 -1 t ĥ2 (t)P k (t)(1 -t 2 ) (d-3)/2 dt = k 2k + d -2 ω d-2 ω d-1 1 -1 ĥ2 (t)P k-1 (t)(1 -t 2 ) (d-3)/2 dt + k + d -2 2k + d -2 ω d-2 ω d-1 1 -1 ĥ2 (t)P k+1 (t)(1 -t 2 ) (d-3)/2 dt = k 2k + d -2 λ(2) k-1 + k + d -2 2k + d -2 λ(2) k+1 , where we used the recurrence relation (37). Since, h 1 , h 2 , and h 3 have the same eigenfunctions, eigenvalues λ∞,k of k ∞ is λ∞,k = λ(1) k + γ 2 λ(2) k + k 2k + d -2 λ(2) k-1 + k + d -2 2k + d -2 λ(2) k+1 . Hence, calculation of { λ∞,k } ∞ k=1 results in computing μ(1) k and μ(2) k for given activation σ. where the limit is taken with respect to M → ∞ and the notation p -→ denotes the convergence in probability. Next, we have the following convergence: k (s) ∞ -k ∞ L∞(ρ X ) 2 ≤ sup x,x ∈S d-1 E b (0) σ (s) (b (0) x)σ (s) (b (0) x ) -σ(b (0) x)σ(b (0) x ) + (1 + γ 2 ) sup x,x ∈S d-1 E b (0) σ (s) (b (0) x)σ (s) (b (0) x ) -σ (b (0) x)σ (b (0) x ) ≤ 4 sup x∈S d-1 E b (0) σ (s) (b (0) x) -σ(b (0) x) + 4(1 + γ 2 ) sup x∈S d-1 E b (0) σ (s) (b (0) x) -σ (b (0) x) = 4 σ (s) (b (0) e 1 ) -σ(b (0) e 1 ) + 4(1 + γ 2 )E b (0) σ (s) (b (0) e 1 ) -σ (b (0) e 1 ) → 0, where for the first inequality we used the boundedness of σ, σ , σ (s) , and σ (s) on [-1, 1], for the equality we used the rotationally invariance of the measure µ 0 , and the limit is taken with respect to s → For the final convergence in the above expression we used Assumption (A1') and boundedness with Lebesgue's convergence theorem. In general, for a kernel k and associated integral operator Σ with ρ X , we have Tr (Σ) = S d-1 k(X, X)dρ X . Hence, Tr Σ (s) M -Σ (s) ∞ ≤ k (s) M -k (s) ∞ L∞(ρ X ) 2 and Tr Σ (s) ∞ -Σ ∞ ≤ k (s) ∞ -k ∞ L∞(ρ X ) 2 , and the second statement holds immediately. Finally, we show the third statement. In the same manner as the derivation of inequality ( 19), we get ((Σ (s) M + λI) -1 -(Σ (s) ∞ + λI) -1 )g ρ L2(ρ X ) ≤ 1 λ 2 Σ (s) M -Σ (s) ∞ op g ρ L2(ρ X ) . We denote F  M (x) -Σ (s) ∞ F (s) ∞ (x) = X K (s) M,x (X)F (s) M (X)dρ X - X K (s) ∞,x (X )F (s) ∞ (X )dρ X = X (K (s) M,x -K (s) ∞,x )(X)F (s) M (X)dρ X - X K (s) ∞,x (X)(F (s) M -F (s) ∞ )(X)dρ X ≤ K (s) M,x -K (s) ∞,x L2(ρ X ) F (s) M L2(ρ X ) + K (s) ∞,x L2(ρ X ) F (s) M -F (s) ∞ L2(ρ X ) ≤ 1 λ k (s) M -k (s) ∞ L∞(ρ X ) 2 g ρ L2(ρ X ) + 12 λ 2 Σ (s) M -Σ (s) ∞ op g ρ L2(ρ X ) . The both terms in the last expression (49) converge to 0 in probability because of the first statement of this proposition and the Bernstein's inequality (Proposition 3 in Rudi & Rosasco (2017) ) to random operators. This finishes the proof of the former of the third statement. In the same manner, we have g (s) ∞,λ -g ∞,λ L∞(ρ X ) ≤ 1 λ k ∞ -k (s) ∞ L∞(ρ X ) 2 g ρ L2(ρ X ) + 12 λ 2 Σ ∞ -Σ (s) ∞ op g ρ L2(ρ X ) . The first term in the right hand side converges to 0 because of the first statement of this proposition. We next show the convergence Σ ∞ -Σ (s) ∞ op → 0 as s → ∞. As seen in the previous section, Σ ∞ and Σ ∞ and Σ ∞ , respectively. For an arbitrary > 0, we can take i such that ∞ i=i λ ∞,i < . From the convergence λ  ∞,i -λ ∞,i | < (i < i ) and ∞ i=i λ (s) ∞,i < 2 . Clearly, for i ≥ i , |λ (s) ∞,i -λ ∞,i | ≤ ∞ i=i (λ (s) ∞,i + λ ∞,i ) < 3 . for an arbitrarily sufficiently large s and the bound on the degree of freedom in Proposition B is applicable. We get Tr Σ (s) M (Σ (s) M + λI) -1 ≤ 3Tr Σ (s) ∞ (Σ (s) ∞ + λI) -1 . ( ) Let us consider upper bounding the right hand side: Tr Σ (s) ∞ (Σ (s) ∞ + λI) -1 = i λ -1 i=1 λ (s) ∞,i λ + λ (s) ∞,i + ∞ i=i λ λ (s) ∞,i λ + λ (s) ∞,i ≤ i λ -1 i=1 λ (s) ∞,i λ + λ (s) ∞,i + 1 λ ∞ i=i λ λ (s) ∞,i = i λ -1 i=1 λ (s) ∞,i λ + λ (s) ∞,i - 1 λ i λ -1 i=1 λ (s) ∞,i + 1 λ Tr Σ (s) ∞ . On the other hand, by the definition of i λ , 2Tr Σ ∞ (Σ ∞ + λI) -1 = 2 i λ -1 i=1 λ ∞,i λ + λ ∞,i + 2 ∞ i=i λ λ ∞,i λ + λ ∞,i ≥ 2 i λ -1 i=1 λ ∞,i λ + λ ∞,i + 1 λ ∞ i=i λ λ ∞,i = 2 i λ -1 i=1 λ ∞,i λ + λ ∞,i - 1 λ i λ -1 i=1 λ ∞,i + 1 λ Tr (Σ ∞ ) ≥ i λ -1 i=1 λ ∞,i λ + λ ∞,i - 1 λ i λ -1 i=1 λ ∞,i + 1 λ Tr (Σ ∞ ) . Therefore, by inequality (55), the convergence of λ (56) Combining ( 45)-( 48) with ( 51), ( 53), (54), and (56), we establish the counterpart of Theorem B. For given , λ, and δ, there exist sufficiently large s and M such that with high probability 1 -δ, E g (T ) -g ρ 2 L2(ρ X ) ≤ + αλ 2r Σ -r ∞ g ρ 2 L2(ρ X ) + α T + 1 1 + 1 λη 2 (T + 1) g ρ 2 H∞ + α T + 1 1 + g ρ 2 L2(ρ X ) + Σ -r ∞ g ρ 2 L2(ρ X ) Tr Σ ∞ (Σ ∞ + λI) -1 , where α > 0 is a universal constant. Proof of Corollary 2. Since conditions (A1') and (A2') are special cases of (A1) and (A2), we can apply Proposition A to Algorithm 1 for the neural network with the smooth approximation σ (s) of ReLU. Hence, by setting η t = η = O(1) satisfying 4(6 + λ)η ≤ 1 and λ = T -β/(2rβ+1) where β = 1 + 1 d-1 , and by applying Tr Σ ∞ (Σ ∞ + λI) -1 = O(λ -1/β ) (Caponnetto & De Vito, 2007 ) and g ρ H∞ , g ρ L2(ρ X ) ≤ O Σ -r ∞ g ρ L2(ρ X ) because of Σ ∞ op ≤ O(1), we finish the proof of Corollary 2.



In general, for any operator F : L2(ρX ) → L2(ρX ) that commutes with ΣM and has a common eigenbases with ΣM , it follows that F (HM ) ⊂ HM and inequality F 0 in L2(ρX ) is equivalent with F |H M 0. Hence, we do not specify a Hilbert space we consider in such a case for the simplicity.



r = R for r ∈ {1, . . . , M 2 } and a (0) r = -R for r ∈ { M 2 + 1, . . . , M }, where R > 0 is a positive constant. Let µ 0 be a uniform distribution on the sphere S d-1 = {b ∈ R d | b 2 = 1} ⊂ R d used to initialize the parameters for the input layer. The parameters for the input layer are initialized as b (0) r = b (0) r+ M 2 for r ∈ {1, . . . , M 2 }, where (b

); Chizat & Bach (2018b); Du et al. (2019b); Allen-Zhu et al. (2019a;b); Arora et al. (

') The condition (A3) is satisfied by the NTK associated with the ReLU activation σ.

Figure 2: Top: Estimation of Σ -r g ρ L2(ρ X ) (r ∈ [0.5, 1]) for integral operators Σ ∈ {Σ a,∞ , Σ b,∞ , Σ ∞ } of two-layer ReLU networks. Bayes rules g ρ are set to the average eigenfunctions of Σ a,∞ (left), Σ b,∞ (middle), and Σ ∞ (right). Bottom: Learning curves of test errors for Algorithm 1 with two-layer swish networks.

share the same eigenfunctions and every eigenvalue ofΣ (s) ∞ converges to that of Σ ∞ as s → ∞. Let {λ (s) ∞,i } ∞ i=1 and {λ ∞,i } ∞ i=1 be eigenvalues of Σ (s)

→ λ ∞,i and Tr Σ (s) ∞ → Tr (Σ ∞ ) as s → ∞, we see that for arbitrarily sufficiently large s, |λ (s)

→ λ ∞,i for i ∈ {1, . . . , i λ -1} as s → ∞, and the second statement in Proposition E, we haveplim s→∞ lim + λI) -1 ≤ 9Tr Σ ∞ (Σ ∞ + λI) -1 .

Taiji Suzuki. Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional langevin dynamics. In Advances in Neural Information Processing Systems, 2020. E Weinan, Chao Ma, and Lei Wu. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Science China Mathematics, pp. 1-24, 2019. Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019. Yiming Ying and D-X Zhou. Online regularized classification algorithms. IEEE Transactions on Information Theory, 52(11):4775-4788, 2006. Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems, pp. 8082-8093, 2019. Tong Zhang. Statistical behavior and consistency of classification methods based on convex ris minimization. The Annals of Statistics, 32(1):56-134, 2004. Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, pp. 2053-2062, 2019.

ACKNOWLEDGMENTS

AN was partially supported by JSPS Kakenhi (19K20337) and JST-PRESTO. TS was partially supported by JSPS KAKENHI (18K19793, 18H03201, and 20H00576), Japan Digital Design, and JST CREST.

A PROOF SKETCH OF THE MAIN RESULTS

We provide several key results and a proof sketch of Theorem 1 and Corollary 1. We first recall the definition of stochastic gradients of L in a general RKHS (H, , H ) associated with a uniformly bounded real-valued kernel function k : X × X → R. We set K X = k(X, •). Then, it follows that for ∀g, ∀h ∈ H,which is confirmed by the following equations:h(X) = h, k(X, •) H , and |h(X)| ≤ h H k(X, X). This means that the stochastic gradient of L in H is given by ∂ ζ (g(X), Y )k(X, •) for (X, Y ) ∼ ρ. In addition, the stochastic gradient of the L 2 -regularized risk is given by ∂ ζ (g(X), Y )k(X, •) + λg.

E. 2 EIGENVALUES OF DOT-PRODUCT KERNELS

Let µ 0 be the uniform distribution on S d-1 . Note that although τ d-1 and µ 0 are the same distribution, we use two distributions τ d-1 and µ 0 depending on random variables. First, we consider any activation function σ : R → R and a kernel function kon the sphere S d-1 . We show this kernel function is a type of dot-product kernels, that is, there is k : R → R such that k(x, x ) = k(x x ). In fact, it can be confirmed as follows. For any x, x ∈ S d-1 , we take θ ∈ [0, π] so that x x = cos θ, and an orthogonal matrix A ∈ R d×d so that Ax = (1, 0, . . . , 0) and Ax = (cos θ, sin θ, 0, . . . , 0) because A preserves the value of x x . Then, since µ 0 is rotationally invariant we seed ). In other words, we see k is a function of θ = arccos (x x ), and is a dot-product kernel k(x, x ) = k(x x ). Hence, we can apply Funk-Hecke formula (38) to k(x, •).The derivation of eigenvalues of the integral operator follows a way developed by Bach (2017a) ; Bietti & Mairal (2019) ; Cao et al. (2019) . In general, g ∈ L 2 (τ d-1 ) can be decomposed by spherical harmonics as follows.g -where we used addition formula to the last equality.Here, we apply this decomposition (39) to k(x, •) = k(x •). Since P k (Z •) is a linear combination of spherical harmonics of degree k (see addition formula), we getwhere we used Funk-Hecke formula (38) and we set λk = ω d-2We note that λk is eigenvalue with multiplicity N (d, k) of the integral operator defined by k.Next, we derive another expression of k. In a similar way, we obtain the following equation:By the definition of k and the orthogonality of spherical harmonics, we getEigenvalues for ReLU and smooth approximations of ReLU. As for ReLU activation, its eigenvalues were derived in Bach (2017a) . Let σ be ReLU. Then, μ(1)). We note that the multiplicity of λ∞,k is N (d, k), so that we should take into account the multiplicity to derive decay order of eigenvalues λ. As a result, Assumption (A4) is verified withfor ReLU. For the smooth approximation σ (s) of ReLU satisfying Assumption (A1'), we can show that every eigenvalue of Σ (s) ∞ derived from σ (s) converges to that for ReLU as s → ∞ because of ( 42) and ( 43) with Lebesgue's convergence theorem.

F EXPLICIT CONVERGENCE RATES FOR SMOOTH APPROXIMATION OF RELU

For convenience, we here list notations used in this section. In this section, let σ and σ (s) be ReLU activation and its smooth approximation satisfying (A1'), respectively, and forM,λ be corresponding kernel, integral operators, and minimizers of the regularized expected risk functions. Let g (T ) be iterates obtained by the reference ASGD (Algorithm 2) in the RKHS associated with k (s) M . We consider the following decomposition:These terms can be made arbitrarily small by taking large M and s. As for (48) this property is a direct consequence of Proposition D. Note that Proposition C is not applicable to (46) because this proposition require the specification of the target function by k (s)∞ which does not hold in general. In the following, we treat the remaining terms. Proposition E. Suppose (A1') and (A2') hold. Then, we havewhere plim denotes the convergence in probability.Proof. We show the first statement. By the uniform law of large numbers (Theorem 3.1 in Mohri et al. (2012) ), we see the convergence in probability:Therefore, we conclude the uniform convergence sup i∈{1,...,∞} |λ (s)So far, we have shown that ( 46), (47), and (48) can be made arbitrarily small by taking large s and M depending on λ. The remaining problem is to show the convergence of (45). To do so, we establish the counterpart of Theorem B by adapting Theorem A and Proposition B to the current setting.The counterpart of Theorem B. In Theorem A, the condition (A3) is required for NTK associated with the smooth activation σ (s) and it is not satisfied in general. Note that (A3) is used for bounding g (s)M,λ L∞(ρ X ) uniformly as seen in the proof of Lemma A. Let us consider the decomposition:Here, for the second inequality we used (20). Note that the inequality (20) holds for Σ ∞ because the condition (A3) is supposed for ReLU. For the last inequality we used Proposition E. Hence, Theorem A can be applicable and the same convergence in Theorem A holds for σ (s) . For arbitrarily sufficiently large s and M with high probability, we haveNext, we adapt Proposition B to the current setting. By inequality (34), there exists M 0 ∈ Z + such that ∀M ≥ M 0 , with high probability, (Σ∞ + λI) -1 2(Σ ∞ + λI) -1 holds, then we have the counterpart of the second inequality in Proposition B becausewhere we used the fact that g ρ is contained in H ∞ because of (A3'). Note that the first inequality in Proposition B is a direct consequence of the second one. We consider eigenvalues {λ∞ and Σ ∞ , respectively. Let i λ be an index such that for ∀i, for an arbitrarily sufficiently large s, we have |λThese are the counterpart of the first and second inequalities in Proposition B.Next, we consider the bound on the degree of freedom in this proposition. Assume λ ≤ 1 2 Σ ∞ op . As seen earlier, an operator Σ 

G APPLICATION TO BINARY CLASSIFICATION PROBLEMS

In this paper, we mainly focused on regression problems, but our idea can be applied to other applications. We briefly discuss its application to binary classification problems. A label space is set to Y = {-1, 1} and a loss function is set to be the squared loss: (z, y) = 0.5(y -z) 2 . The ultimate goal of the binary classification problem is to obtain the Bayes classifier that minimizes the expected classification error, R(g)over all measurable maps. It is known that the Bayes classifier is expressed as sgn(g ρ (X)), where g ρ is the Bayes rule of L(g) = E ρ [l(g(X), Y )] (see Zhang (2004) ; Bartlett et al. (2006) ). Therefore, if g ρ satisfies a margin condition, i.e., |g ρ (x)| ≥ ∃τ > 0 on supp(ρ X ), then this goal is achieved by obtaining an τ /2-accurate solution of g ρ in terms of the uniform norm on supp(ρ X ). That is, the required optimization accuracy on g Θ (T ) -g ρ L∞(ρ X ) to obtain the Bayes classifier depends only on the margin τ unlike regression problems. Due to this property, averaged stochastic gradient descent in RKHSs can achieve the linear convergence rate demonstrated in Pillaud-Vivien et al. (2018a) . To leverage this theory to our problem setting, we consider the following decomposition:The last term (61) can be made arbitrary small by λ → 0 as shown in Pillaud-Vivien et al. (2018a) .A term (60) can be bounded in the same manner as the third statement of Proposition E, yielding the convergence to 0 as M → ∞ with high probability. The convergence of (59) was shown in Pillaud-Vivien et al. (2018a) and the convergence of ( 58) is guaranteed by Proposition A. As a result, we can show the following exponential convergence of the classification error R(g) for two-layer neural networks with a sufficiently small λ as demonstrated in Pillaud-Vivien et al. (2018a) .E[R(g Θ (T ) ) -R(g ρ )] ≤ 2 exp(-O(λ 2 τ 2 T )).In Nitanda & Suzuki (2019) , an exponential convergence was shown for the logistic loss (z, y) = log(1 + exp(-yz)) as well. Proposition A also holds for the logistic loss with an easier proof than the squared loss because of the boundedness of stochastic gradients of the loss. Hence, their theory is also applicable to the reference ASGD in an RKHS. In summary, (58), (59), and (61) can be bounded by the above argument. However, we note that bounding (60) is not obvious and is left for future work.

