BENEFIT OF DEEP LEARNING WITH NON-CONVEX NOISY GRADIENT DESCENT: PROVABLE EXCESS RISK BOUND AND SUPERIORITY TO KERNEL METHODS

Abstract

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, k-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than O(1/ √ n) that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.

1. INTRODUCTION

In the deep learning theory literature, clarifying the mechanism by which deep learning can outperform shallow approaches has been gathering most attention for a long time. In particular, it is quite important to show that a tractable algorithm for deep learning can provably achieve a better generalization performance than shallow methods. Towards that goal, we study the rate of convergence of excess risk of both deep and shallow methods in a setting of a nonparametric regression problem. One of the difficulties to show generalization ability of deep learning with certain optimization methods is that the solution is likely to be stacked in a bad local minimum, which prevents us to show its preferable performances. Recent studies tackled this problem by considering optimization on overparameterized networks as in neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019a) and mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; 2019; Mei et al., 2018; 2019) , or analyzing the noisy gradient descent such as stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011; Raginsky et al., 2017; Erdogdu et al., 2018) . The NTK analysis deals with a relatively large scale initialization so that the model is well approximated by the tangent space at the initial solution, and eventually, all analyses can be reduced to those of kernel methods (Jacot et al., 2018; Du et al., 2019b; Allen-Zhu et al., 2019; Du et al., 2019a; Arora et al., 2019; Cao & Gu, 2019; Zou et al., 2020) . Although this regime is useful to show its global convergence, the obtained estimator looses large advantage of deep learning approaches because the estimation ability is reduced to the corresponding kernel methods. To overcome this issue, there are several "beyond-kernel" type analyses. For example, Allen-Zhu & Li (2019; 2020) showed benefit of depth by analyzing ResNet type networks. Li et al. (2020) showed global optimality of gradient descent by reducing the optimization problem to a tensor decomposition problem for a specific regression problem, and showed the "ideal" estimator on a linear model has worse dependency on the input dimensionality. Bai & Lee (2020) considered a second order Taylor expansion and showed that the sample complexity of deep approaches has better dependency on the input dimensionality than kernel methods. Chen et al. (2020) also derived a similar conclusion by considering a hierarchical representation. The analyses mentioned above actually show some superiority of deep learning, but all of these bounds are essentially Ω(1/ √ n) where n is the sample size, which is not optimal for regression problems with squared loss (Caponnetto & de Vito, 2007) . The reason why only such a sub-optimal rate is considered is that the target of their analyses is mostly the Rademacher complexity of the set in which estimators exist for bounding the generalization gap. However, to derive a tight excess risk bound instead of the generalization gap, we need to evaluate so called local Rademacher complexity (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006) (see Eq. ( 2) for the definition of excess risk). Moreover, some of the existing analyses should change the target function class as the sample size n increases, for example, the input dimensionality is increased against the sample size, which makes it difficult to see how the rate of convergence is affected by the choice of estimators. Another promising approach is the mean field analysis. There are also some work that showed superiority of deep learning against kernel methods. Ghorbani et al. (2019) showed that, when the dimensionality d of input is polynomially increasing with respect to n, the kernel methods is outperformed by neural network approaches. Although the situation of increasing d explains well the modern high dimensional situations, this setting blurs the rate of convergence. Actually, we can show the superiority of deep learning even in a fixed dimension setting. There are several studies about approximation abilities of deep and shallow models. Ghorbani et al. (2020) showed adaptivity of kernel methods to the intrinsic dimensionality in terms of approximation error and discuss difference between deep and kernel methods. Yehudai & Shamir (2019) showed that the random feature method requires exponentially large number of nodes against the input dimension to obtain a good approximation for a single neuron target function. These are only for approximation errors and estimation errors are not compared. Recently, the superiority of deep learning against kernel methods has been discussed also in the nonparametric statistics literature where the minimax optimality of deep learning in terms of excess risk is shown. Especially, it is shown that deep learning achieves better rate of convergence than linear estimators in several settings (Schmidt-Hieber, 2020; Suzuki, 2019; Imaizumi & Fukumizu, 2019; Suzuki & Nitanda, 2019; Hayakawa & Suzuki, 2020) . Here, the linear estimators are a general class of estimators that includes kernel ridge regression, k-NN regression and Nadaraya-Watson estimator. Although these analyses give clear statistical characterization on estimation ability of deep learning, they are not compatible with tractable optimization algorithms. In this paper, we give a theoretical analysis that unifies these analyses and shows the superiority of a deep learning method trained by a tractable noisy gradient descent algorithm. We evaluate the excess risks of the deep learning approach and linear estimators in a nonparametric regression setting, and show that the minimax optimal convergence rate of the linear estimators can be dominated by the noisy gradient descent on neural networks. In our analysis, the model is fixed and no explicit sparse regularization is employed. Our contributions can be summarized as follows: • A refined analysis of excess risks for a fixed model with a fixed input dimension is given to compare deep and shallow estimators. Although several studies pointed out the curse of dimensionality is a key factor that separates shallow and deep approaches, we point out that such a separation appears in a rather low dimensional setting, and more importantly, the non-convexity of the model essentially makes the two regimes different. • A lower bound of the excess risk which is valid for any linear estimator is derived. The analysis is considerably general because the class of linear estimators includes kernel ridge regression with any kernel and thus it also includes estimators in the NTK regime. • All derived convergence rate is a fast learning rate that is faster than O(1/ √ n). We show that simple noisy gradient descent on a sufficiently wide two-layer neural network achieves a fast learning rate by using a fact that the solution converges to a Bayes estimator with a Gaussian process prior, and the derived convergence rate can be faster than that of linear estimators. This is much different from such existing work that compared only coefficients with the same rate of convergence with respect to the sample size n. Other related work Bach (2017) analyzed the model capacity of neural networks and its corresponding reproducing kernel Hilbert space (RKHS), and showed that the RKHS is much larger than the neural network model. However, separation of the estimation abilities between shallow and deep is not proven. Moreover, the analyzed algorithm is basically the Frank-Wolfe type method which is not typically used in practical deep learning. The same technique is also employed by Barron (1993) . The Frank-Wolfe algorithm is a kind of sparsity inducing algorithm that is effective for estimating a function in a model with an L 1 -norm constraint. It has been shown that explicit or implicit sparse regularization such as L 1 -regularization is beneficial to obtain better performances of deep learning under certain situations (Chizat & Bach, 2020; Chizat, 2019; Gunasekar et al., 2018; Woodworth et al., 2020; Klusowski & Barron, 2016) . For example, E et al. (2019b; a) showed that the approximation error of a linear model suffers from the curse of dimensionality in a setting where the target function is in the Barron class (Barron, 1993) , and showed an L 1 -type regularization avoids the curse of dimensionality. However, our analysis goes in a different direction where a sparse regularization is not required.

2. PROBLEM SETTING AND MODEL

In this section, we give the problem setting and notations that will be used in the theoretical analysis. We consider the standard nonparametric regression problem where data are generated from the following model for an unknown true function f o : R d → R: y i = f o (x i ) + i (i = 1, . . . , n), where x i is independently identically distributed from P X whose support is included in Ω = [0, 1] d , and i is an observation noise that is independent of x i and satisfies E[ i ] = 0 and i ∈ [-U, U ] almost surely. The n i.i.d. observations are denoted by D n = (x i , y i ) n i=1 . We want to estimate the true function f o through the training data D n . To achieve this purpose, we employ the squared loss (y, f (x)) = (y -f (x)) 2 and accordingly we define the expected and empirical risks as L(f ) := E Y,X [ (Y, f (X))] and L(f ) := 1 n n i=1 (y i , f (x i )) respectively. Throughout this paper, we are interested in the excess (expected) risk of an estimator f defined by (Excess risk) L( f ) - inf f :measurable L(f ). Since the loss function is the squared loss, the infimum of inf f :measurable L(f ) is achieved by f o : inf f :measurable L(f ) = L(f o ). The population L 2 (P X )-norm is denoted by f L2(P X ) := E X∼P X [f (X) 2 ] and the sup-norm on the support of P X is denoted by f ∞ := sup x∈supp(P X ) |f (x)|. We can easily check that for an estimator f , the L 2 -distance f -f o 2 L2(P X ) between the estimator f and the true function f o is identical to the excess risk: L( f ) -L(f o ) = f -f o 2 L2(P X ) . Note that the excess risk is different from the generalization gap L( f ) -L( f ). Indeed, the generalization gap typically converges with the rate of O(1/ √ n) which is optimal in a typical setting (Mohri et al., 2012) . On the other hand, the excess risk can be faster than O(1/ √ n), which is known as a fast learning rate (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006; Giné & Koltchinskii, 2006) .

2.1. MODEL OF TRUE FUNCTIONS

To analyze the excess risk, we need to specify a function class (in other words, model) in which the true function f o is included. In this paper, we only consider a two layer neural network model, whereas the techniques adapted in this paper can be directly extended to deeper neural network models. We consider a teacher-student setting, that is, the true function f o can be represented by a neural network defined as follows. For w ∈ R, let w be a "clipping" of w defined as w := R × tanh(w/R) where R ≥ 1 is a fixed constant, and let [x; 1] := [x , 1] for x ∈ R d . Then, the teacher network is given by f W (x) = ∞ m=1 a m w2,m σ m (w 1,m [x; 1]), where w 1,m ∈ R d+1 and w 2,m ∈ R (m ∈ N) are the trainable parameters (where W = (w 1,m , w 2,m ) ∞ m=1 ), a m ∈ R (m ∈ N ) is a fixed scaling parameter, and σ m : R → R is an activation function for the m-th node. The reason why we applied the clipping operation to the parameter of the second layer is just for a technical reason to ensure convergence of Langevin dynamics. The dynamics is bounded in high probability in practical situations and the boundedness condition would be removed if further theoretical development of infinite dimensional Langevin dynamics would be achieved. Let H be a set of parameters W such that its squared norm is bounded: H := {W = (w 1,m , w 2,m ) ∞ m=1 | ∞ m=1 ( w 1,m 2 + w 2 2,m ) < ∞}. Define W H := [ ∞ m=1 ( w 1,m 2 + w 2 2,m )] 1/2 for W ∈ H. Let (µ m ) ∞ m=1 be a regularization parameter such that µ m 0. Accordingly we define H γ := {W ∈ H | W Hγ < ∞} where W Hγ := [ ∞ m=1 µ -γ m ( w 1,m 2 + w 2 2,m )] 1/2 for a given 0 < γ. Throughout this paper, we analyze an estimation problem in which the true function is included in the following model: F γ = {f W | W ∈ H γ , W Hγ ≤ 1}. This is basically two layer neural network with infinite width. As assumed later, a m is assumed to decrease as m → ∞. Its decreasing rate controls the capacity of the model. If the first layer parameters (w 1,m ) m are fixed, this model can be regarded as a variant of the unit ball of some reproducing kernel Hilbert space (RKHS) with basis functions a m σ m (w 1,m [x; 1]). However, since the first layer (w 1,m ) is also trainable, there appears significant difference between deep and kernel approaches. The Barron class (Barron, 1993; E et al., 2019b ) is relevant to this function class. Indeed, it is defined as the convex hull of w 2 σ(w 1 [x; 1]) with norm constraints on (w 1 , w 2 ) where σ is an activation function. On the other hand, we will put an explicit decay rate on a m and the parameter W has an L 2 -norm constraint, which makes the model F γ smaller than the Barron class.

3. ESTIMATORS

We consider two classes of estimators and discuss their differences: linear estimators and deep learning estimator with noisy gradient descent (NGD). Linear estimator A class of linear estimators, which we consider as a representative of "shallow" learning approach, consists of all estimators that have the following form: f (x) = n i=1 y i ϕ i (x 1 , . . . , x n , x). Here, (ϕ i ) n i=1 can be any measurable function (and L 2 (P X )-integrable so that the excess risk can be defined). Thus, they could be selected as the "optimal" one so that the corresponding linear estimator minimizes the worst case excess risk. Even if we chose such an optimal one, the worst case excess risk should be lower bounded by our lower bound given in Theorem 1. It should be noted that the linear estimator does not necessarily imply "linear model." The most relevant linear estimator in the machine learning literature is the kernel ridge regression: f (x) = Y (K X + λI) -1 k(x) where K X = (k(x i , x j )) n i,j=1 ∈ R n×n , k(x) = [k(x, x 1 ), . . . , k(x, x n )] ∈ R n and Y = [y 1 , . . . , y n ] ∈ R n for a kernel function k : R d × R d → R. Therefore, the ridge regression estimator in the NTK regime or the random feature model is also included in the class of linear estimators. The solution obtained in the early stopping criteria instead of regularization in the NTK regime under the squared loss is also included in the linear estimators. Other examples include the k-NN estimator and the Nadaraya-Watson estimator. All of them do not train the basis function in a nonlinear way, which makes difference from the deep learning approach. In the nonparametric statistics literature, linear estimators have been studied for estimating a wavelet series model. Donoho et al. (1990; 1996) have shown that a wavelet shrinkage estimator can outperform any linear estimator by showing suboptimality of linear estimators. Suzuki (2019) utilized such an argument to show superiority of deep learning but did not present any tractable optimization algorithm. Noisy Gradient Descent with regularization As for the neural network approach, we consider a noisy gradient descent algorithm. Basically, we minimize the following regularized empirical risk: L(f W ) + λ 2 W 2 H1 . Here, we employ H 1 -norm as the regularizer. We note that the constant γ controls the relative complexity of the true function f o compared to the typical solution obtained under the regularization. Here, we define a linear operator A as λ W H1 = W AW , that is, AW = (λµ -1 m w 1,m , λµ -1 m w 2,m ) ∞ m=1 . The regularized empirical risk can be minimized by noisy gradient descent as W k+1 = W k -η∇( L(f W k ) + λ 2 W k 2 H1 ) + 2η β ξ k , where η > 0 is a step size and ξ k = (ξ k,(1,m) , ξ k,(2,m) ) ∞ m=1 is an infinite-dimensional Gaussian noise, i.e., ξ k,(1,m) and ξ k,(2,m) are independently identically distributed from the standard normal distribution (Da Prato & Zabczyk, 1996) . Here, ∇ L(f W ) = 1 n n i=1 2(f W (x i ) - y i )( w2,m a m [x i ; 1]σ m (w 1,m [x i ; 1]), a m tanh (w 2,m /R)σ m (w 1,m [x i ; 1])) ∞ m=1 . However, since ∇ W k-1 2 H1 is unbounded which makes it difficult to show convergence, we employ the semiimplicit Euler scheme defined by W k+1 = W k -η∇ L(f W k )-ηAW k+1 + 2η β ξ k ⇔ W k+1 = S η W k -η∇ L(f W k )+ 2η β ξ k , where S η := (I+ηA) -1 . It is easy to check that this is equivalent to the following update rule: W k = W k-1 -η ∇ L(f W k-1 ) + S η AW k-1 + 2η β ξ k-1 . Therefore, the implicit Euler scheme can be seen as a naive noisy gradient descent for minimizing the empirical risk with a slightly modified ridge regularization. This can be interpreted as a discrete time approximation of the following infinite dimensional Langevin dynamics: dW t = -∇( L(f Wt ) + λ 2 W t H1 )dt + 2/βdξ t , where (ξ t ) t≥0 is the so-called cylindrical Brownian motion (see Da Prato & Zabczyk (1996) for the details). Its application and analysis for machine learning problems with non-convex objectives have been recently studied by, for example, Muzellec et al. (2020) ; Suzuki (2020) . The above mentioned algorithm is executed on an infinite dimensional parameter space. In practice, we should deal with a finite width network. To do so, we approximate the solution by a finite dimensional one: W (M ) = (w 1,m , w 2,m ) M m=1 where M corresponds to the width of the network. We identify W (M ) to the "zero-padded" infinite dimensional one, W = (w 1,m , w 2,m ) ∞ m=1 with w 1,m = 0 and w 2,m = 0 for all m > M . Accordingly, we use the same notation f W (M ) to indicate f W with zero padded vector W . Then, the finite dimensional version of the update rule is given by W (M ) k+1 = S (M ) η W (M ) k -η∇ L(f W (M ) k ) + 2η β ξ (M ) k , where ξ (M ) k is the Gaussian noise vector obtained by projecting ξ k to the first M components and S (M ) η is also obtained in a similar way.

4. CONVERGENCE RATE OF ESTIMATORS

In this section, we present the excess risk bounds for linear estimators and the deep learning estimator. As for the linear estimators, we give its lower bound while we give an upper bound for the deep learning approach. To obtain the result, we setup some assumptions on the model. Assumption 1. (i) There exists a constant c µ such that µ m ≤ c µ m -2 (m ∈ N). (ii) There exists α 1 > 1/2 such that a m ≤ µ α1 m (m ∈ N). (iii) The activation functions (σ m ) m is bounded as σ m ∞ ≤ 1. Moreover, they are three times differentiable and their derivatives upto third order differentiation are uniformly bounded: ∃C σ such that σ m 1,3 := max{ σ m ∞ , σ m ∞ , σ m ∞ } ≤ C σ (∀m ∈ N). The first assumption (i) controls the strength of the regularization, and combined with the second assumption (ii) and definition of the model F γ , complexity of the model is controlled. If α 1 and γ are large, the model is less complicated. Indeed, the convergence rate of the excess risk becomes faster if these parameters are large as seen later. The decay rate µ m ≤ c µ m -2 can be generalized as m -p with p > 1 but we employ this setting just for a technical simplicity for ensuring convergence of the Langevin dynamics. The third assumption (iii) can be satisfied by several activation functions such as the sigmoid function and the hyperbolic tangent. The assumption σ m ∞ ≤ 1 could be replaced by another one like σ m ∞ ≤ C, but we fix this scaling for simple presentation.

4.1. MINIMAX LOWER BOUND FOR LINEAR ESTIMATORS

Here, we analyze a lower bound of excess risk of linear estimators, and eventually we show that any linear estimator suffers from curse of dimensionality. To rigorously show that, we consider the following minimax excess risk over the class of linear estimators: R lin (F γ ) := inf f :linear sup f o ∈Fγ E Dn [ f -f o 2 L2(P X ) ], where inf is taken over all linear estimators and E Dn [•] is taken with respect to the training data D n . This expresses the best achievable worst case error over the class of linear estimators to estimate a function in F γ . To evaluate it, we additionally assume the following condition. Assumption 2. We assume that µ m = m -2 and a m = µ α1 m (m ∈ N) (and hence c µ = 1). There exists a monotonically decreasing sequence (b m ) ∞ m=1 and s ≥ 3 such that b m = µ α2 m (∀m) with α 2 > γ/2 and σ m (u) = b s m σ(b -1 m u) (u ∈ R) where σ is the sigmoid function: σ(u) = 1/(1+e -u ). Intuitively, the parameter s controls the "resolution" of each basis function σ m , and the relation between parameter α 1 and α 2 controls the magnitude of coefficient for each basis σ m . Note that the condition s ≥ 3 ensures σ m 1,3 is uniformly bounded and 0 < b m ≤ 1 ensures σ m ∞ ≤ 1. Our main strategy to obtain the lower bound is to make use of the so-called convex-hull argument. That is, it is known that, for a function class F, the minimax risk R(F) over a class of linear estimators is identical to that for the convex hull of F (Hayakawa & Suzuki, 2020; Donoho et al., 1990) : R lin (F) = R lin (conv(F)), where conv(F) = { N i=1 λ i f i | f i ∈ F, N i=1 λ i = 1, λ i ≥ 0, N ∈ N} and conv(•) is the closure of conv(•) with respect to L 2 (P X )-norm. Intuitively, since the linear estimator is linear to the observations (y i ) n i=1 of outputs, a simple application of Jensen's inequality yields that its worst case error on the convex hull of the function class F does not increase compared with that on the original one F (see Hayakawa & Suzuki (2020) for the details). This indicates that the linear estimators cannot distinguish the original hypothesis class F and its convex hull. Therefore, if the class F is highly non-convex, then the linear estimators suffer from much slower convergence rate because its convex hull conv(F) becomes much "fatter" than the original one F. To make use of this argument, for each sample size n, we pick up appropriate m n and consider a subset generated by the basis function σ mn , i.e., F (n) γ := {a mn w2,mn σ m (w 1,mn [x; 1]) ∈ F γ }. By applying the convex hull argument to this set, we obtain the relation R lin (F γ ) ≥ R lin (F (n) γ ) = R lin (conv(F (n) γ )). Since F (n) γ is highly non-convex, its convex hull conv(F (n) γ ) is much larger than the original set F (n) γ and thus the minimax risk over the linear estimators would be much larger than that over all estimators including deep learning. More intuitively, linear estimators do not adaptively select the basis functions and thus they should prepare redundantly large class of basis functions to approximate functions in the target function class. The following theorem gives the lower bound of the minimax optimal excess risk over the class of linear estimators. Theorem 1. Suppose that Var( ) > 0, P X is the uniform distribution on [0, 1] d , and Assumption 2 is satisfied. Let β = α1+(s+1)α2 α2-γ/2 . Then for arbitrary small κ > 0, we have that R lin (F γ ) n -2 β+d 2 β+2d n -κ . ( ) The proof is in Appendix A. We utilized the Irie-Miyake integral representation (Irie & Miyake, 1988; Hornik et al., 1990) to show there exists a "complicated" function in the convex hull, and then we adopted the technique of Zhang et al. (2002) to show the lower bound. The lower bound is characterized by the decaying rate (α 1 ) of a m relative to that (α 2 ) of the scaling factor b m . Indeed, the faster a m decays with increasing m, the faster the rate of the minimax lower bound becomes. We can see that the minimax rate of linear estimators is quite sensitive to the dimension d. Actually, for relatively high dimensional settings, this lower bound becomes close to a slow rate Ω(1/ √ n), which corresponds to the curse of dimensionality. It has been pointed out that the sample complexity of kernel methods suffers from the curse of dimensionality while deep learning can avoid that by a tractable algorithms (e.g., Ghorbani et al. (2019) ; Bach (2017) ). Among them, Ghorbani et al. (2019) showed that if the dimensionality d is polynomial against n, then the excess risk of kernel methods is bounded away from 0 for all n. On the other hand, our analysis can be applied to any linear estimator including kernel methods, and it shows that even if the dimensionality d is fixed, the convergence rate of their excess risk suffers from the curse of dimensionality. This can be accomplished thanks to a careful analysis of the rate of convergence. Bach (2017) derived an upper bound of the Rademacher complexity of the unit ball of the RKHS corresponding to a neural network model. However, it is just an upper bound and there is still a large gap from excess risk estimates. Allen-Zhu & Li (2019; 2020); Bai & Lee (2020); Chen et al. ( 2020) also analyzed a lower bound of sample complexity of kernel methods. However, their lower bound is not for the excess risk of the squared loss. Eventually, the sample complexities of all methods including deep learning take a form of O(C/ √ n) and dependency of coefficient C to the dimensionality or other factors such as magnitude of residual components is compared. On the other hand, our lower bound properly involves the properties of squared loss such as strong convexity and smoothness and the bound shows the curse of dimensionality occurs even in the rate of convergence instead of just the coefficient. Finally, we would like to point out that several existing work (e.g., Ghorbani et al. (2019) ; Allen-Zhu & Li ( 2019)) considered a situation where the target function class changes as the sample size n increases. However, our analysis reveals that separation between deep and shallow occurs even if the target function class F γ is fixed.

4.2. UPPER BOUND FOR DEEP LEARNING

Here, we analyze the excess risk of deep learning trained by NGD and its algorithmic convergence rate. Our analysis heavily relies on the weak convergence of the discrete time gradient Langevin dynamics to the stationary distribution of the continuous time one (Eq. ( 4)). Under some assumptions, the continuous time dynamics has a stationary distribution (Da Prato & Zabczyk, 1992; Maslowski, 1989; Sowers, 1992; Jacquot & Royer, 1995; Shardlow, 1999; Hairer, 2002) . If we denote the probability measure on H corresponding to the stationary distribution by π ∞ , then it is given by dπ∞ dν β (W ) ∝ exp(-β L(f W )), where ν β is the Gaussian measure in H with mean 0 and covariance (βA) -1 (see Da Prato & Zabczyk (1996) for the rigorous definition of the Gaussian measure on a Hilbert space). Remarkably, this can be seen as the Bayes posterior for a prior distribution ν β and a "log-likelihood" function exp(-β L(W )). Through this view point, we can obtain an excess risk bound of the solution W k . The proofs of all theorems in this section are in Appendix B. Under Assumption 1, the distribution of W k derived by the discrete time gradient Langevin synamics satisfies the following weak convergence property to the stationary distribution π ∞ . This convergence rate analysis depends on the techniques by Bréhier & Kopec (2016) ; Muzellec et al. (2020) . Proposition 1. Assume Assumption 1 holds and β > η. Then, there exist spectral gaps Λ * η and Λ * 0 (defined in Eq. (10) of Appendix B.1) and a constant C 0 such that, for any 0 < a < 1/4, the following convergence bound holds for almost sure observation D n : |E W k [L(f W k )|D n ] -E W ∼π∞ [L(f W )|D n ]| ≤ C 0 exp(-Λ * η ηk) + C 1 √ β Λ * 0 η 1/2-a =: Ξ k , where C 1 is a constant depending only on c µ , R, α 1 , C σ , U, a (independent of η, k, β, λ, n). This proposition indicates that the expected risk of W k can be almost identical to that of the "Bayes posterior solution" obeying π ∞ after sufficiently large iterations k with sufficiently small step size η even though L(f W ) is not convex. The definition of Λ * η can be found in Eq. ( 10). We should note that its dependency on β is exponential. Thus, if we take β = Ω(n), then the computational cost until a sufficiently small error could be exponential with respect to the sample size n. The same convergence holds also for finite dimensional one W (M ) k with a modified stationary distribution. The constants appearing in the bound are independent of the model size M (see the proof of Proposition 1 in Appendix B). In particular, the convergence can be guaranteed even if W is infinite dimensional. This is quite different from usual finite dimensional analyses (Raginsky et al., 2017; Erdogdu et al., 2018; Xu et al., 2018) which requires exponential dependency on the dimension, but thanks to the regularization term, we can obtain the model size independent convergence rate. Xu et al. (2018) also analyzed a finite dimensional gradient Langevin dynamics and obtained a similar bound where O(η) appears in place of the second term η 1/2-a which corresponds to time discretization error. In our setting the regularization term is W 2 H1 = m ( w 1,m 2 + w 2 2,m )/µ m with µ m m -2 , but if we employ W 2 H p/2 = m ( w 1,m 2 + w 2 2,m )/µ p/2 m for p > 1, then the time discretization error term would be modified to η (p-1)/p-a (Andersson et al., 2016) . We can interpret the finite dimensional setting as the limit of p → ∞ which leads to η (p-1)/p → η that recovers the finite dimensional result (O(η)) as shown by Xu et al. (2018) . In addition to the above algorithmic convergence, we also have the following convergence rate for the excess risk bound of the finite dimensional solution W (M ) k . Theorem 2. Assume Assumption 1 holds, assume η < β ≤ min{n/(2U 2 ), n}, and 0 < γ < 1/2 + α 1 . Then, if the width satisfies M ≥ min λ 1/4γ(α1+1) β 1/2γ , λ -1/2(α1+1) , n 1/2γ , the expected excess risk of W k is bounded as E D n E W (M) k [ f W (M) k -f o 2 L2(P X ) |D n ] ≤ C max (λβ) 1/γ 1+1/2γ n -1 1+1/2γ , λ - 1 2(α 1 +1) β -1 , λ γ 1+α 1 +Ξ k , where C is a constant independent of n, β, λ, η, k. In particular, if we set β = min{n/(2U 2 ), n} and λ = β -1 , then for M ≥ n 1/2(α1+1) , we obtain E D n E W (M ) k [ f W (M ) k -f o 2 L2(P X ) |D n ] n -γ α 1 +1 + Ξ k . In addition to this theorem, if we further assume Assumption 2, we obtain a refined bound as follows. Corollary 1. Assume Assumptions 1 and 2 hold and η < β, and let β = min{n/(2U 2 ), n} and λ = β -1 . Suppose that there exists 0 ≤ q ≤ s -3 such that 0 < γ < 1/2 + α 1 + qα 2 . Then, the excess risk bound of W (M ) k for M ≥ n 1/2(α1+qα2+1) can be refined as E D n E W (M ) k [ f W (M ) k -f o 2 L2(P X ) |D n ] n - γ α 1 +qα 2 +1 + Ξ k . ( ) These theorem and corollary shows that the tractable NGD algorithm achieves a fast convergence rate of the excess risk bound. Indeed, if q is chosen so that γ > (α 1 +qα 2 +1)/2, then the excess risk bound converges faster than O(1/ √ n). Remarkably, the convergence rate is not affected by the input dimension d, which makes discrepancy from linear estimators. The bound of Theorem 2 is tightest when γ is close to 1/2 + α 1 (γ ≈ 1/2 + α 1 + 3α 2 for Corollary 1), and a smaller γ yields looser bound. This relation between γ and α 1 reflects misspecification of the "prior" distribution. When γ is small, the regularization λ W 2 H1 is not strong enough so that the variance of the posterior distribution becomes unnecessarily large for estimating the true function f o ∈ F γ . Therefore, the best achievable bound can be obtained when the regularization is correctly specified. The analysis of fast rate is in contrast to some existing work (Allen-Zhu & Li, 2019; 2020; Li et al., 2020; Bai & Lee, 2020) that basically evaluated the Rademacher complexity. This is because we essentially evaluated a local Rademacher complexity instead.

4.3. COMPARISON BETWEEN LINEAR ESTIMATORS AND DEEP LEARNING

Here, we compare the convergence rate of excess risks between the linear estimators and the neural network method trained by NGD using the bounds obtained in Theorem 1 and Corollary 1 respectively. We write the lower bound (5) of the minimax excess risk of linear estimators as R * lin and the excess risk of the neural network approach (7) as R * NN . To make the discussion concise, we consider a specific situation where s = 3, α 1 = γ = 1 4 α 2 . In this case, β = 17/3 ≈ 5.667, which gives R * lin n -1+ d 2 β+d -1 n -κ ≈ n -1+ d 11.3+d -1 n -κ . On the other hand, by setting q = 0, we have R * NN n -α 1 α 1 +1 = n -1+ 1 α1 -1 . Thus, as long as α 1 > 11.3/d + 1 ≈ 2 β/d + 1, we have that R * lin R * NN , and lim n→∞ R * NN R * lin = 0. In particular, as d gets larger, R * lin approaches to Ω(n -1/2 ) while R * NN is not affected by d and it gets close to O(n -1 ) as α 1 gets larger. Moreover, the inequality α 1 > 11.3/d + 1 can be satisfied by a relatively low dimensional setting; for example, d = 10 is sufficient when α 1 = 3. As α 1 becomes large, the model becomes "simpler" because (a m ) ∞ m=1 decays faster. However, the linear estimators cannot take advantage of this information whereas deep learning can. From the convex hull argument, this discrepancy stems from the non-convexity of the model. We also note that the superiority of deep learning is shown without sparse regularization while several existing work showed favorable estimation property of deep learning though sparsity inducing regularization (Bach, 2017; Chizat, 2019; Hayakawa & Suzuki, 2020) . However, our analysis indicates that sparse regularization is not necessarily as long as the model has non-convex geometry, i.e., sparsity is just one sufficient condition for non-convexity but not a necessarily condition. The parameter setting above is just a sufficient condition and the lower bound R * lin would not be tight. The superiority of deep learning would hold in much wider situations.

5. CONCLUSION

In this paper, we studied excess risks of linear estimators, as a representative of shallow methods, and a neural network estimator trained by a noisy gradient descent where the model is fixed and no sparsity inducing regularization is imposed. Our analysis revealed that deep learning can outperform any linear estimator even for a relatively low dimensional setting. Essentially, non-convexity of the model induces this difference and the curse of dimensionality for linear estimators is a consequence of a fact that the geometry of the model becomes more "non-convex" as the dimension of input gets higher. All derived bounds are fast rate because the analyses are about the excess risk with the squared loss, which made it possible to compare the rate of convergence. The fast learning rate of the deep learning approach is derived through the fact that the noisy gradient descent behaves like a Bayes estimator with model size independent convergence rate.

A PROOF OF THEOREM 1

We basically combine the "convex hull argument" and the minimax optimal rate analysis for linear estimators developed by Zhang et al. (2002) . Zhang et al. (2002) essentially showed the following statement in their Theorem 1. Proposition (Theorem 1 of Zhang et al. (2002) ). Let µ be the Lebesgue measure. Suppose that the space Ω has even partition A such that |A| = 2 K for an integer K ∈ N, each A has equivalent measure µ(A) = 2 -K for all A ∈ A, and A is indeed a partition of Ω, i.e., ∪ A∈A = Ω, A ∩ A = ∅ for A, A ∈ Ω and A = A . Then, if K is chosen as n -γ1 ≤ 2 -K ≤ n -γ2 for constants γ 1 , γ 2 > 0 that are independent of n, then there exists an event E such that, for a constant C > 0, P (E) ≥ 1 + o(1) and |{x i | x i ∈ A (i ∈ {1, . . . , n})}| ≤ C n/2 K (∀A ∈ A). Moreover, suppose that, for a class F • of functions on Ω, there exists ∆ > 0 that satisfies the following conditions: 1. There exists F > 0 such that, for any A ∈ A, there exists g ∈ F • that satisfies g(x) ≥ 1 2 ∆F for all x ∈ A, 2. There exists K and C > 0 such that 1 n n i=1 g(x i ) 2 ≤ C ∆ 2 2 -K for any g ∈ F • on the event E. Then, there exists a constant F 1 such that at least one of the following inequalities holds: F 2 4F 1 C 2 K n ≤ R lin (F • ), F 3 32 ∆ 2 2 -K ≤ R lin (F • ), for sufficiently large n. Before we show the main assertion, we prepare some additional lemmas. For a sigmoid function σ, let F(σ) C,τ := {x ∈ R d → aσ(τ (w x + b))) | |a| ≤ 2C, w ≤ 1, |b| ≤ 2 (a, b ∈ R, w ∈ R d )} for C > 0, τ > 0. Lemma 1. Let ψ(x) = 1 2 (σ(x + 1) -σ(x - )) and ψ be its Fourier transform: ψ(ω) := (2π) -1 e -iωx ψ(x)dx. Let h > 0 and D w > 0. Then, by setting τ = h -1 (2 √ d + 1)D w and C = (2 √ d+1)Dw πh| ψ(1)| , the Gaussian RBF kernel can be approximated by inf ǧ∈conv( F (σ) C,τ ) sup x∈[0,1] d ǧ(x) -exp - x -c 2 2h 2 ≤ 4 |2π ψ(1)| C d D 2(d-2) w exp(-D 2 w /2) + exp(-D w ) for any c ∈ [0, 1] d , where C d is a constant depending only on d. In particular, the right hand side is O(exp(-n κ )) if D w = n κ . Proof of Lemma 1. Let ψ h (x) = ψ(h -1 x). Suppose that, for f ∈ L 1 (R d ), its Fourier transform f (ω) = (2π) -d e -iω x f (x)dx (ω ∈ R d ) gives R d exp(iw x) f (w)dw = f (x), for every x ∈ R d1 . Then the Irie-Miyake itegral representation (Irie & Miyake (1988) ; see also the proof of Theorem 3.1 in Hornik et al. (1990)) gives f (x) = a∈R d b∈R ψ(a x + b)dν(a, b) (a.e.), 1 If f is integrable, this inversion formula holds for almost every x ∈ R d (Rudin, 1987) . However, we assume a stronger condition that it holds for every x ∈ R d . where dν(a, b) is given by dν(a, b) = Re |ω| d e -iwb 2π ψ(ω) f (wa)dadb for any ω = 0. Since the characteristic function of the multivariate normal distribution gives that R d exp(iw (x -c)) h 2d (2π) d exp h 2 w 2 2 = f (w) dw = exp - x -c 2 2h 2 =: f (x) (∀x ∈ R d ), we have that exp - x -c 2 2h 2 = a∈R d b∈R ψ h (a (x -c) + b)Re e -iwb 2π ψh (ω) |ωh| 2d (2π) d exp - (ωh) 2 a 2 2 dadb, for all x ∈ R d . Since ψ h (•) = ψ(h -1 •) and ψh (•) = h ψ(h• ) by its definition, the right hand side is equivalent to a∈R d b∈R ψ(h -1 [a (x -c) + b])Re e -iwb 2πh ψ(hω) |ωh| 2d (2π) d exp - (ωh) 2 a 2 2 dadb. Here, we set ω = h -1 . Let N σ 2 be the probability measure corresponding to the multivariate normal with mean 0 and covariance σ 2 I, and let A D := {w ∈ R d | w ≤ D}. Let D a > 0 and D b = D a ( √ 2d + 1) , and define f Da (x) := 1 2D b N 1 (A Da ) a ≤Da,|b|≤D b ψ(h -1 [a (x -c) + b])Re e -ib/h 2πh ψ(1) × 1 (2π) d exp - a 2 2 dadb. Then, we can see that, for any x ∈ [0, 1] d , it holds that 1 2D b N 1 (A Da ) f (x) -f Da (x) ≤ 1 2D b N 1 (A Da )|2πh ψ(1)| N 1 (A c Da ) 2 exp(-h -1 |x|)dx + |b|>D b 2 exp(-[h -1 (|b| -2 √ dD a )])db ≤ 1 2D b N 1 (A Da )|2πh ψ(1)| 4hN 1 (A c Da ) + 4h exp(-D a ) = 4h 2D b N 1 (A Da )|2πh ψ(1)| C d D 2(d-2) a exp(-D 2 a /2) + exp(-D a ) , where C d > 0 is a constant depending on only d, and we used |a (x -c) + b| ≥ |b| -|a (x - c)| ≥ |b| -2 √ dD a and ψ(x) ≤ 2 exp(-|x|). Note that if D a = n κ , then the right hand side is O(h exp(-n κ )). Therefore, since N 1 (A Da ) ≤ 1, by setting τ = h -1 D b , C = D b πh| ψ(1)| , we have that inf ǧ∈conv( F (σ) C,τ ) sup x∈[0,1] d ǧ(x) -exp - x -c 2 2h 2 ≤ 4 |2π ψ(1)| C d D 2(d-2) a exp(-D 2 a /2) + exp(-D a ) . Hence, by rewriting D w ← D a , we obtain the assertion. As noted above, the right hand is O(exp(-n κ )) if D a = n κ . Proof of Theorem 1. For a sample size n, we fix m n which will be determined later and use Proposition 2 with F • = F  µ α1+γ/2+sα2 mn 2( √ 2d + 1)D w πh| ψ(1)| -1 exp - • -c 2 2h 2 -g ∞ ≤ µ α1+γ/2+sα2 mn 2( √ 2d + 1)D w πh| ψ(1)| -1 4 |2π ψ(1)| C d D 2(d-2) w exp(-D 2 w /2) + exp(-D w ) = µ α1+γ/2+sα2 mn h ( √ 2d + 1)D w C d D 2(d-2) a exp(-D 2 w /2) + exp(-D w ) . We let D w = n κ for any κ > 0 and choose µ mn as τ µ -α2+γ/2 mn = D w h -1 = h -1 n κ . We write ∆ := µ α1+γ/2+sα2 mn (2C) -1 h α 1 +sα 2 +γ/2 α 2 -γ/2 +1 n -κ( α 1 +sα 2 +γ/2 α 2 -γ/2 +1) . Then, it holds that ∆ exp - • -c 2 2h 2 -g ∞ ∆ exp(-n κ ). Here, we set h as h = 2 -k with a positive integer k. Accordingly, we define a partition A of Ω so that any A ∈ A can be represented as A = [2 -k j 1 , 2 -k (j 1 + 1)] × • • • × [2 -k j d , 2 -k (j d + 1)] by non-negative integers 0 ≤ j i ≤ 2 k -1 (i = 1, . . . , d). Note that |A| = 2 dk = h -d . For each A ∈ A, we define c A as c A = (2 -k (j 1 + 1/2), . . . , 2 -k (j d + 1/2)) where (j 1 , . . . , j d ) is a set of indexes that satisfies A = [2 -k j 1 , 2 -k (j 1 + 1)] × • • • × [2 -k j d , 2 -k (j d + 1)]. For each A ∈ A, we define g A ∈ conv(F (n) γ ) as a function that satisfies Eq. ( 9) for c = c A . Now, we apply Proposition 2 with F • = conv(F (n) γ ) and K = K = dk. Let R * := R lin (conv(F (n) γ ) ). First, we can see that there exits a constant F > 0 such that g A (x) ≥ F ∆ (∀x ∈ A), where we used exp(-n κ ) 1. Second, in the event E introduced in the statement of Proposition 2, there exists C such that |{i ∈ {1, . . . , n} | x i ∈ A }| ≤ Cn/2 -dk for all A ∈ A. In this case, we can check that 1 n n i=1 ∆ exp - x i -c A 2 2h 2 2 ∆ 2 h d = ∆ 2 2 -kd , by the uniform continuity of the Gaussian RBF. Therefore, we also have 1 n n i=1 g A (x i ) 2 ≤ 2 n n i=1 ∆ exp - x i -c A 2 2h 2 2 + c∆ 2 exp(-2n κ ) ∆ 2 (h d + exp(-2n κ )), where c > 0 is a constant. Thus, as long as h is polynomial to n like h = Θ(n -a ), the right hand side is O(∆ 2 h d ). (iii) For any data D n , L is three times differentiable. Let ∇ 3 L(W ) be the third-order derivative of L(W ). This can be identified with a third-order linear form and ∇ 3 L(W )•(h, k) denotes the Riesz representor of l ∈ H → ∇ 3 L(W ) • (h, k, l). There exists α ∈ [0, 1), C α ∈ (0, ∞) such that ∀W, h, k ∈ H, ∇ 3 L(W ) • k) H -α ≤ C α h H k H , ∇ 3 L(W ) • (h, k) H ≤ C α h H α k H (a.s. ). Remark 1. In the analysis of Bréhier & Kopec (2016) ; Muzellec et al. (2020) ; Suzuki (2020), Assumption 3-(iii) is imposed for any finite dimensional projection L(W (M ) ) as a function on H (M ) ) for all M ≥ 1 instead of L(W ) as a function of H. However, the condition on L(W ) gives a sufficient condition for any finite dimensional projection in our setting. Thus, we employed the current version. Assumption 4. For the loss function (y, f (x)) = (y -f (x)) 2 , the following conditions holds: (i) There exists C > 0 such that for any f W (W ∈ H), it holds that E X,Y [( (Y, f W (X)) -(Y, f * (X))) 2 ] ≤ C(L(f W ) -L(f * )). (ii) β > 0 is chosen so that, for any h : R d → R and x ∈ supp(P X ), it holds that E Y |X=x exp -β n ( (Y, h(x)) -(Y, f * (x))) ≤ 1. (iii) There exists L h > 0 such that ∇ W (Y, h W (X)) -∇ W (Y, h W (X)) H ≤ L h W - W H (∀W, W ∈ H) almost surely. (iv) There exists C h such that h W -h W ∞ ≤ C h W -W H (W, W ∈ H). Proposition 3. Assume Assumption 3 holds and β > η. Suppose that ∃ R > 0, 0 ≤ (Y, f W (X)) ≤ R for any W ∈ H (a.s.). Let ρ = 1 1+λη/µ1 and b = µ1 λ B + cµ βλ . Accordingly, let b = max{b, 1}, κ = b + 1 and V = 4 b/( √ (1+ρ 1/η )/2-ρ 1/η ). Then, the spectral gap of the dynamics is given by Λ * η = min λ 2µ1 , 1 2 4 log(κ( V + 1)/(1 -δ)) δ ( ) where 0 < δ < 1 is a real number satisfying δ = Ω(exp(-Θ(poly(λ -1 )β))). We define Λ * 0 = lim η→0 Λ * η (i.e., V is replaced by 4 b/( (1+exp(-λ µ 1 ))/2-exp(-λ µ 1 ))). We also define C W0 = κ[ V + 1] + √ 2( R+b) √ δ . Then, for any 0 < a < 1/4, the following convergence bound holds for almost sure observation D n : for either L = L or L = L, |E W k [L(W k )|D n ] -E W ∼π∞ [L(W )|D n ]| (11) ≤ C 1 C W0 exp(-Λ * η ηk) + √ β Λ * 0 η 1/2-a = Ξ k , where C 1 is a constant depending only on c µ , B, L, C α , a, R (independent of η, k, β, λ). Proposition 4. Assume that Assumptions 3 and 4 hold. Let α := 1/{2(α + 1)} for a given α > 0 and θ be an arbitrary real number satisfying 0 < θ < 1 -α. Assume that the true function f o can be represented by h α+1) , n 1/2[θ(α+1)] , the expected excess risk is bounded by W * = f o for W * ∈ H θ(α+1) . Then, if M ≥ min λ α/2[θ(α+1)] β 1/2[θ(α+1)] , λ -1/2( E D n E W (M ) k [L(h T α/2 M W (M ) k )|D n ] -L(f o ) ≤ C max (λβ) 2 α/θ 1+ α/θ n -1 1+ α/θ , λ -αβ -1 , λ θ , 1/n + Ξ k , ( ) where C is a constant independent of n, β, λ, η, k. Proof. Repeating the same argument in Proposition 1 and using the same notation, Proposition 3 gives |E W (M ) k [L(W (M ) k )|D n ] -E W ∼π (M ) ∞ [L(W )|D n ]| ≤ Ξ k , specific case of the infinite dimensional one. Actually, the dynamics of W (M ) k is same as that of ι( Wk ) where Wk ∈ H obeys the following dynamics: Wk+1 = S η Wk -η∇ L(f ι( Wk ) ) + 2η β ξ k . This is because f ι( Wk ) is determined by only the first M components ι( Wk ), ι(∇ L(f ι( Wk ) )) = ∇ W (M ) L(f W ) )| W (M ) =ι( Wk ) and S η is a diagonal operator. Since the components of Wk with indexes higher than M does not affect the objective, smoothness of the objective is not lost. The stationary distribution π (M ) ∞ of the continuous dynamics corresponding to W (M ) is a probability measure on H (M ) that satisfies dπ (M ) ∞ dν (M ) β (W (M ) ) ∝ exp(-β L(f W (M ) )), where ν (M ) β is the Gaussian measure on R M ×(d+2) with mean 0 and covariance (βA (M ) ) -1 . We can notice that this is the marginal distribution of the stationary distribution of the continuous time counterpart of Wk : dπ ∞ ( W ) ∝ exp(-β L(f ι( W ) ))dν β . Therefore, we just need to consider an infinite dimensional one. For this reasoning, we show the convergence for the original infinite dimensional dynamics (W k ) ∞ k=1 . The convergence of the finite dimensional one (W (M ) k ) ∞ k=1 can be shown by the same manner using the argument above. To show Proposition 1, we use Propositions 3. To do so, we need to check validity of Assumptions 3.  (∵ |f W (x i ) -y i | ≤ R, σ m ∞ ≤ C σ , tanh ∞ ≤ 1) ≤4 R[R 2 C 2 σ (d + 1) + 1] ∞ m=1 a 2 m < ∞. Similarly, we can show the Lipschitz continuity of the gradient as W -W 2 H-α 1 . ∇ L(f W ) -∇ L(f W ) 2 H ≤ ∞ m=1 µ -2α1 We can also verify Assumption 3-(iii) in a similar way. Then, we have verified Assumption 3. Therefore, we may apply Proposition 3, and then we obtain Proposition 1. Next, we show Theorem 2 by using Proposition 4. For that purpose, we need to we verify Assumption 4. The first condition can be verified as E X,Y [((Y -f W (X)) 2 -(Y -f o (X)) 2 ) 2 ] = E X, [((f o (X) + -f W (X)) 2 -2 ) 2 ] = E X [((f o (X) -f W (X)) 2 + 2 (f o (X) -f W (X))) 2 ] = E [(f o (X) -f W (X)) 4 + 2 (f o (X) -f W (X))(f o (X) -f W (X)) 2 + 2 (f o (X) -f W (X)) 2 ] = f o -f W 2 ∞ E X [(f o (X) -f W (X)) 2 ] + U 2 E X [(f o (X) -f W (X)) 2 ] ≤ RE X [(f o (X) -f W (X)) 2 ] = R(L(f W ) -L(f o )). The second condition can be checked as follows. Note that E Y |X=x exp - β n [(Y -f W (x)) 2 -(Y -f o (x)) 2 ] = E exp - β n (f o (x) -f W (x)) 2 -2 (f W (x) -f o (x))] = exp - β n (f o (x) -f W (x)) 2 E exp 2β n (f W (x) -f o (x)) ≤ exp - β n (f o (x) -f W (x)) 2 exp 1 8 4β 2 n 2 4U 2 (f W (x) -f o (x)) 2 . Thus, under the condition β ≤ n/(2U 2 ), the right hand side can be upper bounded by exp - β n 1 -2 U 2 β n (f W (x) -f o (x)) 2 ≤ 1. Next, we check the third and fourth conditions. Noting that  ∇ W h W (X) 2 H ≤ ∞ m=1 a 2 m µ -α m [(d + 1)R 2 C 2 σ + 1] ≤ c 2α1 µ max m {µ 2(α1-α) m }[(d + 1)(1 + R 2 ) + 1/R 2 + C 2 σ (d + 1)] W -W 2 H =: C 2 W -W 2 H , for a constant 0 < C 2 < ∞. Therefore, it holds that |h W (X) -h W (X)| 2 ≤ C 1 W -W 2 H , which yields the forth condition, and we also have ∇ (Y, h W (X)) -∇ W (Y, h W (X)) 2 H = 2(h W (X) -Y )∇ W h W (X) -2(h W (X) -Y )∇ W h W (X) 2 H ≤2 2(h W (X) -Y )(∇ W h W (X) -∇ W h W (X)) 2 H + 2 2(h W (X) -h W (X))∇ W h W (X) 2 H ≤ 8 RC 2 W -W 2 H + 8C 2 1 W -W 2 H W -W 2 H , which yields the third condition. Since f o ∈ F γ , there exists W * ∈ H γ such that f o = f W * . Therefore, applying Proposition 4 with α = α 1 (α = 1/[2(α 1 + 1)]) and θ = γ/(1 + α 1 ) (since γ < 1/2 + α 1 , the condition θ < 1 -α is satisfied), we obtain that for M ≥ min λ 1/4γ(α1+1) β 1/2γ , λ -1/2(α1+1) , n 1/2γ , the following excess risk bound holds: E D n E W (M ) k [L(W (M ) k )|D n ] -L(f * ) max (λβ) 2 α/θ 1+ α/θ n -1 1+ α/θ , λ -αβ -1 , λ θ , 1/n + Ξ k . Finally, by noting L(W (M ) k ) -L(f * ) = f W (M ) k -f * 2 L2(P X ) , we obtain the assertion. Finally, we give the proof of Corollary 1. Proof of Corollary 1. Note that f W (x) = ∞ m=1 a m w2,m σ m (w 1,m [x; 1])



w 2,mn = b µ γ mn /2 with |b| ≤ 1 and w1,m = µ γ/2 mn [u; -u c]/( 2(d + 1)) for u ∈ d such that u ≤ 1 and c ∈ [0, 1] d , then (w 1,mn , w 2,mn ) 2 ≤ µ γ mn (1/2 + (1 + |u c| 2 )/2(d + 1)) ≤ µ γ mn .Therefore, φu,c (x) = a mn w2,mn σ mn (w 1,mn [x; 1]) = µ α1 mn (for all b ∈ R with |b| ≤ 1, u ∈ R d with u ≤ 1, and c ∈ [0, 1] d . In other words, µ D w /(πh| ψ(1)|) for D w > 0, Lemma 1 yields that for any c ∈ [0, 1] d and given h > 0, there exists g ∈ conv(F(n) γ ) such that

First, we check Assumption 3. Assumption 3-(i) is ensured by Assumption 1. Next, we check Assumption 3-(ii). The boundedness of the gradient can be shown as follows:W (x i ) -y i ) w2,m a m [x i ; 1]σ m (w 1,m [x i ; 1]) W (x i ) -y i )a m tanh (w 2,m /R)σ m (w 1,m [x i ; 1]))

w 2,m -w 2,m ) 2 + R 2 w 1,m -w 1,m 2 ] + 4 Ra 2 m [(w 2,m -w 2,m ) 2 /R 2 + C 2 σ (d + 1) w 1,m -w 1m [(w 2,m -w 2,m ) 2 + w 1,m -w 1,m 2 ]

ACKNOWLEDGMENTS

TS was partially supported by JSPS Kakenhi (18K19793, 18H03201, and 20H00576), Japan Digital Design and JST-CREST.

annex

Published as a conference paper at ICLR 2021 Now, if we writethen we have ∆ h β n -κ β by its definition.Here, we choose k as a maximum integer that satisfies F 3 32 ∆ 2 2 -dk > R * . In this situation, it that h 2 β+d n -2κ β R * .Since Eq. ( 8b) is not satisfied, Eq. (8a) must hold, and hence we haveTherefore, we obtain that. This gives the assertion.B PROOFS OF PROPOSITION 1, THEOREM 2 AND COROLLARY 1Proposition 1, Theorem 2 and Corollary 1 can be shown by using Propositions 3 and 4 given in Appendix B.1 shown below.) ∞ m=1 for α > 0, and let us consider a model h W := f T -α/2 W . Then, the training error can be rewritten asFor notational simplicity, we let L(W ) := L(f W ). ) as the map that extracts first M components. By abuse of notation, we write

B.1 AUXILIARY LEMMAS

First, we show some key propositions to show the main results. To do so, we utilize the result by Muzellec et al. (2020) and Suzuki (2020).

Assumption 3.

(i) There exists a constant c µ such that µ m ≤ c µ m -2 .(ii) There exist B, U > 0 such that the following two inequalities hold for some a ∈ (1/4, 1) almost surely:for any 1 ≤ M ≤ ∞. Therefore, we just need to bound the following quantity: ) . For a > 0, we define H (M ) a be the projection of H a to the first components, H . Let the concentration function bewhere, if there does not exist W ∈ Hα+1 that satisfies the condition inf, then we define φThen, Suzuki (2020) showed the following bound:They also showed that, for M = ∞, it holds that * 2 max (λβSubstituting this bound of * to Eq. ( 14), we obtain Eq. ( 13) for M = ∞. Moreover, in their proof, if M ≥ ( * ) -1/[θ(α+1)] , thenTherefore, as long as M ≥ ( * ) -1/[θ(α+1)] , the rate of * is not deteriorated from M = ∞. In other words, if )] , the bound (13) holds.Remark 2. Suzuki (2020) showed Proposition 4 under a condition α > 1/2. However, this is used only to ensure Assumption 3. In our setting, we can show Assumption 3 holds directly and thus we may omit the condition α > 1/2.

B.2 PROOFS OF PROPOSITION 1, THEOREM 2 AND COROLLARY 1

Here, we give the proofs of Proposition 1 and Theorem 2 simultaneously.Proof of Proposition 1 and Theorem 2. Let R = (2 ∞ m=1 a m R + U ) 2 . Then, we can easily check that (y i -f W (x i )) 2 ≤ R. As stated above, we use Propositions 3 and 4 to show the statements.First, we show Proposition 1 for the dynamics of W (M ) k for any 1 ≤ M ≤ ∞. However, it suffices to show the statement only for M = ∞ because the finite dimensional version can be seen as a

