BENEFIT OF DEEP LEARNING WITH NON-CONVEX NOISY GRADIENT DESCENT: PROVABLE EXCESS RISK BOUND AND SUPERIORITY TO KERNEL METHODS

Abstract

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, k-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than O(1/ √ n) that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.

1. INTRODUCTION

In the deep learning theory literature, clarifying the mechanism by which deep learning can outperform shallow approaches has been gathering most attention for a long time. In particular, it is quite important to show that a tractable algorithm for deep learning can provably achieve a better generalization performance than shallow methods. Towards that goal, we study the rate of convergence of excess risk of both deep and shallow methods in a setting of a nonparametric regression problem. One of the difficulties to show generalization ability of deep learning with certain optimization methods is that the solution is likely to be stacked in a bad local minimum, which prevents us to show its preferable performances. Recent studies tackled this problem by considering optimization on overparameterized networks as in neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019a) and mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; 2019; Mei et al., 2018; 2019) , or analyzing the noisy gradient descent such as stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011; Raginsky et al., 2017; Erdogdu et al., 2018) . The NTK analysis deals with a relatively large scale initialization so that the model is well approximated by the tangent space at the initial solution, and eventually, all analyses can be reduced to those of kernel methods (Jacot et al., 2018; Du et al., 2019b; Allen-Zhu et al., 2019; Du et al., 2019a; Arora et al., 2019; Cao & Gu, 2019; Zou et al., 2020) . Although this regime is useful to show its global convergence, the obtained estimator looses large advantage of deep learning approaches because the estimation ability is reduced to the corresponding kernel methods. To overcome this issue, there are several "beyond-kernel" type analyses. For example, Allen-Zhu & Li (2019; 2020) showed benefit of depth by analyzing ResNet type networks. Li et al. (2020) showed global optimality of gradient descent by reducing the optimization problem to a tensor decomposition problem for a specific regression problem, and showed the "ideal" estimator on a linear model has worse dependency on the input dimensionality. Bai & Lee (2020) considered a second order Taylor expansion and showed that the sample complexity of deep approaches has better dependency on the input dimensionality than kernel methods. Chen et al. ( 2020) also derived a similar conclusion by considering a hierarchical representation. The analyses mentioned above actually show some superiority of deep learning, but all of these bounds are essentially Ω(1/ √ n) where n is the sample size, which is not optimal for regression problems with squared loss (Caponnetto & de Vito, 2007) . The reason why only such a sub-optimal rate is considered is that the target of their analyses is mostly the Rademacher complexity of the set in which estimators exist for bounding the generalization gap. However, to derive a tight excess risk bound instead of the generalization gap, we need to evaluate so called local Rademacher complexity (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006 ) (see Eq. ( 2) for the definition of excess risk). Moreover, some of the existing analyses should change the target function class as the sample size n increases, for example, the input dimensionality is increased against the sample size, which makes it difficult to see how the rate of convergence is affected by the choice of estimators. Another promising approach is the mean field analysis. There are also some work that showed superiority of deep learning against kernel methods. Ghorbani et al. ( 2019) showed that, when the dimensionality d of input is polynomially increasing with respect to n, the kernel methods is outperformed by neural network approaches. Although the situation of increasing d explains well the modern high dimensional situations, this setting blurs the rate of convergence. Actually, we can show the superiority of deep learning even in a fixed dimension setting. There are several studies about approximation abilities of deep and shallow models. Ghorbani et al. (2020) showed adaptivity of kernel methods to the intrinsic dimensionality in terms of approximation error and discuss difference between deep and kernel methods. Yehudai & Shamir (2019) showed that the random feature method requires exponentially large number of nodes against the input dimension to obtain a good approximation for a single neuron target function. These are only for approximation errors and estimation errors are not compared. Recently, the superiority of deep learning against kernel methods has been discussed also in the nonparametric statistics literature where the minimax optimality of deep learning in terms of excess risk is shown. Especially, it is shown that deep learning achieves better rate of convergence than linear estimators in several settings (Schmidt-Hieber, 2020; Suzuki, 2019; Imaizumi & Fukumizu, 2019; Suzuki & Nitanda, 2019; Hayakawa & Suzuki, 2020) . Here, the linear estimators are a general class of estimators that includes kernel ridge regression, k-NN regression and Nadaraya-Watson estimator. Although these analyses give clear statistical characterization on estimation ability of deep learning, they are not compatible with tractable optimization algorithms. In this paper, we give a theoretical analysis that unifies these analyses and shows the superiority of a deep learning method trained by a tractable noisy gradient descent algorithm. We evaluate the excess risks of the deep learning approach and linear estimators in a nonparametric regression setting, and show that the minimax optimal convergence rate of the linear estimators can be dominated by the noisy gradient descent on neural networks. In our analysis, the model is fixed and no explicit sparse regularization is employed. Our contributions can be summarized as follows: • A refined analysis of excess risks for a fixed model with a fixed input dimension is given to compare deep and shallow estimators. Although several studies pointed out the curse of dimensionality is a key factor that separates shallow and deep approaches, we point out that such a separation appears in a rather low dimensional setting, and more importantly, the non-convexity of the model essentially makes the two regimes different. • A lower bound of the excess risk which is valid for any linear estimator is derived. The analysis is considerably general because the class of linear estimators includes kernel ridge regression with any kernel and thus it also includes estimators in the NTK regime. • All derived convergence rate is a fast learning rate that is faster than O(1/ √ n). We show that simple noisy gradient descent on a sufficiently wide two-layer neural network achieves a fast

