BENEFIT OF DEEP LEARNING WITH NON-CONVEX NOISY GRADIENT DESCENT: PROVABLE EXCESS RISK BOUND AND SUPERIORITY TO KERNEL METHODS

Abstract

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, k-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than O(1/ √ n) that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.

1. INTRODUCTION

In the deep learning theory literature, clarifying the mechanism by which deep learning can outperform shallow approaches has been gathering most attention for a long time. In particular, it is quite important to show that a tractable algorithm for deep learning can provably achieve a better generalization performance than shallow methods. Towards that goal, we study the rate of convergence of excess risk of both deep and shallow methods in a setting of a nonparametric regression problem. One of the difficulties to show generalization ability of deep learning with certain optimization methods is that the solution is likely to be stacked in a bad local minimum, which prevents us to show its preferable performances. Recent studies tackled this problem by considering optimization on overparameterized networks as in neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019a) and mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2018; 2019; Mei et al., 2018; 2019) , or analyzing the noisy gradient descent such as stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011; Raginsky et al., 2017; Erdogdu et al., 2018) . The NTK analysis deals with a relatively large scale initialization so that the model is well approximated by the tangent space at the initial solution, and eventually, all analyses can be reduced to those of kernel methods (Jacot et al., 2018; Du et al., 2019b; Allen-Zhu et al., 2019; Du et al., 2019a; Arora et al., 2019; Cao & Gu, 2019; Zou et al., 2020) . Although this regime is useful to show its global

