EXCESS RISK OF TWO-LAYER RELU NEURAL NET-WORKS IN TEACHER-STUDENT SETTINGS AND ITS SU-PERIORITY TO KERNEL METHODS

Abstract

While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established. We investigate the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. Especially, we consider the student network that has the same width as the teacher network and is trained in two phases: first by noisy gradient descent and then by the vanilla gradient descent. Our result shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator (more generally, linear estimators), including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate. The key concept inducing this superiority is the nonconvexity of the neural network models. Even though the loss landscape is highly non-convex, the student network adaptively learns the teacher neurons.

1. INTRODUCTION

Explaining why deep learning empirically outperforms other methods has been one of the most significant issues. In particular, from the theoretical viewpoint, it is important to reveal the mechanism of how deep learning trained by an optimization method such as gradient descent can achieve superior generalization performance. To this end, we focus on the excess risk of two-layer ReLU neural networks in a nonparametric regression problem and compare its rate to that of kernel methods. One of the difficulties in showing generalization abilities of deep learning is the non-convexity of the associated optimization problem Li et al. ( 2018), which may let the solution stacked in a bad local minimum. To alleviate the non-convexity of neural network optimization, recent studies focus on over-parameterization as the promising approaches. Indeed, it is fully exploited by (i) Neural Tangent Kernel (NTK) (Jacot et al., 2018; Allen-Zhu et al., 2019; Arora et al., 2019; Du et al., 2019; Weinan et al., 2020; Zou et al., 2020) and (ii) mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Mei et al., 2019; Tzen & Raginsky, 2020; Chizat, 2021; Suzuki & Akiyama, 2021) . In the setting of NTK, a relatively large-scale initialization is considered. Then the gradient descent related to parameters of neural networks can be reduced to the convex optimization in RKHS, which is easier to analyze. However, in this regime, it is hard to explain the superiority of deep learning because the estimation ability of the obtained estimator is reduced to that of the corresponding kernel. From this perspective, recent works focus on the "beyond kernel" type analysis Allen-Zhu & Li ( 2019 (2022) . Although their analysis shows the superiority of deep learning to kernel methods in each setting, in terms of the sample size (n), all derived bounds are essentially Ω(1/ √ n). This bound is known to be sub-optimal for regression problems Caponnetto & De Vito (2007) .



); Bai & Lee (2020); Li et al. (2020); Chen et al. (2020); Refinetti et al. (2021); Abbe et al.

