EXCESS RISK OF TWO-LAYER RELU NEURAL NET-WORKS IN TEACHER-STUDENT SETTINGS AND ITS SU-PERIORITY TO KERNEL METHODS

Abstract

While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established. We investigate the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. Especially, we consider the student network that has the same width as the teacher network and is trained in two phases: first by noisy gradient descent and then by the vanilla gradient descent. Our result shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator (more generally, linear estimators), including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate. The key concept inducing this superiority is the nonconvexity of the neural network models. Even though the loss landscape is highly non-convex, the student network adaptively learns the teacher neurons.

1. INTRODUCTION

Explaining why deep learning empirically outperforms other methods has been one of the most significant issues. In particular, from the theoretical viewpoint, it is important to reveal the mechanism of how deep learning trained by an optimization method such as gradient descent can achieve superior generalization performance. To this end, we focus on the excess risk of two-layer ReLU neural networks in a nonparametric regression problem and compare its rate to that of kernel methods. One of the difficulties in showing generalization abilities of deep learning is the non-convexity of the associated optimization problem Li et al. (2018) , which may let the solution stacked in a bad local minimum. To alleviate the non-convexity of neural network optimization, recent studies focus on over-parameterization as the promising approaches. Indeed, it is fully exploited by (i) Neural Tangent Kernel (NTK) (Jacot et al., 2018; Allen-Zhu et al., 2019; Arora et al., 2019; Du et al., 2019; Weinan et al., 2020; Zou et al., 2020) and (ii) mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Mei et al., 2019; Tzen & Raginsky, 2020; Chizat, 2021; Suzuki & Akiyama, 2021) . In the setting of NTK, a relatively large-scale initialization is considered. Then the gradient descent related to parameters of neural networks can be reduced to the convex optimization in RKHS, which is easier to analyze. However, in this regime, it is hard to explain the superiority of deep learning because the estimation ability of the obtained estimator is reduced to that of the corresponding kernel. From this perspective, recent works focus on the "beyond kernel" type analysis Allen-Zhu & Li ( 2019 (2022) . Although their analysis shows the superiority of deep learning to kernel methods in each setting, in terms of the sample size (n), all derived bounds are essentially Ω(1/ √ n). This bound is known to be sub-optimal for regression problems Caponnetto & De Vito (2007) . In the mean field analysis setting, a kind of continuous limit of the neural network is considered, and its convergence to some specific target functions has been analyzed. This regime is more suitable in terms of a "beyond kernel" perspective, but it essentially deals with a continuous limit and hence is difficult to control the discretization error when considering a teacher network with a finite width. Indeed, the optimization complexity has been exploited recently in some research, but it still requires an exponential time complexity in the worst case (Mei et al., 2018b; Hu et al., 2019; Nitanda et al., 2021a) . This problem is mainly due to the lack of landscape analysis that requires closer exploitation of the problem structure. For example, we may consider the teacher student setting where the true function is represented as a neural network. This allows us to use the landscape analysis in the optimization analysis and give a more precise analysis of the statistical performance. In particular, we can obtain a more precise characterization of the excess risk (e.g., Suzuki & Akiyama (2021)). More recently, some studies have focused on the feature learning ability of neural networks (Abbe et al., 2021; 2022; Chizat & Bach, 2020; Ba et al., 2022; Nguyen, 2021) . Among them, Abbe et al. ( 2021) considers estimation of the function with staircase property and multi-dimensional Boolean inputs and shows that neural networks can learn that structure through stochastic gradient descent. Moreover, Abbe et al. ( 2022) studies a similar setting and shows that in a high-dimensional setting, two-layer neural networks with sufficiently smooth activation can outperform the kernel method. However, obtained bound is still O(1/ √ n) and requires a higher smoothness for activation as the dimensionality of the Boolean inputs increases. The teacher-student setting is one of the most common settings for theoretical studies, e.g., (Tian, 2017; Safran & Shamir, 2018; Goldt et al., 2019; Zhang et al., 2019; Safran et al., 2021; Tian, 2020; Yehudai & Shamir, 2020; Suzuki & Akiyama, 2021; Zhou et al., 2021; Akiyama & Suzuki, 2021) to name a few. Zhong et al. ( 2017) studies the case where the teacher and student have the same width, shows that the strong convexity holds around the parameters of the teacher network and proposes a special tensor method for initialization to achieve the global convergence to the global optimal. However, its global convergence is guaranteed only for a special initialization which excludes a pure gradient descent method. Safran & Shamir ( 2018) empirically shows that gradient descent is likely to converge to non-global optimal local minima, even if we prepare a student that has the same size as the teacher. More recently, Yehudai & Shamir (2020) shows that even in the simplest case where the teacher and student have the width one, there exist distributions and activation functions in which gradient descent fails to learn. Safran et al. (2021) shows the strong convexity around the parameters of the teacher network in the case where the teacher and student have the same width for Gaussian inputs. They also study the effect of over-parameterization and show that overparameterization will change the spurious local minima into the saddle points. However, it should be noted that this does not imply that gradient descent can reach the global optima. Akiyama & Suzuki (2021) shows that the gradient descent with a sparse regularization can achieve the global optimal solution for an over-parameterized student network. Thanks to the sparse regularization, the global optimal solution can exactly recover the teacher network. However, this research requires a highly over-parameterized network. Indeed, it requires an exponentially large number of widths in terms of the dimensionality and the sample size. Moreover, they impose quite strong assumptions such that there is no observation noise and the parameter of each neuron in the teacher network should be orthogonal to each other. The superiority of deep learning against kernel methods has also been discussed in the nonparametric statistics literature. They show the minimax optimality of deep learning in terms of excess risk. Especially a line of research (Schmidt-Hieber, 2020; Suzuki, 2018; Hayakawa & Suzuki, 2020; Suzuki & Nitanda, 2021; Suzuki & Akiyama, 2021) shows that deep learning achieves faster rates of convergence than linear estimators in several settings. Here, the linear estimators are a general class of estimators that includes kernel ridge regression, k-NN regression, and Nadaraya-Watson estimator. Among them, Suzuki & Akiyama (2021) treats a tractable optimization algorithm in a teacher-student setting, but they require an exponential computational complexity and smooth activation function, which does not include ReLU. In this paper, we consider a gradient descent with two phases, a noisy gradient descent first and a vanilla gradient descent next. Our analysis shows that through this method, the student network recovers the teacher network in a polynomial order computational complexity (with respet to the sample size) without using an exponentially wide network, even though we do not need the strong assumptions such as the no-existence of noise and orthogonality. Moreover, we evaluate the excess risk of the trained network and show that the trained network can outperform any linear estimators,



); Bai & Lee (2020); Li et al. (2020); Chen et al. (2020); Refinetti et al. (2021); Abbe et al.

