OPTIMAL RATES FOR AVERAGED STOCHASTIC GRA-DIENT DESCENT UNDER NEURAL TANGENT KERNEL REGIME

Abstract

We analyze the convergence of the averaged stochastic gradient descent for overparameterized two-layer neural networks for regression problems. It was recently found that a neural tangent kernel (NTK) plays an important role in showing the global convergence of gradient-based methods under the NTK regime, where the learning dynamics for overparameterized neural networks can be almost characterized by that for the associated reproducing kernel Hilbert space (RKHS). However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show that the averaged stochastic gradient descent can achieve the minimax optimal convergence rate, with the global convergence guarantee, by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of a ReLU network under certain conditions.

1. INTRODUCTION

Recent studies have revealed why a stochastic gradient descent for neural networks converges to a global minimum and why it generalizes well under the overparameterized setting in which the number of parameters is larger than the number of given training examples. One prominent approach is to map the learning dynamics for neural networks into function spaces and exploit the convexity of the loss functions with respect to the function. The neural tangent kernel (NTK) (Jacot et al., 2018) has provided such a connection between the learning process of a neural network and a kernel method in a reproducing kernel Hilbert space (RKHS) associated with an NTK. By contrast, several studies showed faster convergence rates of the (averaged) stochastic gradient descent in the RKHS in terms of the generalization (Cesa-Bianchi et al., 2004; Smale & Yao, 2006; Ying & Zhou, 2006; Neu & Rosasco, 2018; Lin et al., 2020) . In particular, by extending the results in a finite-dimensional case (Bach & Moulines, 2013) ) is always faster than O(T -1/2 ) and is known as the minimax optimal rate (Caponnetto & De Vito, 2007; Blanchard & Mücke, 2018) . Hence, a gap exists between the theories regarding NTK and kernel methods. In other words, there is still room for an investigation into a stochastic gradient descent due to a lack of specification of the complexities of the target function and the hypothesis space. That is, to obtain faster convergence rates, we should specify the eigenspaces of the NTK that mainly contain the target function (i.e., the complexity of the target function), and specify the decay rates of the eigenvalues of the NTK (i.e., the complexity of the hypothesis space), as studied in kernel methods (Caponnetto & De Vito, 2007; Steinwart et al., 2009; Dieuleveut & Bach, 2016) . In summary, the fundamental question in this study is Can stochastic gradient descent for overparameterized neural networks achieve the optimal rate in terms of the generalization by exploiting the complexities of the target function and hypothesis space? In this study, we answer this question in the affirmative, thereby bridging the gap between the theories of overparameterized neural networks and kernel methods. 

1.1. CONTRIBUTIONS

The connection between neural networks and kernel methods is being understood via the NTK, but it is still unknown whether the optimal convergence rate faster than O(T -1/2 ) is achievable by a certain algorithm for neural networks. This is the first paper to overcome technical challenges of achieving the optimal convergence rate under the NTK regime. We obtain the minimax optimal convergence rates (Corollary 1), inherited from the learning dynamics in an RKHS, for an averaged stochastic gradient descent for neural networks. That is, we show that smooth target functions efficiently specified by the NTK are learned rapidly at faster convergence rates than O(1/ √ T ). Moreover, we obtain an explicit optimal convergence rate of O T -2rd 2rd+d-1 for a smooth approximation of the ReLU network (Corollary 2), where d is the dimensionality of the data space and r is the complexity of the target function specified by the NTK of the ReLU network.

1.2. TECHNICAL CHALLENGE

The key to showing a global convergence (Theorem 1) is making the connection between kernel methods and neural networks in some sense. Although this sort of analysis has been developed in several studies (Du et al., 2019b; Arora et al., 2019a; Weinan et al., 2019; Arora et al., 2019b;  



The global convergence of the gradient descent was demonstrated in Du et al. (2019b); Allen-Zhu et al. (2019a); Du et al. (2019a); Allen-Zhu et al. (2019b) through the development of a theory of NTK with the overparameterization. In these theories, the positivity of the NTK on the given training examples plays a crucial role in exploiting the property of the NTK. Specifically, the positivity of the Gram-matrix of the NTK leads to a rapid decay of the training loss, and thus the learning dynamics can be localized around the initial point of a neural network with the overparameterization, resulting in the equivalence between two learning dynamics for neural networks and kernel methods with the NTK through a linear approximation of neural networks. Moreover, Arora et al. (2019a) provided a generalization bound of O(T -1/2 ), where T is the number of training examples, on a gradient descent under the positivity assumption of the NTK. These studies provided the first steps in understanding the role of the NTK. However, the eigenvalues of the NTK converge to zero as the number of examples increases, as shown in Su & Yang (2019) (also see Figure 1), resulting in the degeneration of the NTK. This phenomenon indicates that the convergence rates in previous studies in terms of generalization are generally slower than O(T -1/2 ) owing to the dependence on the minimum eigenvalue. Moreover, Bietti & Mairal (2019); Ronen et al. (2019); Cao et al. (2019) also supported this observation by providing a precise estimation of the decay of the eigenvalues, and Ronen et al. (2019); Cao et al. (2019) proved the spectral bias (Rahaman et al., 2019) for a neural network, where lower frequencies are learned first using a gradient descent.

, Dieuleveut & Bach (2016); Dieuleveut et al. (2017) showed convergence rates of O(T -2rβ2rβ+1 ) depending on the complexity r ∈ [1/2, 1] of the target functions and the decay rate β > 1 of the eigenvalues of the kernel (a.k.a. the complexity of the hypothesis space). In addition, extensions to the random feature settings(Rahimi & Recht, 2007;  Rudi & Rosasco, 2017; Carratino et al., 2018), to the multi-pass variant (Pillaud-Vivien et al., 2018b), and to the tail-averaging and mini-batching variant (Mücke et al., 2019) have been developed. Motivation. The convergence rate of O(T -2rβ 2rβ+1

Figure1: An estimation of the eigenvalues of Σ ∞ using two-layer ReLU networks with a width of M = 2 × 10 4 . The number of uniformly randomly generated samples on the unit sphere is n = 10 4 and the dimensionality of the input space is d ∈ {5, 10, 100}.

