EXCESS RISK OF TWO-LAYER RELU NEURAL NET-WORKS IN TEACHER-STUDENT SETTINGS AND ITS SU-PERIORITY TO KERNEL METHODS

Abstract

While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established. We investigate the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. Especially, we consider the student network that has the same width as the teacher network and is trained in two phases: first by noisy gradient descent and then by the vanilla gradient descent. Our result shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator (more generally, linear estimators), including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate. The key concept inducing this superiority is the nonconvexity of the neural network models. Even though the loss landscape is highly non-convex, the student network adaptively learns the teacher neurons.

1. INTRODUCTION

Explaining why deep learning empirically outperforms other methods has been one of the most significant issues. In particular, from the theoretical viewpoint, it is important to reveal the mechanism of how deep learning trained by an optimization method such as gradient descent can achieve superior generalization performance. To this end, we focus on the excess risk of two-layer ReLU neural networks in a nonparametric regression problem and compare its rate to that of kernel methods. One of the difficulties in showing generalization abilities of deep learning is the non-convexity of the associated optimization problem Li et al. (2018) , which may let the solution stacked in a bad local minimum. To alleviate the non-convexity of neural network optimization, recent studies focus on over-parameterization as the promising approaches. Indeed, it is fully exploited by (i) Neural Tangent Kernel (NTK) (Jacot et al., 2018; Allen-Zhu et al., 2019; Arora et al., 2019; Du et al., 2019; Weinan et al., 2020; Zou et al., 2020) and (ii) mean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Mei et al., 2019; Tzen & Raginsky, 2020; Chizat, 2021; Suzuki & Akiyama, 2021) . In the setting of NTK, a relatively large-scale initialization is considered. Then the gradient descent related to parameters of neural networks can be reduced to the convex optimization in RKHS, which is easier to analyze. However, in this regime, it is hard to explain the superiority of deep learning because the estimation ability of the obtained estimator is reduced to that of the corresponding kernel. From this perspective, recent works focus on the "beyond kernel" type analysis Allen-Zhu & Li (2019) ; Bai & Lee (2020) ; Li et al. (2020) ; Chen et al. (2020) ; Refinetti et al. (2021) ; Abbe et al. (2022) . Although their analysis shows the superiority of deep learning to kernel methods in each setting, in terms of the sample size (n), all derived bounds are essentially Ω(1/ √ n). This bound is known to be sub-optimal for regression problems Caponnetto & De Vito (2007) . In the mean field analysis setting, a kind of continuous limit of the neural network is considered, and its convergence to some specific target functions has been analyzed. This regime is more suitable in terms of a "beyond kernel" perspective, but it essentially deals with a continuous limit and hence is difficult to control the discretization error when considering a teacher network with a finite width. Indeed, the optimization complexity has been exploited recently in some research, but it still requires an exponential time complexity in the worst case (Mei et al., 2018b; Hu et al., 2019; Nitanda et al., 2021a) . This problem is mainly due to the lack of landscape analysis that requires closer exploitation of the problem structure. For example, we may consider the teacher student setting where the true function is represented as a neural network. This allows us to use the landscape analysis in the optimization analysis and give a more precise analysis of the statistical performance. In particular, we can obtain a more precise characterization of the excess risk (e.g., Suzuki & Akiyama (2021) ). More recently, some studies have focused on the feature learning ability of neural networks (Abbe et al., 2021; 2022; Chizat & Bach, 2020; Ba et al., 2022; Nguyen, 2021) . Among them, Abbe et al. (2021) considers estimation of the function with staircase property and multi-dimensional Boolean inputs and shows that neural networks can learn that structure through stochastic gradient descent. Moreover, Abbe et al. (2022) studies a similar setting and shows that in a high-dimensional setting, two-layer neural networks with sufficiently smooth activation can outperform the kernel method. However, obtained bound is still O(1/ √ n) and requires a higher smoothness for activation as the dimensionality of the Boolean inputs increases. The teacher-student setting is one of the most common settings for theoretical studies, e.g., (Tian, 2017; Safran & Shamir, 2018; Goldt et al., 2019; Zhang et al., 2019; Safran et al., 2021; Tian, 2020; Yehudai & Shamir, 2020; Suzuki & Akiyama, 2021; Zhou et al., 2021; Akiyama & Suzuki, 2021) to name a few. Zhong et al. (2017) studies the case where the teacher and student have the same width, shows that the strong convexity holds around the parameters of the teacher network and proposes a special tensor method for initialization to achieve the global convergence to the global optimal. However, its global convergence is guaranteed only for a special initialization which excludes a pure gradient descent method. Safran & Shamir (2018) empirically shows that gradient descent is likely to converge to non-global optimal local minima, even if we prepare a student that has the same size as the teacher. More recently, Yehudai & Shamir (2020) shows that even in the simplest case where the teacher and student have the width one, there exist distributions and activation functions in which gradient descent fails to learn. Safran et al. (2021) shows the strong convexity around the parameters of the teacher network in the case where the teacher and student have the same width for Gaussian inputs. They also study the effect of over-parameterization and show that overparameterization will change the spurious local minima into the saddle points. However, it should be noted that this does not imply that gradient descent can reach the global optima. Akiyama & Suzuki (2021) shows that the gradient descent with a sparse regularization can achieve the global optimal solution for an over-parameterized student network. Thanks to the sparse regularization, the global optimal solution can exactly recover the teacher network. However, this research requires a highly over-parameterized network. Indeed, it requires an exponentially large number of widths in terms of the dimensionality and the sample size. Moreover, they impose quite strong assumptions such that there is no observation noise and the parameter of each neuron in the teacher network should be orthogonal to each other. The superiority of deep learning against kernel methods has also been discussed in the nonparametric statistics literature. They show the minimax optimality of deep learning in terms of excess risk. Especially a line of research (Schmidt-Hieber, 2020; Suzuki, 2018; Hayakawa & Suzuki, 2020; Suzuki & Nitanda, 2021; Suzuki & Akiyama, 2021) shows that deep learning achieves faster rates of convergence than linear estimators in several settings. Here, the linear estimators are a general class of estimators that includes kernel ridge regression, k-NN regression, and Nadaraya-Watson estimator. Among them, Suzuki & Akiyama (2021) treats a tractable optimization algorithm in a teacher-student setting, but they require an exponential computational complexity and smooth activation function, which does not include ReLU. In this paper, we consider a gradient descent with two phases, a noisy gradient descent first and a vanilla gradient descent next. Our analysis shows that through this method, the student network recovers the teacher network in a polynomial order computational complexity (with respet to the sample size) without using an exponentially wide network, even though we do not need the strong assumptions such as the no-existence of noise and orthogonality. Moreover, we evaluate the excess risk of the trained network and show that the trained network can outperform any linear estimators, including kernel methods, in terms of its dependence on the sample size. More specifically, our contributions can be summarized as follows: • We show that by two-phase gradient descent, the student network, which has the same width as the teacher network, provably reaches the near-optimal solution. Moreover, we conduct a refined analysis of the excess risk and provide the upper bound for the excess risk of the student network, which is much faster than that obtained by the generalization bound analysis with the Rademacher complexity argument. Throughout this paper, our analysis does not require the highly over-parameterization and any special initialization schemes. • We provide a comparison of the excess risk between the student network and linear estimators and show that while the linear estimators much suffer from the curse of dimensionality, the student network less suffers from that. Particularly, in high dimensional settings, the convergence rate of the excess risk of any linear estimators becomes close to O(n -1/2 ), which coincides with the classical bound derived by the Rademacher complexity argument. • The lower bound of the excess risk derived in this paper is valid for any linear estimator. The analysis is considerably general because the class of linear estimators includes kernel ridge regression with any kernel. This generality implies that the derived upper bound cannot be derived by the argument that uses a fixed kernel, including Neural Tangent Kernel.

2. PROBLEM SETTINGS

Notations For m ∈ N, let [m] := {1, . . . , m}. For x ∈ R d , x denotes its Euclidean norm. We denote the inner product between x, y ∈ R d by x, y = d j=1 x i y i . S d-1 denotes the unit sphere in R d . For a matrix W , we denote its operator and Frobenius norm by W 2 and W F , respectively. Here, we introduce the problem setting and the model that we consider in this paper. We focus on a regression problem where we observe n training examples D n = (x i , y i ) n i=1 generated by the following model for an unknown measurable function f • : R d → R: y i = f • (x i ) + i , where (x i ) n i=1 is independently identically distributed sequence from P X that is the uniform distribution over Ω = S d-1 , and i are i.i.d. random variables satisfying E[ i ] = 0, E[ 2 i ] = v 2 , and | i | ≤ U a.s.. Our goal is to estimate the true function f • through the training data. To this end, we consider the square loss (y, f (x)) = (y -f (x)) 2 and define the expected risk and the empirical risk as L(f ) := E X,Y [ (Y, f (X)] and L(f ) := 1 n (y i , f (x i )) , respectively. In this paper, we measure the performance of an estimator f by the excess risk L( f ) -inf f :measurable L(f ). Since inf L(f ) = L(f • ) = 0, we can check that the excess risk coincides with f -f • 2 L2(P X ) , the L 2distance between f and f • . We remark that the excess risk is different from the generalization gap L( f ) -L( f ). Indeed, when considering the convergence rate with respect to n, the generalization gap typically converges to zero with O(1/ √ n) Wainwright (2019) . On the other hand, the excess risk can converge with the rate faster than O(1/ √ n), which is known as fast learning rate.

2.1. MODEL OF TRUE FUNCTIONS

To evaluate the excess risk, we introduce a function class in which the true function f • is included. In this paper, we focus on the teacher-student setting with two-layer ReLU neural networks, in which the true function (called teacher) is given by f a • ,W • (x) = m j=1 a • j σ( w • j , x ), where σ(u) = max{u, 0} is the ReLU activation, m is the width of the teacher model satisfying m ≤ d, and a • j ∈ R, w • j ∈ R d for j ∈ [m] are its parameters. We impose several conditions for the parameters of the teacher networks. Let W • = (w • 1 w • 2 • • • w • m ) ∈ R d×m and σ 1 ≥ σ 2 ≥ • • • ≥ σ m be the singular values of W • . First, we assume that a • j ∈ {±1} for any j ∈ [m]. Note that by 1homogeneity of the ReLU activationfoot_0 , this condition does not restrict the generality of the teacher networks. Moreover, we assume that there exists σ min > 0 such that σ m > σ min . If σ m = 0, there exists an example in which f a • ,W • has multiple representations. Indeed, Zhou et al. (2021) shows that in the case a • j = 1 for all j ∈ [m] and w • j = 0, it holds that f a • ,W • = m j=1 σ( w • j , x ) = m j=1 σ( -w • j , x ). Hence, throughout this paper, we focus on the estimation problem in which the true function is included in the following class: F • := {f a • ,W • | a • ∈ {±1} m , W • 2 ≤ 1, σ m > σ min }. This class represents the two-layer neural networks with the ReLU activation whose width is at most the dimensionality of the inputs. The constraint W • 2 ≤ 1 is assumed only for the analytical simplicity and can be extended to any positive constants.

3. ESTIMATORS

In this section, we introduce the classes of estimators: linear estimators and neural networks (student networks) trained by two-phase gradient descent. The linear estimator is introduced as a generalization of the kernel method. We will show separation between any linear estimator and neural networks by giving a suboptimal rate of the excess risk for the linear estimators (Theorem 4.1), which simultaneously gives separation between the kernel methods and the neural network approach. A detailed comparison of the excess risk of these estimators will be conducted in Section 4.

3.1. LINEAR ESTIMATORS

Given observation (x 1 , y 1 ), . . . , (x n , y n ), an estimator f is called linear if it is represented by f (x) = n i=1 y i ϕ i (x 1 , . . . , x n , x), where (ϕ i ) n i=1 is a sequence of measurable and L 2 (P X )-integrable functions. The most important example in this study is the kernel ridge regression estimator. We note that the kernel ridge estimator is given by f (x) = Y T (K X + λI) -1 k(x), where K X = (k(x i , x j )) n i,j=1 ∈ R n×n , k(x) = [k(x, x 1 ), . . . , k(x, x n )] T ∈ R n and Y = [y 1 , . . . , y n ] T ∈ R n for a kernel function k : R d × R d → R, which is linear to the output observation Y . Since this form is involved in the definition of linear estimators, the kernel ridge regression with any kernel function can be seen as one of the linear estimators. The choice of ϕ i is arbitrary, and thus the choice of the kernel function is also arbitrary. Therefore, we may choose the best kernel function before we observe the data. However, as we will show in Theorem 4.1, it suffers from a suboptimal rate. Other examples include the k-NN estimator and the Nadaraya-Watson estimator. Thus our analysis gives a suboptimality of not only the kernel method but also these well-known linear estimators. Suzuki (2018) ; Hayakawa & Suzuki (2020) utilized such an argument to show the superiority of deep learning but did not present any tractable optimization algorithm.

3.2. STUDENT NETWORKS TRAINED BY TWO-PHASE GRADIENT DESCENT

We prepare the neural network trained through the observation data (called student), defined by f (x; θ) = m j=1 a j σ( w j , x ), where θ = ((a 1 , w 1 ), . . . (a m , w m )) ∈ R (d+1)m =: Θ. We assume that the student and teacher networks have the same width. Based on this formulation, we aim to train the parameter θ that will be provably close to that of the teacher network. To this end, we introduce the training algorithm, two-phase gradient descent, which we consider in this paper. Phase I: noisy gradient descent For r ∈ R, let r := R • tanh (r|r|/2R) be a clipping of r, where R > 1 is a fixed constant. In the first phase, we conduct a noisy gradient descent with the weight decay regularization. The objective function used to train the student network is given as follows: R λ (θ) := 1 2n n i=1 (y i -f (x i ; θ)) 2 + λ m j=1 |a j | 2 + w j 2 , where θ is the element-wise clipping of θ and λ > 0 is a regularization parameter. The parameter clipping ensures the bounded objective value and smoothness of the expected risk around the origin, which will be helpful in our analysis. Then, the parameters of the student network are updated by θ (k+1) = θ (k) -η (1) ∇ R λ θ (k) + 2η (1) β ζ (k) , where η (1) > 0 is a step-size, ζ (k) ∞ k=1 are independently identically distributed noises from the standard normal distribution, and β > 0 is a constant called inverse temperature. This type of noisy gradient descent is called gradient Langevin dynamics. It is known that by letting β be large, we can ensure that the smooth objective function will decrease. On the other hand, because of the non-smoothness of the ReLU activation, the objective function R λ is also non-smooth. Hence it is difficult to guarantee the small objective value. To overcome this problem, we evaluate the expected one instead in the theoretical analysis, which is given by R λ (θ) := 1 2 E x f a • ,W • (x) -f (x; θ) 2 + λ m j=1 |a j | 2 + w j 2 . We can ensure a small R λ (θ) after a sufficient number of iterations (see Section 4.2 for the detail). Phase II: vanilla gradient descent After phase I, we can ensure that for each node of the student network, there is a node of the teacher network that is relatively close to each other. Then we move to conduct the vanilla gradient descent to estimate the parameters of the teacher more precisely. Before conducting the gradient descent, we rescale the parameters as follows: a (k) j ← sgn(ā (k) ), w (k) j ← ā(k) wj (k) , ∀j ∈ [m]. We note this transformation does not change the output of the student network thanks to the 1homogeneity of the ReLU activation. After that, we update the parameters of the first layer by W (k+1) = W (k) -η (2) ∇ W R W (k) , where η (2) > 0 is a step-size different from η (1) and R(W ) := 1 2n n i=1 (y i -f (x i ; θ)) 2 . In this phase, we no longer need to update the parameters of both layers. Moreover, the regularization term and the gradient noise added in phase I are also unnecessary. These simplifications of the optimization algorithm are based on the strong convexity of R(W ) around W • , the parameters of the teacher network. The analysis for this local convergence property is based on that of Zhang et al. (2019) , and eventually, we can evaluate the excess risk of the student network. The overall training algorithm can be seen in Algorithm 1. In summary, we characterize the role of each phase as follows: in phase I, the student network explore the parameter space globally and finds the parameters that are relatively close to that of teachers, and in phase II, the vanilla gradient descent for the first layer outputs more precise parameters, as we analyze in Section 4.2. Remark 3.1. Akiyama & Suzuki (2021) also considered the convergence of the gradient descent in a teacher-student model. They considered a sparse regularization, m j=1 |a j | w j , for the ReLU activation while we consider the L 2 -regularization given by m j=1 (|a j | 2 + w j 2 ). These two regularizations are essentially the same since the minimum of the later regularization under the constraint of |a j | w j = const. is given by 2 m j=1 |a j | w j by the arithmetic-geometric mean relation. On the other hand, Akiyama & Suzuki (2021) consider a vanilla gradient descent instead of the noisy gradient descent. This makes it difficult to reach the local region around the optimal solution, and their analysis required an exponentially large width to find the region. We may use a narrow network in this paper with the same width as the teacher network. This is due to the ability of the gradient Langevin dynamics to explore the entire space and find the near global optimal solution.

Algorithm 1 Two-Phase Gradient Descent

Input: max iteration k (1) and k (2) , stepsize parameter η (1) , η (2) > 0, regularization parameter λ > 0, inverse temperature β > 0. 1: Initialization: θ (0) ∼ ρ 0 . 2: for k = 1, 2, . . . , k (1) do 3: ζ (k) ∼ N (0, I m(d+1) ) 4: θ (k+1) = θ (k) -η (1) ∇ R λ θ (k) + 2η (1) β ζ (k) 5: end for 6: Reparameterization: a (k) j = sgn(ā (k) j ), w (k) j = ā(k) j w(k) j 7: for k = k (1) + 1, k (1) + 2, . . . , k (2) do 8: W (k+1) = W (k) -η (2) ∇ W R W (k) 9: end for

4. EXCESS RISK ANALYSIS AND ITS COMPARISON

This section provides the excess risk bounds for linear estimators and the deep learning estimator (the trained student network). More precisely, we give its lower bound for linear estimators and upper bound for the student network. As a consequence, it will be provided that the student network achieves a faster learning rate and less hurt from a curse of dimensionality than linear estimators.

4.1. MINIMAX LOWER BOUND FOR LINEAR ESTIMATORS

Here, we analyze the excess risk of linear estimators and introduce its lower bound. More specifically, we consider the minimax excess risk over the class of linear estimators given as follows: R lin (F • ) = inf f :linear sup f • ∈F • E Dn [ f -f • 2 L2(P X ) ], where the infimum is taken over all linear estimators, and the expectation is taken for the training data. This expresses the infimum of worst-case error over the class of linear estimators to estimate a function class F • . Namely, any class of linear estimators cannot achieve a faster excess risk than R lin (F • ). Based on this concept, we provide our result about the excess risk bound for linear estimators. Under the definition of F • by Eq. ( 1), we can obtain the lower bound as follows: Theorem 4.1. For arbitrary small κ > 0, we have that R lin (F • ) n -d+2 2d+2 n -κ . The proof can be seen in Appendix A. This theorem implies that under d ≥ 2, the convergence rate of excess risk is at least slower than n -2+2 2•2+2 = n -2/3 . Moreover, since -d+2 2d+2 → -1/2 as d → ∞, the convergence rate of excess risk will be close to n -1/2 in high dimensional settings, which coincides with the generalization bounds derived by the Rademacher complexity argument. Hence, we can conclude that the linear estimators suffer from the curse of dimensionality. The key strategy to show this theorem is the following "convex-hull argument" given as follows: R lin (F • ) = R lin (conv(F • )), where conv(F • ) := { N j=1 λ j f j | N ∈ N, f j ∈ F • , λ j ≥ 0, N j=1 λ j = 1} and conv(•) is the closure of conv(•) in L 2 (P X ) . By combining this argument with the minimax optimal rate analysis exploited in Zhang et al. (2002) for linear estimators, we obtain the rate in Theorem 4.1. This equality implies that the linear estimators cannot distinguish the original class F • and its convex hull conv(F • ). Therefore, if the function class F • is highly non-convex, then the linear estimators result in a much slower convergence rate since conv(F • ) will be much larger than that of the original class F • . Indeed, we can show that the convex hull of the teacher network class is considerably larger than the original function class, which causes the curse of dimensionality. For example, the mean of two teacher networks with a width m can be a network with width 2m, which shows that conv(F • ) can consist of much wider networks. See Appendix A for more details.

4.2. EXCESS RISK OF THE NEURAL NETWORKS

Here, we give an upper bound of the excess risk of the student network trained by Algorithm 1. The main result is shown in Theorem 4.6, which states that the student network can achieve the excess risk with O(n -1 ). This consequence is obtained by three-step analysis. First, (1) we provide a convergence guarantee for phase I and phase II in Algorithm 1. We first show that by phase I, the value of R λ (θ (k) ) will be sufficiently small (see Proposition 4.3). Then, (2) we can show that the parameters of the student network and the teacher networks are close to each other by Proposition 4.4. By using the strong convexity around the parameters of the teacher network, the convergence of phase II is ensured. By combining these result, (3) we get the excess risk bound as Theorem 4.6. (1) Convergence in phase I: First, we provide a convergence result and theoretical strategy of the proof for phase I. Since the ReLU activation is non-smooth, the loss function R λ (•) is also nonsmooth. Therefore it is difficult to ensure the convergence of the gradient Langevin dynamics. To overcome this problem, we evaluate the value of R λ (•) instead by considering the update θ k) , and bound the residual due to using the gradient of R λ (•). This update can be interpreted as the discretization of the following stochastic differential equation: (k+1) = θ (k) -η (1) ∇R λ θ (k) + 2η (1) β ζ ( dθ = -β∇R λ (θ)dt + √ 2dB t , where (B t ) t≥0 is the standard Brownian motion in Θ(= R (d+1)m ). It is known that this process has a unique invariant distribution π ∞ that satisfies dπ∞ dθ (θ) ∝ exp(-βR λ (θ)). Intuitively, as β → ∞, this invariant measure concentrates around the minimizer of R λ . Hence, by letting β sufficiently large, obtaining a near-optimal solution will be guaranteed. Such a technique for optimization is guaranteed in recent works (Raginsky et al., 2017; Erdogdu et al., 2018) . However, as we stated above, they require a smooth objective function. Therefore we cannot use the same technique here directly. To overcome this difficulty, we evaluate the difference between ∇ R λ and ∇R λ as follows: Lemma 4.2. There exists a constant C > 0 such that with probability at least 1 -δ, it holds that V grad := sup θ ∇R λ (θ) -∇ R λ (θ) ≤ CR 3 m d log(mdn/δ) n . This lemma implies that with high probability, the difference between ∇ R λ and ∇R λ will vanish as n → ∞. Thanks to this lemma, we can connect the dynamics of the non-smooth objective with that of the smooth objective and import the convergence analysis developed so far in the smooth objective. In particular, we utilize the technique developed by Vempala & Wibisono (2019) (see Appendix C for more details). We should note that our result extends the existing one Vempala & Wibisono (2019) in the sense that it gives the convergence for the non-differential objective function R λ (•). As a consequence, we obtain the following convergence result as for phase I. Proposition 4.3. Let R * λ be the minimum value of R λ in Θ, q be a density function of π ∞ (i.e., q(θ) ∝ exp(-βR λ (θ))) and H q (p) := R p(θ) log p(θ ) q(θ ) dx be the KL-divergence. There exists a constant c, C > 0 and the log-Sobolev constant α (defined in Lemma C.4) such that with step-size 0 < η (1) < c δλα βR 3 m 3 d , after k (1) ≥ β αη (1) log 2Hq(ρ0) δ iteration, the output θ (k) satisfies E[R λ (θ (k) )] -R * λ ≤ C (λ + m) exp m 2 β δ + 1 3nλ + d 2β log m 3 dβ λ with probability at least 1 -δ, where the expectation is taken over the initialization and Gaussian random variables added in the algorithm. Therefore, we can see that phase I optimization can find a near optimal solution with a polynomial time complexity (with respect to n) even though the objective function is non-smooth. It also may be considered to use the gradient Langevin dynamics to reach the global optimal solution by using higher β. However, it requires increasing the inverse temperature β exponentially related to n and other parameters, which leads to exponential computational complexity. To overcome this difficulty, we utilize the local landscape of the objective function. We can show the objective function will be strongly convex around the teacher parameters and we do not need to use the gradient noise and any regularization. Indeed, we can show that the vanilla gradient descent can reach the global optimal solution in phase II, as shown in the following. (2) Convergence in phase II: Next, we prove the convergence guarantee of phase II and provide an upper bound of the excess risk. The convergence result is based on the fact that when R λ (θ) is small enough (guaranteed in Proposition 4.3), the parameters of the student network will be close to those of the teacher network, as the following proposition: Proposition 4.4. There exists a threshold 0 = poly(m -1 , σ min ) such that by letting λ ≤ 0 /m, if = R λ (θ)-R * λ ≤ 0 , it holds that for every j ∈ [m], there exists k j ∈ [m] such that sgn(a kj ) = a • j and a kj w kj -w • j ≤ cσ m /κ 3 m 3 . The proof of this proposition can be seen in Appendix D. We utilize the technique in Zhou et al. (2021) , which give the same results to the cases when the activation is the absolute value function. In this proposition, we compare the parameters of the teacher network with the normalized student parameters. This normalization is needed because of the 1-homogeneity of the ReLU activation. The inequality a kj w kj -w • j ≤ cσ m /κ 3 m 3 ensures the closeness of parameters in the sense of the direction and the amplitude. Combining this with the equality sgn(a kj ) = a • j , we can conclude the closeness and move to ensure the local convergence. Thanks to this closeness and local strong convexity, we can ensure the convergence in phase II as follows: Lemma 4.5. Let κ := σ 1 /σ m and σ := ( m j=1 σ j )/σ m m . Suppose that the condition is Proposition 4.4 holds. Then there exists absolute constants c 1 , c 2 , c 3 , c 4 , c 5 such that under n ≥ c 1 κ 10 m 9 d σ m log κmd σ m • W * 2 F + v 2 , the output of the gradient descent with step-size η ≤ 1 c2κm 2 satisfies f -f • 2 L2(P X ) σ2 σ -4 min m 5 log n n + c 4 1 - c 3 η σκ 2 k • σ m κ 3 m 2 after k iterations with probability at least 1 -c 5 d -10 . (3) Unified risk bound: By combining (1) and ( 2), we obtain a unified result as follows: Theorem 4.6. There exists 0 = poly(m -1 , σ min ) and constants C and C > 0 such that for any 0 < b < 1, under n ≥ λσ -3 b exp σ -1 b m 2 where σ b := b 0 , let k (1) = Cλ -2 β -1 exp m 2 β and k (2) = k (1) + log C nη (2)-2 , then the output of Algorithm 1 with λ = σ b d -1 , β = Ω(σ -1 b d), η (1) = O(λσ 3 b exp σ -1 b m 2 ) and η (2) = O(σ min m -2 ) satisfies f -f • 2 L2(P X ) σ2 σ -4 min m 5 log n n with probability at least 1 -b -d -10 , where σ = ( m j=1 σ j )/σ m m . The proof of this theorem also can be seen in Appendix D. This theorem implies that for fixed m, the excess risk of the student networks is bounded by E Dn [ f -f • 2 L2(P X ) ] n -1 . As compared to the lower bound derived for linear estimators in Theorem 4.1, we get the faster rate n -1 in terms of the sample size. Moreover, the dependence of the excess risk on the dimensionality d does not appear explicitly. Therefore we can conclude that the student network less suffers from the curse of dimensionality than linear estimators. As we pointed out in the previous subsection, the convex hull argument causes the curse of dimensionality for linear estimators since they only prepare a fixed basis. On the other hand, the student network can "find" the basis of the teacher network via noisy gradient descent in phase I and eventually avoid the curse of dimensionality. Remark 4.7. Akiyama & Suzuki (2021) establishes the local convergence theory for the student wider than the teacher. However, it cannot apply here since they only consider the teacher whose parameters are orthogonal to each other. Suzuki & Akiyama (2021) also showed the benefit of the neural network and showed the superiority of deep learning in a teacher-student setting where the teacher has infinite width. They assume that the teacher has decaying importance; that is, the teacher can be written as f • (x) = ∞ j=1 a • j σ( w • j , x ) where a • j j -a and w • j j -b (with an exponent a, b > 0) for a bounded smooth activation σ. Our analysis does not assume the decay of importance, and the activation function is the non-differential ReLU function. Moreover, Suzuki & Akiyama (2021) considers a pure gradient Langevin dynamics instead of the two-stage algorithm, which results in the exponential computational complexity in contrast to our analysis.

5. NUMERICAL EXPERIMENT

In this section, we conduct a numerical experiment to justify our theoretical results. We apply Algorithm 1 to the settings d = m = 50. For the teacher network, we employ a • j = 1 for 1 ≤ j ≤ 25, a • j = -1 for 26 ≤ j ≤ 50 and (w • 1 , . . . , w • 50 ) = I 50 as its parameters. The parameters of the student network are initialized by θ (0) ∼ N (0, I m(d+1) ). We use the sample with n = 1000 as the training data. Hyperparameters are set to η (1) = η (2) = 0.01, β = 100, λ = 0.01, k (1) max = 1000 and k (2) max = 2000. Figure 1 shows the experimental result. The orange line represents the training loss without the regularization term. The blue line represents the test loss. Since we can compute the generalization error analytically (see Appendix B), we utilize its value as the test loss. We can see that in phase I, both the training and test losses decrease first and then fall flat. At the beginning of phase II, we can observe that both the training and test losses decrease linearly. This reflects the strong convexity around the parameters of the teacher network, as we stated in the convergence guarantee of phase II. While the training loss keeps going up and down, the curve of the test loss is relatively smooth. This difference is due to the smoothness of the generalization loss (or R λ ), which we use in the convergence analysis in phase I. The test loss does not keep decreasing and converges to a constant. The existence of the sample noise causes this phenomenon: even if the parameters of the student coincide with that of the teacher, its training loss will not be zero. Thus we can say that the numerical experiment is consistent with our theoretical results.

6. CONCLUSION

In this paper, we focus on the nonparametric regression problem, in which a true function is given by a two-layer neural network with the ReLU activation, and evaluate the excess risks of linear estimators and neural networks trained by two-phase gradient descent. Our analysis revealed that while any linear estimator suffers from the curse of dimensionality, deep learning can avoid it and outperform linear estimators, which include the neural tangent kernel approach, random feature model, and other kernel methods. Essentially, the non-convexity of the model induces this difference. A PROOF OF THEOREM 4.1 First, we introduce the formal statement of the "convex hull argmunent" we stated in Section 4.1. Proposition A.1 (Hayakawa & Suzuki (2020) ). The minimax optimal rate of linear estimators on a target function class F • is the same as that on the convex hull of F • : R lin (F • ) = R lin (conv(F • )), where conv(F • ) := { N j=1 λ j f j | N ∈ N, f j ∈ F • , λ j ≥ 0, N j=1 λ j = 1} and conv(•) is the closure of conv(•) in L 2 (P X ). For the proof, we use this convex hull argument and the minimax optimal rate analysis for linear estimators developed by Zhang et al. (2002) . They essentially showed the following statement in their Theorem 1. Note that they consider the class of linear estimators on the Euclidean space, but we can apply the same argument for the class of linear estimators on S d-1 . Proposition A.2 (Theorem 1 of Zhang et al. (2002) ). Let µ be uniform measure on S d-1 satisfying µ(S d-1 ) = 1. Suppose that the space Ω has even partition A such that |A| = 2 K for an integer K ∈ N, each A ∈ A has measure α 1 2 -K ≤ µ(A) ≤ α 2 2 -K for constants α 1 , α 2 > 0, and A is indeed a partition of Ω, i.e., ∪ A∈A A = Ω, A ∩ A = ∅ for A, A ∈ A and A = A . Then, if K is chosen as n -γ1 ≤ 2 -K ≤ n -γ2 for constants γ 1 , γ 2 > 0 that are independent of n, then there exists an event E such that, for a constant C > 0, P (E) ≥ 1 -o(1) and |{x i | x i ∈ A (i ∈ {1, . . . , n})}| ≤ C α 2 n2 -K (∀A ∈ A). Moreover, suppose that, for a class F • of functions on Ω, there exists ∆ > 0 that satisfies the following conditions: 1. There exists F > 0 such that, for any A ∈ A, there exists g ∈ F • that satisfies g(x) ≥ 1 2 ∆F for all x ∈ A, 2. There exists K and C > 0 such that 1 n n i=1 g(x i ) 2 ≤ C ∆ 2 2 -K for any g ∈ F • on the event E. Then, there exists a constant F 1 such that at least one of the following inequalities holds: F 2 4F 1 C 2 K n ≤ R lin (F • ), F 3 32 ∆ 2 2 -K ≤ R lin (F • ), for sufficiently large n. Lemma A.3. Let 0 < ∆ ≤ 1/2 and let g : S d-1 → R be a function defined by g(x) = 1 d -1 d j=2 -σ(x j ) + 1 2 σ(x j + 2∆ • x 1 ) + 1 2 σ(x j -2∆ • x 1 ) . Then it holds that g(x) ≥ ∆/2 for x ∈ B ∞ ∆ (e 1 ) and g(x ) = 0 for x / ∈ B ∞ 2∆ (e 1 ) , where e 1 := (1, 0, . . . , 0) ∈ S d-1 and B ∞ r (e 1 ) : = x ∈ S d-1 | x -e 1 ∞ ≤ r for r > 0. Proof. Let g j (x) = -σ(x j ) + 1 2 σ(x j + 2∆ • x 1 ) + 1 2 σ(x j -2∆ • x 1 ). First we suppose that x ∈ B ∞ ∆ . Then, we have x 1 ≥ 1 -∆ and |x j | ≤ ∆ for any j ∈ {2, . . . , d}. If 0 ≤ x j ≤ ∆, it holds that g j (x) = -σ(x j ) + 1 2 σ(x j + 2∆ • x 1 ) + 1 2 σ(x j -2∆ • x 1 ) = 1 2 (2∆ • x 1 -x j ) ≥ ∆(1 -∆) ≥ 1 2 ∆. Moreover, if -∆ ≤ x j ≤ 0, we get g j (x) = -σ(x j ) + 1 2 σ(x j + 2∆ • x 1 ) + 1 2 σ(x j -2∆ • x 1 ) = 1 2 (x j + 2∆ • x 1 ) ≥ ∆(1 -∆) ≥ 1 2 ∆. Hence, we get the first assertion by g (x) = 1 d-1 d j=2 g j (x) ≥ ∆ 2 . Published as a conference paper at ICLR 2023 Next we suppose x / ∈ B ∞ 2∆ (e 1 ). Then, it holds that x j ≥ 2∆ ≥ 2∆x 1 for any j ∈ {2, . . . , d}. Hence it holds that |x j /x 1 | ≥ 2∆, and we obtain that sgn(x j + 2∆ • x 1 ) = sgn(x j ) = sgn(x j -2∆ • x 1 ) ∈ {±1}. We can check g j (x) = 0 for each case, and hence it holds that g(x) = 0. Thus we get the second assertion. proof of Theorem 4.1. Let us consider the covering of S d-1 by spherical caps, i.e., B r (x) ∩ S d-1 for some x ∈ S d-1 with radius r satisfying r ∈ (0, 1). It is known that there is a covering A with |A| ∼ r -d (ignoring logarithm terms). Then, by letting r ∼ 2 -K/d , there exists a covering A satisfying |A| = 2 K . For each A ∈ A, we define a function g A by the same manner as in Lemma A.3, i.e., for A ∈ A written by B r (x A ) ∩ S d-1 with x A ∈ S d-1 , we consider the orthogonal basis including x A and define g A with regrading x A as e 1 . Define F • A := {g A /2 | A ∈ A}. It is not difficult to check that F • A ∈ conv(F • ). Then by Proposition A.1, it holds that R lin (F • ) = R lin (conv(F • )) ≥ R lin (F • A ) , where the inequality follows from F • A ∈ conv(F • ). Hence, it suffices to give the lower bound for the right hand side. Now, we apply Proposition A.2 with F • = F • A and K = K . Applying Lemma A.3 with ∆ = 2 -K/d In the event E which we introduce in Proposition A.2, there exists a constant C such that |{x i | x i ∈ A (i ∈ {1, . . . , n})}| ≤ C α 2 n2 -K for all A ∈ A. Therefore, we obtain that d+1) , we get the assertion. 1 n n i=1 g A (x i ) 2 1 n n2 -K • ∆ 2 = 2 -K ∆ 2 Therefore, Proposition A.2 gives R lin (F • A ) min{ 2 K n , 2 -(1+2/d)K }. By letting 2 K ∼ n d/2(

B EXPLICIT FORM OF THE OBJECTIVE FUNCTION AND ITS GRADIENT

In this section, we derive the explicit form of R λ (•) and its gradient, which we utilize in our analysis (especially that of the convergence in phase I). First, for w, v ∈ R d /{0}, we have that E x∼P X [σ( w, x )σ( v, x )] = E x∼N (0,I d ) [σ( w, x )σ( v, x )] E x∼N (0,I d ) [ x 2 ] = sin φ(w, v) + (π -φ(w, v)) cos φ(w, v) 2πd w v , where φ(w, v) := arccos( w, v / w v ). The second equality follows from E x∼N (0,I d ) [ x 2 ] = d and E x∼N (0,I d ) [σ( w, x )σ( v, x )] = sin φ(w, v) + (π -φ(w, v)) cos φ(w, v) 2π w v (see Cho & Saul (2009) or Safran & Shamir (2018) ). Moreover, the first equality follows from that fact that for x ∼ N (0, I d ), r 2 := x 2 and φ := x/ x are random variables that independently follow the Chi-squared distribution and the uniform distribution on S d-1 respectively, and therefore, E x∼N (0,I d ) [σ( w, x )σ( v, x )] = E x∼N (0,I d ) r 2 σ( w, φ )σ( v, φ ) = E x∼P X [σ( w, x )σ( v, x )] • E x∼N (0,I d ) [ x 2 ]. By using Eq. ( 2), we get R λ (θ) = 1 2 E x f a • ,W • (x) -f (x; θ) 2 + λ θ 2 = 1 2 E x (f a • ,W • (x)) 2 - m i,j=1 āi a • j I( wi , w • j ) + 1 2 m i,j=1 āi āj I( wi , wj ) + λ θ 2 , where w is the element-wise clipping of w ∈ R d and I(w, v) = sin φ(w, v) + (π -φ(w, v)) cos φ(w, v) 2πd w v . Next, we move to derive the gradient of R λ (•). Note that dr dr = |r|/cosh 2 (r|r|/2R). Then, since e r + e -r ≥ 2 + |r| for r ∈ R, we have that cosh(r|r|/2R) ≥ 1 + r 2 /4R, and hence | dr dr | ≤ |r| (1+r 2 /4R) 2 ≤ min{|r|, 16R 2 |r|/r 4 } ≤ 4R. Moreover, through a straightforward calculation, we can show that dr dr is 1-Lipschitz (in other words, the mapping r → r is 1-smooth). Using this, each component of the gradient of R λ (•) can be written as follows: ∇ aj R λ (θ) = m i=1 āi I( wi , wj ) • dā j da j - m i=1 a • i I(w • i , wj ) • dā j da j + 2λa j (3) ∇ wj R λ (θ) = - m i=1 āi a • j J( wi , w • j ) d wj dw j + m i=1 āi āj J( wi , wj ) d wj dw j + 2λw j , where denotes the Hadamard product and J(w, v) = v w -1 sin φ(w, v)w + (π -φ(w, v))v 2πd , which is the gradient of I(w, v) with respect to w (see Brutzkus & Globerson (2017) or Safran & Shamir (2018) ).

C PROOF OF PROPOSITION 4.3

This section provides the convergence guarantee for phase I. Our objective is to give the proof of Proposition 4.3. To this end, we first introduce the theory around the gradient Langevin dynamics exploited in Vempala & Wibisono (2019) .

C.1 A BRIEF NOTE ON THE GRADIENT LANGEVIN DYNAMICS

In their analysis, the following notion of the log-Sobolev inequality plays the essential role, which defined as follows: Definition C.1. A probability distribution with a density function q satisfies the log-Sobolev inequality (LSI) if there exists a constant α > 0 such that for all smooth function g, it holds that E q [g 2 log g 2 ] -E q [g 2 ] log E q [g 2 ] ≤ 2 α E q [ ∇g 2 ]. α is called a log-Sobolev constant. It is known that the LSI is equivalent to the following inequality: H q (p) ≤ 1 2α J q (p) (∀p ∈ P), where H q (p) := R p(θ) log p(θ ) q(θ ) dx is the KL divergence, J q (p) := R p(θ) ∇ log p(θ ) q(θ ) 2 dθ is the relative Fisher information, and P is the set of all probability density functions. Now we consider the sampling from the probability distribution q over R d . We assume that -log q(•) : R d → R is differentiable. One of the well-known and promising approaches is updating the parameter θ (0) sampled from an initial distribution ρ 0 as follows: θ (k+1) = θ (k) -η∇(-log q)(θ (k) ) + 2ηζ (k) , where η > 0 is a constant and ζ (k) ∼ N (0, I d ) is an independent standard Gaussian random variable. Vempala & Wibisono (2019) shows that if the LSI holds and -log q has a smoothness, the sufficient number of updates (6) actually achieves the sampling from q, in a sense that the KL divergence between the distribution of θ (k) and q will be small. Theorem C.2 ((Vempala & Wibisono, 2019, Theorem 1)). Suppose that a probability measure with a density function q satisfies the LSI and -log q is L-smooth. Then for any θ (0) ∼ p 0 with H q (p 0 ), the sequence (θ (k) ) ∞ k=0 with step-size 0 < η < α 4L 2 satisfies H q (p t ) ≤ exp(-αηk)H q (p 0 ) + 8ηdL 2 α . Hence for any δ > 0, the output of the update (6) with step-size η ≤ α 4L 2 min{1, δ 4d } achieves H q (p t ) < δ after k ≥ 1 αη log 2Hq(pt) δ iterations. C.2 PROOF OF LEMMA 4.2 The goal of this section is to prove Proposition 4.3, the convergence of gradient Langevin dynamics. As we stated in Section 4.2, we consider the value of R λ (•) instead of R λ (•), and ensure its value will decrease enough. To this end, we first prove Lemma 4.2, which evaluates the difference between θθj and θ := θj(θ) . For θ ∈ B 0, √ DR , we consider the following decomposition: ∇R λ (•) ∇ R λ (•). 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] = 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇ (y i , f (x i ; θ)) + 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] + ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] . This gives that 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] ≤ 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇ (y i , f (x i ; θ)) + 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] + ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] , and hence it holds that P   sup θ ∈B(0, √ DR) 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] ≥ t   ≤ P   sup θ ∈B(0, √ DR) 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇ (y i , f (x i ; θ)) ≥ t 3   (I) + P   sup θ ∈B(0, √ DR) 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] ≥ t 3   (II) + P   sup θ ∈B(0, √ DR) ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] ≥ t 3   (III) for any t > 0. Then we evaluate the each term of the RHS. Upper bound on (I): Since ∇ (y i , f (x i ; θ)) = 2 f (x i ; θ) -y i ∇f (x i ; θ), it holds that ∇ (y i , f (x i ; θ)) -∇ (y i , f (x i ; θ)) = 2 f (x i ; θ) -f (x i ; θ) ∇f (x i ; θ) -2(f (x i ; θ) -y i )(∇f (x i ; θ) -∇f (x i ; θ)). Therefore, we have that P   sup θ ∈B(0, √ DR) 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇ (y i , f (x i ; θ)) ≥ t 3   ≤ P   sup θ ∈B(0, √ DR) 2 n n i=1 f (x i ; θ) -f (x i ; θ) ∇f (x i ; θ) ≥ t 6   + P   sup θ ∈B(0, √ DR) 2 n n i=1 (f (x i ; θ) -y i )(∇f (x i ; θ) -∇f (x i ; θ)) ≥ t 6   , Since the mapping θ → f (x; θ) is 2R-Lipschitz and ∇f (x; θ) ≤ 2mR for any θ ∈ Θ, it holds that the first term must be zero as long as t ≥ 4mR 2 . As for the second term, since |f (x; θ) -y i | ≤ mR 2 + U + 1 for any x i , y i and θ ∈ Θ, it holds that (I) = P   sup θ ∈B(0, √ DR) 2 n n i=1 (f (x i ; θ) -y i )(∇f (x i ; θ) -∇f (x i ; θ)) ≥ t 6   ≤ P   sup θ ∈B(0, √ DR) 2 n n i=1 ∇f (x i ; θ) -∇f (x i ; θ) ≥ t 6(mR 2 + U + 1)   . Hence, we move to evaluate sup θ ∈B(0, √ DR) 2 n n i=1 ∇f (x i ; θ) -∇f (x i ; θ) . To this end, we consider the decomposition 2 n n i=1 ∇f (x i ; θ) -∇f (x i ; θ) ≤ m j=1 2 n n i=1 ∇ aj f (x i ; θ) -∇ aj f (x i ; θ) + 2 n n i=1 ∇ wj f (x i ; θ) -∇ wj f (x i ; θ) , where ∇ aj f (x i ; θ) = σ( wj , x i ) dā j da j , ∇ wj f (x i ; θ) = āj 1l{ wj , x i ≥ 0}x i d wj dw j . This decomposition implies that (I) ≤ P   max j∈[m] sup θ ∈B(0, √ DR) 2 n n i=1 ∇ aj f (x i ; θ) -∇ aj f (x i ; θ) ≥ t 12m(mR 2 + U + 1)   + P   max j∈[m] sup θ ∈B(0, √ DR) 2 n n i=1 ∇ wj f (x i ; θ) -∇ wj f (x i ; θ) ≥ t 12m(mR 2 + U + 1)   . For each term, it holds that ∇ aj f (x i ; θ) -∇ aj f (x i ; θ) ≤ (σ( wj , x i ) -σ( ŵj , x i )) dā j da j + σ( ŵj , x i ) dā j da j - dâ j da j ≤ wj -ŵj dā j da j + dā j da j - dâ j da j ≤ 4R wj -ŵj + 2|ā j -âj | ≤ 4R + and ∇ wj f (x i ; θ) -∇ wj f (x i ; θ) ≤ āj (1l{ wj , x i ≥ 0} -1l{ ŵj , x i ≥ 0})x i d wj dw j + (ā j -âj )1l{ ŵj , x i ≥ 0}x i d wj dw j + âj 1l{ wj , x i ≥ 0}x i d wj dw j - d ŵj dw j ≤ R (1l{ wj , x i ≥ 0} -1l{ ŵj , x i ≥ 0})x i d wj dw j + 4R āj -âj + 4R wj -ŵj ≤ R (1l{ wj , x i ≥ 0} -1l{ ŵj , x i ≥ 0})x i d wj dw j + 8R . The first term can be bounded by (1l{ wj , x i ≥ 0} -1l{ ŵj , x i ≥ 0})x i d wj dw j ≤ (1l{ wj , x i ≥ 0} -1l{ ŵj , x i ≥ 0})x i • ŵj ≤ 1l{| ŵj , x i | ≤ } • ŵj , where the last inequality follows from | wj , x i -ŵj , x i | ≤ wj -ŵj • x i ≤ . Therefore, we obtain that (I) ≤ P   max j∈[m] sup θ ∈B(0, √ DR) #{i ∈ [n] | | ŵj , x i | ≤ } • ŵj n ≥ t 24mR(mR 2 + U + 1)   = P max θ ∈Θ ,j∈[m] #{i ∈ [n] | | ŵj , x i | ≤ } • ŵj n ≥ t 24mR(mR 2 + U + 1) as long as t 24mR(mR 2 +U +1) ≥ max{4R , , 8R } = 8R . We have that P #{i ∈ [n] | | ŵj , x i | ≤ } • ŵj n ≥ t 24mR(mR 2 + U + 1) = P #{i ∈ [n] | | ŵj , x i | ≤ } n ≥ t 24mR(mR 2 + U + 1) ŵj when ŵj = 0. If ŵj = 0, the LHS must be zero as long as t > 0. Lemma 12 in Cai et al. (2013) shows that for each j and i, the angle between ŵj and x i is distributed with density function h(φ) = 1 √ π Γ d 2 Γ d-1 2 • (sin φ) d-2 : φ ∈ [0, π]. Since π 2 -φ ≤ ∆ implies |cos φ| = sin π 2 -φ ≤ ∆ for any ∆ > 0 and h(φ) ≤ 1 √ π Γ( d 2 ) Γ( d-1 2 ) for any φ ∈ [0, π], it holds that P(| ŵj , x i | ≤ ) ≤ P π 2 -φ ij ≤ ŵj ≤ 2 ŵj 1 √ π Γ d 2 Γ d-1 2 ≤ 2 √ d √ π ŵj , where φ ij is the angle between ŵj and x i . Therefore, #{i ∈ [n] | | ŵj , x i | ≤ } follows the Binomial distribution B(n, p) with p ≤ 2 √ d √ π ŵj . Since a random variables that follows the Binomial distribution is bounded and especially sub-Gaussian Wainwright (2019) , it holds that P #{i ∈ [n] | | ŵj , x i | ≤ } n ≥ s + 2 √ d √ π ŵj ≤ P #{i ∈ [n] | | ŵj , x i | ≤ } n ≥ s + p ≤ exp -2ns 2 . for an arbitrarily s > 0. By taking uniform bound, we obtain that P max θ ∈Θ ,j∈[m] #{i ∈ [n] | | ŵj , x i | ≤ } n ≥ s + 2 √ d √ π ŵj ≤ N exp -2ns 2 Hence, as long as ≤ √ π 96mR(mR 2 +U +1)2 √ d t (verified later in this proof), by letting s = t 48mR(mR 2 +U +1 ) ŵj , we obtain that P max θ ∈Θ ,j∈[m] #{i ∈ [n] | | ŵj , x i | ≤ } n ≥ s + d √ π ŵj ≤ mN exp -2n t 24mR(mR 2 + U + 1) ŵj 2 ≤ mN exp - 2n dR 2 t 24mR(mR 2 + U + 1) 2 , where the last inequality follows from ŵj 2 ≤ dR 2 . As a result, the term (I) can be bounded by (I) ≤ mN exp - 2n dR 2 t 24mR(mR 2 + U + 1) 2 . Upper bound on (II): First, we observe that the term (II) is equivalent to P max j∈[N ] 1 n n i=1 ∇ (y i , f (x i ; θj )) -∇E[ (y, f (x; θj ))] ≥ t 3 . For each j ∈ [N ], a straightforward calculation gives that ∇ (y i , f (x i ; θ j )) ≤ 2R(mR 2 + 1), and hence the vector ∇ (y i , f (x i ; θ j )) is sub-Gaussian with a parameter R(mR 2 + 1), i.e., it holds that P 1 n n i=1 ∇ (y i , f (x i ; θj )) -∇E[ (y, f (x; θj ))] ≥ t 3 ≤ 2e -nt 2 18G 2 with G = R(mR 2 + 1) for arbitrary t > 0. By taking uniform bound, we obtain P max j∈[N ] 1 n n i=1 ∇ (y i , f (x i ; θj )) -∇E[ (y, f (x; θj ))] ≥ t 3 ≤ 2N e -t 2 18G 2 . Upper bound on (III): The goal is obtaining (III) = 0 for a sufficiently small . Particularly, we assume that < 1 here. To this end, we aim to show ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] ≤ cL 1/2 (7) with a constant c > 0 and L = O(m 2 R 3 ). First we consider the case where the absolute value of the each component in θ is bounded by 1/2. By Lemma C.5, it holds that ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] ≤ L θ -θ j(θ ) = L • θ -θ j(θ ) 2 1 2 for any θ ∈ Θ with L = O(m 2 R 3 ). Moreover, a straightforward calculation shows that a mapping r → 2R tanh -1 (r/R) (the inverse mapping of r → R tanh(r/2R)) is 8-Lipschitz in [0, 1/2], we have θ -θ j(θ ) 2 ≤ 8 θ -θ ≤ 8 . Therefore, we obtain that ∇E[ (y, f (x; θ))] -∇E[ (y, f (x; θ))] ≤ L (8 ) 1/2 , i.e., Eq. ( 7) with c = 8. Assume that there is a component of θ whose absolute value is greater than 1/2. First, suppose that a component of wj is greater than 1/2 for j ∈ [m]. We consider the decomposition ∇ wj E[ (y, f (x; θ))] -∇ wj E[ (y, f (x; θ))] ≤ ∇ ŵj E[ (y, f (x; θ))] -∇ wj E[ (y, f (x; θ))] • d ŵj dw j + ∇ wj E[ (y, f (x; θ))] • d ŵj dw j - d wj dw j . Since wj > 1/2, we can check that the mapping ŵj → E[ (y, f (x; θ))] is L smooth with L = O(mR 2 d -1/2 ) according to its Hessian (see Safran & Shamir (2018) ). Since d ŵj dwj ≤ 4 √ dR, the first term is at most O(mR 3 ) • . Since E[ (y, f (x; θ))] ≤ 2R(mR 2 + 1) and r → r is 1-smooth, the second term is at most O(mR 3 ) • . Hence we get that ∇ wj E[ (y, f (x; θ))] -∇ wj E[ (y, f (x; θ))] ≤ O(mR 3 ) • . In the case |a j | > 1/2 for j ∈ [m], the same bound also holds with ∇ aj E[ (y, f (x; θ))] -∇ aj E[ (y, f (x; θ))] . By using these bound instead of Lemma E.5 and < 1, we obtain the same bound Eq. ( 7) in this case. Eq. ( 7) implies (III) = 0 as long as t 3 ≥ cL 1/2 , which gives the assertion.

Combining (I)-(III):

Combining these bounds, we get that P   sup θ ∈B(0, √ DR) 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))] ≥ t   ≤ mN exp - 2n dR 2 t 24mR(mR 2 + U + 1) 2 + 2N e -t 2 18G 2 + 0 = exp D log 3 √ DR • m exp - 2n dR 2 t 24mR(mR 2 + U + 1) 2 + 2 exp - nt 2 18R 2 (mR 2 + U + 1) 2 as long as t ≥ C 0 max{mR 2 , mR(mR 2 + U ) , L 1/2 } holds with a constant C 0 > 0. By letting t = C 1 L 1/2 and = C 2 d log δ nm 2 with constants C 1 > 0 and C 2 > 0 , we obtain the conclusion.

C.3 PROOF OF THE CONVERGENCE IN PHASE I

Based on the results so far, we move to the proof of Proposition 4.3. The proof is conducted in twostep. First, we evaluate the "distance" between the π ∞ and the distribution of θ (k) . Moreover, it is ensured that the function value R λ (θ), where θ is sampled from π ∞ , will be small for a sufficiently large β. Combining these two facts, we can guarantee that the function value R λ (θ (k) ) also will be small, which concludes Proposition 4.3. The following proposition ensures the convergence of the marginal distribution of θ (k) to the invariant measure π ∞ : Proposition C.3. Suppose that the probability measure π ∞ satisfies the LSI with a constant α and R λ (•) is L-smooth with L > 1. Let q be a density function of π ∞ (i.e., q(θ) ∝ exp(-βR λ (θ))) with β > 2. For any θ (0) ∼ ρ 0 with H q (ρ 0 ) < +∞, the sequence (θ (k) ) ∞ k=0 with step-size 0 < η (1) < α 4βL 2 satisfies H q (ρ k ) ≤ exp - αη (1) β k H q (ρ 0 ) + 16βη (1) DL 2 α + 32βV 2 grad 3α , where D := m(d + 1), ρ k is the density function of the marginal distribution of θ (k) , and V grad is a constant introduced in Lemma 4.2. In particular, for any δ > 0, the output of phase I with step-size η (1) ≤ δα 32βL 2 D achieves H q (ρ k ) < δ + 32βV 2 grad 3α after k ≥ β αη (1) log 2Hq(ρ0) δ iterations. As we stated in Section 4.2, our result extends the existing one Vempala & Wibisono (2019) in the sense that it gives the convergence for the non-differential objective function R λ (•). Indeed, this difference appears in the last term, 32βV 2 grad 3α . Since V 2 grad n -1 by Lemma 4.2, we can ensure that this error diverges to zero as the sample size n increases. To apply this result to ensure the convergence of the phase I, we just need to check that the invariant measure π ∞ satisfies the LSI and R λ is smooth, and we clarify them as follows: Lemma C.4 (log-Sobolev inequality). The invariant measure π ∞ satisfies the LSI with a constant α = 2βλ exp -8βm 2 R 4 . Lemma C.5 (smoothness). R λ (•) is L-smooth, i.e., for any θ, θ ∈ Θ, ∇R λ (θ) -∇R λ (θ ) ≤ L θ -θ holds with L = O(m 2 R 3 + λ). The proof of these lemmas can be seen in Appendix E. Remark C.6. In Lemma C.4, the LSI constant α depends exponentially on m, which results in the exponential dependency of phase I convergence on m. This is caused by the fact that the sup-norm of the student network depends on m (see the proof of Lemma C.4). Therefore, we can indeed remove this dependency by considering the following settings: (1) utilizing the mean field network (multiplying 1/m to the output of the student), (2) w j s are directed to different directions to some extent, and hence the sup-norm can be bounded. To ensure Proposition C.3, we first show the following lemma, which evaluates the each step of the gradient Langevin dynamics. Lemma C.7. Suppose that π ∞ satisfies the LSI with a constant α and R λ (•) is L-smooth with L > 1, and β > 2. Then for any θ (0) ∼ ρ 0 with H q (ρ 0 ) < +∞, if 0 < η < α 4βL 2 ,it holds that H q (ρ k+1 ) ≤ e -αη/β H q (ρ k ) + 12η 2 DL 2 + 8ηV 2 grad , where ρ k is the density function of the marginal distribution of θ (k) and V grad is the constant defined in Lemma 4.2. Proof. The proof of Lemma C.7 is basically based on that of Lemma 3 in Vempala & Wibisono (2019) . For notational simplicity suppose k = 0 and let θ 0 = θ (0) . The one step of the gradient Langevin dynamics θ (1) = θ (0) -η∇ R λ θ (0) + 2η β ζ (0) can be seen as an output at time ηβ -1 of the following SDE: dθ t = -β∇ R λ (θ 0 )dt + √ 2dB t , where {B t } t≥0 is the standard Brownian motion in Θ (= R (d+1)×m ). As Vempala & Wibisono (2019) , it holds that ∂ρ t|0 (θ t |θ 0 ) ∂t = ∇ • ρ t|0 (θ t |θ 0 )β∇ R λ (θ 0 ) + ∆ρ t|0 (θ t |θ 0 ), and therefore, d dt H q (ρ t ) = -J q (ρ t ) + β • E ρ0t ∇R λ (θ t ) -∇ R λ (θ 0 ), ∇ log ρ t (θ t ) q(θ t ) , where ρ t|0 (•|θ 0 ) the conditional density, and ρ t0 is the density of the joint distribution of θ 0 and θ t . Then we evaluate the second term. The inner product in this term can be bounded by ∇R λ (θ t ) -∇ R λ (θ 0 ), ∇ log ρ t (θ t ) q(θ t ) ≤ ∇R λ (θ t ) -∇ R λ (θ 0 ) 2 + 1 4 ∇ log ρ t (θ t ) q(θ t ) 2 ≤ 2 ∇R λ (θ t ) -∇R λ (θ 0 ) 2 + 2 ∇R λ (θ 0 ) -∇ R λ (θ 0 ) 2 + 1 4 ∇ log ρ t (θ t ) q(θ t ) 2 . In the above bound, we use a, b ≤ a 2 + b 2 /4 for a, b ∈ R D in the first inequality and a -b 2 ≤ 2 a 2 + 2 b 2 for a, b ∈ R D in the second inequality. Therefore, by using Lemma 4.2, we get that E ρ0t ∇R λ (θ t ) -∇R λ (θ 0 ), ∇ log ρ t (θ t ) q(θ t ) ≤ 2V 2 grad + 2E ρ0t ∇R λ (θ t ) -∇R λ (θ 0 ) 2 + 1 4 E ρ0t ∇ log ρ t (θ t ) q(θ) 2 = 2V 2 grad + 2E ρ0t ∇R λ (θ t ) -∇R λ (θ 0 ) 2 + 1 4 J q (ρ t ). Then the second term is bounded by E ρ0t ∇R λ (θ t ) -∇R λ (θ 0 ) 2 ≤ L 2 E ρ0t θ t -θ 0 2 = L 2 E ρ0t -t∇ R λ (θ 0 ) + 2t β ζ (0) 2 = t 2 L 2 E ρ0t ∇R λ (θ 0 ) + (∇ R λ (θ 0 ) -∇R λ (θ 0 )) 2 + L 2 E ρ0t 2t β ζ (0) 2 ≤ 2t 2 L 2 E ρ0t ∇R λ (θ 0 ) 2 + V 2 grad + L 2 2t β D ≤ 1 β 4t 2 L 4 α H q (ρ 0 ) + 2t 2 L 3 D + 2η 2 L 2 V 2 grad + tL 2 D. In the last inequality, we use Lemma 10 in Vempala & Wibisono (2019) and β > 2. Thus we obtain d dt H q (ρ t ) ≤ - 3 4 J q (ρ t ) + 8βt 2 L 4 α H q (ρ 0 ) + 4βt 2 L 3 D + 2βtL 2 D + (2βη 2 L 2 + 2)V 2 grad ≤ - 3α 2 H q (ρ t ) + 8βt 2 L 4 α H q (ρ 0 ) + 6βtL 2 D + 4βV 2 grad since the LSI (Eq. ( 5)) holds and tL ≤ ηL ≤ 1. Multiplying both sides by e 3αt/2 and integrating them from t = 0 to t = ηβ -1 , we get e 3αη/2β H q (ρ η ) -H q (ρ 0 ) ≤ 2(e 3αη/2β -1) 3α 4βη 2 L 4 α H q (ρ 0 ) + 6βηDL 2 + 4βV 2 grad ≤ 2η 8η 2 L 4 α H q (ρ 0 ) + 6ηDL 2 + 4V 2 grad , where we use the inequality e a ≤ 1 + 2a for a ∈ [0, 1] and 3αη/2β ≤ 1 (derived from the assumption of η). Rearranging this inequality, we have H q (ρ η ) ≤ e -3αη/2β 1 + 16η 3 L 4 α H q (ρ 0 ) + e -3αη/2β 12η 2 DL 2 + 8V 2 grad η ≤ e -αη/β H q (ρ 0 ) + 12η 2 DL 2 + 8ηV 2 grad , where the last inequality follows from 1 + 16η 3 L 4 α ≤ 1 + αη 16β 2 ≤ 1 + αη 2β ≤ e αη/2β . By replacing ρ 0 by ρ k and ρ η by ρ k+1 , we get the conclusion. proof of Proposition C.3. By Lemma C.7, it holds that H q (ρ k ) ≤ e -αηk/β H q (ρ 0 ) + 12η 2 DL 2 + 8ηV 2 grad k k =1 e -αηk /β ≤ e -αηk/β H q (ρ 0 ) + 12η 2 DL 2 + 8ηV 2 grad 1 -e -αη/β ≤ e -αηk/β H q (ρ 0 ) + 16βηDL 2 α + 32βV 2 grad 3α , where, the last inequality follows from L > 1 (derived from Lemma C.5) and 1 -e -c ≥ 3 4 c for c ∈ [0, 1 4 ] and αη β < 1 4L 2 < 1 4 . Thus we get the assertion. proof of Proposition 4.3. By the Otto-Villani theorem, it holds that W 2 (ρ k , q) 2 ≤ 2 α H q (ρ k ). Therefore, Proposition C.3 implies that after k ≥ β αη log 2Hq(ρ0) δ iteration, it holds that W 2 (ρ k , q) ≤ 2 α δ + 32βV 2 grad 3α Then we obtain that The objective of this section is to prove Theorem 4.6. First, by the noisy gradient descent, the objective value decreases enough, and we can ensure that for each node of the teacher network, there exists a node of the student network that is "close" to each other. Then we can prove the local convergence property based on the strong convexity around the parameters of the teacher network. E[R λ (θ (k) )] -R * λ ≤ E[R λ (θ (k) )] -E π∞ [R λ (θ)] + (E π∞ [R λ (θ)] -R * λ ) ≤ C(λ + m) 2 α δ + 32βV 2 grad 3α + D The proof of the local convergence relies on that of Zhang et al. (2019) . They consider the setting where the parameters of the second layer are all positive, i.e., a j = a • j = 1 for all j ∈ [m] and provide the following proposition: Proposition D.1 (Theorem 4.2 of Zhang et al. (2019) ). Let f  • : x → m j=1 σ( w • j , x ) be a teacher network with parameters W • = (w • 1 w • 2 • • • w • m ) ∈ R d×m , κ = σ 1 /σ m is (0) satisfies W (0) -W • F ≤ cσ m /κ 3 m 2 , where c > 0 is a small enough absolute constant. Then there exists absolute constants c 1 , c 2 , c 3 , c 4 , and c 5 such that under n ≥ c 1 κ 10 m 9 d σ m log κmd σ m • W * 2 F + v 2 , the output of the gradient descent with step-size η ≤ 1 c2κm 2 satisfies W (k) -W • 2 F ≤ 1 - c 3 η σκ 2 k W (0) -W • 2 F + c 4 σ 2 κ 4 m 5 d log n n • W • 2 F + v 2 with probability at least 1 -c 5 d -10 . Their proof can also be applied to the setting in this paper, i.e., a j , a • j ∈ {±1} holds, and if a teacher node j and a student node k j are close to each other, it holds that a • j = a kj . In Proposition 4.3, if it holds that E[R λ (θ (k) )] -R * λ ≤ b 0 , we obtain that R λ (θ (k) ) -R * λ ≤ 0 8) with probability at least 1 -b, by using the Markov inequality. In the rest of this section, we assume that (8) is satisfied. In this case, if Proposition 4.4 is ensured, we can apply Proposition D.1 and Theorem 4.6 is proved. We give its proof in the rest of this section. proof of Proposition 4.4. Let θ • = ((a • 1 , w • 1 ), . . . , (a • m , w • m )). Then by R λ (θ) -R λ (θ • ) ≤ R λ (θ) -R * λ ≤ 0 , it holds that 1 2 E x (f a • ,W • (x) -f (x; θ)) 2 + λ m j=1 |a j | 2 + w j 2 ≤ λ m j=1 |a • j | 2 + w • j 2 + 0 , and therefore, 1 2 E x (f a • ,W • (x) -f (x; θ)) 2 ≤ 0 m m j=1 |a • j | 2 + w • j 2 + 0 ≤ 0 m m j=1 1 + W • 2 F + 0 ≤ 3 0 , where we use |a • j | 2 = 1 for all j ∈ [m] and m j=1 w • j 2 = W • 2 F ≤ m W • 2 2 ≤ m. Then we move to evaluate the LHS. Since σ(u) = u+|u| 2 for u ∈ R, it holds that f (x; θ) = m j=1 a j σ( w j , x ) = 1 2 m j=1 a j (| w j , x | + w j , x ), f a • ,W • (x) = m j=1 a j σ( w j , x ) = 1 2 m j=1 a • j w • j , x + w • j , x Hence we have that 1 2 E x (f a • ,W • (x) -f (x; θ)) 2 = 1 8 E x      m j=1 a • j w • j , x + w • j , x - m j=1 a j (| w j , x | + w j , x )   2    = 1 8 E x      m j=1 a • j w • j , x - m j=1 a j | w j , x |   2    + 1 8 E x   m j=1 a • j w • j - m j=1 a j w j , x 2   , where the last equality follows from E x [| w 1 , x | w 2 , x ] = 0 for all w 1 , w 2 ∈ R d , which follows from the fact that the distribution P X is symmetric. Then Eq. ( 9) gives that E x      m j=1 a • j w • j , x - m j=1 a j | w j , x |   2    ≤ 24 0 , E x   m j=1 a • j w • j - m j=1 a j w j , x 2   ≤ 24 0 . The analysis based on Eq. ( 10), the error analysis of student networks with the absolute value activation, is conducted in Zhou et al. (2021) . Here we import Lemma D.2 from their technique. They focus on the setting where a • j = 1 for all j ∈ [m], but we can apply it here. Then we get that for every j ∈ [m], there exists k j ∈ [m] and a constant C > 0 such that arccos | w • j , w k |/ w • j w k ≤ Cmσ -5/3 min 1/3 and a kj w kj -w • j ≤ poly(m, σ -1 min ) 3/8 . We simply denote k j by j. Since Zhou et al. (2021) uses the absolute value for the activation, it may hold that arccos w • j , w j / w • j w j > π/2 (i.e., w • j and w k have "opposite" directions). From now on, we omit such cases by Eq. ( 11). Let a = (a 1 , . . . , a m ) = (a • 1 , . . . , a • m ) and W ∆ = (w • 1 -w 1 , . . . , w • m -w m ). And we denote the angle between w • j and w j by φ j . Then, Eq. ( 11) can be rewritten as E x∼P X (a T W ∆ x) 2 ≤ 24 0 . Let x ∼ N (0, I d ), since r 2 := x 2 and φ := x/ x are random variables that independently follow the Chi-squared distribution and the uniform distribution on S d-1 respectively. Hence it holds that E x∼P X (a T W ∆ x) 2 = E x∼N (0,I d ) (a T W ∆ x) 2 E x∼N (0,I d ) x 2 = E r∼N (0, a,W∆ 2 ) r 2 d = a, W ∆ 2 d ≤ 24 0 . This implies a, W ∆ 2 ≤ 24 0 d. Since w • j -w j = (1 -w • j , w j )w • j + ( w • j , w j w • j -w j ) and w • j , w j w • j -w j = sin φ j , we have that a, W ∆ = (1 -w • 1 , w 1 )a 1 , . . . , (1 -w • m , w m )a m T W • + a, W ∆ -(1 -w • 1 , w 1 )a 1 , . . . , (1 -w • m , w m )a m T W • = (1 -w • 1 , w 1 )a 1 , . . . , (1 -w • m , w m )a m T W • + a, w • 1 , w 1 w • 1 -w 1 , . . . , w • m , w m w • m -w m and the second term is at most O(m 3/2 σ -5/3 min 1/3 ). As for the first term, it holds that it is at least σ min m j=1 (1 -w • j , w j ) 2 . Hence, by letting = o(d -1 m -3/2 σ 8 min ), it must hold that w • j , w j > 0, which gives the assertion. Lemma D.2 (Lemma 9 and Lemma 10 in Zhou et al. (2021) ). Assume that x ∼ N (0, I d ) and f • : x → m j=1 | w • j , x | is a teacher network with parameters w • 1 , . . . , w • m ∈ R d satisfying min j1,j2 arccos w • j1 , w • j2 / w • j1 w • j2 ≥ ∆ for ∆ > 0 and 0 < w min ≤ w • j ≤ w max for all j ∈ [m]. Then there exists a threshold 0 = poly(∆, m -1 , w -1 max .w min ) such that if a student network f : x → m j=1 | w j , x | satisfies E x [(f • -f ) 2 ] ≤ ≤ 0 , it holds that for every j ∈ [m], there exists k j ∈ [m] and a constant C > 0 such that arccos | w • j , w k |/ w • j w k ≤ Cmw max w -5/3 min 1/3 and a kj w kj -w • j ≤ poly(m, ∆ -1 , w max ) 3/8 .

E AUXILIARY LEMMAS E.1 EVALUATION OF THE INVARIANT MEASURE

This subsection provides lemmas about the evaluation of the function value sampled from the invariant measure β. These are utilized in the proof of Proposition 4.3 (see Appendix C). First, we introduce two results from Raginsky et al. (2017) , and then we prove the dissipativity, which is imposed as an assumption in these results. Lemma E.1 (Proposition 11 in Raginsky et al. (2017) ). Suppose that f : Θ → R satisfies the following conditions: • f is L-smooth. • f is (M, b)-dissipative, i.e., it holds that θ, ∇f (θ) ≥ M θ 2 -b for any θ ∈ Θ. Then, for any β ≥ 2/M , it holds that E θ ∼π∞ [f (θ)] -min θ∈Θ f (θ) ≤ d 2β log eL M bβ d + 1 Lemma E.2 (Lemma 2 and Lemma 6 in Raginsky et al. (2017) ). Let µ 1 , µ 2 be two probability measures on Θ with finite second moments, and let f : Θ → R be a (M, b)-dissipative function satisfying ∇f (0) ≤ B for B ≥ 0. Then, it holds that Θ gdµ 1 - Θ gdµ 2 ≤ (M σ + B)W 2 (µ 1 , µ 2 ), where σ 2 := max{ Θ θ 2 dµ 1 , Θ θ 2 dµ 2 }. In this subsection, we give a proof to Lemma C.4, the LSI for the invariant measure π ∞ . The key notion is that R λ can be decomposed to the bounded term (L 2 -distance) and the strongly convex term (regularization term). Combining this fact with the following lemma, we can ensure the LSI. Lemma E.4 (Holley & Stroock (1987) ; Nitanda et al. (2021b) ). Let a probability measure on Θ with a density function q satisfy the LSI with a constant α. For a function f : Θ → R that satisfies |f (θ)| ≤ B for any θ ∈ Θ, a probability measure defined by Q(θ)dθ := exp(f (θ))q(θ) E q [exp(f (θ))q(θ)] dθ satisfies the LSI with a constant α exp(-4B). proof of Lemma C.4. First, we note that exp(-βR λ (θ))dθ = exp -βλ θ 2 • exp - β 2 E x f a • ,W • (x) -f (x; θ) 2 dθ. Since the function θ → βλ θ 2 is 2βλ-strongly convex, a measure with density exp -βλ θ 2 dθ satisfies the LSI with a constant βλ Bakry & Émery (1985) . Moreover, by a straightforward calculation shows that β 2 E x f a • ,W • (x) -f (x; θ) 2 ≤ 2βm 2 R 4 , Lemma E.4 implies that π ∞ satisfies the LSI with a constant 2βλ exp -8βm 2 R 4 , which gives the conclusion. āi āj J( wi , wj ) d wj dw j -ā i ā j J( w i , w j ) d w j dw j . (see Eq. (3) and Eq. ( 4)). The following lemma gives an upper bound for each term. Proof. The proof is based on the straightforward calculation. As for the first inequality, for every i ∈ [m], it holds that āi I( wi , wj ) • dā j da j -ā i I( w i , w j ) • dā j da j ≤ (ā i I( wi , wj ) -ā i I( wi , wj )) • dā j da j + ā i I( wi , wj ) -ā i I( wi , w j ) • dā j da j + ā i I( wi , w j ) -ā i I( w i , w j ) • dā j da j + ā i I( w i , w j ) • dā j da j -dā j da j ≤ |ā j -ā j | wi wj 2d Then the triangle inequality gives the first assertion. As for the second inequality, for every i ∈ [m], it holds that āi āj J( wi , wj ) d wj dw j -ā i ā j J( w i , w j ) d w j dw j ≤ āi āj J( wi , wj ) -ā i ā j J( wi , wj ) d wj dw j + ā i ā j J( wi , wj ) -ā i ā j J( wi , w j ) d wj dw j + ā i ā j J( wi , w j ) -ā i ā j J( w i , w j ) d wj dw j + ā i ā j J( w i , w j ) d wj dw j -d w j dw j ≤ R a j -a j √ dR 2d • 4 √ dR + R|a i -a i | √ dR 2d • 4 √ dR + R 2 √ d 2d • w j -w j • 4 √ dR + R 2 √ d 2d • w i -w i • 4 √ dR + 2R 2 √ dR 2d • w j -w j ≤ 2R 3 a j -a j + w j -w j + 2R 3 (|a i -a i | + w i -w i ). and āi a • j J( wi , w Again by using the triangle inequality, we obtain the conclusion.



σ( w, x ) = w σ( w/ w , x ) for any w ∈ R d /{0} and x ∈ R d .



Figure 1: Convergence of the training loss and test loss.

proof of Lemma 4.2. The proof of Lemma 4.2 is basically based on that of Theorem 1 in Mei et al. (2018a) and Lemma 5.3 in Zhang et al. (2019). For the notational simplicity we denote m(d + 1) =: D. Let N be the -covering number of B 0, √ DR with respect to the 2distance. Let Θ = θ 1 , . . . , θN be a corresponding -cover with |Θ | = N . It is known that log N = D log 3 √ DR/ is sufficient to ensure the existence of such covering. First we note that ∇R λ (θ) -∇ R λ (θ) = 1 n n i=1 ∇ (y i , f (x i ; θ)) -∇E[ (y, f (x; θ))]. For each θ ∈ Θ, let j(θ) ∈ arg min j∈[N ]

Lemma E.1 and Lemma E.2 for the inequality. By specifying α, L, M and b by applying Lemma C.4, Lemma C.5, and Lemma E.3, we get the conslusion. D PROOF OF THEOREM 4.6

the condition number of W • , and σ = ( m j=1 σ j )/σ m m . Assume the inputs (x i ) n i=1 are sampled from N (0, I d ), and the outputs (y i ) n i=1 are generated from the teacher network. Suppose that the initial estimator W

(dissipativity). R λ (•) is (M, b)-dissipative with M = 2λ and b = 8m 2 R 3 .Proof. By a straightforward calculation, we have thatθ, ∇R λ (θ) = m j=1 a j ∇ aj R λ (θ) + m j=1 w j , ∇ wj R λ (θ)As for the second term and the third term, since |I(w, v)| ≤ w v /2d and J(w, v) ≤ v /2d for any w, v ∈ R d , we have that Combining these inequality, we get thatθ, ∇R λ (θ) ≥ 2λ θ 2 -8m 2 R 3 ,which gives the conclusion.E.2 PROOF OF LEMMA C.4

PROOF OFLEMMA C.5    In this subsection we write L(θ) := 1 2 E x f a • ,W • (x) -f (x; θ) 2 , i.e., R λ (θ) := L(θ) + λ θ 2 . Since θ → λ θ 2 is 2λ-smooth, it is sufficient to show that L(•) is L -smooth with L = O(m 2 R 3) for proving Lemma C.5. To this end, let θ, θ ∈ Θ. We consider the decomposition∇L(θ) -∇L(θ ) = m j=1 ∇ aj L(θ) -∇ aj L(θ ) 2 + ∇ wj L(θ) -∇ wj L(θ ) aj L(θ) -∇ aj L(θ ) + ∇ wj L(θ) -∇ wj L(θ ) ,where∇ aj L(θ) -∇ aj L(θ ) = m i=1āi I( wi , wj ) • dā j da j -ā i I( w i , w j ) • dā j da j

Lemma E.5. For any θ, θ ∈ Θ and j ∈ [m], it holds that∇ aj L(θ) -∇ aj L(θ ) ≤ m 5R 3 2 + 2 √ dR d a j -a j + w j -w j + 2R 3 m i=1 wi -w i ∇ wj L(θ) -∇ wj L(θ ) ≤ m 2R d + 2R 3 a j -a j + w j -w j + 2R 3 m i=1 (|a i -a i | + w i -w i ).

-a j + w j -w j + 2R 3 wi -w i ,

annex

Published as a conference paper at ICLR 2023 proof of Lemma C.5. By using Lemma E.5,Combining this with the fact that the mapping θ → λ θ 2 is 2λ-smooth and the triangle inequality, we obtain that R λ (•) is L-smooth with L = O(m 2 R 3 + λ), which gives the conclusion.

